Fig. 4: Building AIdit_OFF model to predict SpCas9/gRNA off-target activity.
From: Deep sampling of gRNA in the human genome and deep-learning-informed prediction of gRNA activities

a Schematic representation of the sequence design process for the 180k synthetic gRNA-off-target library. The 180k library included 184,561 carefully designed off-target sequences. These sequences included: 89,730 single-mutation off-targets, which could be further divided into single mismatches (OFF_Mis), single deletions (OFF_Del), and single insertions (OFF_Ins); 93,579 off-targets with multiple mismatches (OFF_Mul), which were generated via a traversal strategy and a predictive strategy; and 1252 pairs collected from computational predictions or experimental assays (e.g., GUIDE-seq). The former two groups were utilized to quantify indel frequencies associated with different off-target types at a large scale. The latter group was used to validate our method. b Heatmap of average relative indel frequencies between the matched targets and off-targets with 1-bp mismatches. At each position along the target region, the columns represent the nucleotides of the target sequences, and the rows represent the mismatched nucleotides. The relative indel efficiencies are color-coded. c Influence of insertion position on off-target sequences with 1-bp bulges. The relative editing activities, which are relative ratios of indel frequencies between the off-target sequences and the corresponding matched targets, were plotted on the y-axis. Positions 1–3 were excluded from this analysis due to data filtering. d The influence of the insertion position on off-target sequences with deletions. The relative editing activities, which are relative ratios of indel frequencies between the off-target sequences and the corresponding matched targets, were plotted on the y-axis. e Schematic representation of the workflow of the AIdit_OFF models for indel frequency prediction in off-targets for SpCas9/gRNA. The input of AIdit_OFF included one-hot encoded 23 bp sequence of both matched and mismatched target sequence of gRNAs (184 features), position-dependent substitution types (240 features), the PAM types of targets (8 features), mismatch number and prediction values of AIdit_ON for both matched and unmatched target sequences. These features were merged to serve as the input of a built multilayer perceptron network with five hidden layers whose hidden sizes were 500, 650, 380, 110, and 30, respectively. Finally, the output of the multilayer perceptron network was used to predict off-target activity. f Comparison of the model performances in terms of predicting cleavage activities on off-target sequences. The benchmark was conducted on endogenous off-target datasets, which were generated using GUIDE-seq across different models (AIdit_OFF, Elevation_score, CFD score, CCTop score, and Hsu score). Three metrics were compared, including the area under the curve (AUC) (left) for examining the false-positive rate and the area under the precision-recall curve (PR-AUC; right) for examining the recall rate.