Extended Data Fig. 3: Machine learning models to predict PAM profile from amino acid sequence. | Nature

Extended Data Fig. 3: Machine learning models to predict PAM profile from amino acid sequence.

From: Custom CRISPR–Cas9 PAM variants via scalable engineering and machine learning

Extended Data Fig. 3

a, Comparison of machine learning model architectures (linear regression, random forest, and neural network) and amino acid encodings (one-hot, one-hot plus all pairwise amino acid combinations, and Georgiev47). The R2 value is shown between the experimentally determined k (via HT-PAMDA) and the predicted k (via each ML model) for an internal 5-fold cross-validation on the training set. Each validation set is sub-divided according to the minimum hamming distance (HD) of each variant to the nearest neighbor in the corresponding training set; thus, validation sets become more challenging as HD increases. b, Performance of the optimal PAM machine learning algorithm (PAMmla; comprised of a neural network with one hot encoding) on two additional 80%/20% random train-test splits. c, Proportion of test set SpCas9 enzymes that have a predominant preference for A, C, G, T in the 3rd position of the PAM, or are inactive (based on HT-PAMDA data). d, Comparison of test set ks broken down by nucleotide preference of each test variant at the 3rd position of the PAM (comparing ks experimentally determined by HT-PAMDA versus predicted by PAMmla). Nucleotide preference is defined as the 3rd position nucleotide of each enzyme variant’s most preferred PAM by HT-PAMDA. e, Proportion of test set SpCas9 enzymes that have a predominant preference for A, C, G, T in the 4th position of the PAM, or are inactive (based on HT-PAMDA data). f, Comparison of test set ks broken down by preference of each test set variant at the 4th position of the PAM (comparing ks experimentally determined by HT-PAMDA versus predicted by PAMmla). Nucleotide preference is defined as 4th position nucleotide of each enzyme variant’s most preferred PAM by HT-PAMDA. g, Effect of random over-sampling by most active PAM. The PAMmla model was trained with and without randomly over-sampling the training set to balance the number of enzyme variants with different PAM preferences. R2 values for the two models were compared on subsets of variants within the test set with different preferences at the 3rd and 4th positions of the PAM. Over-sampling improved performance particularly for under-represented PAM classes (see panels c and e). h, Pearson’s correlations between HT-PAMDA replicates performed with distinct spacer sequences for a set of 28 inactive versus 28 active enzymes within the test set. Dashed line = data median. True labels for active versus inactive enzymes were determined using a cutoff value for maximum k on any PAM of 10−4.3. Enzymes separated into active and inactive classes based on these criteria showed correlation between replicates only for active enzymes, indicating HT-PAMDA data for enzymes with maximum ks below this cutoff are likely due to non-reproducible noise in the HT-PAMDA assay. i, Correlation between ks experimentally determined by HT-PAMDA versus predicted by PAMmla for inactive variants (maximum HT-PAMDA k < 10−4.3) within the test set; PAMmla is not predictive for background noise in the HT-PAMDA determined PAM profiles of inactive enzymes. For all panels that utilize HT-PAMDA data, the log10 rate constants (k) are the mean of n = 2 replicate HT-PAMDA experiments using two distinct spacer sequences. For all scatterplots, each datapoint represents the rate constant activity of one enzyme variant against on one of 64 possible NNNN PAMs.

Back to article page