Fig. 3: Models can accurately predict toehold performance with sparse data. | Nature Communications

Fig. 3: Models can accurately predict toehold performance with sparse data.

From: Sequence-to-function deep learning frameworks for engineered riboregulators

Fig. 3

a The language model (LM) trained on toehold embeddings only accurately classifies real toeholds (blue circle), not shuffled (gray triangle, p = 2.45 × 10−14) or scrambled sequences (gray square, p = 5.03 × 10−15), as assessed by Matthews Correlation Coefficient (MCC). While language models trained on shuffled toeholds and scrambled toeholds are also more accurate for real than shuffled (p = 1.18 × 10−6, p = 4.41 × 10−4) and more accurate for real than scrambled toeholds (p = 1.58 × 10−8, p = 2.01 × 10−5), they fail to achieve the accuracy of the LM trained on real toeholds. b The convolutional neural network (CNN)-based model predictions for both ON and OFF are significantly higher than predictions on scrambled toeholds in five-fold cross validation, evaluated with R2 for ON and OFF (p = 2.97 × 10−14, p = 4.19 × 10−12), c Spearman correlation (p = 1.46 × 10−12, p = 9.09 × 10−13), and (d) MSE (p = 9.59 × 10−12, p = 2.12 × 10−9). For (ad), N = 5 cross validation folds. e Data ablation studies were performed with both the LM and CNN-based model, evaluating both real toeholds (blue circle), shuffled sequences (gray triangle) and scrambled sequences (gray square). LM performance does not drop off steeply with half as much data and continues to perform better on real vs. shuffled (p = 9.23 × 10−9) and real vs. scrambled (p = 6.65 × 10−10) with just 736 training examples. For (e), N = 5 trials. f Similarly, the CNN-based model performance does not drop off steeply for either ON or (g) OFF prediction with half as much data and has significantly different R2 values at a training size of 736 samples for both ON and OFF predictions (p = 7.23 × 10−7, p = 0.028). For (f, g), N = 10 trials. For all panels, error bars represent mean ± standard deviation and all tests are two-tailed t-tests.

Back to article page