Fig. 4: Prediction and optimization of toeholds can be used for diagnostic sensor development. | Nature Communications

Fig. 4: Prediction and optimization of toeholds can be used for diagnostic sensor development.

From: Sequence-to-function deep learning frameworks for engineered riboregulators

Fig. 4

a Diagnostic sensor optimization pipeline involves tiling the genome of a pathogen, creation of an in silico corpus, and utilizing both the natural language processing (NLP)-based and convolutional neural network (CNN)-based models to generate a list of sensors for experimental validation. b To balance the utility of the large Angenent-Mari et al.30 dataset with the nature of the Green et al. dataset8, which tested toeholds with free RNA as opposed to switch-trigger fusions, we used transfer learning to fine-tune the existing model and achieve a higher degree of correlation with the predictions on an external validation set of free-trigger Zika toeholds10 (N = 24). Data are shown for the CNN-based model rank predictions. c The NuSpeak optimization pipeline was designed to maintain base-pairing complementarity in the switch while producing all possible variants of positions 21–30. Sequences are then run through the consensus pipeline with fine-tuned models as above to produce a ranked list of sensors for a given region. d In contrast to traditional model training, which uses gradient descent to optimize the model’s weights from randomly initialized values (dark blue, time = 0) and predict the ON and OFF values for any given sequence, the CNN-based model can be inverted for gradient ascent of the sequence. Target ON and OFF values can be set, and the sequence space can be explored, starting at a given initialized sequence (dark blue, time = 0), to achieve those target values while the trained model remains fixed. Both optimization pipelines effectively improve the in silico ON/OFF ratio. e NuSpeak was used to optimize sequences with an increase in in silico ON/OFF ratio (blue circle) from the original ON/OFF ratio (orange triangle) for experimentally tested sequences (N = 354, p = 4.54 × 10−33). f Likewise, the Sequence-based Toehold Optimization and Redesign Model (STORM) was used to optimize experimental sequences, with center line indicating median and box limits indicating 25th and 75th quartile (N = 20, p = 3.33 × 10−8). For (e, f), a two-tailed Mann-Whitney U test was used. g, h Both pipelines can be used to predict and optimize SARS-CoV-2 viral RNA sensors, with experimentally validated toeholds performing as predicted. There is a significant increase in NuSpeak predicted bad (N = 8) and predicted good (N = 8) values (p = 0.026), as well as predicted good and optimized (N = 8) values (p = 0.020) and predicted bad and optimized values (p = 0.004). Likewise, there is a significant increase in STORM predicted bad (N = 4) and predicted good (N = 5) values (p = 0.010), as well as predicted good and optimized (N = 4) values (p = 0.089) and predicted bad and optimized values (p = 0.015). For (g, h), error bars indicate mean ± S.E.M and a one-tailed Mann–Whitney U test was used. Background was calculated by measuring the initial ON/OFF ratio. Source data are provided as a Source Data File.

Back to article page