Fig. 5: Interpretation of uASPIre data and SAPIENs. | Nature Communications

Fig. 5: Interpretation of uASPIre data and SAPIENs.

From: Large-scale DNA-based phenotypic recording and deep learning enable highly accurate sequence-function mapping

Fig. 5: Interpretation of uASPIre data and SAPIENs.

a, b Influence of Shine–Dalgarno-like motifs (a) and AUG codons (b) in the 5′-UTR on the RBS activity of N17 library members. Black horizontal lines indicate the median IFP0–480 min in the data set. Boxplots (a) contain a variable number of RBSs (n between 24 and 1246) depending on the occurrence of the respective motif, and boxes range from first (lower line) to third (upper line) quartile with median (red center line) and percentiles 20/80 (whiskers). Circles (b) represent median IFP0–480 min with percentiles 20/80 (shaded areas) and in-frame positions (highlighted red). c Importance of ResNet filters for the prediction. Pearson correlation between filter activation and RBS activities of all held-out sequences is displayed per filter and position for the first convolutional layer of one randomly selected ResNet. Five filter stacks with apparent high significance are framed in bold and the average weight per base and position of the corresponding centroid filter is shown (right). d Visualization of integrated gradients scores of SAPIENs in a low-dimensional space. T-distributed stochastic neighbor embedding (t-SNE) is applied to the integrated gradient scores of test set RBSs. t-SNE dim1/2 are the two dimensions resulting from the t-SNE algorithm. e Impact of 5′-UTR bases and positions on RBS activity. Using an all-zeroes input as baseline, the average attribution score per base and position is displayed as determined for the test-set sequences. Letter size corresponds to the importance score and orientation to the direction of effect (i.e. upward/downward corresponding to a tendency to increase/decrease IFP0–480 min). f Attribution of bases and positions to strong RBSs. The strongest 5% of sequences in the test set were distributed into five clusters using k-means algorithm. The displayed motifs are the medoids of each cluster (i.e. the sequences closest to the respective cluster centroid). g, h In silico evolution of RBSs. Starting from the sequence with the lowest (g) and highest (h) predicted IFP0–480 min in the test set, pairwise mutations (underlined) are greedily applied until no further increase (g) or decrease (h) in IFP0–480 min is observed (total of 10 and 8 rounds for g and h, respectively). Source data for ac, e, and f are available as a Source Data file.

Back to article page