Fig. 6: Sequence-to-expression models using promoter data from Saccharomyces cerevisiae. | Nature Communications

Fig. 6: Sequence-to-expression models using promoter data from Saccharomyces cerevisiae.

From: Accuracy and data efficiency in deep learning models of protein expression

Fig. 6

A Genotypic space of yeast promoter data from Vaishnav et al.25 visualized with the UMAP27 algorithm for dimensionality reduction; sequences were featurized using counts of overlapping 4-mers, as in Fig. 1B. The dataset contains 3929 promoter variants (80nt long) of 199 native genes, as well as fluorescence measurements of a yellow fluorescent protein (YFP) reporter; inset shows the distribution of variants per gene across the whole dataset. B Bubble plots show the accuracy of five random forest (RF) models trained on datasets of constant size and increasing sequence diversity, following a similar strategy as in Fig. 5A. We first aggregated variant clusters into twelve groups, and then trained RF models by aggregating fractions of randomly chosen groups into a new dataset for training; the total size of the training set was kept constant at 400 sequences. Accuracy was quantified with the R2 score averaged across test sets from each group (~30 sequences/group) that were held out from training. Inset shows model accuracy in each test set. In line with the results in Fig. 5A, we observe that model coverage can be improved by adding small fractions of each group into the training set; we observed similar trends for other random choices of groups included in the training set (Supplementary Fig. S13). Details on data processing and model training can be found in the Methods and Supplementary Text. Sequence diversity was quantified as in Fig. 5B.

Back to article page