Fig. 1: Characterization of the training data. | Nature Communications

Fig. 1: Characterization of the training data.

From: Accuracy and data efficiency in deep learning models of protein expression

Fig. 1

A We employed a large phenotypic screen in Escherichia coli23 of an sfGFP coding gene preceded by a variable 96nt sequence. The variable region was designed on the basis of eight sequence properties previously described as impacting translational efficiency: nucleotide content (%AT), patterns of codon usage (codon adaptation index, CAI, codon ramp bottleneck position, BtlP, and strength, BtlS), hydrophobicity of the polypeptide (mean hydrophobicity index, MHI) and stability of three secondary structures tiled along the transcript (MFE-1, MFE-2, and MFE-3). A total of 56 seed sequences were designed to provide a broad coverage of the sequence space, and then subjected to controlled randomization to create 56 mutational series of ~4000 sequences each. After removal of variants with missing measurements, the dataset contains 228,000 sequences. Violin plots show the distribution of the average value of the eight properties across the 56 mutational series; the biophysical properties were normalized to the range [0, 1] and then averaged across series. For all violins, the white circle indicates the median, box edges are at the 25th and 75th percentiles, and whiskers show the 95% confidence interval. B Two dimensional UMAP27 visualization of overlapping 4-mers computed for all 228,000 sequences; this representation reveals 56 clusters, with each cluster corresponding to a mutational series that locally explores the sequence space around its seed; we have highlighted five series with markedly distinct phenotype distributions (labels denote the series number). Other UMAP projections for overlapping 3-mers and and 5-mers are shown in Supplementary Fig. S1. C Mutational series with qualitatively distinct phenotypic distributions, as measured by FACS-sequencing of sfGFP fluorescence normalized to its maximal measured value; solid lines are Gaussian kernel density estimates of the fluorescence distribution. Measurements are normalized to the maximum sfGFP fluorescence across cells transformed with the same construct averaged over 4 experimental replicates of the whole library23. Fluorescence distributions for all mutational series are shown in Supplementary Fig. S2.

Back to article page