Fig. 2: Deep learning-generated sequences exhibit properties of natural regulatory DNA. | Nature Communications

Fig. 2: Deep learning-generated sequences exhibit properties of natural regulatory DNA.

From: Controlling gene expression with deep generative design of regulatory DNA

Fig. 2

a, b Cumulative positional distribution of known DNA regulatory grammar elements (see Fig. 1a) across the regulatory regions of a generated synthetic and b natural sequences (n = 425 each). Shown are yeast TFBS48 identified (q-value < 0.05) using FIMO49 (blue) and TATA core promoter elements50,51 (green) in promoters, Kozak sequences52,53 in 5′ UTRs (yellow), termination related motifs (positioning, efficiency and poly-AT motifs)6,54 in 3′ UTRs and terminators (orange), and deep learning-uncovered expression-related motifs and motif association rules7 (red) as well as nucleosome depletion55,56 (gray) across all regions. Note that the amount of Kozak sequences and nucleosome depleted positions are not shown to scale, with 4-fold and 200-fold dilutions, respectively, to improve visualization (see separate comparisons across elements in Supplementary Fig. 3). TSS denotes the transcription start site, Start/Stop the coding sequence start/stop positions and TTS the transcription termination site. c GC content in the equal-sized subsets of generated synthetic (red) and natural test sequences (blue) across the regulatory regions (n = 425 each). d Distribution of 5′UTR lengths in the synthetic (red) and (blue) natural sequences. Boxes denote interquartile (IQR) ranges, centers mark medians and whiskers extend to 1.5 IQR from the quartiles. e Distribution of 3′UTR lengths in the synthetic (red) and natural (blue) sequences. f T-distributed stochastic neighbor embedding (t-SNE) dimensionality reduction60 over the sequence identity distance matrix among equal amounts of combined generated (red) and natural (blue) sequences (n = 2000 each). Source data are provided as a Source Data file.

Back to article page