Fig. 5: Comparison of the predictability of expression-predictive motifs (EPMs) in 15 different Solanum genotypes with structural variations (SVs) in the transcription start site (TSS) and transcription termination site (TTS) identified and characterised by Alonge and colleagues (2020)36.
From: Deep learning the cis-regulatory code for gene expression in selected model plants

a The taxonomic grouping of fifteen Solanum genotypes based on SVs inferred by Alonge and colleagues (left) in comparison to hierarchically clustered predictions of gene expression (right) display differences in topology for the groups of S. lycopersicum var. cerasiforme (SLC. 23) and vintage (SLL,27.). b There is an intersection of 314 genes between genes with SVs in their upstream or downstream 5 kbp regions, which were detected for log-fold change in gene expression levels across any of the fifteen Solanum genotypes, and genes with variances exceeding 0.005 in their predicted probabilities from the Solanum MSR leaf model, indicating differential expression across the 15 genotypes. Six random examples were selected from the intersection for detailed examination, as shown in panels (d–i), with further material provided in supplementary data 9. c Examined EPM variation in 15 Solanum genotypes, analysing genes with conserved or mutated EPMs alongside shifts in gene expression levels. Genes with homogenous gene expression (n = 27,993) showed higher rates of conserved EPMs (blue boxplot), while those with differential gene expression (n = 2053) exhibited higher rates of mutated EPMs (yellow box plots). Gene expression heterogeneity was determined based on MSR leaf model probability, using variance thresholds larger than 0.005. Predicted probabilities below or equal to 0.5, indicated low gene expression rates and vice versa (Supplementary Data 7). EPM occurrence and predicted gene expression levels were linked using BLAMM37. EPMs were classified as conserved or mutated if present among all or absent among one genotype per gene, respectively (Supplementary Data 8). Boxplot depicting samples after bootstrap repetition using the 25th, 50th (median) and 75th percentiles along with the interquartile range, representing the central 50% of the data. Whiskers extend from the minimum to maximum values, showcasing the spread of the dataset. The two-sided F1 and Chi-squared test (p value <0.0001) support statistical significance (Source Data). d Structural variations in the flanking regions altered gene expression measured in species indicated with asterisks for the exemplarily shown genes Solyc02g087170 and Solyc02g080300 (g). Congruently, high (red) and low (blue) probability scores of the multi-species reference (MSR) model indicate differential gene expression across the genotypes. Boxplot depicting sample characteristics using the 25th, 50th (median) and 75th percentiles along with the interquartile range, representing the central 50% of the data. Whiskers extend from the minimum to maximum values, showcasing the spread of the dataset. Outliers are depicted as dots (Source Data). e Gene maps of upstream regions around the TSS include UTR region (striped boxes), exons (“ATG” + filled boxes) and introns (white boxes), along with the location of significant EPM matches (e-value <0.00001) inside (white arrow) and outside (grey arrow) of their positional preference and the location of SVs (black arrows). f For closer inspection, only EPMs allocated to their preferred position were chosen. For sequence comparison, ITAG.3, representing genetic variant A and PAS104479, representing genetic variant B are displayed. EPMs of S. lycopersicum MSR model M006 p0m02 and M032 p0m15 lie within the 5’UTR of Solyc02g087170. An SV indel mutation of 178 bp located only 37 bp behind the TSS disrupts epmSoly-M006-p0m02 variant B genotype, starting 30 bp behind the expected TSS in genotypes of variant A. Due to the same SV, however, epmSoly-M032-p0m15 is now localised within its positional preferred range for species of genetic variant B. g Probability scores of the MSR model indicate differential gene expression for Solyc02g080300 (See caption for sub-figure b). h (see caption for sub-figure d). g Within the first intron, 233 bp downstream of the expected TSS of Solyc02g080300 epmSoly-M009-p0m04 was identified. An SV indel mutation of 10 bp lies 7 bp downstream of epmSoly-M009-p0m04, not disrupting the EPM. In contrast to the example before, the difference between two point mutations coincides with low probabilities of high gene expression for genotypes of variant A within epmSoly-M009-p0m04.