Fig. 4: Expression predictive motifs (EPMs) identified by DeepLIFT and TF-MoDISco determined with convolutional neural networks (CNN) trained on single-species reference (SSR) and multi species reference (MSR) models of A. thaliana, S. lycopersicum, S. bicolor and Z. mays.
From: Deep learning the cis-regulatory code for gene expression in selected model plants

a Importance scores (IS) in the 1.5 kbp upstream and downstream selected region of exemplarily chosen gene AT1G01650 of A. thaliana that is predicted by SSR leaf model with a positive sum IS of 3.86. The region with the maximum IS of AT1G01650 lies in the upstream region 1097-1155 bp including two cytosine-thymidine-hexamers (6CT). b The 6CT motifs matched with an expression-predictive motif (EPM) that was inferred with TF-MoDISco. For clarity, we propose an EPM nomenclature system that assigns abbreviations to plant species based on their genus and epithet, followed by the model used to produce the EPMs (SSR or MSR). The physiological conditions of the plant are indicated by a number (0 for standard conditions). The predictability of each motif is indicated by a ‘p’ followed by several 1 s and 0 s for low and high rates of gene expression, respectively. The delimiter is followed by the motif number within the metacluster and its orientation (forward or reverse). Finally, the number of seqlets included, the information content, and the consensus sequences are added at the end of the EPM. For example, epmArth-S019-p0m06 has a sum importance score of 0.31, a maximum importance score of 0.03 and a minimum score of −0.001 (Supplementary Data 4). It has been found three times in the upstream region of AT1G01650. In addition, epmArth-S019-p0m06 matched with 99% similarity measured with Pearson correlation coefficient (PCC) and e-value = 0.002 transcription factor binding site (TFBS) of A. thaliana BPC5 of the BRR/BPC class (Supplementary Data 4, JASPAR accession MA1403.1). According to the nomenclature, proposed henceforth, the EPM was identified in A. thaliana (Arth), by the SSR model (S), under standard conditions in leaf (0), predicting high gene expression rates (p0), inferred from 443 seqlets with an information score, indicating nucleotide frequency, specificity, and motif heterogeneity of 19.4, along with its consensus sequence (CTCTCT). c The EPMs are assigned into 17 clusters based on similarity using the Smith–Waterman algorithm and manual inspection of the consensus sequences following the alignment. The clusters are named after conserved DNA motifs and indicated by the IUPAC nucleotide code, along with the least number of repeats of motifs (numerals) and potential additions (+). Clusters with EPMs that significantly match TFBSs from the JASPAR database with e-value <0.05 compared using PCC are marked with black triangles. The EPMs identified by the MSR model are highlighted by grey boxes in the dendrogram, while prediction for high and low gene expression is displayed by red and blue branches, respectively. Underlined clusters, with selected representative EPMs, are shown exemplarily in panels (c) and (d). The complete full-scale version of the dendrogram and the consensus sequence alignment EPMs can be found in supplementary Fig. 9 and supplementary data 5. d EPMs of the 2CWY+ cluster uniformly predict low rates of gene expression (blue tips). EPMs of this cluster are identified by both the SSR and MSR (grey boxes) models. In contrast to the SSR model, the MSR models identified 2CWY+ motifs for all four reference species. The inverted web-logos show the EPMs negative importance scores, ranging from 0 to 0.05 or −0.05 associated with DeepLIFT and TF-MoDISco metacluster 0 or 1 (p0, p1), respectively. Histograms display the positional preference inferred from the number of seqlets relative to the transcription start and transcription termination sites (TSS, TTS) of each EPM. The EPMs of the 2CWY+ type display significant similarity to the transcription factors binding site of AGL42 (JASPAR accession MA1201.1) of A. thaliana. e The 2GCB+ and 2CT+ clusters contain EPMs predicting both low and high gene expression rates, identified by the SSR and MSR models, corresponding to the positional occurrence related to the TSS or TTS. Both clusters highly resemble previously determined transcription factor binding motifs for 2GCB+ and 2CT+, with MA1820.1 and MA1403.1, respectively.