Fig. 1: Gene expression prediction models required the extraction of proximal gene sequence from crop plant reference genomes, estimation and classification of transcript levels and nucleotide sequence conversion via one-hot-encoding to generate training data for the modelling in a convolutional neural network.
From: Deep learning the cis-regulatory code for gene expression in selected model plants

a Per gene, two proximal regions with a size of 1.5 kbp each were extracted at the transcription start sites (TSS) and transcript termination site (TTS), respectively, fused and separated by a 20 nt padding of Ns. The extracted regions cover 1 kbp of non-transcribed, intergenic region DNA flanking the gene up and downstream, plus 500 bp of each gene transcribed 5‘ and 3´ end, covering e.g. UTR regions. DNA regions were extracted 1 kbp upstream and downstream and 0.5 kbp from the annotated gene start and end of genes, respectively. Extracted sequences were converted into matrices by one-hot encoding, separated by a 20 nt padding. b Genes were assigned into low (dark orange), medium (blue), and high (red) expression classes based on the upper and lower quartile of the logMaxTPM distributions (orange, blue and red) exemplarily shown for A. thaliana. Histograms for transcript profiles of S. lycopersicum, S. bicolor and Z. mays are shown in Supplementary Fig. 1. The threshold values for leaf transcript profiles of A. thaliana, S. lycopersicum, S. bicolor and Z. mays were 0.199, 0.000, 0.153 and 0.113 for the lower and 1.621, 1.051, 1.389 and 1.465 for the higher quartile, respectively (Supplementary Data 1, Source Data). c An end-to-end depiction of model training for one-hot-encoded sequences that were used as training and testing data for the convolutional neural networks (CNN). The CNN architecture consisted of three convolutional blocks, each containing two convolutional layers followed by a pooling and a dropout layer. The final convolutional block was followed by two fully connected layers separated by a dropout layer and a final output layer with sigmoid activation.