Extended Data Fig. 6: Deep learning predicts de novo key transcriptional activators and repressors.

a. t-SNE from the cisTopic analysis on the subset of 15 cell types used for the deep learning (DL) analysis. b. Accessibility of topic regions near marker genes. Calculated as the average region probability for topic-regions linked to each set of marker genes (markers from the transcriptome atlas). c. Comparison of topic coherence and DL classification performance (area under ROC-curve (auROC) for the classification of the left-out test regions). The topic coherence represents how likely the regions of the topic will co-occur (higher values are better). d. Box plot of TF motif enrichment in the topics (average enrichment score) split by the topic annotation (i.e., to one cell type, to multiple cell types, or marked as low contribution). The box plot marks the median, upper and lower quartiles and 1.5× interquartile range (whiskers); outliers are shown individually. Number of topics per category shown above plot. e. Topic heatmap showing cell-type specific topics. Bar plots show the number of regions per topic (cutoff p=0.995) and area under the precision recall curve (auPR) of the DL model. f. Contributions of the patterns identified by DeepFlyBrain to classify glial regions reveal activators and repressors (negative nucleotide importance). These motifs can be matched to known factors with concordant expression. g–h. Conservation of the regions centred by the motif (blue) or ATAC peak (orange) for (g) KC and (h) T neuron motif instances. The location of the motifs is shown with dashed lines. i. Heatmap showing Jaccard index between TF binding site predictions from DL and regions from conventional motif discovery. j. Box plots showing higher conservation for overlapping regions compared to deep-learning only regions. The box plot marks the median, upper and lower quartiles and 1.5× interquartile range (whiskers); outliers are shown individually. Number of enhancers per category shown above plot. k. Box plots showing higher enhancer-gene link scores for overlapping regions compared to deep-learning only regions. The box plot marks the median, upper and lower quartiles and 1.5× interquartile range (whiskers); outliers are shown individually. Number of enhancers per category shown above plot. l. Bulk ATAC-seq was performed on brains of 44 different genotypes leading to the identification of caQTLs. caQTLs affecting each of 28k motifs (adjusted p-value of Fisher test versus difference of number of motifs increasing/decreasing accessibility, see Methods). Dots in the same colour affect similar motifs (black: not-significant). m. The fraction of caQTLs predicted to affect chromatin accessibility at different false positive rates (random SNPs). The 5% false positive rate is shown as a dashed grey line. n–p. Effect of SNPs in Mamo (n), Lola-PF (o) and Lola-N (p) motifs on chromatin accessibility. Top-left: DeepExplainer plot for the reference (G) and alternative allele (T), showing a loss of a repressor site for Mamo and Lola-N, and a gain of a repressor site for Lola-PF. Top-right: Candle plots showing predicted accessibility change caused by the SNP for different cell types (increase shown in blue, decrease in red). Bottom-left: Box plot showing bulk accessibility of 44 DGRP lines, split by genotype at this SNP, highlighting an increase in accessibility for the alternative allele. The box plot marks the median, upper and lower quartiles and 1.5× interquartile range (whiskers); all data points are shown. Number of genotypes associated with either reference (Ref) or alternative (Alt) alleles shown. Bottom-right: Single-cell aggregates over the SNP. q. Overexpression of lola isoform N (lola-N) in glia (repo driver) versus neurons (elav driver, control) leads to the closing of 250 regions with the GATC motif. r. Example of a region in perineurial glia (PNG) and subperineurial glia (SUB), that closes upon overexpression of lola-N in glia. The region is also part of the PNG eGRN (see Fig. 5). s. caQTLs affecting DL motifs. Column nUP/nDw: Number of SNPs overlapping with the motif which produce an increase/decrease of accessibility. The FDR is checked on 1000 random caQTLs with the same number of SNPs (i.e., ey: take 6 random caQTLs, 1000 times, and see how many of the 1k repetitions have at least 5 SNPs increasing accessibility).