Abstract
Transcription of genes is regulated by DNA elements such as promoters and enhancers, the activity of which are in turn controlled by many transcription factors. Owing to the highly complex combinatorial logic involved, it has been difficult to construct computational models that predict gene activity from DNA sequence. Recent advances in deep learning techniques applied to data from epigenome mapping and high-throughput reporter assays have made substantial progress towards addressing this complexity. Such models can capture the regulatory grammar with remarkable accuracy and show great promise in predicting the effects of non-coding variants, uncovering detailed molecular mechanisms of gene regulation and designing synthetic regulatory elements for biotechnology. Here, we discuss the principles of these approaches, the types of training data sets that are available and the strengths and limitations of different approaches.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout





Similar content being viewed by others
References
Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and genomics. Nat. Rev. Genet. 16, 321–332 (2015).
Fulco, C. P. et al. Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nat. Genet. 51, 1664–1669 (2019).
Ying, P. et al. Genome-wide enhancer-gene regulatory maps link causal variants to target genes underlying human cancer risk. Nat. Commun. 14, 5958 (2023).
Sokolova, K., Chen, K. M., Hao, Y., Zhou, J. & Troyanskaya, O. G. Deep learning sequence models for transcriptional regulation. Annu. Rev. Genomics Hum. Genet. 25, 105–122 (2024).
Novakovsky, G., Dexter, N., Libbrecht, M. W., Wasserman, W. W. & Mostafavi, S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat. Rev. Genet. 24, 125–137 (2023). This Review provides a detailed description of interpretation methods for sequence-to-expression models.
Capitanchik, C., Wilkins, O. G., Wagner, N., Gagneur, J. & Ule, J. From computational models of the splicing code to regulatory mechanisms and therapeutic implications. Nat. Rev. Genet. https://doi.org/10.1038/s41576-024-00774-2 (2024).
La Fleur, A., Shi, Y. & Seelig, G. Decoding biology with massively parallel reporter assays and machine learning. Genes Dev. 38, 843–865 (2024).
van Helden, J., Andre, B. & Collado-Vides, J. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281, 827–842 (1998).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
Vaswani, A. et al. Attention is all you need. Preprint at https://arxiv.org/abs/1706.03762 (2017).
Eraslan, G., Avsec, Z., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).
Stormo, G. D. Modeling the specificity of protein–DNA interactions. Quant. Biol. 1, 115–130 (2013).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015). Describing DeepSEA, this pioneering paper uses a convolutional neural network to predict the effects of non-coding variants on epigenomic tracks using only DNA sequence as input.
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
Avsec, Z. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021). Describing Enformer, this study was among the first to effectively use transformers to capture long-distance enhancer–promoter interactions, enabling the prediction of epigenomic and expression tracks across multiple cell types.
Karollus, A., Mauermeier, T. & Gagneur, J. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biol. 24, 56 (2023). This article highlights some limitations of current transformer sequence-to-expression models.
He, A. Y., Palamuttam, N. P. & Danko, C. G. Training deep learning models on personalized genomic sequences improves variant effect prediction. Preprint at bioRxiv https://doi.org/10.1101/2024.10.15.618510 (2024).
Linder, J., Srivastava, D., Yuan, H., Agarwal, V. & Kelley, D. R. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nat. Genet. https://doi.org/10.1038/s41588-024-02053-6 (2025). This paper describes Borzoi, the current state-of-the-art transformer-based model that predicts RNA-seq, among other tracks, by capturing transcription, splicing and poly-adenylation signals.
Toneyan, S. & Koo, P. K. Interpreting cis-regulatory interactions from large-scale deep neural networks. Nat. Genet. 56, 2517–2527 (2024).
Penzar, D. et al. LegNet: a best-in-class deep learning model for short DNA regulatory regions. Bioinformatics https://doi.org/10.1093/bioinformatics/btad457 (2023).
Rafi, A. M. et al. A community effort to optimize sequence-based deep learning models of gene regulation. Nat. Biotechnol. https://doi.org/10.1038/s41587-024-02414-w (2024).
Cochran, K. et al. Dissecting the cis-regulatory syntax of transcription initiation with deep learning. Preprint at bioRxiv https://doi.org/10.1101/2024.05.28.596138 (2024).
Dudnyk, K., Cai, D., Shi, C., Xu, J. & Zhou, J. Sequence basis of transcription initiation in the human genome. Science 384, eadj0116 (2024).
He, A. Y. & Danko, C. G. Dissection of core promoter syntax through single nucleotide resolution modeling of transcription initiation. Preprint at bioRxiv https://doi.org/10.1101/2024.03.13.583868 (2024).
Naqvi, S. et al. Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage. Cell Genom. 5, 100780 (2025).
Zrimec, J. et al. Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure. Nat. Commun. 11, 6141 (2020).
Agarwal, V. & Shendure, J. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep. 31, 107663 (2020).
Lee, B. H. & Rhie, S. K. Molecular and computational approaches to map regulatory elements in 3D chromatin structure. Epigenet. Chromat. 14, 14 (2021).
Zhang, Y. et al. MLSNet: a deep learning model for predicting transcription factor binding sites. Brief Bioinform. https://doi.org/10.1093/bib/bbae489 (2024).
Avsec, Z. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).
Zhang, Q. et al. Base-resolution prediction of transcription factor binding signals by a deep learning framework. PLoS Comput. Biol. 18, e1009941 (2022).
Brennan, K. J. et al. Chromatin accessibility in the Drosophila embryo is determined by transcription factor pioneering and enhancer activation. Dev. Cell 58, 1898–1916.e9 (2023).
Gasperini, M., Tome, J. M. & Shendure, J. Towards a comprehensive catalogue of validated and target-linked human enhancers. Nat. Rev. Genet. 21, 292–310 (2020).
Marr, L. T., Jaya, P., Mishra, L. N. & Hayes, J. J. Whole-genome methods to define DNA and histone accessibility and long-range interactions in chromatin. Biochem. Soc. Trans. 50, 199–212 (2022).
Liu, Q., Xia, F., Yin, Q. & Jiang, R. Chromatin accessibility prediction via a hybrid deep convolutional neural network. Bioinformatics 34, 732–738 (2018).
Minnoye, L. et al. Cross-species analysis of enhancer logic using deep learning. Genome Res. 30, 1815–1834 (2020).
Rada-Iglesias, A. et al. A unique chromatin signature uncovers early developmental enhancers in humans. Nature 470, 279–283 (2011).
Noguchi, S. et al. FANTOM5 CAGE profiles of human and mouse samples. Sci. Data 4, 170112 (2017).
Min, X. et al. Predicting enhancers with deep convolutional neural networks. BMC Bioinform. 18, 478 (2017).
Li, Y., Shi, W. & Wasserman, W. W. Genome-wide prediction of cis-regulatory regions using supervised deep learning methods. BMC Bioinform. 19, 202 (2018).
Cappelletti, L. et al. Boosting tissue-specific prediction of active cis-regulatory regions through deep learning and Bayesian optimization techniques. BMC Bioinform. 23, 154 (2022).
Serebreni, L. & Stark, A. Insights into gene regulation: from regulatory genomic elements to DNA–protein and protein–protein interactions. Curr. Opin. Cell Biol. 70, 58–66 (2021).
Kim, S. & Wysocka, J. Deciphering the multi-scale, quantitative cis-regulatory code. Mol. Cell 83, 373–392 (2023).
Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018).
Chen, K. M., Wong, A. K., Troyanskaya, O. G. & Zhou, J. A sequence-based global map of regulatory activity for deciphering human genetics. Nat. Genet. 54, 940–949 (2022).
Li, H. & Guan, Y. Fast decoding cell type-specific transcription factor binding landscape at single-nucleotide resolution. Genome Res. 31, 721–731 (2021).
Kaiser, L. et al. One model to learn them all. Preprint at https://arxiv.org/abs/1706.05137 (2017).
Vandenhende, S. et al. Multi-task learning for dense prediction tasks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44, 3614–3633 (2022).
Kathail, P. et al. Current genomic deep learning models display decreased performance in cell type-specific accessible regions. Genome Biol. 25, 202 (2024).
Sasse, A. et al. Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings. Nat. Genet. 55, 2060–2064 (2023).
Lakkapragada, A., Sleiman, E., Surabhi, S. & Wall, D. P. Mitigating negative transfer in multi-task learning with exponential moving average loss weighting strategies. Preprint at https://arxiv.org/abs/2211.12999 (2022).
Schwessinger, R., Deasy, J., Woodruff, R. T., Young, S. & Branson, K. M. Single-cell gene expression prediction from DNA sequence at large contexts. Preprint at bioRxiv https://doi.org/10.1101/2023.07.26.550634 (2023).
Lal, A. et al. Decoding sequence determinants of gene expression in diverse cellular and disease states. Preprint at bioRxiv https://doi.org/10.1101/2024.10.09.617507 (2025).
Novakovsky, G., Saraswat, M., Fornes, O., Mostafavi, S. & Wasserman, W. W. Biologically relevant transfer learning improves transcription factor binding prediction. Genome Biol. 22, 280 (2021).
de Almeida, B. P. et al. Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo. Nature 626, 207–211 (2024).
Bravo Gonzalez-Blas, C. et al. Single-cell spatial multi-omics and deep learning dissect enhancer-driven gene regulatory networks in liver zonation. Nat. Cell Biol. 26, 153–167 (2024).
Hingerl, J. C. et al. scooby: modeling multi-modal genomic profiles from DNA sequence at single-cell resolution. Preprint at bioRxiv https://doi.org/10.1101/2024.09.19.613754 (2024).
Drusinsky, S., Whalen, S. & Pollard, K. S. Deep-learning prediction of gene expression from personal genomes. Preprint at bioRxiv https://doi.org/10.1101/2024.07.27.605449 (2024).
Kathail, P., Bajwa, A. & Ioannidis, N. M. Leveraging genomic deep learning models for non-coding variant effect prediction. Preprint at https://arxiv.org/abs/2411.11158 (2024).
Murphy, A. E., Beardall, W., Rei, M., Phuycharoen, M. & Skene, N. G. Predicting cell type-specific epigenomic profiles accounting for distal genetic effects. Nat. Commun. 15, 9951 (2024).
Waszak, S. M. et al. Population variation and genetic control of modular chromatin architecture in humans. Cell 162, 1039–1050 (2015).
Agarwal, V. et al. Massively parallel characterization of transcriptional regulatory elements. Nature 639, 411–420 (2025).
Liu, Y. et al. Functional assessment of human enhancer activities using whole-genome STARR-sequencing. Genome Biol. 18, 219 (2017).
Sahu, B. et al. Sequence determinants of human gene regulatory elements. Nat. Genet. 54, 283–294 (2022).
Deng, C. et al. Massively parallel characterization of regulatory elements in the developing human cortex. Science 384, eadh0559 (2024).
Trauernicht, M., Martinez-Ara, M. & van Steensel, B. Deciphering gene regulation using massively parallel reporter assays. Trends Biochem. Sci. 45, 90–91 (2020).
Gallego Romero, I. & Lea, A. J. Leveraging massively parallel reporter assays for evolutionary questions. Genome Biol. 24, 26 (2023).
Zheng, Y. & VanDusen, N. J. Massively parallel reporter assays for high-throughput in vivo analysis of cis-regulatory elements. J. Cardiovasc. Dev. Dis. https://doi.org/10.3390/jcdd10040144 (2023).
Movva, R. et al. Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays. PLoS ONE 14, e0218073 (2019).
de Almeida, B. P., Reiter, F., Pagani, M. & Stark, A. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat. Genet. 54, 613–624 (2022). This study applied a convolutional neural network model to predict enhancer activity in a plasmid-based assay to understand the grammatical rules of enhancers in Drosophila melanogaster cells.
Barbadilla-Martínez, L. et al. The regulatory grammar of human promoters uncovered by MPRA-trained deep learning. Preprint at bioRxiv https://doi.org/10.1101/2024.07.09.602649 (2024).
Duttke, S. H. et al. Position-dependent function of human sequence-specific transcription factors. Nature 631, 891–898 (2024).
de Boer, C. G. & Taipale, J. Hold out the genome: a roadmap to solving the cis-regulatory code. Nature 625, 41–50 (2024).
de Boer, C. G. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat. Biotechnol. 38, 56–65 (2020).
Vaishnav, E. D. et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature 603, 455–463 (2022). This study was one of the first to design functional synthetic sequences and investigate regulatory evolution in Saccharomyces cerevisiae using sequence-to-expression models.
Akhtar, W. et al. Chromatin position effects assayed by thousands of reporters integrated in parallel. Cell 154, 914–927 (2013).
Klein, J. C. et al. A systematic evaluation of the design and context dependencies of massively parallel reporter assays. Nat. Methods 17, 1083–1091 (2020).
Badis, G. et al. Diversity and complexity in DNA recognition by transcription factors. Science 324, 1720–1723 (2009).
Alexandari, A. M. et al. De novo distillation of thermodynamic affinity from deep learning regulatory sequence models of in vivo protein–DNA binding. Preprint at bioRxiv https://doi.org/10.1101/2023.05.11.540401 (2023).
Nair, S., Shrikumar, A., Schreiber, J. & Kundaje, A. fastISM: performant in silico saturation mutagenesis for convolutional neural networks. Bioinformatics 38, 2397–2403 (2022).
Schreiber, J., Nair, S., Balsubramani, A. & Kundaje, A. Accelerating in silico saturation mutagenesis using compressed sensing. Bioinformatics 38, 3557–3564 (2022).
Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. Preprint at https://arxiv.org/abs/1312.6034 (2013).
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. Preprint at https://arxiv.org/abs/1704.02685 (2017). Describing DeepLIFT, this study represents pioneering work on attribution methods for interrogating sequence-to-expression models.
Lundberg, S. & Lee, S.-I. A unified approach to interpreting model predictions. Preprint at https://arxiv.org/abs/1705.07874 (2017).
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. Preprint at https://arxiv.org/abs/1703.01365 (2017).
Shrikumar, A. et al. Technical note on transcription factor motif discovery from importance scores (TF-MoDISco) version 0.5.6.5. Preprint at https://arxiv.org/abs/1811.00416 (2018).
Janssens, J. et al. Decoding gene regulation in the fly brain. Nature 601, 630–636 (2022).
Yuan, H. & Kelley, D. R. scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks. Nat. Methods 19, 1088–1096 (2022).
Ribeiro, M. T., Singh, S. & Guestrin, C. “Why Should I Trust You?”: explaining the predictions of any classifier. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1135–1144 (Association for Computing Machinery, 2016).
Novakovsky, G., Fornes, O., Saraswat, M., Mostafavi, S. & Wasserman, W. W. ExplaiNN: interpretable and transparent neural networks for genomics. Genome Biol. 24, 154 (2023).
Seitz, E. E., McCandlish, D. M., Kinney, J. B. & Koo, P. K. Interpreting cis-regulatory mechanisms from genomic deep neural networks using surrogate models. Nat. Mach. Intell. 6, 701–713 (2024).
Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Ser. B Stat. Methodol. 82, 1273–1300 (2020).
Siraj, L. et al. Functional dissection of complex and molecular trait variants at single nucleotide resolution. Preprint at bioRxiv https://doi.org/10.1101/2024.05.05.592437 (2024).
Shigaki, D. et al. Integration of multiple epigenomic marks improves prediction of variant impact in saturation mutagenesis reporter assay. Hum. Mutat. 40, 1280–1291 (2019).
Chen, X. D. et al. Helicase-assisted continuous editing for programmable mutagenesis of endogenous genomes. Science 386, eadn5876 (2024).
Yao, D. et al. Multicenter integrated analysis of noncoding CRISPRi screens. Nat. Methods 21, 723–734 (2024).
Schraivogel, D. et al. Targeted Perturb-seq enables genome-scale genetic screens in single cells. Nat. Methods 17, 629–635 (2020).
Eder, M., Moene, C. J. I., Dauban, L., Leemans, C. & van Steensel, B. Functional maps of a genomic locus reveal confinement of an enhancer by its target gene. Preprint at bioRxiv https://doi.org/10.1101/2024.08.26.609360 (2024).
Horton, C. A. et al. Short tandem repeats bind transcription factors to tune eukaryotic gene expression. Science 381, eadd1250 (2023).
Reiter, F., de Almeida, B. P. & Stark, A. Enhancers display constrained sequence flexibility and context-specific modulation of motif function. Genome Res. 33, 346–358 (2023).
Dey, K. K. et al. Evaluating the informativeness of deep learning annotations for human complex diseases. Nat. Commun. 11, 4703 (2020).
Trevino, A. E. et al. Chromatin and gene-regulatory dynamics of the developing human cerebral cortex at single-cell resolution. Cell 184, 5053–5069.e23 (2021).
Wang, S. K. et al. Single-cell multiome of the human retina and deep learning nominate causal variants in complex eye diseases. Cell Genom. https://doi.org/10.1016/j.xgen.2022.100164 (2022).
Huang, C. et al. Personal transcriptome variation is poorly explained by current genomic deep learning models. Nat. Genet. 55, 2056–2059 (2023).
Chen, L., Fish, A. E. & Capra, J. A. Prediction of gene regulatory enhancers across species reveals evolutionarily conserved sequence properties. PLoS Comput. Biol. 14, e1006484 (2018).
Kelley, D. R. Cross-species regulatory sequence activity prediction. PLoS Comput. Biol. 16, e1008050 (2020).
Kaplow, I. M. et al. Inferring mammalian tissue-specific regulatory conservation by predicting tissue-specific differences in open chromatin. BMC Genom. 23, 291 (2022).
Kaplow, I. M. et al. Relating enhancer genetic variation across mammals to complex phenotypes using machine learning. Science 380, eabm7993 (2023).
Hecker, N. et al. Enhancer-driven cell type comparison reveals similarities between the mammalian and bird pallium. Science 387, eadp3957 (2025).
Taskiran, I. I. et al. Cell-type-directed design of synthetic enhancers. Nature 626, 212–220 (2024). In this paper, several approaches are used to construct cell-type-specific enhancers using sequence-to-expression models.
Gosai, S. J. et al. Machine-guided design of cell-type-targeting cis-regulatory elements. Nature 634, 1211–1220 (2024).
Lal, A., Garfield, D., Biancalani, T. & Eraslan, G. Designing realistic regulatory DNA with autoregressive language models. Genome Res. 34, 1411–1420 (2024).
Sarkar, A., Tang, Z., Zhao, C. & Koo, P. K. Designing DNA with tunable regulatory activity using discrete diffusion. Preprint at bioRxiv https://doi.org/10.1101/2024.05.23.595630 (2024).
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 37, 2112–2120 (2021).
Dalla-Torre, H. et al. Nucleotide transformer: building and evaluating robust foundation models for human genomics. Nat. Methods 22, 287–297 (2025).
Benegas, G., Batra, S. S. & Song, Y. S. DNA language models are powerful predictors of genome-wide variant effects. Proc. Natl Acad. Sci. USA 120, e2311219120 (2023).
Nguyen, E. et al. HyenaDNA: long-range genomic sequence modeling at single nucleotide resolution. (2023).
Tang, Z., Somia, N., Yu, Y. & Koo, P. K. Evaluating the representational power of pre-trained DNA language models for regulatory genomics. Preprint at bioRxiv https://doi.org/10.1101/2024.02.29.582810 (2024).
Isa Marin, F. et al. BEND: benchmarking DNA language models on biologically meaningful tasks. Preprint at https://arxiv.org/abs/2311.12570 (2023).
Friedman, R. Z. et al. Active learning of enhancer and silencer regulatory grammar in photoreceptors. Preprint at bioRxiv https://doi.org/10.1101/2023.08.21.554146 (2023).
Duncan, A. G., Mitchell, J. A. & Moses, A. M. Improving the performance of supervised deep learning for regulatory genomics using phylogenetic augmentation. Bioinformatics https://doi.org/10.1093/bioinformatics/btae190 (2024).
Lee, N. K., Tang, Z., Toneyan, S. & Koo, P. K. EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations. Genome Biol. 24, 105 (2023).
Rastogi, R., Reddy, A. J., Chung, R. & Ioannidis, N. M. Fine-tuning sequence-to-expression models on personal genome and transcriptome data. Preprint at bioRxiv https://doi.org/10.1101/2024.09.23.614632 (2024).
Fu, X. et al. A foundation model of transcription across human cell types. Nature 637, 965–973 (2025).
Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. Genome-wide mapping of in vivo protein–DNA interactions. Science 316, 1497–1502 (2007).
Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods 10, 1213–1218 (2013).
Barski, A. et al. High-resolution profiling of histone methylations in the human genome. Cell 129, 823–837 (2007).
Shiraki, T. et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc. Natl Acad. Sci. USA 100, 15776–15781 (2003).
Kruesi, W. S., Core, L. J., Waters, C. T., Lis, J. T. & Meyer, B. J. Condensin controls recruitment of RNA polymerase II to achieve nematode X-chromosome dosage compensation. eLife 2, e00808 (2013).
Tsuchihara, K. et al. Massive transcriptional start site analysis of human genes in hypoxia cells. Nucleic Acids Res. 37, 2249–2263 (2009).
Policastro, R. A. & Zentner, G. E. Global approaches for profiling transcription initiation. Cell Rep. Methods https://doi.org/10.1016/j.crmeth.2021.100081 (2021).
van Arensbergen, J. et al. Genome-wide mapping of autonomous promoter activity in human cells. Nat. Biotechnol. 35, 145–153 (2017).
Arnold, C. D. et al. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 339, 1074–1077 (2013).
Bajwa, A., Rastogi, R., Kathail, P., Shuai, R. W. & Ioannidis, N. M. Characterizing uncertainty in predictions of genomic sequence-to-activity models. Preprint at bioRxiv https://doi.org/10.1101/2023.12.21.572730 (2023).
Acknowledgements
The authors acknowledge V. Franceschini-Santos for extensive discussion and help with the generation of figures. Research at the Netherlands Cancer Institute is supported by an institutional grant of the Dutch Cancer Society and of the Dutch Ministry of Health, Welfare and Sport. The Oncode Institute is partially funded by the Dutch Cancer Society.
Author information
Authors and Affiliations
Contributions
All authors contributed substantially to discussion of the content, wrote the article and reviewed and/or edited the manuscript before submission. L.B.-M. and N.K. designed and generated figures.
Corresponding authors
Ethics declarations
Competing interests
The authors have applied for a patent related to the S2E model PARM.
Peer review
Peer review information
Nature Reviews Genetics thanks Julia Zeitlinger and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Glossary
- Attribution map
-
Interpretation technique that visualizes the importance of each nucleotide in a sequence to the prediction of the model.
- Convolutional neural networks
-
(CNNs). A neural network architecture using convolutional layers that extract local patterns at different spatial hierarchies.
- Deep learning
-
(DL). A class of machine-learning approaches capable of identifying highly complex patterns in large data sets. Unlike classical machine learning, DL methods can automatically learn the best representation through stacking artificial neural networks in multiple layers, thereby minimizing the need for manual feature engineering.
- Dot product
-
A basic linear algebra computation to multiply matrices.
- Expression quantitative trait loci
-
(eQTL). Genomic loci that regulate expression levels of mRNA or proteins.
- Fully connected layer
-
A neural network layer in which every input contributes to the computation of every output. These are typically the last layers in a model.
- Hyperparameters
-
Configuration settings that control how the model learns, such as model size and learning rate, which are set before training and may be optimized during validation.
- Kernel
-
Also referred to as a filter, a kernel is a small matrix that can detect specific patterns in the input sequence through the process of convolution. Which features each kernel recognizes is determined by the parameters in the matrix, which are learned during training.
- k-mer
-
A short substring of length ‘k’ found within a larger biological sequence (such as DNA or RNA).
- Machine learning
-
Describes a wide range of algorithms that can learn from data to make predictions on new and unseen data. Examples include random forest, support vector machines and gradient boosting. Machine-learning methods often require manual feature engineering.
- Parameters
-
Also referred to as weights, parameters are values in the model that are learned during training to optimize performance.
- Position weight matrices
-
Representation of motifs in a DNA sequence as matrices that indicate the frequency or importance of each nucleotide at each position.
- Receptive fields
-
Region of DNA sequence that influences the prediction of the model at a given position.
- Self-attention
-
A mechanism used in transformer deep learning models that allows the model to weigh the importance of different parts of an input sequence when processing it, effectively allowing the model to ‘focus’ on the most relevant part of the input DNA sequence.
- Transfer learning
-
A technique whereby a model trained on one task is used as a starting point for a model on a different but related task, leveraging previously learned knowledge to improve performance.
- Transformers
-
A neural network architecture based on the concept of self-attention that weights the importance of different parts of the input, allowing long-range dependencies to be captured.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Barbadilla-Martínez, L., Klaassen, N., van Steensel, B. et al. Predicting gene expression from DNA sequence using deep learning models. Nat Rev Genet 26, 666–680 (2025). https://doi.org/10.1038/s41576-025-00841-2
Accepted:
Published:
Issue date:
DOI: https://doi.org/10.1038/s41576-025-00841-2
This article is cited by
-
The distinct roles of genome, methylation, transcription, and translation on protein expression in Arabidopsis thaliana resolve the Central Dogma’s information flow
Genome Biology (2025)
-
Making sense of the regulatory genome
Nature Reviews Genetics (2025)
-
Harnessing functional annotation to improve the accuracy and transferability of polygenic scores
Nature Reviews Genetics (2025)