Predicting gene expression from DNA sequence using deep learning models

Barbadilla-Martínez, Lucía; Klaassen, Noud; van Steensel, Bas; de Ridder, Jeroen

doi:10.1038/s41576-025-00841-2

Review Article
Published: 13 May 2025

Predicting gene expression from DNA sequence using deep learning models

Nature Reviews Genetics volume 26, pages 666–680 (2025)Cite this article

38k Accesses
12 Citations
127 Altmetric
Metrics details

Subjects

Abstract

Transcription of genes is regulated by DNA elements such as promoters and enhancers, the activity of which are in turn controlled by many transcription factors. Owing to the highly complex combinatorial logic involved, it has been difficult to construct computational models that predict gene activity from DNA sequence. Recent advances in deep learning techniques applied to data from epigenome mapping and high-throughput reporter assays have made substantial progress towards addressing this complexity. Such models can capture the regulatory grammar with remarkable accuracy and show great promise in predicting the effects of non-coding variants, uncovering detailed molecular mechanisms of gene regulation and designing synthetic regulatory elements for biotechnology. Here, we discuss the principles of these approaches, the types of training data sets that are available and the strengths and limitations of different approaches.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Sequence-to-expression architectures and training schemes.**

**Fig. 2: Various data types are used for training sequence-to-expression models.**

**Fig. 3: Design of massively parallel reporter assays.**

**Fig. 4: Interpreting deep learning models.**

**Fig. 5: Main applications of sequence-to-expression models.**

Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings

Article 30 November 2023

Effective gene expression prediction from sequence by integrating long-range interactions

Article Open access 04 October 2021

Controlling gene expression with deep generative design of regulatory DNA

Article Open access 30 August 2022

References

Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and genomics. Nat. Rev. Genet. 16, 321–332 (2015).
Article CAS PubMed PubMed Central Google Scholar
Fulco, C. P. et al. Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nat. Genet. 51, 1664–1669 (2019).
Article CAS PubMed PubMed Central Google Scholar
Ying, P. et al. Genome-wide enhancer-gene regulatory maps link causal variants to target genes underlying human cancer risk. Nat. Commun. 14, 5958 (2023).
Article CAS PubMed PubMed Central Google Scholar
Sokolova, K., Chen, K. M., Hao, Y., Zhou, J. & Troyanskaya, O. G. Deep learning sequence models for transcriptional regulation. Annu. Rev. Genomics Hum. Genet. 25, 105–122 (2024).
Article CAS PubMed Google Scholar
Novakovsky, G., Dexter, N., Libbrecht, M. W., Wasserman, W. W. & Mostafavi, S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat. Rev. Genet. 24, 125–137 (2023). This Review provides a detailed description of interpretation methods for sequence-to-expression models.
Article CAS PubMed Google Scholar
Capitanchik, C., Wilkins, O. G., Wagner, N., Gagneur, J. & Ule, J. From computational models of the splicing code to regulatory mechanisms and therapeutic implications. Nat. Rev. Genet. https://doi.org/10.1038/s41576-024-00774-2 (2024).
La Fleur, A., Shi, Y. & Seelig, G. Decoding biology with massively parallel reporter assays and machine learning. Genes Dev. 38, 843–865 (2024).
Article PubMed PubMed Central Google Scholar
van Helden, J., Andre, B. & Collado-Vides, J. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281, 827–842 (1998).
Article PubMed Google Scholar
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article CAS PubMed Google Scholar
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).
Article Google Scholar
Vaswani, A. et al. Attention is all you need. Preprint at https://arxiv.org/abs/1706.03762 (2017).
Eraslan, G., Avsec, Z., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).
Article CAS PubMed Google Scholar
Stormo, G. D. Modeling the specificity of protein–DNA interactions. Quant. Biol. 1, 115–130 (2013).
Article CAS PubMed PubMed Central Google Scholar
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015). Describing DeepSEA, this pioneering paper uses a convolutional neural network to predict the effects of non-coding variants on epigenomic tracks using only DNA sequence as input.
Article CAS PubMed PubMed Central Google Scholar
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Article CAS PubMed Google Scholar
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
Article CAS PubMed PubMed Central Google Scholar
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
Article CAS PubMed PubMed Central Google Scholar
Avsec, Z. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021). Describing Enformer, this study was among the first to effectively use transformers to capture long-distance enhancer–promoter interactions, enabling the prediction of epigenomic and expression tracks across multiple cell types.
Article CAS PubMed PubMed Central Google Scholar
Karollus, A., Mauermeier, T. & Gagneur, J. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biol. 24, 56 (2023). This article highlights some limitations of current transformer sequence-to-expression models.
Article PubMed PubMed Central Google Scholar
He, A. Y., Palamuttam, N. P. & Danko, C. G. Training deep learning models on personalized genomic sequences improves variant effect prediction. Preprint at bioRxiv https://doi.org/10.1101/2024.10.15.618510 (2024).
Linder, J., Srivastava, D., Yuan, H., Agarwal, V. & Kelley, D. R. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nat. Genet. https://doi.org/10.1038/s41588-024-02053-6 (2025). This paper describes Borzoi, the current state-of-the-art transformer-based model that predicts RNA-seq, among other tracks, by capturing transcription, splicing and poly-adenylation signals.
Toneyan, S. & Koo, P. K. Interpreting cis-regulatory interactions from large-scale deep neural networks. Nat. Genet. 56, 2517–2527 (2024).
Article CAS PubMed PubMed Central Google Scholar
Penzar, D. et al. LegNet: a best-in-class deep learning model for short DNA regulatory regions. Bioinformatics https://doi.org/10.1093/bioinformatics/btad457 (2023).
Rafi, A. M. et al. A community effort to optimize sequence-based deep learning models of gene regulation. Nat. Biotechnol. https://doi.org/10.1038/s41587-024-02414-w (2024).
Article PubMed PubMed Central Google Scholar
Cochran, K. et al. Dissecting the cis-regulatory syntax of transcription initiation with deep learning. Preprint at bioRxiv https://doi.org/10.1101/2024.05.28.596138 (2024).
Article PubMed PubMed Central Google Scholar
Dudnyk, K., Cai, D., Shi, C., Xu, J. & Zhou, J. Sequence basis of transcription initiation in the human genome. Science 384, eadj0116 (2024).
Article CAS PubMed PubMed Central Google Scholar
He, A. Y. & Danko, C. G. Dissection of core promoter syntax through single nucleotide resolution modeling of transcription initiation. Preprint at bioRxiv https://doi.org/10.1101/2024.03.13.583868 (2024).
Article PubMed PubMed Central Google Scholar
Naqvi, S. et al. Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage. Cell Genom. 5, 100780 (2025).
Article CAS PubMed PubMed Central Google Scholar
Zrimec, J. et al. Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure. Nat. Commun. 11, 6141 (2020).
Article CAS PubMed PubMed Central Google Scholar
Agarwal, V. & Shendure, J. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep. 31, 107663 (2020).
Article CAS PubMed Google Scholar
Lee, B. H. & Rhie, S. K. Molecular and computational approaches to map regulatory elements in 3D chromatin structure. Epigenet. Chromat. 14, 14 (2021).
Article CAS Google Scholar
Zhang, Y. et al. MLSNet: a deep learning model for predicting transcription factor binding sites. Brief Bioinform. https://doi.org/10.1093/bib/bbae489 (2024).
Avsec, Z. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Q. et al. Base-resolution prediction of transcription factor binding signals by a deep learning framework. PLoS Comput. Biol. 18, e1009941 (2022).
Article CAS PubMed PubMed Central Google Scholar
Brennan, K. J. et al. Chromatin accessibility in the Drosophila embryo is determined by transcription factor pioneering and enhancer activation. Dev. Cell 58, 1898–1916.e9 (2023).
Article CAS PubMed PubMed Central Google Scholar
Gasperini, M., Tome, J. M. & Shendure, J. Towards a comprehensive catalogue of validated and target-linked human enhancers. Nat. Rev. Genet. 21, 292–310 (2020).
Article CAS PubMed PubMed Central Google Scholar
Marr, L. T., Jaya, P., Mishra, L. N. & Hayes, J. J. Whole-genome methods to define DNA and histone accessibility and long-range interactions in chromatin. Biochem. Soc. Trans. 50, 199–212 (2022).
Article CAS PubMed PubMed Central Google Scholar
Liu, Q., Xia, F., Yin, Q. & Jiang, R. Chromatin accessibility prediction via a hybrid deep convolutional neural network. Bioinformatics 34, 732–738 (2018).
Article PubMed Google Scholar
Minnoye, L. et al. Cross-species analysis of enhancer logic using deep learning. Genome Res. 30, 1815–1834 (2020).
Article CAS PubMed PubMed Central Google Scholar
Rada-Iglesias, A. et al. A unique chromatin signature uncovers early developmental enhancers in humans. Nature 470, 279–283 (2011).
Article CAS PubMed Google Scholar
Noguchi, S. et al. FANTOM5 CAGE profiles of human and mouse samples. Sci. Data 4, 170112 (2017).
Article CAS PubMed PubMed Central Google Scholar
Min, X. et al. Predicting enhancers with deep convolutional neural networks. BMC Bioinform. 18, 478 (2017).
Article Google Scholar
Li, Y., Shi, W. & Wasserman, W. W. Genome-wide prediction of cis-regulatory regions using supervised deep learning methods. BMC Bioinform. 19, 202 (2018).
Article Google Scholar
Cappelletti, L. et al. Boosting tissue-specific prediction of active cis-regulatory regions through deep learning and Bayesian optimization techniques. BMC Bioinform. 23, 154 (2022).
Article CAS Google Scholar
Serebreni, L. & Stark, A. Insights into gene regulation: from regulatory genomic elements to DNA–protein and protein–protein interactions. Curr. Opin. Cell Biol. 70, 58–66 (2021).
Article CAS PubMed Google Scholar
Kim, S. & Wysocka, J. Deciphering the multi-scale, quantitative cis-regulatory code. Mol. Cell 83, 373–392 (2023).
Article CAS PubMed Google Scholar
Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018).
Article CAS PubMed PubMed Central Google Scholar
Chen, K. M., Wong, A. K., Troyanskaya, O. G. & Zhou, J. A sequence-based global map of regulatory activity for deciphering human genetics. Nat. Genet. 54, 940–949 (2022).
Article CAS PubMed PubMed Central Google Scholar
Li, H. & Guan, Y. Fast decoding cell type-specific transcription factor binding landscape at single-nucleotide resolution. Genome Res. 31, 721–731 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kaiser, L. et al. One model to learn them all. Preprint at https://arxiv.org/abs/1706.05137 (2017).
Vandenhende, S. et al. Multi-task learning for dense prediction tasks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44, 3614–3633 (2022).
PubMed Google Scholar
Kathail, P. et al. Current genomic deep learning models display decreased performance in cell type-specific accessible regions. Genome Biol. 25, 202 (2024).
Article CAS PubMed PubMed Central Google Scholar
Sasse, A. et al. Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings. Nat. Genet. 55, 2060–2064 (2023).
Article CAS PubMed Google Scholar
Lakkapragada, A., Sleiman, E., Surabhi, S. & Wall, D. P. Mitigating negative transfer in multi-task learning with exponential moving average loss weighting strategies. Preprint at https://arxiv.org/abs/2211.12999 (2022).
Schwessinger, R., Deasy, J., Woodruff, R. T., Young, S. & Branson, K. M. Single-cell gene expression prediction from DNA sequence at large contexts. Preprint at bioRxiv https://doi.org/10.1101/2023.07.26.550634 (2023).
Article Google Scholar
Lal, A. et al. Decoding sequence determinants of gene expression in diverse cellular and disease states. Preprint at bioRxiv https://doi.org/10.1101/2024.10.09.617507 (2025).
Novakovsky, G., Saraswat, M., Fornes, O., Mostafavi, S. & Wasserman, W. W. Biologically relevant transfer learning improves transcription factor binding prediction. Genome Biol. 22, 280 (2021).
Article CAS PubMed PubMed Central Google Scholar
de Almeida, B. P. et al. Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo. Nature 626, 207–211 (2024).
Article PubMed Google Scholar
Bravo Gonzalez-Blas, C. et al. Single-cell spatial multi-omics and deep learning dissect enhancer-driven gene regulatory networks in liver zonation. Nat. Cell Biol. 26, 153–167 (2024).
Article CAS PubMed PubMed Central Google Scholar
Hingerl, J. C. et al. scooby: modeling multi-modal genomic profiles from DNA sequence at single-cell resolution. Preprint at bioRxiv https://doi.org/10.1101/2024.09.19.613754 (2024).
Article Google Scholar
Drusinsky, S., Whalen, S. & Pollard, K. S. Deep-learning prediction of gene expression from personal genomes. Preprint at bioRxiv https://doi.org/10.1101/2024.07.27.605449 (2024).
Article Google Scholar
Kathail, P., Bajwa, A. & Ioannidis, N. M. Leveraging genomic deep learning models for non-coding variant effect prediction. Preprint at https://arxiv.org/abs/2411.11158 (2024).
Murphy, A. E., Beardall, W., Rei, M., Phuycharoen, M. & Skene, N. G. Predicting cell type-specific epigenomic profiles accounting for distal genetic effects. Nat. Commun. 15, 9951 (2024).
Article CAS PubMed PubMed Central Google Scholar
Waszak, S. M. et al. Population variation and genetic control of modular chromatin architecture in humans. Cell 162, 1039–1050 (2015).
Article CAS PubMed Google Scholar
Agarwal, V. et al. Massively parallel characterization of transcriptional regulatory elements. Nature 639, 411–420 (2025).
Article CAS PubMed PubMed Central Google Scholar
Liu, Y. et al. Functional assessment of human enhancer activities using whole-genome STARR-sequencing. Genome Biol. 18, 219 (2017).
Article PubMed PubMed Central Google Scholar
Sahu, B. et al. Sequence determinants of human gene regulatory elements. Nat. Genet. 54, 283–294 (2022).
Article CAS PubMed PubMed Central Google Scholar
Deng, C. et al. Massively parallel characterization of regulatory elements in the developing human cortex. Science 384, eadh0559 (2024).
Article CAS PubMed PubMed Central Google Scholar
Trauernicht, M., Martinez-Ara, M. & van Steensel, B. Deciphering gene regulation using massively parallel reporter assays. Trends Biochem. Sci. 45, 90–91 (2020).
Article CAS PubMed Google Scholar
Gallego Romero, I. & Lea, A. J. Leveraging massively parallel reporter assays for evolutionary questions. Genome Biol. 24, 26 (2023).
Article PubMed PubMed Central Google Scholar
Zheng, Y. & VanDusen, N. J. Massively parallel reporter assays for high-throughput in vivo analysis of cis-regulatory elements. J. Cardiovasc. Dev. Dis. https://doi.org/10.3390/jcdd10040144 (2023).
Movva, R. et al. Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays. PLoS ONE 14, e0218073 (2019).
Article CAS PubMed PubMed Central Google Scholar
de Almeida, B. P., Reiter, F., Pagani, M. & Stark, A. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat. Genet. 54, 613–624 (2022). This study applied a convolutional neural network model to predict enhancer activity in a plasmid-based assay to understand the grammatical rules of enhancers in Drosophila melanogaster cells.
Article PubMed Google Scholar
Barbadilla-Martínez, L. et al. The regulatory grammar of human promoters uncovered by MPRA-trained deep learning. Preprint at bioRxiv https://doi.org/10.1101/2024.07.09.602649 (2024).
Article Google Scholar
Duttke, S. H. et al. Position-dependent function of human sequence-specific transcription factors. Nature 631, 891–898 (2024).
Article CAS PubMed PubMed Central Google Scholar
de Boer, C. G. & Taipale, J. Hold out the genome: a roadmap to solving the cis-regulatory code. Nature 625, 41–50 (2024).
Article PubMed Google Scholar
de Boer, C. G. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat. Biotechnol. 38, 56–65 (2020).
Article PubMed Google Scholar
Vaishnav, E. D. et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature 603, 455–463 (2022). This study was one of the first to design functional synthetic sequences and investigate regulatory evolution in Saccharomyces cerevisiae using sequence-to-expression models.
Article CAS PubMed Google Scholar
Akhtar, W. et al. Chromatin position effects assayed by thousands of reporters integrated in parallel. Cell 154, 914–927 (2013).
Article CAS PubMed Google Scholar
Klein, J. C. et al. A systematic evaluation of the design and context dependencies of massively parallel reporter assays. Nat. Methods 17, 1083–1091 (2020).
Article CAS PubMed PubMed Central Google Scholar
Badis, G. et al. Diversity and complexity in DNA recognition by transcription factors. Science 324, 1720–1723 (2009).
Article CAS PubMed PubMed Central Google Scholar
Alexandari, A. M. et al. De novo distillation of thermodynamic affinity from deep learning regulatory sequence models of in vivo protein–DNA binding. Preprint at bioRxiv https://doi.org/10.1101/2023.05.11.540401 (2023).
Nair, S., Shrikumar, A., Schreiber, J. & Kundaje, A. fastISM: performant in silico saturation mutagenesis for convolutional neural networks. Bioinformatics 38, 2397–2403 (2022).
Article CAS PubMed PubMed Central Google Scholar
Schreiber, J., Nair, S., Balsubramani, A. & Kundaje, A. Accelerating in silico saturation mutagenesis using compressed sensing. Bioinformatics 38, 3557–3564 (2022).
Article CAS PubMed PubMed Central Google Scholar
Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. Preprint at https://arxiv.org/abs/1312.6034 (2013).
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. Preprint at https://arxiv.org/abs/1704.02685 (2017). Describing DeepLIFT, this study represents pioneering work on attribution methods for interrogating sequence-to-expression models.
Lundberg, S. & Lee, S.-I. A unified approach to interpreting model predictions. Preprint at https://arxiv.org/abs/1705.07874 (2017).
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. Preprint at https://arxiv.org/abs/1703.01365 (2017).
Shrikumar, A. et al. Technical note on transcription factor motif discovery from importance scores (TF-MoDISco) version 0.5.6.5. Preprint at https://arxiv.org/abs/1811.00416 (2018).
Janssens, J. et al. Decoding gene regulation in the fly brain. Nature 601, 630–636 (2022).
Article CAS PubMed Google Scholar
Yuan, H. & Kelley, D. R. scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks. Nat. Methods 19, 1088–1096 (2022).
Article CAS PubMed Google Scholar
Ribeiro, M. T., Singh, S. & Guestrin, C. “Why Should I Trust You?”: explaining the predictions of any classifier. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1135–1144 (Association for Computing Machinery, 2016).
Novakovsky, G., Fornes, O., Saraswat, M., Mostafavi, S. & Wasserman, W. W. ExplaiNN: interpretable and transparent neural networks for genomics. Genome Biol. 24, 154 (2023).
Article CAS PubMed PubMed Central Google Scholar
Seitz, E. E., McCandlish, D. M., Kinney, J. B. & Koo, P. K. Interpreting cis-regulatory mechanisms from genomic deep neural networks using surrogate models. Nat. Mach. Intell. 6, 701–713 (2024).
Article PubMed PubMed Central Google Scholar
Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Ser. B Stat. Methodol. 82, 1273–1300 (2020).
Article Google Scholar
Siraj, L. et al. Functional dissection of complex and molecular trait variants at single nucleotide resolution. Preprint at bioRxiv https://doi.org/10.1101/2024.05.05.592437 (2024).
Article PubMed PubMed Central Google Scholar
Shigaki, D. et al. Integration of multiple epigenomic marks improves prediction of variant impact in saturation mutagenesis reporter assay. Hum. Mutat. 40, 1280–1291 (2019).
Article CAS PubMed Google Scholar
Chen, X. D. et al. Helicase-assisted continuous editing for programmable mutagenesis of endogenous genomes. Science 386, eadn5876 (2024).
Article CAS PubMed Google Scholar
Yao, D. et al. Multicenter integrated analysis of noncoding CRISPRi screens. Nat. Methods 21, 723–734 (2024).
Article CAS PubMed PubMed Central Google Scholar
Schraivogel, D. et al. Targeted Perturb-seq enables genome-scale genetic screens in single cells. Nat. Methods 17, 629–635 (2020).
Article CAS PubMed PubMed Central Google Scholar
Eder, M., Moene, C. J. I., Dauban, L., Leemans, C. & van Steensel, B. Functional maps of a genomic locus reveal confinement of an enhancer by its target gene. Preprint at bioRxiv https://doi.org/10.1101/2024.08.26.609360 (2024).
Horton, C. A. et al. Short tandem repeats bind transcription factors to tune eukaryotic gene expression. Science 381, eadd1250 (2023).
Article CAS PubMed Google Scholar
Reiter, F., de Almeida, B. P. & Stark, A. Enhancers display constrained sequence flexibility and context-specific modulation of motif function. Genome Res. 33, 346–358 (2023).
Article CAS PubMed PubMed Central Google Scholar
Dey, K. K. et al. Evaluating the informativeness of deep learning annotations for human complex diseases. Nat. Commun. 11, 4703 (2020).
Article CAS PubMed PubMed Central Google Scholar
Trevino, A. E. et al. Chromatin and gene-regulatory dynamics of the developing human cerebral cortex at single-cell resolution. Cell 184, 5053–5069.e23 (2021).
Article CAS PubMed Google Scholar
Wang, S. K. et al. Single-cell multiome of the human retina and deep learning nominate causal variants in complex eye diseases. Cell Genom. https://doi.org/10.1016/j.xgen.2022.100164 (2022).
Huang, C. et al. Personal transcriptome variation is poorly explained by current genomic deep learning models. Nat. Genet. 55, 2056–2059 (2023).
Article CAS PubMed PubMed Central Google Scholar
Chen, L., Fish, A. E. & Capra, J. A. Prediction of gene regulatory enhancers across species reveals evolutionarily conserved sequence properties. PLoS Comput. Biol. 14, e1006484 (2018).
Article PubMed PubMed Central Google Scholar
Kelley, D. R. Cross-species regulatory sequence activity prediction. PLoS Comput. Biol. 16, e1008050 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kaplow, I. M. et al. Inferring mammalian tissue-specific regulatory conservation by predicting tissue-specific differences in open chromatin. BMC Genom. 23, 291 (2022).
Article CAS Google Scholar
Kaplow, I. M. et al. Relating enhancer genetic variation across mammals to complex phenotypes using machine learning. Science 380, eabm7993 (2023).
Article CAS PubMed PubMed Central Google Scholar
Hecker, N. et al. Enhancer-driven cell type comparison reveals similarities between the mammalian and bird pallium. Science 387, eadp3957 (2025).
Article CAS PubMed Google Scholar
Taskiran, I. I. et al. Cell-type-directed design of synthetic enhancers. Nature 626, 212–220 (2024). In this paper, several approaches are used to construct cell-type-specific enhancers using sequence-to-expression models.
Article CAS PubMed Google Scholar
Gosai, S. J. et al. Machine-guided design of cell-type-targeting cis-regulatory elements. Nature 634, 1211–1220 (2024).
Article CAS PubMed PubMed Central Google Scholar
Lal, A., Garfield, D., Biancalani, T. & Eraslan, G. Designing realistic regulatory DNA with autoregressive language models. Genome Res. 34, 1411–1420 (2024).
Article CAS PubMed PubMed Central Google Scholar
Sarkar, A., Tang, Z., Zhao, C. & Koo, P. K. Designing DNA with tunable regulatory activity using discrete diffusion. Preprint at bioRxiv https://doi.org/10.1101/2024.05.23.595630 (2024).
Article PubMed PubMed Central Google Scholar
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
Article CAS PubMed Google Scholar
Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 37, 2112–2120 (2021).
Article CAS PubMed PubMed Central Google Scholar
Dalla-Torre, H. et al. Nucleotide transformer: building and evaluating robust foundation models for human genomics. Nat. Methods 22, 287–297 (2025).
Article CAS PubMed Google Scholar
Benegas, G., Batra, S. S. & Song, Y. S. DNA language models are powerful predictors of genome-wide variant effects. Proc. Natl Acad. Sci. USA 120, e2311219120 (2023).
Article CAS PubMed PubMed Central Google Scholar
Nguyen, E. et al. HyenaDNA: long-range genomic sequence modeling at single nucleotide resolution. (2023).
Tang, Z., Somia, N., Yu, Y. & Koo, P. K. Evaluating the representational power of pre-trained DNA language models for regulatory genomics. Preprint at bioRxiv https://doi.org/10.1101/2024.02.29.582810 (2024).
Article PubMed PubMed Central Google Scholar
Isa Marin, F. et al. BEND: benchmarking DNA language models on biologically meaningful tasks. Preprint at https://arxiv.org/abs/2311.12570 (2023).
Friedman, R. Z. et al. Active learning of enhancer and silencer regulatory grammar in photoreceptors. Preprint at bioRxiv https://doi.org/10.1101/2023.08.21.554146 (2023).
Article PubMed PubMed Central Google Scholar
Duncan, A. G., Mitchell, J. A. & Moses, A. M. Improving the performance of supervised deep learning for regulatory genomics using phylogenetic augmentation. Bioinformatics https://doi.org/10.1093/bioinformatics/btae190 (2024).
Lee, N. K., Tang, Z., Toneyan, S. & Koo, P. K. EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations. Genome Biol. 24, 105 (2023).
Article CAS PubMed PubMed Central Google Scholar
Rastogi, R., Reddy, A. J., Chung, R. & Ioannidis, N. M. Fine-tuning sequence-to-expression models on personal genome and transcriptome data. Preprint at bioRxiv https://doi.org/10.1101/2024.09.23.614632 (2024).
Article PubMed PubMed Central Google Scholar
Fu, X. et al. A foundation model of transcription across human cell types. Nature 637, 965–973 (2025).
Article CAS PubMed PubMed Central Google Scholar
Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. Genome-wide mapping of in vivo protein–DNA interactions. Science 316, 1497–1502 (2007).
Article CAS PubMed Google Scholar
Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods 10, 1213–1218 (2013).
Article CAS PubMed PubMed Central Google Scholar
Barski, A. et al. High-resolution profiling of histone methylations in the human genome. Cell 129, 823–837 (2007).
Article CAS PubMed Google Scholar
Shiraki, T. et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc. Natl Acad. Sci. USA 100, 15776–15781 (2003).
Article CAS PubMed PubMed Central Google Scholar
Kruesi, W. S., Core, L. J., Waters, C. T., Lis, J. T. & Meyer, B. J. Condensin controls recruitment of RNA polymerase II to achieve nematode X-chromosome dosage compensation. eLife 2, e00808 (2013).
Article PubMed PubMed Central Google Scholar
Tsuchihara, K. et al. Massive transcriptional start site analysis of human genes in hypoxia cells. Nucleic Acids Res. 37, 2249–2263 (2009).
Article CAS PubMed PubMed Central Google Scholar
Policastro, R. A. & Zentner, G. E. Global approaches for profiling transcription initiation. Cell Rep. Methods https://doi.org/10.1016/j.crmeth.2021.100081 (2021).
van Arensbergen, J. et al. Genome-wide mapping of autonomous promoter activity in human cells. Nat. Biotechnol. 35, 145–153 (2017).
Article PubMed Google Scholar
Arnold, C. D. et al. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 339, 1074–1077 (2013).
Article CAS PubMed Google Scholar
Bajwa, A., Rastogi, R., Kathail, P., Shuai, R. W. & Ioannidis, N. M. Characterizing uncertainty in predictions of genomic sequence-to-activity models. Preprint at bioRxiv https://doi.org/10.1101/2023.12.21.572730 (2023).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors acknowledge V. Franceschini-Santos for extensive discussion and help with the generation of figures. Research at the Netherlands Cancer Institute is supported by an institutional grant of the Dutch Cancer Society and of the Dutch Ministry of Health, Welfare and Sport. The Oncode Institute is partially funded by the Dutch Cancer Society.

Author information

Authors and Affiliations

Oncode Institute, Utrecht, The Netherlands
Lucía Barbadilla-Martínez, Noud Klaassen, Bas van Steensel & Jeroen de Ridder
Center for Molecular Medicine, UMC Utrecht, Utrecht, The Netherlands
Lucía Barbadilla-Martínez & Jeroen de Ridder
Division of Molecular Genetics, Netherlands Cancer Institute, Amsterdam, The Netherlands
Noud Klaassen & Bas van Steensel

Authors

Lucía Barbadilla-Martínez
View author publications
Search author on:PubMed Google Scholar
Noud Klaassen
View author publications
Search author on:PubMed Google Scholar
Bas van Steensel
View author publications
Search author on:PubMed Google Scholar
Jeroen de Ridder
View author publications
Search author on:PubMed Google Scholar

Contributions

All authors contributed substantially to discussion of the content, wrote the article and reviewed and/or edited the manuscript before submission. L.B.-M. and N.K. designed and generated figures.

Corresponding authors

Correspondence to Bas van Steensel or Jeroen de Ridder.

Ethics declarations

Competing interests

The authors have applied for a patent related to the S2E model PARM.

Peer review

Peer review information

Nature Reviews Genetics thanks Julia Zeitlinger and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Glossary

Attribution map: Interpretation technique that visualizes the importance of each nucleotide in a sequence to the prediction of the model.
Convolutional neural networks: (CNNs). A neural network architecture using convolutional layers that extract local patterns at different spatial hierarchies.
Deep learning: (DL). A class of machine-learning approaches capable of identifying highly complex patterns in large data sets. Unlike classical machine learning, DL methods can automatically learn the best representation through stacking artificial neural networks in multiple layers, thereby minimizing the need for manual feature engineering.
Dot product: A basic linear algebra computation to multiply matrices.
Expression quantitative trait loci: (eQTL). Genomic loci that regulate expression levels of mRNA or proteins.
Fully connected layer: A neural network layer in which every input contributes to the computation of every output. These are typically the last layers in a model.
Hyperparameters: Configuration settings that control how the model learns, such as model size and learning rate, which are set before training and may be optimized during validation.
Kernel: Also referred to as a filter, a kernel is a small matrix that can detect specific patterns in the input sequence through the process of convolution. Which features each kernel recognizes is determined by the parameters in the matrix, which are learned during training.
k-mer: A short substring of length ‘k’ found within a larger biological sequence (such as DNA or RNA).
Machine learning: Describes a wide range of algorithms that can learn from data to make predictions on new and unseen data. Examples include random forest, support vector machines and gradient boosting. Machine-learning methods often require manual feature engineering.
Parameters: Also referred to as weights, parameters are values in the model that are learned during training to optimize performance.
Position weight matrices: Representation of motifs in a DNA sequence as matrices that indicate the frequency or importance of each nucleotide at each position.
Receptive fields: Region of DNA sequence that influences the prediction of the model at a given position.
Self-attention: A mechanism used in transformer deep learning models that allows the model to weigh the importance of different parts of an input sequence when processing it, effectively allowing the model to ‘focus’ on the most relevant part of the input DNA sequence.
Transfer learning: A technique whereby a model trained on one task is used as a starting point for a model on a different but related task, leveraging previously learned knowledge to improve performance.
Transformers: A neural network architecture based on the concept of self-attention that weights the importance of different parts of the input, allowing long-range dependencies to be captured.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Barbadilla-Martínez, L., Klaassen, N., van Steensel, B. et al. Predicting gene expression from DNA sequence using deep learning models. Nat Rev Genet 26, 666–680 (2025). https://doi.org/10.1038/s41576-025-00841-2

Download citation

Accepted: 01 April 2025
Published: 13 May 2025
Issue date: October 2025
DOI: https://doi.org/10.1038/s41576-025-00841-2

This article is cited by

The distinct roles of genome, methylation, transcription, and translation on protein expression in Arabidopsis thaliana resolve the Central Dogma’s information flow
- Ziming Zhong
- Mark Bailey
- Richard Mott
Genome Biology (2025)
Making sense of the regulatory genome

Nature Reviews Genetics (2025)
Harnessing functional annotation to improve the accuracy and transferability of polygenic scores
- Jian Zeng
- Peter M. Visscher
Nature Reviews Genetics (2025)

Predicting gene expression from DNA sequence using deep learning models

Subjects

Abstract

Access options

Similar content being viewed by others

Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings

Effective gene expression prediction from sequence by integrating long-range interactions

Controlling gene expression with deep generative design of regulatory DNA

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Glossary

Rights and permissions

About this article

Cite this article

This article is cited by

The distinct roles of genome, methylation, transcription, and translation on protein expression in Arabidopsis thaliana resolve the Central Dogma’s information flow

Making sense of the regulatory genome

Harnessing functional annotation to improve the accuracy and transferability of polygenic scores

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings

Effective gene expression prediction from sequence by integrating long-range interactions

Controlling gene expression with deep generative design of regulatory DNA

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Glossary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

The distinct roles of genome, methylation, transcription, and translation on protein expression in Arabidopsis thaliana resolve the Central Dogma’s information flow

Making sense of the regulatory genome

Harnessing functional annotation to improve the accuracy and transferability of polygenic scores

Search

Quick links