Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Predicting gene expression from DNA sequence using deep learning models

Abstract

Transcription of genes is regulated by DNA elements such as promoters and enhancers, the activity of which are in turn controlled by many transcription factors. Owing to the highly complex combinatorial logic involved, it has been difficult to construct computational models that predict gene activity from DNA sequence. Recent advances in deep learning techniques applied to data from epigenome mapping and high-throughput reporter assays have made substantial progress towards addressing this complexity. Such models can capture the regulatory grammar with remarkable accuracy and show great promise in predicting the effects of non-coding variants, uncovering detailed molecular mechanisms of gene regulation and designing synthetic regulatory elements for biotechnology. Here, we discuss the principles of these approaches, the types of training data sets that are available and the strengths and limitations of different approaches.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Sequence-to-expression architectures and training schemes.
Fig. 2: Various data types are used for training sequence-to-expression models.
Fig. 3: Design of massively parallel reporter assays.
Fig. 4: Interpreting deep learning models.
Fig. 5: Main applications of sequence-to-expression models.

Similar content being viewed by others

References

  1. Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and genomics. Nat. Rev. Genet. 16, 321–332 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Fulco, C. P. et al. Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nat. Genet. 51, 1664–1669 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Ying, P. et al. Genome-wide enhancer-gene regulatory maps link causal variants to target genes underlying human cancer risk. Nat. Commun. 14, 5958 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Sokolova, K., Chen, K. M., Hao, Y., Zhou, J. & Troyanskaya, O. G. Deep learning sequence models for transcriptional regulation. Annu. Rev. Genomics Hum. Genet. 25, 105–122 (2024).

    Article  CAS  PubMed  Google Scholar 

  5. Novakovsky, G., Dexter, N., Libbrecht, M. W., Wasserman, W. W. & Mostafavi, S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat. Rev. Genet. 24, 125–137 (2023). This Review provides a detailed description of interpretation methods for sequence-to-expression models.

    Article  CAS  PubMed  Google Scholar 

  6. Capitanchik, C., Wilkins, O. G., Wagner, N., Gagneur, J. & Ule, J. From computational models of the splicing code to regulatory mechanisms and therapeutic implications. Nat. Rev. Genet. https://doi.org/10.1038/s41576-024-00774-2 (2024).

  7. La Fleur, A., Shi, Y. & Seelig, G. Decoding biology with massively parallel reporter assays and machine learning. Genes Dev. 38, 843–865 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  8. van Helden, J., Andre, B. & Collado-Vides, J. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281, 827–842 (1998).

    Article  PubMed  Google Scholar 

  9. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

    Article  CAS  PubMed  Google Scholar 

  10. LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998).

    Article  Google Scholar 

  11. Vaswani, A. et al. Attention is all you need. Preprint at https://arxiv.org/abs/1706.03762 (2017).

  12. Eraslan, G., Avsec, Z., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).

    Article  CAS  PubMed  Google Scholar 

  13. Stormo, G. D. Modeling the specificity of protein–DNA interactions. Quant. Biol. 1, 115–130 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015). Describing DeepSEA, this pioneering paper uses a convolutional neural network to predict the effects of non-coding variants on epigenomic tracks using only DNA sequence as input.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

    Article  CAS  PubMed  Google Scholar 

  16. Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Avsec, Z. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021). Describing Enformer, this study was among the first to effectively use transformers to capture long-distance enhancer–promoter interactions, enabling the prediction of epigenomic and expression tracks across multiple cell types.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Karollus, A., Mauermeier, T. & Gagneur, J. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biol. 24, 56 (2023). This article highlights some limitations of current transformer sequence-to-expression models.

    Article  PubMed  PubMed Central  Google Scholar 

  20. He, A. Y., Palamuttam, N. P. & Danko, C. G. Training deep learning models on personalized genomic sequences improves variant effect prediction. Preprint at bioRxiv https://doi.org/10.1101/2024.10.15.618510 (2024).

  21. Linder, J., Srivastava, D., Yuan, H., Agarwal, V. & Kelley, D. R. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nat. Genet. https://doi.org/10.1038/s41588-024-02053-6 (2025). This paper describes Borzoi, the current state-of-the-art transformer-based model that predicts RNA-seq, among other tracks, by capturing transcription, splicing and poly-adenylation signals.

  22. Toneyan, S. & Koo, P. K. Interpreting cis-regulatory interactions from large-scale deep neural networks. Nat. Genet. 56, 2517–2527 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Penzar, D. et al. LegNet: a best-in-class deep learning model for short DNA regulatory regions. Bioinformatics https://doi.org/10.1093/bioinformatics/btad457 (2023).

  24. Rafi, A. M. et al. A community effort to optimize sequence-based deep learning models of gene regulation. Nat. Biotechnol. https://doi.org/10.1038/s41587-024-02414-w (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  25. Cochran, K. et al. Dissecting the cis-regulatory syntax of transcription initiation with deep learning. Preprint at bioRxiv https://doi.org/10.1101/2024.05.28.596138 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  26. Dudnyk, K., Cai, D., Shi, C., Xu, J. & Zhou, J. Sequence basis of transcription initiation in the human genome. Science 384, eadj0116 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. He, A. Y. & Danko, C. G. Dissection of core promoter syntax through single nucleotide resolution modeling of transcription initiation. Preprint at bioRxiv https://doi.org/10.1101/2024.03.13.583868 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Naqvi, S. et al. Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage. Cell Genom. 5, 100780 (2025).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Zrimec, J. et al. Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure. Nat. Commun. 11, 6141 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Agarwal, V. & Shendure, J. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep. 31, 107663 (2020).

    Article  CAS  PubMed  Google Scholar 

  31. Lee, B. H. & Rhie, S. K. Molecular and computational approaches to map regulatory elements in 3D chromatin structure. Epigenet. Chromat. 14, 14 (2021).

    Article  CAS  Google Scholar 

  32. Zhang, Y. et al. MLSNet: a deep learning model for predicting transcription factor binding sites. Brief Bioinform. https://doi.org/10.1093/bib/bbae489 (2024).

  33. Avsec, Z. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Zhang, Q. et al. Base-resolution prediction of transcription factor binding signals by a deep learning framework. PLoS Comput. Biol. 18, e1009941 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Brennan, K. J. et al. Chromatin accessibility in the Drosophila embryo is determined by transcription factor pioneering and enhancer activation. Dev. Cell 58, 1898–1916.e9 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Gasperini, M., Tome, J. M. & Shendure, J. Towards a comprehensive catalogue of validated and target-linked human enhancers. Nat. Rev. Genet. 21, 292–310 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Marr, L. T., Jaya, P., Mishra, L. N. & Hayes, J. J. Whole-genome methods to define DNA and histone accessibility and long-range interactions in chromatin. Biochem. Soc. Trans. 50, 199–212 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Liu, Q., Xia, F., Yin, Q. & Jiang, R. Chromatin accessibility prediction via a hybrid deep convolutional neural network. Bioinformatics 34, 732–738 (2018).

    Article  PubMed  Google Scholar 

  39. Minnoye, L. et al. Cross-species analysis of enhancer logic using deep learning. Genome Res. 30, 1815–1834 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Rada-Iglesias, A. et al. A unique chromatin signature uncovers early developmental enhancers in humans. Nature 470, 279–283 (2011).

    Article  CAS  PubMed  Google Scholar 

  41. Noguchi, S. et al. FANTOM5 CAGE profiles of human and mouse samples. Sci. Data 4, 170112 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Min, X. et al. Predicting enhancers with deep convolutional neural networks. BMC Bioinform. 18, 478 (2017).

    Article  Google Scholar 

  43. Li, Y., Shi, W. & Wasserman, W. W. Genome-wide prediction of cis-regulatory regions using supervised deep learning methods. BMC Bioinform. 19, 202 (2018).

    Article  Google Scholar 

  44. Cappelletti, L. et al. Boosting tissue-specific prediction of active cis-regulatory regions through deep learning and Bayesian optimization techniques. BMC Bioinform. 23, 154 (2022).

    Article  CAS  Google Scholar 

  45. Serebreni, L. & Stark, A. Insights into gene regulation: from regulatory genomic elements to DNA–protein and protein–protein interactions. Curr. Opin. Cell Biol. 70, 58–66 (2021).

    Article  CAS  PubMed  Google Scholar 

  46. Kim, S. & Wysocka, J. Deciphering the multi-scale, quantitative cis-regulatory code. Mol. Cell 83, 373–392 (2023).

    Article  CAS  PubMed  Google Scholar 

  47. Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Chen, K. M., Wong, A. K., Troyanskaya, O. G. & Zhou, J. A sequence-based global map of regulatory activity for deciphering human genetics. Nat. Genet. 54, 940–949 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Li, H. & Guan, Y. Fast decoding cell type-specific transcription factor binding landscape at single-nucleotide resolution. Genome Res. 31, 721–731 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Kaiser, L. et al. One model to learn them all. Preprint at https://arxiv.org/abs/1706.05137 (2017).

  51. Vandenhende, S. et al. Multi-task learning for dense prediction tasks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44, 3614–3633 (2022).

    PubMed  Google Scholar 

  52. Kathail, P. et al. Current genomic deep learning models display decreased performance in cell type-specific accessible regions. Genome Biol. 25, 202 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Sasse, A. et al. Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings. Nat. Genet. 55, 2060–2064 (2023).

    Article  CAS  PubMed  Google Scholar 

  54. Lakkapragada, A., Sleiman, E., Surabhi, S. & Wall, D. P. Mitigating negative transfer in multi-task learning with exponential moving average loss weighting strategies. Preprint at https://arxiv.org/abs/2211.12999 (2022).

  55. Schwessinger, R., Deasy, J., Woodruff, R. T., Young, S. & Branson, K. M. Single-cell gene expression prediction from DNA sequence at large contexts. Preprint at bioRxiv https://doi.org/10.1101/2023.07.26.550634 (2023).

    Article  Google Scholar 

  56. Lal, A. et al. Decoding sequence determinants of gene expression in diverse cellular and disease states. Preprint at bioRxiv https://doi.org/10.1101/2024.10.09.617507 (2025).

  57. Novakovsky, G., Saraswat, M., Fornes, O., Mostafavi, S. & Wasserman, W. W. Biologically relevant transfer learning improves transcription factor binding prediction. Genome Biol. 22, 280 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. de Almeida, B. P. et al. Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo. Nature 626, 207–211 (2024).

    Article  PubMed  Google Scholar 

  59. Bravo Gonzalez-Blas, C. et al. Single-cell spatial multi-omics and deep learning dissect enhancer-driven gene regulatory networks in liver zonation. Nat. Cell Biol. 26, 153–167 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Hingerl, J. C. et al. scooby: modeling multi-modal genomic profiles from DNA sequence at single-cell resolution. Preprint at bioRxiv https://doi.org/10.1101/2024.09.19.613754 (2024).

    Article  Google Scholar 

  61. Drusinsky, S., Whalen, S. & Pollard, K. S. Deep-learning prediction of gene expression from personal genomes. Preprint at bioRxiv https://doi.org/10.1101/2024.07.27.605449 (2024).

    Article  Google Scholar 

  62. Kathail, P., Bajwa, A. & Ioannidis, N. M. Leveraging genomic deep learning models for non-coding variant effect prediction. Preprint at https://arxiv.org/abs/2411.11158 (2024).

  63. Murphy, A. E., Beardall, W., Rei, M., Phuycharoen, M. & Skene, N. G. Predicting cell type-specific epigenomic profiles accounting for distal genetic effects. Nat. Commun. 15, 9951 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Waszak, S. M. et al. Population variation and genetic control of modular chromatin architecture in humans. Cell 162, 1039–1050 (2015).

    Article  CAS  PubMed  Google Scholar 

  65. Agarwal, V. et al. Massively parallel characterization of transcriptional regulatory elements. Nature 639, 411–420 (2025).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Liu, Y. et al. Functional assessment of human enhancer activities using whole-genome STARR-sequencing. Genome Biol. 18, 219 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  67. Sahu, B. et al. Sequence determinants of human gene regulatory elements. Nat. Genet. 54, 283–294 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  68. Deng, C. et al. Massively parallel characterization of regulatory elements in the developing human cortex. Science 384, eadh0559 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Trauernicht, M., Martinez-Ara, M. & van Steensel, B. Deciphering gene regulation using massively parallel reporter assays. Trends Biochem. Sci. 45, 90–91 (2020).

    Article  CAS  PubMed  Google Scholar 

  70. Gallego Romero, I. & Lea, A. J. Leveraging massively parallel reporter assays for evolutionary questions. Genome Biol. 24, 26 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  71. Zheng, Y. & VanDusen, N. J. Massively parallel reporter assays for high-throughput in vivo analysis of cis-regulatory elements. J. Cardiovasc. Dev. Dis. https://doi.org/10.3390/jcdd10040144 (2023).

  72. Movva, R. et al. Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays. PLoS ONE 14, e0218073 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. de Almeida, B. P., Reiter, F., Pagani, M. & Stark, A. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat. Genet. 54, 613–624 (2022). This study applied a convolutional neural network model to predict enhancer activity in a plasmid-based assay to understand the grammatical rules of enhancers in Drosophila melanogaster cells.

    Article  PubMed  Google Scholar 

  74. Barbadilla-Martínez, L. et al. The regulatory grammar of human promoters uncovered by MPRA-trained deep learning. Preprint at bioRxiv https://doi.org/10.1101/2024.07.09.602649 (2024).

    Article  Google Scholar 

  75. Duttke, S. H. et al. Position-dependent function of human sequence-specific transcription factors. Nature 631, 891–898 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  76. de Boer, C. G. & Taipale, J. Hold out the genome: a roadmap to solving the cis-regulatory code. Nature 625, 41–50 (2024).

    Article  PubMed  Google Scholar 

  77. de Boer, C. G. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat. Biotechnol. 38, 56–65 (2020).

    Article  PubMed  Google Scholar 

  78. Vaishnav, E. D. et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature 603, 455–463 (2022). This study was one of the first to design functional synthetic sequences and investigate regulatory evolution in Saccharomyces cerevisiae using sequence-to-expression models.

    Article  CAS  PubMed  Google Scholar 

  79. Akhtar, W. et al. Chromatin position effects assayed by thousands of reporters integrated in parallel. Cell 154, 914–927 (2013).

    Article  CAS  PubMed  Google Scholar 

  80. Klein, J. C. et al. A systematic evaluation of the design and context dependencies of massively parallel reporter assays. Nat. Methods 17, 1083–1091 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  81. Badis, G. et al. Diversity and complexity in DNA recognition by transcription factors. Science 324, 1720–1723 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  82. Alexandari, A. M. et al. De novo distillation of thermodynamic affinity from deep learning regulatory sequence models of in vivo protein–DNA binding. Preprint at bioRxiv https://doi.org/10.1101/2023.05.11.540401 (2023).

  83. Nair, S., Shrikumar, A., Schreiber, J. & Kundaje, A. fastISM: performant in silico saturation mutagenesis for convolutional neural networks. Bioinformatics 38, 2397–2403 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  84. Schreiber, J., Nair, S., Balsubramani, A. & Kundaje, A. Accelerating in silico saturation mutagenesis using compressed sensing. Bioinformatics 38, 3557–3564 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  85. Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional networks: visualising image classification models and saliency maps. Preprint at https://arxiv.org/abs/1312.6034 (2013).

  86. Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. Preprint at https://arxiv.org/abs/1704.02685 (2017). Describing DeepLIFT, this study represents pioneering work on attribution methods for interrogating sequence-to-expression models.

  87. Lundberg, S. & Lee, S.-I. A unified approach to interpreting model predictions. Preprint at https://arxiv.org/abs/1705.07874 (2017).

  88. Sundararajan, M., Taly, A. & Yan, Q. Axiomatic attribution for deep networks. Preprint at https://arxiv.org/abs/1703.01365 (2017).

  89. Shrikumar, A. et al. Technical note on transcription factor motif discovery from importance scores (TF-MoDISco) version 0.5.6.5. Preprint at https://arxiv.org/abs/1811.00416 (2018).

  90. Janssens, J. et al. Decoding gene regulation in the fly brain. Nature 601, 630–636 (2022).

    Article  CAS  PubMed  Google Scholar 

  91. Yuan, H. & Kelley, D. R. scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks. Nat. Methods 19, 1088–1096 (2022).

    Article  CAS  PubMed  Google Scholar 

  92. Ribeiro, M. T., Singh, S. & Guestrin, C. “Why Should I Trust You?”: explaining the predictions of any classifier. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1135–1144 (Association for Computing Machinery, 2016).

  93. Novakovsky, G., Fornes, O., Saraswat, M., Mostafavi, S. & Wasserman, W. W. ExplaiNN: interpretable and transparent neural networks for genomics. Genome Biol. 24, 154 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  94. Seitz, E. E., McCandlish, D. M., Kinney, J. B. & Koo, P. K. Interpreting cis-regulatory mechanisms from genomic deep neural networks using surrogate models. Nat. Mach. Intell. 6, 701–713 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  95. Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Ser. B Stat. Methodol. 82, 1273–1300 (2020).

    Article  Google Scholar 

  96. Siraj, L. et al. Functional dissection of complex and molecular trait variants at single nucleotide resolution. Preprint at bioRxiv https://doi.org/10.1101/2024.05.05.592437 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  97. Shigaki, D. et al. Integration of multiple epigenomic marks improves prediction of variant impact in saturation mutagenesis reporter assay. Hum. Mutat. 40, 1280–1291 (2019).

    Article  CAS  PubMed  Google Scholar 

  98. Chen, X. D. et al. Helicase-assisted continuous editing for programmable mutagenesis of endogenous genomes. Science 386, eadn5876 (2024).

    Article  CAS  PubMed  Google Scholar 

  99. Yao, D. et al. Multicenter integrated analysis of noncoding CRISPRi screens. Nat. Methods 21, 723–734 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  100. Schraivogel, D. et al. Targeted Perturb-seq enables genome-scale genetic screens in single cells. Nat. Methods 17, 629–635 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  101. Eder, M., Moene, C. J. I., Dauban, L., Leemans, C. & van Steensel, B. Functional maps of a genomic locus reveal confinement of an enhancer by its target gene. Preprint at bioRxiv https://doi.org/10.1101/2024.08.26.609360 (2024).

  102. Horton, C. A. et al. Short tandem repeats bind transcription factors to tune eukaryotic gene expression. Science 381, eadd1250 (2023).

    Article  CAS  PubMed  Google Scholar 

  103. Reiter, F., de Almeida, B. P. & Stark, A. Enhancers display constrained sequence flexibility and context-specific modulation of motif function. Genome Res. 33, 346–358 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  104. Dey, K. K. et al. Evaluating the informativeness of deep learning annotations for human complex diseases. Nat. Commun. 11, 4703 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  105. Trevino, A. E. et al. Chromatin and gene-regulatory dynamics of the developing human cerebral cortex at single-cell resolution. Cell 184, 5053–5069.e23 (2021).

    Article  CAS  PubMed  Google Scholar 

  106. Wang, S. K. et al. Single-cell multiome of the human retina and deep learning nominate causal variants in complex eye diseases. Cell Genom. https://doi.org/10.1016/j.xgen.2022.100164 (2022).

  107. Huang, C. et al. Personal transcriptome variation is poorly explained by current genomic deep learning models. Nat. Genet. 55, 2056–2059 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  108. Chen, L., Fish, A. E. & Capra, J. A. Prediction of gene regulatory enhancers across species reveals evolutionarily conserved sequence properties. PLoS Comput. Biol. 14, e1006484 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  109. Kelley, D. R. Cross-species regulatory sequence activity prediction. PLoS Comput. Biol. 16, e1008050 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  110. Kaplow, I. M. et al. Inferring mammalian tissue-specific regulatory conservation by predicting tissue-specific differences in open chromatin. BMC Genom. 23, 291 (2022).

    Article  CAS  Google Scholar 

  111. Kaplow, I. M. et al. Relating enhancer genetic variation across mammals to complex phenotypes using machine learning. Science 380, eabm7993 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  112. Hecker, N. et al. Enhancer-driven cell type comparison reveals similarities between the mammalian and bird pallium. Science 387, eadp3957 (2025).

    Article  CAS  PubMed  Google Scholar 

  113. Taskiran, I. I. et al. Cell-type-directed design of synthetic enhancers. Nature 626, 212–220 (2024). In this paper, several approaches are used to construct cell-type-specific enhancers using sequence-to-expression models.

    Article  CAS  PubMed  Google Scholar 

  114. Gosai, S. J. et al. Machine-guided design of cell-type-targeting cis-regulatory elements. Nature 634, 1211–1220 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  115. Lal, A., Garfield, D., Biancalani, T. & Eraslan, G. Designing realistic regulatory DNA with autoregressive language models. Genome Res. 34, 1411–1420 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  116. Sarkar, A., Tang, Z., Zhao, C. & Koo, P. K. Designing DNA with tunable regulatory activity using discrete diffusion. Preprint at bioRxiv https://doi.org/10.1101/2024.05.23.595630 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  117. Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).

    Article  CAS  PubMed  Google Scholar 

  118. Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 37, 2112–2120 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  119. Dalla-Torre, H. et al. Nucleotide transformer: building and evaluating robust foundation models for human genomics. Nat. Methods 22, 287–297 (2025).

    Article  CAS  PubMed  Google Scholar 

  120. Benegas, G., Batra, S. S. & Song, Y. S. DNA language models are powerful predictors of genome-wide variant effects. Proc. Natl Acad. Sci. USA 120, e2311219120 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  121. Nguyen, E. et al. HyenaDNA: long-range genomic sequence modeling at single nucleotide resolution. (2023).

  122. Tang, Z., Somia, N., Yu, Y. & Koo, P. K. Evaluating the representational power of pre-trained DNA language models for regulatory genomics. Preprint at bioRxiv https://doi.org/10.1101/2024.02.29.582810 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  123. Isa Marin, F. et al. BEND: benchmarking DNA language models on biologically meaningful tasks. Preprint at https://arxiv.org/abs/2311.12570 (2023).

  124. Friedman, R. Z. et al. Active learning of enhancer and silencer regulatory grammar in photoreceptors. Preprint at bioRxiv https://doi.org/10.1101/2023.08.21.554146 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  125. Duncan, A. G., Mitchell, J. A. & Moses, A. M. Improving the performance of supervised deep learning for regulatory genomics using phylogenetic augmentation. Bioinformatics https://doi.org/10.1093/bioinformatics/btae190 (2024).

  126. Lee, N. K., Tang, Z., Toneyan, S. & Koo, P. K. EvoAug: improving generalization and interpretability of genomic deep neural networks with evolution-inspired data augmentations. Genome Biol. 24, 105 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  127. Rastogi, R., Reddy, A. J., Chung, R. & Ioannidis, N. M. Fine-tuning sequence-to-expression models on personal genome and transcriptome data. Preprint at bioRxiv https://doi.org/10.1101/2024.09.23.614632 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  128. Fu, X. et al. A foundation model of transcription across human cell types. Nature 637, 965–973 (2025).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  129. Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. Genome-wide mapping of in vivo protein–DNA interactions. Science 316, 1497–1502 (2007).

    Article  CAS  PubMed  Google Scholar 

  130. Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods 10, 1213–1218 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  131. Barski, A. et al. High-resolution profiling of histone methylations in the human genome. Cell 129, 823–837 (2007).

    Article  CAS  PubMed  Google Scholar 

  132. Shiraki, T. et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc. Natl Acad. Sci. USA 100, 15776–15781 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  133. Kruesi, W. S., Core, L. J., Waters, C. T., Lis, J. T. & Meyer, B. J. Condensin controls recruitment of RNA polymerase II to achieve nematode X-chromosome dosage compensation. eLife 2, e00808 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  134. Tsuchihara, K. et al. Massive transcriptional start site analysis of human genes in hypoxia cells. Nucleic Acids Res. 37, 2249–2263 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  135. Policastro, R. A. & Zentner, G. E. Global approaches for profiling transcription initiation. Cell Rep. Methods https://doi.org/10.1016/j.crmeth.2021.100081 (2021).

  136. van Arensbergen, J. et al. Genome-wide mapping of autonomous promoter activity in human cells. Nat. Biotechnol. 35, 145–153 (2017).

    Article  PubMed  Google Scholar 

  137. Arnold, C. D. et al. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 339, 1074–1077 (2013).

    Article  CAS  PubMed  Google Scholar 

  138. Bajwa, A., Rastogi, R., Kathail, P., Shuai, R. W. & Ioannidis, N. M. Characterizing uncertainty in predictions of genomic sequence-to-activity models. Preprint at bioRxiv https://doi.org/10.1101/2023.12.21.572730 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The authors acknowledge V. Franceschini-Santos for extensive discussion and help with the generation of figures. Research at the Netherlands Cancer Institute is supported by an institutional grant of the Dutch Cancer Society and of the Dutch Ministry of Health, Welfare and Sport. The Oncode Institute is partially funded by the Dutch Cancer Society.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed substantially to discussion of the content, wrote the article and reviewed and/or edited the manuscript before submission. L.B.-M. and N.K. designed and generated figures.

Corresponding authors

Correspondence to Bas van Steensel or Jeroen de Ridder.

Ethics declarations

Competing interests

The authors have applied for a patent related to the S2E model PARM.

Peer review

Peer review information

Nature Reviews Genetics thanks Julia Zeitlinger and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Glossary

Attribution map

Interpretation technique that visualizes the importance of each nucleotide in a sequence to the prediction of the model.

Convolutional neural networks

(CNNs). A neural network architecture using convolutional layers that extract local patterns at different spatial hierarchies.

Deep learning

(DL). A class of machine-learning approaches capable of identifying highly complex patterns in large data sets. Unlike classical machine learning, DL methods can automatically learn the best representation through stacking artificial neural networks in multiple layers, thereby minimizing the need for manual feature engineering.

Dot product

A basic linear algebra computation to multiply matrices.

Expression quantitative trait loci

(eQTL). Genomic loci that regulate expression levels of mRNA or proteins.

Fully connected layer

A neural network layer in which every input contributes to the computation of every output. These are typically the last layers in a model.

Hyperparameters

Configuration settings that control how the model learns, such as model size and learning rate, which are set before training and may be optimized during validation.

Kernel

Also referred to as a filter, a kernel is a small matrix that can detect specific patterns in the input sequence through the process of convolution. Which features each kernel recognizes is determined by the parameters in the matrix, which are learned during training.

k-mer

A short substring of length ‘k’ found within a larger biological sequence (such as DNA or RNA).

Machine learning

Describes a wide range of algorithms that can learn from data to make predictions on new and unseen data. Examples include random forest, support vector machines and gradient boosting. Machine-learning methods often require manual feature engineering.

Parameters

Also referred to as weights, parameters are values in the model that are learned during training to optimize performance.

Position weight matrices

Representation of motifs in a DNA sequence as matrices that indicate the frequency or importance of each nucleotide at each position.

Receptive fields

Region of DNA sequence that influences the prediction of the model at a given position.

Self-attention

A mechanism used in transformer deep learning models that allows the model to weigh the importance of different parts of an input sequence when processing it, effectively allowing the model to ‘focus’ on the most relevant part of the input DNA sequence.

Transfer learning

A technique whereby a model trained on one task is used as a starting point for a model on a different but related task, leveraging previously learned knowledge to improve performance.

Transformers

A neural network architecture based on the concept of self-attention that weights the importance of different parts of the input, allowing long-range dependencies to be captured.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Barbadilla-Martínez, L., Klaassen, N., van Steensel, B. et al. Predicting gene expression from DNA sequence using deep learning models. Nat Rev Genet 26, 666–680 (2025). https://doi.org/10.1038/s41576-025-00841-2

Download citation

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41576-025-00841-2

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing