Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Transformers and genome language models

Abstract

Large language models based on the transformer deep learning architecture have revolutionized natural language processing. Motivated by the analogy between human language and the genome’s biological code, researchers have begun to develop genome language models (gLMs) based on transformers and related architectures. This Review explores the use of transformers and language models in genomics. We survey open questions in genomics amenable to the use of gLMs, and motivate the use of gLMs and the transformer architecture for these problems. We discuss the potential of gLMs for modelling the genome using unsupervised pretraining tasks, specifically focusing on the power of zero- and few-shot learning. We explore the strengths and limitations of the transformer architecture, as well as the strengths and limitations of current gLMs more broadly. Additionally, we contemplate the future of genomic modelling beyond the transformer architecture, based on current trends in research. This Review serves as a guide for computational biologists and computer scientists interested in transformers and language models for genomic data.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: A big-picture look at the diverse applications of gLMs.
Fig. 2: A comparison of how different genomic deep learning models operate on DNA sequence data.
Fig. 3: The total amount of compute, in PFS-days used to train the various models discussed in the Review (all of the models for which parameter number, training time, and GPU usage were available).

Similar content being viewed by others

References

  1. Nichol, A. et al. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. Preprint at https://doi.org/10.48550/arXiv.2112.10741 (2021).

  2. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. Hierarchical text-conditional image generation with clip latents. Preprint at https://doi.org/10.48550/arXiv.2204.06125 (2022).

  3. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://doi.org/10.48550/arXiv.1810.04805 (2018).

  4. Radford, A. et al. Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019).

    MATH  Google Scholar 

  5. Yang, Z. et al. XLNet: generalized autoregressive pretraining for language understanding. In Proc. 33rd International Conference Neural Information Prcoessing Systems 517, 5753–5763 (2019).

  6. Brown, T. B. et al. Language models are few-shot learners. Preprint at https://doi.org/10.48550/arXiv.2005.14165 (2020).

  7. Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).

    Article  MATH  Google Scholar 

  8. GPT-4 Technical Report (OpenAI, 2023).

  9. Warr, A. et al. Exome sequencing: current and future perspectives. G3 Genes Genomes Genet. 5, 1543–1550 (2015).

    Article  MATH  Google Scholar 

  10. Ng, P. C. & Kirkness, E. F. Whole genome sequencing. Genet. Var. 628, 215–226 (2010).

  11. Buenrostro, J. D., Wu, B., Chang, H. Y. & Greenleaf, W. J. ATAC-seq: a method for assaying chromatin accessibility genome-wide. Curr. Protoc. Mol. Biol. 109, 21–29 (2015).

    Article  Google Scholar 

  12. Park, P. J. ChIP–seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10, 669–680 (2009).

    Article  MATH  Google Scholar 

  13. Vaisvila, R. et al. Enzymatic methyl sequencing detects DNA methylation at single-base resolution from picograms of DNA. Genome Res. 31, 1280–1289 (2021).

    Article  MATH  Google Scholar 

  14. Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009).

    Article  Google Scholar 

  15. Haque, A., Engel, J., Teichmann, S. A. & Lönnberg, T. A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome Med. 9, 75 (2017).

    Article  MATH  Google Scholar 

  16. Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).

    Article  MATH  Google Scholar 

  17. Ecker, J. R. et al. ENCODE explained. Nature 489, 52–54 (2012).

    Article  Google Scholar 

  18. Zou, J. et al. A primer on deep learning in genomics. Nat. Genet. 51, 12–18 (2019).

    Article  MATH  Google Scholar 

  19. Quang, D., Chen, Y. & Xie, X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinforma. Oxf. Engl. 31, 761–763 (2014).

  20. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

    Article  Google Scholar 

  21. Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 12, 931–934 (2015).

    Article  MATH  Google Scholar 

  22. Pei, G., Hu, R., Jia, P. & Zhao, Z. DeepFun: a deep learning sequence-based model to decipher non-coding variant effect in a tissue- and cell type-specific manner. Nucleic Acids Res. 49, W131–W139 (2021).

    Article  Google Scholar 

  23. Hassanzadeh, H. R. & Wang, M. DeeperBind: enhancing prediction of sequence specificities of DNA binding proteins. In Proc. IEEE International Conference on Bioinformatics and Biomedicine Vol. 2016, 178–183 (2016).

  24. Trieu, T., Martinez-Fundichely, A. & Khurana, E. DeepMILO: a deep learning approach to predict the impact of non-coding sequence variants on 3D chromatin structure. Genome Biol. 21, 79 (2020).

    Article  MATH  Google Scholar 

  25. Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018).

    Article  MATH  Google Scholar 

  26. Wang, M., Tai, C., E, W. & Wei, L. Define: Deep convolutional neural networks accurately quantify intensities of transcription factor-DNA binding and facilitate evaluation of functional non-coding variants. Nucleic Acids Res. 46, e69 (2018).

    Article  MATH  Google Scholar 

  27. He, Z., Liu, L., Wang, K. & Ionita-Laza, I. A semi-supervised approach for predicting cell-type specific functional consequences of non-coding variation using MPRAs. Nat. Commun. 9, 5199 (2018).

    Article  MATH  Google Scholar 

  28. Wells, A. et al. Ranking of non-coding pathogenic variants and putative essential regions of the human genome. Nat. Commun. 10, 5241 (2019).

    Article  MATH  Google Scholar 

  29. Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44, e107 (2016).

  30. Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 37, 2112–2120 (2021).

    Article  MATH  Google Scholar 

  31. Avsec, Z. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).

    Article  MATH  Google Scholar 

  32. Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).

    Article  MATH  Google Scholar 

  33. Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).

  34. Tasaki, S., Gaiteri, C., Mostafavi, S. & Wang, Y. Deep learning decodes the principles of differential gene expression. Nat. Mach. Intell. 2, 376–386 (2020).

    Article  MATH  Google Scholar 

  35. Xiong, H. Y. et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2015).

    Article  MathSciNet  Google Scholar 

  36. Fudenberg, G., Kelley, D. R. & Pollard, K. S. Predicting 3D genome folding from DNA sequence with Akita. Nat. Methods 17, 1111–1117 (2020).

    Article  Google Scholar 

  37. Avsec, Z. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).

    Article  MATH  Google Scholar 

  38. Vitsios, D., Dhindsa, R. S., Middleton, L., Gussow, A. B. & Petrovski, S. Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning. Nat. Commun. 12, 1504 (2021).

    Article  Google Scholar 

  39. Zhou, Z. et al. DNABERT-2: efficient foundation model and benchmark for multi-species genome. Preprint at https://doi.org/10.48550/arXiv.2306.15006 (2023).

  40. Cui, H., Wang, C., Maan, H. & Wang, B. scGPT: towards building a foundation model for single-cell multi-omics using generative AI. Nat. Methods 21, 1470–1480 (2024).

  41. Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618, 616–624 (2023).

  42. Tan, J. et al. Cell-type-specific prediction of 3D chromatin organization enables high-throughput in silico genetic screening. Nat. Biotechnol. 41, 1140–1150 (2023).

  43. Dalla-Torre, H. et al. Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nat. Methods 22, 287–297 (2025).

    Article  MATH  Google Scholar 

  44. Bolya, D., Fu, C.-Y., Dai, X., Zhang, P. & Hoffman, J. Hydra Attention: efficient attention with many heads. Preprint at https://doi.org/10.48550/arXiv.2209.07484 (2022).

  45. Ma, X. et al. Mega: moving average equipped gated attention. Preprint at https://doi.org/10.48550/arXiv.2209.10655 (2022).

  46. Nguyen, E. et al. HyenaDNA: long-range genomic sequence modeling at single nucleotide resolution. Preprint at https://doi.org/10.48550/arXiv.2306.15794 (2023).

  47. Jones, W., Alasoo, K., Fishman, D. & Parts, L. Computational biology: deep learning. Emerg. Top. Life Sci. 1, 257–274 (2017).

    Article  Google Scholar 

  48. Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 15, 20170387 (2018).

    Article  MATH  Google Scholar 

  49. Min, S., Lee, B. & Yoon, S. Deep learning in bioinformatics. Brief. Bioinform. 18, 851–869 (2017).

    MATH  Google Scholar 

  50. Richards, B. A. et al. A deep learning framework for neuroscience. Nat. Neurosci. 22, 1761–1770 (2019).

    Article  MATH  Google Scholar 

  51. Wainberg, M., Merico, D., Delong, A. & Frey, B. J. Deep learning in biomedicine. Nat. Biotechnol. 36, 829–838 (2018).

    Article  MATH  Google Scholar 

  52. Novakovsky, G., Dexter, N., Libbrecht, M. W., Wasserman, W. W. & Mostafavi, S. Obtaining genetics insights from deep learning via explainable artificial intelligence. Nat. Rev. Genet. 24, 125–137 (2023).

    Article  Google Scholar 

  53. Talukder, A., Barham, C., Li, X. & Hu, H. Interpretation of deep learning in genomics and epigenomics. Brief. Bioinform. 22, bbaa177 (2021).

    Article  Google Scholar 

  54. Li, Z. et al. Applications of deep learning in understanding gene regulation. Cell Rep. Methods 3, 100384 (2023).

    Article  MATH  Google Scholar 

  55. Eraslan, G., Avsec, Z., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).

    Article  Google Scholar 

  56. Routhier, E. & Mozziconacci, J. Genomics enters the deep learning era. PeerJ 10, e13613 (2022).

    Article  MATH  Google Scholar 

  57. Sapoval, N. et al. Current progress and open challenges for applying deep learning across the biosciences. Nat. Commun. 13, 1728 (2022).

    Article  MATH  Google Scholar 

  58. Muse, S. Introduction to Biomedical Engineering 2nd edn (eds Enderle, J. D. et al.) Ch. 13, 799–831 (2005).

  59. Marin, F. I. et al. BEND: benchmarking DNA language models on biologically meaningful tasks. Preprint at https://doi.org/10.48550/arXiv.2311.12570 (2024).

  60. Benson, D. A. et al. GenBank. Nucleic Acids Res. 41, D36–D42 (2013).

    Article  Google Scholar 

  61. O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).

    Article  MATH  Google Scholar 

  62. Leinonen, R., Sugawara, H. & Shumway, M. The Sequence Read Archive. Nucleic Acids Res. 39, D19–D21 (2011).

    Article  Google Scholar 

  63. Song, L. & Crawford, G. E. DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harb. Protoc. 2010, pdb.prot5384 (2010).

    Article  MATH  Google Scholar 

  64. Belton, J.-M. et al. Hi-C: a comprehensive technique to capture the conformation of genomes. Methods 58, 268–276 (2012).

    Article  MATH  Google Scholar 

  65. Yao, D. et al. Multicenter integrated analysis of noncoding CRISPRi screens. Nat. Methods 21, 723–734 (2024).

    Article  MATH  Google Scholar 

  66. ENCODE Project Consortium et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710 (2020).

    Article  Google Scholar 

  67. Satterlee, J. S. et al. The NIH Common Fund/Roadmap Epigenomics Program: successes of a comprehensive consortium. Sci. Adv. 5, eaaw6507 (2019).

    Article  Google Scholar 

  68. Lonsdale, J. et al. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).

    Article  MATH  Google Scholar 

  69. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  MATH  Google Scholar 

  70. Sennrich, R., Haddow, B. & Birch, A. Neural machine translation of rare words with subword units. Preprint at https://doi.org/10.48550/arXiv.1508.07909 (2016).

  71. Chandra, A., Tünnermann, L., Löfstedt, T. & Gratz, R. Transformer-based deep learning for predicting protein properties in the life sciences. eLife 12, e82819 (2023).

    Article  Google Scholar 

  72. Zhou, J. Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale. Nat. Genet. 54, 725–734 (2022).

    Article  MATH  Google Scholar 

  73. Tang, Z. & Koo, P. K. Evaluating the representational power of pre-trained DNA language models for regulatory genomics. Preprint at bioRxiv https://doi.org/10.1101/2024.02.29.582810 (2024).

  74. Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep convolutional neural networks. In NIPS'12: Proc. 26th International Conference on Neural Information Processing Systems Vol. 1, 1097–1105 (NIPS, 2012).

  75. Elman, J. L. Finding structure in time. Cogn. Sci. 14, 179–211 (1990).

    Article  MATH  Google Scholar 

  76. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

    Article  MATH  Google Scholar 

  77. Vaswani, A. et al. Attention is all you need. Preprint at https://doi.org/10.48550/arXiv.1706.03762 (2017).

  78. Wang, T. et al. What language model architecture and pretraining objective works best for zero-shot generalization? In Int. Conf. Machine Learning 22964–22984 (PMLR, 2022).

  79. Poli, M. et al. Hyena Hierarchy: towards larger convolutional language models. Preprint at https://doi.org/10.48550/arXiv.2302.10866 (2023).

  80. Tay, Y. et al. Are pre-trained convolutions better than pre-trained transformers? Preprint at https://doi.org/10.48550/arXiv.2105.03322 (2022).

  81. Yang, K. K., Lu, A. X. & Fusi, N. Convolutions are competitive with transformers for protein sequence pretraining. Cell Syst. 15, 286–294.e2 (2024).

  82. Greene, C. S. The future is unsupervised. Sci. Transl. Med. 8, 346ec108 (2016).

    Article  MATH  Google Scholar 

  83. Benegas, G., Batra, S. S. & Song, Y. S. DNA language models are powerful predictors of genome-wide variant effects. Proc. Natl Acad. Sci. USA 120, e2311219120 (2023).

  84. Nguyen, E. et al. Sequence modeling and design from molecular to genome scale with Evo. Science 386, eado9336 (2024).

    Article  Google Scholar 

  85. Zhang, Y., Bai, Z. & Imoto, S. Investigation of the BERT model on nucleotide sequences with non-standard pre-training and evaluation of different k-mer embeddings. Bioinformatics 39, btad617 (2023).

  86. Gu, A., Goel, K. & Ré, C. Efficiently modeling long sequences with structured state spaces. Preprint at https://doi.org/10.48550/arXiv.2111.00396 (2022).

  87. Gu, A. & Dao, T. Mamba: linear-time sequence modeling with selective state spaces. Preprint at https://doi.org/10.48550/arXiv.2312.00752 (2024).

  88. Schiff, Y. et al. Caduceus: bi-directional equivariant long-range dna sequence modeling. Preprint at https://doi.org/10.48550/arXiv.2403.03234 (2024).

  89. Bishop, C. M. & Bishop, H. Deep Learning: Foundations and Concepts (Springer International, 2024).

  90. Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).

  91. MIT Deep Learning 6.S191. http://introtodeeplearning.com (accessed 11 July 2024).

  92. Bach, S. et al. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS One 10, e0130140 (2015).

    Article  MATH  Google Scholar 

  93. Karollus, A., Mauermeier, T. & Gagneur, J. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Preprint at bioRxiv https://doi.org/10.1101/2022.09.15.508087 (2022).

  94. Kelley, D. R. Cross-species regulatory sequence activity prediction. PLoS Comput. Biol. 16, e1008050 (2020).

    Article  MATH  Google Scholar 

  95. Linder, J., Srivastava, D., Yuan, H., Agarwal, V. & Kelley, D. R. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nat. Genet. https://doi.org/10.1038/s41588-024-02053-6 (2025).

  96. Fishman, V. et al. GENA-LM: a family of open-source foundational DNA language models for long sequences, Nucleic Acids Res. 53, gkae1310 (2025).

  97. Dao, T., Fu, D. Y., Ermon, S., Rudra, A. & Ré, C. FlashAttention: fast and memory-efficient exact attention with IO-awareness. Preprint at https://doi.org/10.48550/arXiv.2205.14135 (2022).

  98. Press, O., Smith, N. A. & Lewis, M. Train short, test long: attention with linear biases enables input length extrapolation. Preprint at https://doi.org/10.48550/arXiv.2108.12409 (2022).

  99. Hu, E. J. et al. LoRA: low-rank adaptation of large language models. Preprint at https://doi.org/10.48550/arXiv.2106.09685 (2021).

  100. Katharopoulos, A., Vyas, A., Pappas, N. & Fleuret, F. Transformers are RNNs: fast autoregressive transformers with linear attention. Preprint at https://doi.org/10.48550/arXiv.2006.16236 (2020).

  101. Sun, Y. et al. Retentive Network: a successor to transformer for large language models. Preprint at https://doi.org/10.48550/arXiv.2307.08621 (2023).

  102. Gresova, K., Martinek, V., Cechak, D., Simecek, P. & Alexiou, P. Genomic benchmarks: a collection of datasets for genomic sequence classification. BMC Genomic Data 24, 25 (2023).

    Article  Google Scholar 

  103. Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://doi.org/10.48550/arXiv.2001.08361 (2020).

  104. Serrano, Y., Ciudad, A. & Molina, A. Are protein language models compute optimal? Preprint at https://doi.org/10.48550/arXiv.2406.07249 (2024).

  105. Li, F.-Z., Amini, A. P., Yue, Y., Yang, K. K. & Lu, A. X. Feature reuse and scaling: understanding transfer learning with protein language models. Preprint at bioRxiv https://doi.org/10.1101/2024.02.05.578959 (2024).

  106. Theodoris, C. V. Perspectives on benchmarking foundation models for network biology. Quant. Biol. 12, 335–338 (2024).

    Article  MATH  Google Scholar 

  107. Thurman, R. E. et al. The accessible chromatin landscape of the human genome. Nature 489, 75–82 (2012).

    Article  MATH  Google Scholar 

  108. Javierre, B. M. et al. Lineage-specific genome architecture links enhancers and non-coding disease variants to target gene promoters. Cell 167, 1369–1384 (2016).

    Article  MATH  Google Scholar 

  109. Fang, R. et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat. Commun. 12, 1337 (2021).

    Article  MATH  Google Scholar 

  110. Chen, Y., Xie, M. & Wen, J. Predicting gene expression from histone modifications with self-attention based neural networks and transfer learning. Front. Genet. 13, 1081842 (2022).

    Article  Google Scholar 

  111. Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

    Article  MATH  Google Scholar 

  112. Dwork, C. & Roth, A. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9, 211–407 (2014).

    Article  MathSciNet  MATH  Google Scholar 

  113. McMahan, H. B., Moore, E., Ramage, D., Hampson, S. & Arcas, B. A. Y. Communication-efficient learning of deep networks from decentralized data. Preprint at https://doi.org/10.48550/arXiv.1602.05629 (2016).

  114. Clauwaert, J., Menschaert, G. & Waegeman, W. Explainability in transformer models for functional genomics. Brief. Bioinform. 22, bbab060 (2021).

    Article  MATH  Google Scholar 

  115. Serrano, S. & Smith, N. A. Is attention interpretable? Preprint at https://doi.org/10.48550/arXiv.1906.03731 (2019).

  116. Chefer, H., Gur, S. & Wolf, L. Transformer interpretability beyond attention visualization. Preprint at https://doi.org/10.48550/arXiv.2012.09838 (2020).

  117. Voita, E., Talbot, D., Moiseev, F., Sennrich, R. & Titov, I. Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. Preprint at https://doi.org/10.48550/arXiv.1905.09418 (2019).

  118. Abnar, S. & Zuidema, W. Quantifying attention flow in transformers. Preprint at https://doi.org/10.48550/arXiv.2005.00928 (2020).

  119. Binder, A., Montavon, G., Lapuschkin, S., Müller, K.-R. & Samek, W. Layer-wise relevance propagation for neural networks with local renormalization layers. In Artificial Neural Networks and Machine Learning–ICANN 2016: 25th International Conference on Artificial Neural Networks 63–71 (Springer, 2016).

  120. Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. In Proc. IEEE International Conference on Computer Vision 618–626 (IEEE, 2017).

  121. Lundberg, S. & Lee, S.-I. A unified approach to interpreting model predictions. Preprint at https://doi.org/10.48550/arXiv.1705.07874 (2017).

  122. Kwon, Y. & Zou, J. WeightedSHAP: analyzing and improving Shapley based feature attributions. Preprint at https://doi.org/10.48550/arXiv.2209.13429 (2022).

  123. Ullah, F. & Ben-Hur, A. A self-attention model for inferring cooperativity between regulatory features. Nucleic Acids Res. 49, e77 (2021).

    Article  MATH  Google Scholar 

  124. Toneyan, S. & Koo, P. K. Interpreting cis-regulatory interactions from large-scale deep neural networks. Nat. Genet. 56, 2517–2527 (2024).

    Article  Google Scholar 

  125. Zhang, Z. et al. Protein language models learn evolutionary statistics of interacting sequence motifs. Proc. Natl Acad. Sci. USA 121, e2406285121 (2024).

    Article  Google Scholar 

  126. Vig, J. et al. BERTology meets biology: interpreting attention in protein language models. Preprint at https://doi.org/10.48550/arXiv.2006.15222 (2021).

  127. Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://doi.org/10.48550/arXiv.2108.07258 (2022).

  128. Kedzierska, K. Z., Crawford, L., Amini, A. P. & Lu, A. X. Assessing the limits of zero-shot foundation models in single-cell biology. Preprint at bioRxiv https://doi.org/10.1101/2023.10.16.561085 (2023).

  129. Lu, A. X., Lu, A. X. & Moses, A. Evolution is all you need: phylogenetic augmentation for contrastive learning. Preprint at https://doi.org/10.48550/arXiv.2012.13475 (2020).

  130. Benegas, G., Albors, C., Aw, A. J., Ye, C. & Song, Y. S. A DNA language model based on multispecies alignment predicts the effects of genome-wide variants. Nat. Biotechnol. https://doi.org/10.1038/s41587-024-02511-w (2025).

  131. Belancio, V. P., Deininger, P. L. & Roy-Engel, A. M. LINE dancing in the human genome: transposable elements and disease. Genome Med. 1, 97 (2009).

    Article  Google Scholar 

  132. Yang, F. et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat. Mach. Intell. 4, 852–866 (2022).

    Article  MATH  Google Scholar 

  133. Levine, D. et al. Cell2Sentence: teaching large language models the language of biology. Preprint at bioRxiv https://doi.org/10.1101/2023.09.11.557287 (2023).

  134. Hao, M. et al. Large scale foundation model on single-cell transcriptomics. Nat. Methods 21, 1481–1491 (2024).

    Article  MATH  Google Scholar 

  135. Szałata, A. et al. Transformers in single-cell omics: a review and new perspectives. Nat. Methods 21, 1430–1443 (2024).

    Article  MATH  Google Scholar 

  136. Hao, M. et al. Current opinions on large cellular models. Quant. Biol. 12, 433–443 (2024).

    Article  MATH  Google Scholar 

  137. Hassani, A. & Shi, H. Dilated neighborhood attention transformer. Preprint at https://doi.org/10.48550/arXiv.2209.15001 (2022).

  138. Bolya, D. et al. Token Merging: your ViT but faster. Preprint at https://doi.org/10.48550/arXiv.2210.09461 (2022).

  139. Alamdari, S. et al. Protein generation with evolutionary diffusion: sequence is all you need. Preprint at bioRxiv https://doi.org/10.1101/2023.09.11.556673 (2023).

  140. Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).

    Google Scholar 

  141. Kimura, M. Solution of a process of random genetic drift with a continuous model. Proc. Natl Acad. Sci. USA 41, 144–150 (1955).

    Article  MATH  Google Scholar 

  142. Kimura, M. Stochastic processes and distribution of gene frequencies under natural selection. Cold Spring Harb. Symp. Quant. Biol. 20, 33–53 (1955).

    Article  MATH  Google Scholar 

  143. Wakeley, J. The limits of theoretical population. Genetics 169, 1–7 (2005).

    Article  MATH  Google Scholar 

  144. DaSilva, L. F. et al. DNA-Diffusion: leveraging generative models for controlling chromatin accessibility and gene expression via synthetic regulatory elements. Preprint at bioRxiv https://doi.org/10.1101/2024.02.01.578352 (2024).

Download references

Acknowledgements

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC).

Author information

Authors and Affiliations

Authors

Contributions

M.E.C. selected the papers to review, summarized contributions from all papers, performed analysis, and designed all figures. A.M., B.W., M.W. and D.F. helped with figure design. C.D. contributed to paper selection and summarizing contributions; A.M., M.W., M.K., F.J.T. and H.G. contributed to manuscript writing. A.M. supervised and B.W. conceived and supervised the project.

Corresponding author

Correspondence to Bo Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Jesper Tegner, Fan Yang, Xuegong Zhang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Appendices A–C, Table 1, Figs. 1 and 2.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Consens, M.E., Dufault, C., Wainberg, M. et al. Transformers and genome language models. Nat Mach Intell 7, 346–362 (2025). https://doi.org/10.1038/s42256-025-01007-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue date:

  • DOI: https://doi.org/10.1038/s42256-025-01007-9

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing