Abstract
Transcriptional regulation, critical for cellular differentiation and adaptation to environmental changes, involves coordinated interactions among DNA sequences, regulatory proteins, and chromatin architecture. Despite extensive chromatin profiles and gene expression data from consortia, understanding the dynamics of cis-regulatory elements in gene expression remains challenging. Deep learning is a powerful tool for learning gene expression and epigenomic profiles from DNA sequences, exhibiting superior performance compared to conventional machine learning approaches. However, even the most advanced deep learning-based methods may fall short in capturing the regulatory effects of distal elements such as enhancers, limiting their predictive accuracy. In addition, these methods may require significant resources to train or adapt to newly generated data. To address these challenges, we present EPInformer, a scalable deep-learning framework for predicting gene expression by integrating promoter-enhancer interactions with their sequences, epigenomic profiles, and chromatin contacts. Our model outperforms existing gene expression prediction models in rigorous cross-chromosome validation, accurately recapitulates enhancer-gene interactions validated by genome editing experiments, and identifies crucial transcription factor motifs within regulatory sequences.
Similar content being viewed by others
Data availability
The genomic datasets analyzed during the current study are available in the ENCODE Project repository (https://www.encodeproject.org/) under the following accession codes: DNase-seq (K562: ENCFF425WDA, ENCFF205FNC; GM12878: ENCFF020WZB, ENCFF729UYK; H1: ENCFF761ZRE; HepG2: ENCFF691HJY; HUVEC: ENCFF091KTX; NHEK: ENCFF117RNM); H3K27ac ChIP-seq (K562: ENCFF600THN, ENCFF232RQF, ENCFF704LGA; GM12878: ENCFF269GKF, ENCFF201OHW; H1: ENCFF693IFG, ENCFF860ABR; HepG2: ENCFF745JCH, ENCFF862NDZ, ENCFF926NHE; HUVEC: ENCFF374DGO, ENCFF609TUB; NHEK: ENCFF051NTC, ENCFF770JWP); and reference Hi-C (ENCFF134PUN [https://www.encodeproject.org/files/ENCFF134PUN]). Additional Hi-C contact matrices are available from the 4D Nucleome Data Portal (https://data.4dnucleome.org/) under accession codes 4DNFITUOMFUQ and 4DNFI1UEG1HD. CAGE data are available from FANTOM5 (https://fantom.gsc.riken.jp/5/sstar/Main_Page) under accession codes CNhs11250 [https://fantom.gsc.riken.jp/5/sstar/FF:10454-106G4] and CNhs12333 [https://fantom.gsc.riken.jp/5/sstar/FF:10823-111C4]. RNA-seq expression profiles are available from the Roadmap Epigenomics Consortium (https://egg2.wustl.edu/roadmap/data/byDataType/rna/expression/57epigenomes.RPKM.pc.gz). The enhancer-gene linkage benchmarking datasets are available in the Engreitz Lab GitHub repositories (https://github.com/EngreitzLab/CRISPR_comparison and https://github.com/EngreitzLab/eQTLEnrichment) and are included in Supplementary Data 2 and 3. The enhancer-gene pair data generated in this study have been deposited in the Zenodo (https://zenodo.org/records/17167181). Source data are provided with this paper.
Code availability
The code used to develop EPInformer, perform the analyses and generate results in this study is publicly available and has been deposited in https://github.com/pinellolab/EPInformer (release version 0.1.1) under the MIT License. The specific version of the code with this publication is archived in Zenodo and is accessible via https://doi.org/10.5281/zenodo.1716718070.
References
Oudelaar, A. M. & Higgs, D. R. The relationship between genome structure and function. Nat. Rev. Genet. 22, 154–168 (2021).
Gasperini, M., Tome, J. M. & Shendure, J. Towards a comprehensive catalogue of validated and target-linked human enhancers. Nat. Rev. Genet. 21, 292–310 (2020).
Andersson, R. & Sandelin, A. Determinants of enhancer and promoter activities of regulatory elements. Nat. Rev. Genet. 21, 71–87 (2020).
de Boer, C. G. & Taipale, J. Hold out the genome: a roadmap to solving the cis-regulatory code. Nature 625, 41–50 (2024).
Dunham, I. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012).
Lizio, M. et al. Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome Biol. 16, 22 (2015).
de Hoon, M., Shin, J. W. & Carninci, P. Paradigm shifts in genomics through the FANTOM projects. Mamm. Genome 26, 391–402 (2015).
Reiff, S. B. et al. The 4D nucleome data portal as a resource for searching and visualizing curated nucleomics data. Nat. Commun. 13, 2365 (2022).
Dekker, J. et al. The 4D nucleome project. Nature 549, 219–226 (2017).
Chen, K. M., Wong, A. K., Troyanskaya, O. G. & Zhou, J. A sequence-based global map of regulatory activity for deciphering human genetics. Nat. Genet. 54, 940–949 (2022).
Avsec, Ž. et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 53, 354–366 (2021).
Gao, Z., Liu, Q., Zeng, W., Jiang, R. & Wong, W. H. EpiGePT: a pretrained transformer-based language model for context-specific human epigenomics. Genome Biol. 25, 310 (2024).
Li, Z. et al. Applications of deep learning in understanding gene regulation. Cell Rep. Methods 3, 100384 (2023).
Zrimec, J. et al. Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure. Nat. Commun. 11, 6141 (2020).
Zhang, Z., Feng, F., Qiu, Y. & Liu, J. A generalizable framework to comprehensively predict epigenome, chromatin organization, and transcriptome. Nucleic Acids Res. 51, 5931–5947 (2023).
Salvatore, M., Horlacher, M., Marsico, A., Winther, O. & Andersson, R. Transfer learning identifies sequence determinants of cell-type specific regulatory element accessibility. NAR Genom. Bioinform. 5, lqad026 (2023).
Seitz, E. E., McCandlish, D. M., Kinney, J. B. & Koo, P. K. Interpreting cis-regulatory mechanisms from genomic deep neural networks using surrogate models. bioRxiv https://doi.org/10.1101/2023.11.14.567120 (2024).
Tan, J. et al. Cell-type-specific prediction of 3D chromatin organization enables high-throughput in silico genetic screening. Nat. Biotechnol. 41, 1140–1150 (2023).
Linder, J., Srivastava, D., Yuan, H., Agarwal, V. & Kelley, D. R. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nat. Genet. 57, 949–961 (2025).
Zhou, J. et al. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat. Genet. 50, 1171–1179 (2018).
Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).
Consens, M. E. et al. Transformers and large language models for genomics. Nat. Mach. Intell. 7, 346–362 (2025).
Zhang, S. et al. Applications of transformer-based language models in bioinformatics: a survey. Bioinform. Adv. 3, vbad001 (2023).
Lee, D., Yang, J. & Kim, S. Learning the histone codes with large genomic windows and three-dimensional chromatin interactions using transformer. Nat. Commun. 13, 6678 (2022).
Tang, Z., Toneyan, S. & Koo, P. K. Current approaches to genomic deep learning struggle to fully capture human genetic variation. Nat. Genet. 55, 2021–2022 (2023).
Karollus, A., Mauermeier, T. & Gagneur, J. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biol. 24, 56 (2023).
Li, Y. et al. CREaTor: zero-shot cis-regulatory pattern modeling with attention mechanisms. Genome Biol. 24, 266 (2023).
Karbalayghareh, A., Sahin, M. & Leslie, C. S. Chromatin interaction-aware gene regulatory modeling with graph attention networks. Genome Res. 32, 930–944 (2022).
Zhou, Z. et al. DNABERT-2: Efficient foundation model and benchmark for multi-species genome. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR, 2024).
Dalla-Torre, H. et al. Nucleotide transformer: building and evaluating robust foundation models for human genomics. Nat. Methods 1–11, https://doi.org/10.1038/s41592-024-02523-z (2024).
Nguyen, E. et al. HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. In Advances in Neural Information Processing Systems (NeurIPS, 2023).
Marin, F. I. et al. BEND: Benchmarking DNA language models on biologically meaningful tasks. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR, 2024).
Wang, Y. et al. Genomic touchstone: benchmarking genomic language models in the context of the central dogma. bioRxiv https://doi.org/10.1101/2025.06.25.661622 (2025).
Feng, H. et al. Benchmarking DNA foundation models for genomic and genetic tasks. Nat Commun 16, 10780 (2025).
Shrikumar, A. et al. Technical note on transcription factor motif discovery from importance scores (TF-MoDISco) version 0.5.6.5. Preprint at arXiv https://doi.org/10.48550/arXiv.1811.00416 (2018).
Schreiber, J. tangermeme: A toolkit for understanding cis-regulatory logic using deep learning models. bioRxiv https://doi.org/10.1101/2025.08.08.669296 (2025).
Yuan, H. & Kelley, D. R. scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks. Nat. Methods 19, 1088–1096 (2022).
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Pampari, A. et al. ChromBPNet: Bias Factorized, Base-Resolution Deep Learning Models of Chromatin Accessibility Reveal Cis-Regulatory Sequence Syntax, Transcription Factor Footprints and Regulatory Variants. bioRxiv https://doi.org/10.1101/2024.12.25.630221 (2024).
Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 37, 2112–2120 (2021).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT, 2019).
Agarwal, V. & Shendure, J. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep. 31, 107663 (2020).
Fulco, C. P. et al. Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nat. Genet. 51, 1664–1669 (2019).
Kruse, K., Hug, C. B. & Vaquerizas, J. M. FAN-C: a feature-rich framework for the analysis and visualisation of chromosome conformation capture data. Genome Biol. 21, 303 (2020).
Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Linder, J., Srivastava, D., Yuan, H., Agarwal, V. & Kelley, D. R. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nat. Genet. 57, 949–961 (2025).
Koido, M. et al. Prediction of the cell-type-specific transcription of non-coding RNAs from genome sequences via machine learning. Nat. Biomed. Eng. 7, 830–844 (2023).
Nasser, J. et al. Genome-wide enhancer maps link risk variants to disease genes. Nature 593, 238–243 (2021).
Gschwind, A. R. et al. An encyclopedia of enhancer-gene regulatory interactions in the human genome. bioRxiv https://doi.org/10.1101/2023.11.09.563812 (2023).
GTEx Consortium The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning (ICML, 2017).
Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L. & Noble, W. S. Quantifying similarity between motifs. Genome Biol. 8, R24 (2007).
Schreiber, J. Tomtom-lite: accelerating Tomtom enables large-scale and real-time motif similarity scoring. bioRxiv https://doi.org/10.1101/2025.05.27.656386 (2025).
Doré, L. C. & Crispino, J. D. Transcription factor networks in erythroid cell and megakaryocyte development. Blood 118, 231–239 (2011).
Martin-Rufino, J. D. et al. Transcription factor networks disproportionately enrich for heritability of blood cell phenotypes. Science 388, 52–59 (2025).
Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).
Agarwal, V. et al. Massively parallel characterization of transcriptional regulatory elements in three diverse human cell types. Nature, 639, 411–420 (2025)
Fulco, C. P. et al. Systematic mapping of functional enhancer–promoter connections with CRISPR interference. Science 354, 769–773 (2016).
De Braekeleer, E. et al. ETV6 fusion genes in hematological malignancies: a review. Leuk. Res. 36, 945–961 (2012).
Bloom, M. et al. ETV6 represses TNF during stress hematopoiesis and regulates HSC self renewal. Blood 140, 2849–2850 (2022).
Kaczynski, J., Cook, T. & Urrutia, R. Sp1- and Krüppel-like transcription factors. Genome Biol. 4, 206 (2003).
Paszke, A. et al. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS, 2019).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In Proceedings of the Seventh International Conference on Learning Representations (ICLR, 2019).
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV, 2015).
Moore, J. E. et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710 (2020).
Miglani, V., Yang, A., Markosyan, A., Garcia-Olano, D. & Kokhlikyan, N. Using Captum to explain generative language models. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS, 2023).
Rauluseviciute, I. et al. JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 52, D174–D182 (2024).
Lin, J. Pinellolab/EPInformer: Release and Storing Also on Zenodo. Zenodo, https://doi.org/10.5281/ZENODO.17167181 (2025).
Acknowledgements
We gratefully acknowledge Simon Senan, Lucas Ferreira DaSilva, and other members of the Pinello Lab for their insightful feedback and discussions. We would also like to thank Maya Sheth and Jesse Engreitz for sharing the data and code for eQTL enrichment analysis. L.P. was partially supported by 1R35HG010717-01 and the Rappaport MGH Research Scholar Award 2024-2029. R.L. was supported by Hong Kong Research Grants Council grants GRF (17113721), TRS (T21-708705/20-N) and the URC fund from HKU.
Author information
Authors and Affiliations
Contributions
L.P. and R.L. conceived the study; L.P. supervised the project. J.L. developed EPInformer and performed computational downstream analysis, including model benchmarking and case studies. J.L., Z.L., Y.Z., and R.L. evaluated the benchmarking results. J.L. and L.P. wrote the manuscript with contributions from all authors.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Communications thanks Chikashi Terao and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Lin, J., Li, Z., Zhao, Y. et al. EPInformer: scalable and integrative prediction of gene expression from promoter-enhancer sequences with multimodal epigenomic profiles. Nat Commun (2026). https://doi.org/10.1038/s41467-026-70535-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467-026-70535-8


