Large-scale data-driven pre-trained DNA models enhance performance across diverse genomics tasks

Sun, Canzhuang; He, Zhijie; Zhang, Shifei; Xu, Kang; Sun, Yu; Wang, Yuyang; Hu, Pengzhen; Bo, Xiaochen; Liao, Mingzhi; Li, Hao; Chen, Hebing

doi:10.1038/s41467-026-73129-6

Download PDF

Article
Open access
Published: 14 May 2026

Large-scale data-driven pre-trained DNA models enhance performance across diverse genomics tasks

Nature Communications (2026) Cite this article

817 Accesses
8 Altmetric
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Sequence-based deep learning has advanced genome interpretation, yet most models remain task-specific and rely on retraining, limiting scalability across biological contexts. Here we present SUCCEED, a supervised multi-task DNA foundation model pretrained on 6,389 ENCODE functional genomics tracks to learn transferable regulatory representations. By integrating convolutional layers with a Transformer architecture, SUCCEED captures both local sequence motifs and long-range regulatory dependencies, achieving performance comparable to or exceeding Enformer across benchmark tasks. Through transfer learning, it predicts cell-type-specific epigenomic profiles, denoises sparse chromatin accessibility signals, and predicts three-dimensional chromatin contacts without CTCF input across data scales and cell types. Across diverse genomics tasks, SUCCEED performs comparably to supervised foundation models such as Sei and outperforms self-supervised models trained solely on DNA sequence. Overall, SUCCEED is a transferable and scalable foundation model that provides a unified framework for genome-scale regulatory modeling in complex biological contexts.

A community effort to optimize sequence-based deep learning models of gene regulation

Article Open access 11 October 2024

Annotating the genome at single-nucleotide resolution with DNA foundation models

Article Open access 29 October 2025

GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model

Article Open access 08 October 2024

Funding

This work was supported by the National Natural Science Foundation of China (Nos. 62422318 to H.C. and 62472360 to M.L.), the Beijing Nova Program of Science and Technology (20250484974 to H.C.), the State Key Laboratory of Medical Proteomics (SKLP-K202407 to H.C.), and the National Key Research and Development Program of China (2023YFF0725500 to H.C. and 2024YFA1307700 to X.B.). Additional support was provided by the Science Fund for Distinguished Young Scholars of Shaanxi Province (grant no. 2024JC-JCQN-29 to M.L.).

Author information

These authors contributed equally: Canzhuang Sun, Zhijie He, Shifei Zhang.

Authors and Affiliations

College of Life Sciences, Center of Bioinformatics, Northwest A&F University, Yangling, China
Canzhuang Sun, Zhijie He, Shifei Zhang & Mingzhi Liao
Academy of Military Medical Sciences, Beijing, China
Kang Xu, Yu Sun, Yuyang Wang, Pengzhen Hu, Xiaochen Bo, Hao Li & Hebing Chen

Authors

Canzhuang Sun
View author publications
Search author on:PubMed Google Scholar
Zhijie He
View author publications
Search author on:PubMed Google Scholar
Shifei Zhang
View author publications
Search author on:PubMed Google Scholar
Kang Xu
View author publications
Search author on:PubMed Google Scholar
Yu Sun
View author publications
Search author on:PubMed Google Scholar
Yuyang Wang
View author publications
Search author on:PubMed Google Scholar
Pengzhen Hu
View author publications
Search author on:PubMed Google Scholar
Xiaochen Bo
View author publications
Search author on:PubMed Google Scholar
Mingzhi Liao
View author publications
Search author on:PubMed Google Scholar
Hao Li
View author publications
Search author on:PubMed Google Scholar
Hebing Chen
View author publications
Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Xiaochen Bo, Mingzhi Liao, Hao Li or Hebing Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Transparent Peer Review file (download PDF )

Reporting Summary (download PDF )

Description of Additional Supplementary Files (download PDF )

Supplementary Data 1-5 (download XLSX )

Source data

Source Data (download XLSX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sun, C., He, Z., Zhang, S. et al. Large-scale data-driven pre-trained DNA models enhance performance across diverse genomics tasks. Nat Commun (2026). https://doi.org/10.1038/s41467-026-73129-6

Download citation

Received: 28 September 2025
Accepted: 01 May 2026
Published: 14 May 2026
DOI: https://doi.org/10.1038/s41467-026-73129-6