DDA-BERT: end-to-end training for data-dependent acquisition mass spectrometry-based proteomics

A, Jun; Liu, Pu; Sun, Yingying; Lin, Jiaying; Zhang, Xiaofan; Nie, Zongxiang; Liu, Jingnan; Yu, Zhiguo; Zhang, Yuqi; Xing, Ziyuan; Chen, Yi; Guo, Tiannan

doi:10.1038/s41467-026-72246-6

Download PDF

Article
Open access
Published: 27 April 2026

DDA-BERT: end-to-end training for data-dependent acquisition mass spectrometry-based proteomics

Jun A^1,2,3^na1,
Pu Liu⁴^na1,
Yingying Sun^1,2,3^na1,
Jiaying Lin^1,2,3,
Xiaofan Zhang^1,2,3,
Zongxiang Nie²,
Jingnan Liu^1,2,3,5,
Zhiguo Yu^1,2,3,6,
Yuqi Zhang^1,2,3,
Ziyuan Xing^1,2,3,
Yi Chen² &
…
Tiannan Guo ORCID: orcid.org/0000-0003-3869-7651^1,2,3

Nature Communications (2026) Cite this article

5332 Accesses
2 Altmetric
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Peptide-spectrum match (PSM) rescoring is critical for accurate peptide identification in data-dependent acquisition (DDA)-based proteomics. Existing rescoring frameworks typically combine search-engine scores with heuristic or learned auxiliary features to refine PSM ranking and confidence estimation. Although recent approaches incorporate deep learning-derived representations of spectra, retention time, or ion mobility, the final decision stage still commonly relies on separately trained shallow classifiers, constraining the expressive capacity of the overall scoring framework. Here, we introduce DDA-BERT, a transformer-based end-to-end deep learning model trained with ~271 million PSMs from 11 species. DDA-BERT consistently outperforms existing tools across species-specific benchmarks, achieving 2.24%–269.35%, 3.73%–141.46%, 5.53%–45.64%, and 3.68%–62.77% increases in peptide identifications on human, yeast, Drosophila, and Arabidopsis datasets, respectively. The model retains high sensitivity in trace-level proteomics samples. On HLA immunopeptidomics data, DDA-BERT further increases peptide identifications by 4.14%–87.47%. The main limitations of DDA-BERT include the requirement for GPU-based computing and the need for substantial, diverse training datasets to achieve optimal model performance. This study introduces an alternative DDA rescoring approach and establishes a methodological foundation for scalable, AI-driven peptide identification in DDA proteomics.

A streamlined platform for analyzing tera-scale DDA and DIA mass spectrometry data enables highly sensitive immunopeptidomics

Article Open access 07 June 2022

DIA-BERT: pre-trained end-to-end transformer models for enhanced DIA proteomics data analysis

Article Open access 14 April 2025

Artificial intelligence-driven approaches for the rational design of peptides with predictable aggregation propensity

Article Open access 25 September 2025

Acknowledgements

This work is supported by grants from National Natural Science Foundation of China (Key Joint Research Program, grant no. U24A20476) (T.G.), Noncommunicable Chronic Diseases-National Science and Technology Major Project (grant no. 2024ZD0533300) (T.G.), “Pioneer” and “Leading Goose” R&D Program of Zhejiang (2024SSYS0035) (T.G.), and State Key Laboratory of Medical Proteomics (SKLP-K202406) (T.G.). We gratefully acknowledge the Westlake University Supercomputer Center for assistance in data analysis and storage. During the preparation of this work, the authors used ChatGPT to improve language and readability. The authors reviewed and edited the output as needed and take full responsibility for the content of the publication.

Author information

These authors contributed equally: Jun A, Pu Liu, Yingying Sun.

Authors and Affiliations

Affiliated Hangzhou First People’s Hospital, State Key Laboratory of Medical Proteomics, School of Medicine, School of Future Biomedicine, Westlake University, Hangzhou, China
Jun A, Yingying Sun, Jiaying Lin, Xiaofan Zhang, Jingnan Liu, Zhiguo Yu, Yuqi Zhang, Ziyuan Xing & Tiannan Guo
Westlake Center for Intelligent Proteomics, Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, China
Jun A, Yingying Sun, Jiaying Lin, Xiaofan Zhang, Zongxiang Nie, Jingnan Liu, Zhiguo Yu, Yuqi Zhang, Ziyuan Xing, Yi Chen & Tiannan Guo
Research Center for Industries of the Future, School of Life Sciences, Westlake University, Hangzhou, China
Jun A, Yingying Sun, Jiaying Lin, Xiaofan Zhang, Jingnan Liu, Zhiguo Yu, Yuqi Zhang, Ziyuan Xing & Tiannan Guo
Westlake Omics (Hangzhou) Biotechnology Co., Ltd., Hangzhou, China
Pu Liu
School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
Jingnan Liu
School of Informatics, Hunan University of Chinese Medicine, Changsha, China
Zhiguo Yu

Authors

Jun A
View author publications
Search author on:PubMed Google Scholar
Pu Liu
View author publications
Search author on:PubMed Google Scholar
Yingying Sun
View author publications
Search author on:PubMed Google Scholar
Jiaying Lin
View author publications
Search author on:PubMed Google Scholar
Xiaofan Zhang
View author publications
Search author on:PubMed Google Scholar
Zongxiang Nie
View author publications
Search author on:PubMed Google Scholar
Jingnan Liu
View author publications
Search author on:PubMed Google Scholar
Zhiguo Yu
View author publications
Search author on:PubMed Google Scholar
Yuqi Zhang
View author publications
Search author on:PubMed Google Scholar
Ziyuan Xing
View author publications
Search author on:PubMed Google Scholar
Yi Chen
View author publications
Search author on:PubMed Google Scholar
Tiannan Guo
View author publications
Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Yi Chen or Tiannan Guo.

Ethics declarations

Competing interests

T.G. is the shareholder of Westlake Omics Inc. P.L. is the employee of Westlake Omics Inc. The remaining authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Description of Additional Supplementary Files (download PDF )

Supplementary Data 1 (download CSV )

Reporting Summary (download PDF )

Transparent Peer Review File (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

A, J., Liu, P., Sun, Y. et al. DDA-BERT: end-to-end training for data-dependent acquisition mass spectrometry-based proteomics. Nat Commun (2026). https://doi.org/10.1038/s41467-026-72246-6

Download citation

Received: 21 November 2024
Accepted: 09 April 2026
Published: 27 April 2026
DOI: https://doi.org/10.1038/s41467-026-72246-6