Abstract
Here we propose a search algorithm for proteoform identification that computes the largest-size error-correction alignments between a protein mass graph and a spectrum mass graph. Our combined method uses a filtering algorithm to identify candidates and then applies a search algorithm to report the final results. Our exact searching method is 3.9 to 9.0 times faster than popular methods such as TopMG and TopPIC. Our combined method can further speed-up the running time of sTopMG without affecting the search accuracy. We develop a pipeline for generating simulated top-down spectra on the basis of input protein sequences with modifications. Experiments on simulated datasets show that our combined method has 95% accuracy, which exceeds existing methods. Experiments on real annotated datasets show that our method has ≥97.1% accuracy using deconvolution method FLASHDeconv.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout


Similar content being viewed by others
Data availability
The SULIS spectra dataset28 was obtained from PRIDE30 via accession code PXD003074. The SULIS reference proteome database and the ECOLI reference proteome database can be downloaded from UniProt12 using Proteome IDs UP000013006 and UP000000625, respectively. A further four datasets and two databases are available via ref. 31. Specifically, the four datasets are the top-down mass spectrum dataset of 2025 collision-induced dissociation (CID) MS/MS spectra8; the dataset of 100 simulated top-down mass spectra used for the evaluation of the search accuracy; and two antibody top-down mass spectra datasets that generated from Waters and HB100, respectively. The two databases are the database of 800 antibody sequences used for evaluation of the search accuracy and the database of 500 decoy antibody sequences used for the experiments of FDR control. Source data are provided with this paper.
Code availability
The source code of sTopMG and the proposed top-down spectra simulator are available via Zenodo at https://doi.org/10.5281/zenodo.16597445 (ref. 32) and https://doi.org/10.5281/zenodo.16597493 (ref. 33), respectively.
References
Roberts, D. S. et al. Top-down proteomics. Nat. Rev. Methods Primers 4, 38 (2024).
Schaffer, L. V. et al. Identification and quantification of proteoforms by mass spectrometry. Proteomics 19, 1800361 (2019).
Larsen, S. C. et al. Proteome-wide analysis of arginine monomethylation reveals widespread occurrence in human cells. Sci. Signal. 9, aaf7329 (2016).
Karabacak, N. M. et al. Sensitive and specific identification of wild type and variant proteins from 8 to 669 kDa using top-down mass spectrometry. Mol Cell Proteomics 8, 846–56 (2009).
Zamdborg, L. et al. Prosight PTM 2.0: improved protein identification and characterization for top down mass spectrometry. Nucleic Acids Res. 35, W701–W706 (2007).
Liu, X. et al. Protein identification using top-down spectra. Mol. Cell. Proteom. 11, M111–008524 (2012).
Liu, X. et al. Identification of ultramodified proteins using top-down tandem mass spectra. J. Proteome Res. 12, 5830–5838 (2013).
Kou, Q. et al. A mass graph-based approach for the identification of modified proteoforms using top-down tandem mass spectra. Bioinformatics 33, 1309–1316 (2017).
Kou, Q., Xun, L. & Liu, X. TopPIC: a software tool for top-down mass spectrometry-based proteoform identification and characterization. Bioinformatics 32, 3495–3497 (2016).
Zhan, Z. & Wang, L. Proteoform identification based on top-down tandem mass spectra with peak error corrections. Brief. Bioinformatics 23, bbab599 (2022).
Zhan, Z. & Wang, L. Fast peak error correction algorithms for proteoform identification using top-down tandem mass spectra. Bioinformatics 40, btae149 (2024).
UniProt Consortium. Uniprot: the universal protein knowledgebase in 2025. Nucleic Acids Res. 53, D609–D617 (2025).
Kou, Q., Wu, S. & Liu, X. Systematic evaluation of protein sequence filtering algorithms for proteoform identification using top-down mass spectrometry. Proteomics 18, 1700306 (2018).
Chick, J. M. et al. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat. Biotechnol. 33, 743–749 (2015).
Mann, M. & Wilm, M. Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem. 66, 4390–4399 (1994).
Yang, R. et al. A spectrum graph-based protein sequence filtering algorithm for proteoform identification by top-down mass spectrometry. In 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 222–229 (IEEE, 2017).
Deng, F., Wang, L. & Liu, X. An efficient algorithm for the blocked pattern matching problem. Bioinformatics 31, 532–538 (2014).
Sun, R.-X. et al. pTop 1.0: A high-accuracy and high-efficiency search engine for intact protein identification. Anal. Chem. 88, 3082–3090 (2016).
Park, J. et al. Informed-proteomics: open-source software package for top-down proteomics. Nat. Methods 14, 909–914 (2017).
Tabb, D. L. et al. Comparing top-down proteoform identification: deconvolution, PrSM overlap, and PTM detection. J. Proteome Res. 22, 2199–2217 (2023).
Awan, M. G. & Saeed, F. Mass-simulator: a highly configurable simulator for generating ms/ms datasets for benchmarking of proteomics algorithms. Proteomics 18, 1800206 (2018).
Liu, X. et al. Deconvolution and database search of complex tandem mass spectra of intact proteins. Mol. Cell. Proteom. 9, 2772–2782 (2010).
Young, A., Röst, H. & Wang, B. Tandem mass spectrum prediction for small molecules using graph transformers. Nat. Mach. Intell. 6, 404–416 (2024).
TopPIC Suite (TopPic, 2023).
Chambers, M. C. et al. A cross-platform toolkit for mass spectrometry and proteomics. Nat. Biotechnol. 30, 918–920 (2012).
Jeong, K. et al. Flashdeconv: ultrafast, high-quality feature deconvolution for top-down proteomics. Cell Syst. 10, 213–218 (2020).
Kim, S., Gupta, N. & Pevzner, P. A. Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. J. Proteome Res. 7, 3354–3363 (2008).
Vorontsov, E. A., Rensen, E., Prangishvili, D., Krupovic, M. & Chamot-Rooke, J. Abundant lysine methylation and N-terminal acetylation in sulfolobus islandicus revealed by bottom-up and top-down proteomics. Mol. Cell. Proteom. 15, 3388–3404 (2016).
Mus musculus (Mouse) (UniProt, 2025); https://www.uniprot.org/uniprotkb?dir=ascend&facets=model_organism%3A10090&query=antibody&sort=accession
Vizcaíno, J. A. et al. 2016 Update of the pride database and its related tools. Nucleic Acids Res. 44, D447–D456 (2016).
Li, K., Shan, B., Xin, L., Li, M. & Wang, L. Datasets and databases for proteoform search from protein database with top-down mass spectra. figshare https://doi.org/10.6084/m9.figshare.29634140.v2 (2025).
Li, K., Shan, B., Xin, L., Li, M. & Wang, L. sTopMG. Zenodo https://doi.org/10.5281/zenodo.16597445 (2025).
Li, K., Shan, B., Xin, L., Li, M. & Wang, L. TopDown-MaSS-Simulator. Zenodo https://doi.org/10.5281/zenodo.16597493 (2025).
Acknowledgements
We thank the reviewers for their valuable suggestions and comments. L.W. was supported by the National Science Foundation of China (NSFC grant no. 61972329), and the General Research Fund (GRF) grants for Hong Kong Special Administrative Region, P. R. China (grant nos. CityU 11218821 and CityU 11218423). M.L. was supported by the National Key Research and Development Program of China 2024YFA1306400 and the Natural Sciences and Engineering Research Council of Canada (NSERC) (grant no. OGP0046506).
Author information
Authors and Affiliations
Contributions
K.L. focused on computer programming and was heavily involved in the design of the algorithm, the development of the pipeline for the top-down spectra simulator, and computing the experimental results. L.W. oversaw all aspects of the paper, particularly algorithm design, paper writing, design of the pipeline for the top-down spectra simulator, and experimental design. M.L. was involved in all of the aspects of the paper and oversaw the generation of the annotated real datasets by B.S. and L.X.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks Jianxin Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Kaitlin McCardle, in collaboration with the Nature Computational Science team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 The match gap distributions for real dataset and simulated dataset.
1 to 500 residues are divided into 50 ranges, 10 for each range. For each range, we generate 40 simulated spectra. a, The match gap distributions of real dataset and simulated dataset with matched segment length 20-150 residues, there are totally 1212 real spectra and 520 simulated spectra, 40 for each range. b, The match gap distributions of real dataset and simulated dataset with alignment covered segment length 150 - 420 residues, where there are totally 423 real spectra and 1080 simulated spectra, 40 for each range. The horizontal axes indicate the match gaps (up to 60 amino acids). Each curve displays the percentages of total numbers of matches in the alignments with the specific match gap. The yellow curves represent the match gap distribution of real dataset, and the red curves represent the match gap distribution of simulated dataset. For match gaps larger than 60 amino acids, the vertical values of the curves are 0 or very close to 0.distribution of simulated dataset. For match gaps larger than 60 amino acids, the vertical values of the curves are 0 or very close to 0.
Extended Data Fig. 2 The MMEs of alignments obtained by different methods.
MME comparison curves of sTopMG, TopMG, TopMGFast and MS-Align-E for 100 alignments of simulated dataset when the predefined error tolerance δ = 0.1 Dalton. The alignments are sorted by the MMEs of sTopMG in non-descending order.
Extended Data Fig. 3 Match gap distributions of spectra in real dataset and 3 simulated subsets.
Extended Data Fig. 3 ∣ Match gap distributions of spectra in real dataset and 3 simulated subsets. Subset 1 has similar small match gaps (⩽ 6) percentage as real data. Subset 2 has lower small match gaps (⩽ 6) percentage. Subset 3 does not contain any small match gap (⩽ 6).
Supplementary information
Supplementary Information
Supplementary Notes 1–2, Figs. 1–9 and Tables 1–20.
Supplementary Data 1
Means and variances of the IGPs for each ion and residue.
Source data
Source Data Fig. 1
Statistical source data.
Source Data Fig. 2
Statistical source data.
Source Data Extended Data Fig. 1
Statistical source data.
Source Data Extended Data Fig. 2
Statistical source data.
Source Data Extended Data Fig. 3
Statistical source data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, K., Shan, B., Xin, L. et al. Proteoform search from protein database with top-down mass spectra. Nat Comput Sci 5, 998–1009 (2025). https://doi.org/10.1038/s43588-025-00880-z
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s43588-025-00880-z


