Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Proteoform search from protein database with top-down mass spectra

Abstract

Here we propose a search algorithm for proteoform identification that computes the largest-size error-correction alignments between a protein mass graph and a spectrum mass graph. Our combined method uses a filtering algorithm to identify candidates and then applies a search algorithm to report the final results. Our exact searching method is 3.9 to 9.0 times faster than popular methods such as TopMG and TopPIC. Our combined method can further speed-up the running time of sTopMG without affecting the search accuracy. We develop a pipeline for generating simulated top-down spectra on the basis of input protein sequences with modifications. Experiments on simulated datasets show that our combined method has 95% accuracy, which exceeds existing methods. Experiments on real annotated datasets show that our method has ≥97.1% accuracy using deconvolution method FLASHDeconv.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: The distribution of the number of amino acids for the protein segments covered by the 1,684 alignment.
Fig. 2: The correlation between the total number of peaks and the precursor mass of the spectra for 1,684 PSMs.

Similar content being viewed by others

Data availability

The SULIS spectra dataset28 was obtained from PRIDE30 via accession code PXD003074. The SULIS reference proteome database and the ECOLI reference proteome database can be downloaded from UniProt12 using Proteome IDs UP000013006 and UP000000625, respectively. A further four datasets and two databases are available via ref. 31. Specifically, the four datasets are the top-down mass spectrum dataset of 2025 collision-induced dissociation (CID) MS/MS spectra8; the dataset of 100 simulated top-down mass spectra used for the evaluation of the search accuracy; and two antibody top-down mass spectra datasets that generated from Waters and HB100, respectively. The two databases are the database of 800 antibody sequences used for evaluation of the search accuracy and the database of 500 decoy antibody sequences used for the experiments of FDR control. Source data are provided with this paper.

Code availability

The source code of sTopMG and the proposed top-down spectra simulator are available via Zenodo at https://doi.org/10.5281/zenodo.16597445 (ref. 32) and https://doi.org/10.5281/zenodo.16597493 (ref. 33), respectively.

References

  1. Roberts, D. S. et al. Top-down proteomics. Nat. Rev. Methods Primers 4, 38 (2024).

    Article  Google Scholar 

  2. Schaffer, L. V. et al. Identification and quantification of proteoforms by mass spectrometry. Proteomics 19, 1800361 (2019).

    Article  Google Scholar 

  3. Larsen, S. C. et al. Proteome-wide analysis of arginine monomethylation reveals widespread occurrence in human cells. Sci. Signal. 9, aaf7329 (2016).

    Article  Google Scholar 

  4. Karabacak, N. M. et al. Sensitive and specific identification of wild type and variant proteins from 8 to 669 kDa using top-down mass spectrometry. Mol Cell Proteomics 8, 846–56 (2009).

    Article  Google Scholar 

  5. Zamdborg, L. et al. Prosight PTM 2.0: improved protein identification and characterization for top down mass spectrometry. Nucleic Acids Res. 35, W701–W706 (2007).

    Article  Google Scholar 

  6. Liu, X. et al. Protein identification using top-down spectra. Mol. Cell. Proteom. 11, M111–008524 (2012).

    Article  Google Scholar 

  7. Liu, X. et al. Identification of ultramodified proteins using top-down tandem mass spectra. J. Proteome Res. 12, 5830–5838 (2013).

    Article  Google Scholar 

  8. Kou, Q. et al. A mass graph-based approach for the identification of modified proteoforms using top-down tandem mass spectra. Bioinformatics 33, 1309–1316 (2017).

    Article  Google Scholar 

  9. Kou, Q., Xun, L. & Liu, X. TopPIC: a software tool for top-down mass spectrometry-based proteoform identification and characterization. Bioinformatics 32, 3495–3497 (2016).

    Article  Google Scholar 

  10. Zhan, Z. & Wang, L. Proteoform identification based on top-down tandem mass spectra with peak error corrections. Brief. Bioinformatics 23, bbab599 (2022).

    Article  Google Scholar 

  11. Zhan, Z. & Wang, L. Fast peak error correction algorithms for proteoform identification using top-down tandem mass spectra. Bioinformatics 40, btae149 (2024).

    Article  Google Scholar 

  12. UniProt Consortium. Uniprot: the universal protein knowledgebase in 2025. Nucleic Acids Res. 53, D609–D617 (2025).

    Article  Google Scholar 

  13. Kou, Q., Wu, S. & Liu, X. Systematic evaluation of protein sequence filtering algorithms for proteoform identification using top-down mass spectrometry. Proteomics 18, 1700306 (2018).

    Article  Google Scholar 

  14. Chick, J. M. et al. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat. Biotechnol. 33, 743–749 (2015).

    Article  Google Scholar 

  15. Mann, M. & Wilm, M. Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem. 66, 4390–4399 (1994).

    Article  Google Scholar 

  16. Yang, R. et al. A spectrum graph-based protein sequence filtering algorithm for proteoform identification by top-down mass spectrometry. In 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 222–229 (IEEE, 2017).

  17. Deng, F., Wang, L. & Liu, X. An efficient algorithm for the blocked pattern matching problem. Bioinformatics 31, 532–538 (2014).

    Article  Google Scholar 

  18. Sun, R.-X. et al. pTop 1.0: A high-accuracy and high-efficiency search engine for intact protein identification. Anal. Chem. 88, 3082–3090 (2016).

    Article  Google Scholar 

  19. Park, J. et al. Informed-proteomics: open-source software package for top-down proteomics. Nat. Methods 14, 909–914 (2017).

    Article  Google Scholar 

  20. Tabb, D. L. et al. Comparing top-down proteoform identification: deconvolution, PrSM overlap, and PTM detection. J. Proteome Res. 22, 2199–2217 (2023).

    Article  Google Scholar 

  21. Awan, M. G. & Saeed, F. Mass-simulator: a highly configurable simulator for generating ms/ms datasets for benchmarking of proteomics algorithms. Proteomics 18, 1800206 (2018).

    Article  Google Scholar 

  22. Liu, X. et al. Deconvolution and database search of complex tandem mass spectra of intact proteins. Mol. Cell. Proteom. 9, 2772–2782 (2010).

    Article  Google Scholar 

  23. Young, A., Röst, H. & Wang, B. Tandem mass spectrum prediction for small molecules using graph transformers. Nat. Mach. Intell. 6, 404–416 (2024).

    Article  Google Scholar 

  24. TopPIC Suite (TopPic, 2023).

  25. Chambers, M. C. et al. A cross-platform toolkit for mass spectrometry and proteomics. Nat. Biotechnol. 30, 918–920 (2012).

    Article  Google Scholar 

  26. Jeong, K. et al. Flashdeconv: ultrafast, high-quality feature deconvolution for top-down proteomics. Cell Syst. 10, 213–218 (2020).

    Article  Google Scholar 

  27. Kim, S., Gupta, N. & Pevzner, P. A. Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. J. Proteome Res. 7, 3354–3363 (2008).

    Article  Google Scholar 

  28. Vorontsov, E. A., Rensen, E., Prangishvili, D., Krupovic, M. & Chamot-Rooke, J. Abundant lysine methylation and N-terminal acetylation in sulfolobus islandicus revealed by bottom-up and top-down proteomics. Mol. Cell. Proteom. 15, 3388–3404 (2016).

    Article  Google Scholar 

  29. Mus musculus (Mouse) (UniProt, 2025); https://www.uniprot.org/uniprotkb?dir=ascend&facets=model_organism%3A10090&query=antibody&sort=accession

  30. Vizcaíno, J. A. et al. 2016 Update of the pride database and its related tools. Nucleic Acids Res. 44, D447–D456 (2016).

    Article  Google Scholar 

  31. Li, K., Shan, B., Xin, L., Li, M. & Wang, L. Datasets and databases for proteoform search from protein database with top-down mass spectra. figshare https://doi.org/10.6084/m9.figshare.29634140.v2 (2025).

  32. Li, K., Shan, B., Xin, L., Li, M. & Wang, L. sTopMG. Zenodo https://doi.org/10.5281/zenodo.16597445 (2025).

  33. Li, K., Shan, B., Xin, L., Li, M. & Wang, L. TopDown-MaSS-Simulator. Zenodo https://doi.org/10.5281/zenodo.16597493 (2025).

Download references

Acknowledgements

We thank the reviewers for their valuable suggestions and comments. L.W. was supported by the National Science Foundation of China (NSFC grant no. 61972329), and the General Research Fund (GRF) grants for Hong Kong Special Administrative Region, P. R. China (grant nos. CityU 11218821 and CityU 11218423). M.L. was supported by the National Key Research and Development Program of China 2024YFA1306400 and the Natural Sciences and Engineering Research Council of Canada (NSERC) (grant no. OGP0046506).

Author information

Authors and Affiliations

Authors

Contributions

K.L. focused on computer programming and was heavily involved in the design of the algorithm, the development of the pipeline for the top-down spectra simulator, and computing the experimental results. L.W. oversaw all aspects of the paper, particularly algorithm design, paper writing, design of the pipeline for the top-down spectra simulator, and experimental design. M.L. was involved in all of the aspects of the paper and oversaw the generation of the annotated real datasets by B.S. and L.X.

Corresponding authors

Correspondence to Ming Li or Lusheng Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Jianxin Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Kaitlin McCardle, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 The match gap distributions for real dataset and simulated dataset.

1 to 500 residues are divided into 50 ranges, 10 for each range. For each range, we generate 40 simulated spectra. a, The match gap distributions of real dataset and simulated dataset with matched segment length 20-150 residues, there are totally 1212 real spectra and 520 simulated spectra, 40 for each range. b, The match gap distributions of real dataset and simulated dataset with alignment covered segment length 150 - 420 residues, where there are totally 423 real spectra and 1080 simulated spectra, 40 for each range. The horizontal axes indicate the match gaps (up to 60 amino acids). Each curve displays the percentages of total numbers of matches in the alignments with the specific match gap. The yellow curves represent the match gap distribution of real dataset, and the red curves represent the match gap distribution of simulated dataset. For match gaps larger than 60 amino acids, the vertical values of the curves are 0 or very close to 0.distribution of simulated dataset. For match gaps larger than 60 amino acids, the vertical values of the curves are 0 or very close to 0.

Source data

Extended Data Fig. 2 The MMEs of alignments obtained by different methods.

MME comparison curves of sTopMG, TopMG, TopMGFast and MS-Align-E for 100 alignments of simulated dataset when the predefined error tolerance δ = 0.1 Dalton. The alignments are sorted by the MMEs of sTopMG in non-descending order.

Source data

Extended Data Fig. 3 Match gap distributions of spectra in real dataset and 3 simulated subsets.

Extended Data Fig. 3 Match gap distributions of spectra in real dataset and 3 simulated subsets. Subset 1 has similar small match gaps ( 6) percentage as real data. Subset 2 has lower small match gaps ( 6) percentage. Subset 3 does not contain any small match gap ( 6).

Source data

Supplementary information

Supplementary Information

Supplementary Notes 1–2, Figs. 1–9 and Tables 1–20.

Reporting Summary

Peer Review File

Supplementary Data 1

Means and variances of the IGPs for each ion and residue.

Source data

Source Data Fig. 1

Statistical source data.

Source Data Fig. 2

Statistical source data.

Source Data Extended Data Fig. 1

Statistical source data.

Source Data Extended Data Fig. 2

Statistical source data.

Source Data Extended Data Fig. 3

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, K., Shan, B., Xin, L. et al. Proteoform search from protein database with top-down mass spectra. Nat Comput Sci 5, 998–1009 (2025). https://doi.org/10.1038/s43588-025-00880-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s43588-025-00880-z

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics