Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
End-to-end deep attention-based multitask pipeline for predicting uncertainty-quantified peptide properties from mass spectrometry data
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 13 March 2026

End-to-end deep attention-based multitask pipeline for predicting uncertainty-quantified peptide properties from mass spectrometry data

  • Usman Tariq1,
  • Bilal Shabbir1 &
  • Fahad Saeed1,2,3 

Scientific Reports , Article number:  (2026) Cite this article

  • 840 Accesses

  • 1 Altmetric

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Biomarkers
  • Computational biology and bioinformatics

Abstract

Mass-based filtering significantly reduces the peptide candidate pool for subsequent scoring in database search algorithms. While useful, filtering based on one property may lead to exclusion of non-abundant spectra and uncharacterized peptides – potentially exacerbating the streetlight effect. Here we present ProteoRift, a novel attention and multitask deep-network, which can predict multiple peptide properties (length, missed cleavages, and modification status) directly from spectra 77.8% of the time. Integrating ProteoRift into an end-to-end pipeline significantly reduces the search space compared to mass-only filtering. This delivers 8x to 12x speedups while maintaining peptide deduction accuracy comparable to established algorithmic techniques. We also developed uncertainty estimation metrics, which can distinguish between in-distribution and out-of-distribution data (ROC-AUC 0.99) and predict high-scoring mass spectra against the correct peptide (ROC-AUC 0.94). These models and metrics are integrated in an end-to-end pipeline available at https://github.com/pcdslab/ProteoRift.

Data availability

No new data was collected for this study. We collected Pride Archive (PXD) datasets, pre-processed and wrangled them into a suitable format for machine-learning workflows for our experimentation and evaluation of proteomics search from ProteomeXchange (PXD000612, PXD001468, PXD009861, PXD019774, PXD026295, PXD010595), and MassIVE (msv000082031) dataset. Proteome database files were downloaded from RefUP++ database (https://github.com/miinslin/ProteoStorm), and Uniport proteome ID UP000005640 (H. sapiens). The parameter, log, and result files in this study are available in https://osf.io/puefz/. Peptides with 1% FDR: https://osf.io/2sndq, Peptides with 5% FDR: https://osf.io/vq4ny, Unique Proteorift Peptides: https://osf.io/ke9dy, Mismatch Analysis: https://osf.io/t3dch, PXD019774_MSFragger_Crux_Analysis(https://osf.io/n527v), Entrapment analysis: https://osf.io/xhztm, https://osf.io/6s9kc.All code, associated models and weights are made open-source at: [https://github.com/pcdslab/ProteoRift](https:/github.com/pcdslab/ProteoRift).

References

  1. Haseeb, M. & Saeed, F. High performance computing framework for tera-scale database search of mass spectrometry data. Nat. Comput. Sci. 2021 1, 550–561 (2021).

    Google Scholar 

  2. Haseeb, M. & Saeed, F. GPU-acceleration of the distributed-memory database peptide search of mass spectrometry data. Sci. Rep. 13, 18713 (2023).

    Google Scholar 

  3. Tariq, M. U. & Saeed, F. SpeCollate: Deep cross-modal similarity network for mass spectrometry data based peptide deductions. PLoS One 16, e0259349 (2021).

    Google Scholar 

  4. Tariq, M. U., Ebert, S. & Saeed, F. Making MS Omics Data ML-Ready: SpeCollate Protocols In (ed. Lisacek, F.) (2024).

  5. McIlwain, S. et al. Crux: Rapid open source protein tandem mass spectrometry analysis. J. Proteome Res. 13, 4488–4491 (2014).

    Google Scholar 

  6. Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: Ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics. Nat. Methods 14, 513–520 (2017).

    Google Scholar 

  7. Sun, J. et al. Open-pFind enhances the identification of missing proteins from human testis tissue. J. Proteome Res. 18, 4189–4196 (2019).

    Google Scholar 

  8. Polasky, D. A. et al. MSFragger-Labile: A flexible method to improve labile PTM analysis in proteomics. Mol. Cell. Proteomics 22, 100538 (2023).

    Google Scholar 

  9. Kustatscher, G. et al. Understudied proteins: Opportunities and challenges for functional proteomics. Nat. Methods 19, 774–779 (2022).

    Google Scholar 

  10. Kustatscher, G. et al. An open invitation to the Understudied Proteins Initiative. Nat. Biotechnol. 40, 815–817 (2022).

    Google Scholar 

  11. Dunham, I. Human genes: Time to follow the roads less traveled?. PLoS Biol. 16, e3000034 (2018).

    Google Scholar 

  12. Haynes, W. A., Tomczak, A. & Khatri, P. Gene annotation bias impedes biomedical research. Sci. Rep. 8, 1362 (2018).

    Google Scholar 

  13. Nguengang Wakap, S. et al. Estimating cumulative point prevalence of rare diseases: Analysis of the Orphanet database. Eur. J. Hum. Genet. 28, 165–173 (2020).

    Google Scholar 

  14. Bakos, J., Zatkova, M., Bacova, Z. & Ostatnikova, D. The role of hypothalamic neuropeptides in neurogenesis and neuritogenesis. Neural Plast. https://doi.org/10.1155/2016/3276383 (2016).

    Google Scholar 

  15. Huang, Z. et al. Brain proteomic analysis implicates actin filament processes and injury response in resilience to Alzheimer’s disease. Nat. Commun. 14, 2747 (2023).

    Google Scholar 

  16. Johnson, E. C. B. et al. Large-scale proteomic analysis of Alzheimer’s disease brain and cerebrospinal fluid reveals early changes in energy metabolism associated with microglia and astrocyte activation. Nat. Med. 26, 769–780 (2020).

    Google Scholar 

  17. Chen, F., Chandrashekar, D. S., Varambally, S. & Creighton, C. J. Pan-cancer molecular subtypes revealed by mass-spectrometry-based proteomic characterization of more than 500 human cancers. Nat. Commun. 10, 1–15 (2019).

    Google Scholar 

  18. Leiserson, M. D. et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat. Genet. 47, 106–114 (2015).

    Google Scholar 

  19. Tran, N. H., Zhang, X., Xin, L., Shan, B. & Li, M. De novo peptide sequencing by deep learning. Proc. Natl. Acad. Sci. U. S. A. 114, 8247–8252 (2017).

    Google Scholar 

  20. Diament, B. J. & Noble, W. S. Faster SEQUEST searching for peptide identification from tandem mass spectra. J. Proteome Res. 10, 3871–3879 (2011).

    Google Scholar 

  21. Craig, R. & Beavis, R. C. TANDEM: Matching proteins with tandem mass spectra. Bioinformatics 20, 1466–1467 (2004).

    Google Scholar 

  22. Seydel, C. Diving deeper into the proteome. Nat. Methods. 19, 1036–1040 (2022).

    Google Scholar 

  23. Skinner, O. S. & Kelleher, N. L. Illuminating the dark matter of shotgun proteomics. Nat. Biotechnol. 33, 717–718 (2015).

    Google Scholar 

  24. Lazear, M. R. Sage: An open-source tool for fast proteomics searching and quantification at scale. J. Proteome Res. 22, 3652–3659 (2023).

    Google Scholar 

  25. Chick, J. M. et al. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat. Biotechnol. 33, 743–749 (2015).

    Google Scholar 

  26. Nesvizhskii, A. I. Proteogenomics: Concepts, applications and computational strategies. Nat. Methods 11, 1114–1125 (2014).

    Google Scholar 

  27. Burke, M. C. et al. The hybrid search: a mass spectral library search method for discovery of modifications in proteomics. J. Proteome Res. 16, 1924–1935 (2017).

    Google Scholar 

  28. Solntsev, S. K., Shortreed, M. R., Frey, B. L. & Smith, L. M. Enhanced global post-translational modification discovery with MetaMorpheus. J. Proteome Res. 17, 1844–1851 (2018).

    Google Scholar 

  29. Muth, T., Renard, B. Y. & Martens, L. Metaproteomic data analysis at a glance: Advances in computational microbial community proteomics. Expert Rev. Proteomics 13, 757–769 (2016).

    Google Scholar 

  30. Muth, T. et al. The MetaProteomeAnalyzer: A powerful open-source software suite for metaproteomics data analysis and interpretation. J. Proteome Res. 14, 1557–1565 (2015).

    Google Scholar 

  31. Schiebenhoefer, H. et al. A complete and flexible workflow for metaproteomics data analysis based on MetaProteomeAnalyzer and Prophane. Nat. Protoc. 15, 3212–3239 (2020).

    Google Scholar 

  32. Nesvizhskii, A. I. et al. Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Mol. Cell. Proteom. 5, 652–670 (2006).

    Google Scholar 

  33. Griss, J. et al. Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat. Methods 13, 651–656 (2016).

    Google Scholar 

  34. Ning, K., Fermin, D. & Nesvizhskii, A. I. Computational analysis of unassigned high-quality MS/MS spectra in proteomic data sets. Proteomics 10, 2712–2718 (2010).

    Google Scholar 

  35. Nielsen, M. L., Savitski, M. M. & Zubarev, R. A. Extent of modifications in Human proteome samples and their effect on dynamic range of analysis in shotgun proteomics* S.. Mol. Cell. Proteomics 5, 2384–2391 (2006).

    Google Scholar 

  36. Frank, A., Tanner, S., Bafna, V. & Pevzner, P. Peptide sequence tags for fast database search in mass-spectrometry. J. Proteome Res. 4, 1287–1295 (2005).

    Google Scholar 

  37. Zhang, S. et al. High-resolution quadrupole improves spectral purity and reduces interference from non-target ions in isobaric multiplexed quantitative proteomics.. Anal. Chim. Acta 1325, 343135 (2024).

    Google Scholar 

  38. Haseeb, M. & Saeed, F. Efficient shared peak counting in database peptide search using compact data structure for fragment-ion index. in IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 275–278 (IEEE, 2019). 275–278 (IEEE, 2019). (2019).

  39. Wei, M.-M. et al. Integrating Transformer and Graph Attention Network for circRNA-miRNA interaction prediction.. IEEE J. Biomed. Health Inform. 29, 6105–6113 (2025).

    Google Scholar 

  40. Huang, Y.-A. et al. Consensus representation of multiple cell–cell graphs from gene signaling pathways for cell type annotation.. BMC Biol. 23, 23 (2025).

    Google Scholar 

  41. Wei, M., Wang, L., Su, X., Zhao, B. & You, Z. Multi-hop graph structural modeling for cancer-related circRNA-miRNA interaction prediction.. Pattern Recogn. 170, 112078 (2026).

    Google Scholar 

  42. Altenburg, T., Muth, T. & Renard, B. Y. yHydra: Deep learning enables an ultra fast open search by jointly embedding MS/MS spectra and peptides of mass spectrometry-based proteomics. http://biorxiv.org/lookup/doi/ (2021). https://doi.org/10.1101/2021.12.01.470818 doi:10.1101/2021.12.01.470818.

  43. Gessulat, S. et al. Prosit: Proteome-wide prediction of peptide tandem mass spectra by deep learning.. Nat. Methods 16, 509–518 (2019).

    Google Scholar 

  44. Meyer, J. G. Deep learning neural network tools for proteomics. Cell Rep. Methods https://doi.org/10.1016/j.crmeth.2021.100003 (2021).

    Google Scholar 

  45. Cox, J. Prediction of peptide mass spectral libraries with machine learning. Nat. Biotechnol. 41, 33–43 (2023).

    Google Scholar 

  46. Qiao, R. et al. Computationally instrument-resolution-independent de novo peptide sequencing for high-resolution devices. Nat. Mach. Intell. 3, 420–425 (2021).

    Google Scholar 

  47. Yilmaz, M., Fondrie, W., Bittremieux, W., Oh, S. & Noble, W. S. De novo mass spectrometry peptide sequencing with a transformer model. in International Conference on Machine Learning 25514–25522PMLR, (2022).

  48. Karunratanakul, K., Tang, H. Y., Speicher, D. W., Chuangsuwanich, E. & Sriswasdi, S. Uncovering thousands of new peptides with sequence-mask-search hybrid de novo peptide sequencing framework. Mol. Cell. Proteomics. 18, 2478–2491 (2019).

    Google Scholar 

  49. Blundell, C., Cornebise, J., Kavukcuoglu, K. & Wierstra, D. Weight uncertainty in neural network. in International conference on machine learning 1613–1622PMLR, (2015).

  50. Gal, Y. & Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. in international conference on machine learning 1050–1059PMLR, (2016).

  51. Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances neural Inform. Process. systems 30, (2017).

  52. Liu, K., Ye, Y., Tang, H. & PepNet A fully convolutional neural network for De novo Peptide Sequencing. (2022).

  53. Altenburg, T., Giese, S. H., Wang, S., Muth, T. & Renard, B. Y. Ad hoc learning of peptide fragmentation from mass spectra enables an interpretable detection of phosphorylated and cross-linked peptides. Nat. Mach. Intell. 4, 378–388 (2022).

    Google Scholar 

  54. Zeng, W.-F. et al. AlphaPeptDeep: A modular deep learning framework to predict peptide properties for proteomics. Nat. Commun. 13, 7238 (2022).

    Google Scholar 

  55. Kalhor, M., Lapin, J., Picciani, M. & Wilhelm, M. Rescoring peptide spectrum matches: Boosting proteomics performance by integrating peptide property predictors into peptide identification. Mol. Cell. Proteomics. 23, 100798 (2024).

    Google Scholar 

  56. Dorfer, V., Maltsev, S., Winkler, S. & Mechtler, K. CharmeRT: Boosting peptide identifications by chimeric spectra identification and retention time prediction. J. Proteome Res. 17, 2581–2589 (2018).

    Google Scholar 

  57. Deng, J., Julian, M. H. & Lazar, I. M. Partial enzymatic reactions: A missed opportunity in proteomics research. Rapid Commun. Mass Spectrom. 32, 2065–2073 (2018).

    Google Scholar 

  58. Sharma, K. et al. Ultradeep human phosphoproteome reveals a distinct regulatory nature of Tyr and Ser/Thr-based signaling. Cell Rep. 8, 1583–1594 (2014).

    Google Scholar 

  59. Bittremieux, W., Meysman, P., Noble, W. S. & Laukens, K. Fast open modification spectral library searching through approximate nearest neighbor indexing. J. Proteome Res. 17, 3463–3474 (2018).

    Google Scholar 

  60. Qi, Y. A. et al. Proteogenomic analysis unveils the HLA Class I-presented immunopeptidome in melanoma and EGFR-mutant lung adenocarcinoma. Mol. Cell. Proteomics 20, 100136 (2021).

    Google Scholar 

  61. Ino, Y. et al. Evaluation of four phosphopeptide enrichment strategies for mass spectrometry-based proteomic analysis. Proteomics 22, 2100216 (2022).

    Google Scholar 

  62. Beyter, D., Lin, M. S., Yu, Y., Pieper, R. & Bafna, V. Proteostorm: An ultrafast metaproteomics database search framework. Cell Syst. 7, 463–467 (2018).

    Google Scholar 

  63. Cogne, Y. et al. Comparative proteomics in the wild: Accounting for intrapopulation variability improves describing proteome response in a Gammarus pulex field population exposed to cadmium. Aquat. Toxicol. 214, 105244 (2019).

    Google Scholar 

  64. Cooper, J. C. et al. Altered localization of hybrid incompatibility proteins in Drosophila. Mol. Biol. Evol. 36, 1783–1792 (2019).

    Google Scholar 

  65. Wen, B. et al. Assessment of false discovery rate control in tandem mass spectrometry analysis using entrapment. Nat. Methods (7), https://doi.org/10.1038/s41592-025-02719-x (2025).

    Google Scholar 

  66. Bushuiev, R. et al. Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS. Nat. Biotechnol. https://doi.org/10.1038/s41587-025-02663-3 (2025).

    Google Scholar 

  67. Wang, L., Zhang, X., Su, H. & Zhu, J. A comprehensive survey of continual learning: Theory, method and application. IEEE Trans. Pattern Anal. Mach. Intell. https://doi.org/10.1109/TPAMI.2024.3367329 (2024).

    Google Scholar 

  68. Chambers, M. C. et al. A cross-platform toolkit for mass spectrometry and proteomics. Nat. Biotechnol. 30, 918–920 (2012).

    Google Scholar 

  69. Käll, L., Canterbury, J. D., Weston, J., Noble, W. S. & MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods. 4, 923–925 (2007).

    Google Scholar 

  70. Da Silva Antunes, R. et al. Urinary peptides as a novel source of T Cell allergen epitopes. Front. Immunol. 9, 886 (2018).

    Google Scholar 

  71. Huang, C. et al. Combined transcriptomics and proteomics forecast analysis for potential biomarker in the acute phase of temporal lobe epilepsy. Front. Neurosci. 17, 1145805 (2023).

    Google Scholar 

  72. Hüllermeier, E. & Waegeman, W. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Mach. Learn. 110, 457–506 (2021).

    Google Scholar 

  73. Der Kiureghian, A. (ed, O.) Aleatory or epistemic? Does it matter? Struct. Saf. 31 105–112 (2009).

    Google Scholar 

Download references

Funding

This research was supported by the NIGMS of the National Institutes of Health (NIH) under award number: R35GM153434. The authors were further supported by the National Science Foundations (NSF) under the award number: NSF OAC-2312599. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health and/or National Science Foundation. This work used the NSF Extreme Science and Engineering Discovery Environment (XSEDE) Supercomputers through allocations: TG-CCR150017 and TG-ASC200004.

Author information

Authors and Affiliations

  1. Knight Foundation School of Computing and Information Sciences, Florida International University (FIU), Miami, FL, USA

    Usman Tariq, Bilal Shabbir & Fahad Saeed

  2. Biomolecular Sciences Institute (BSI), Florida International University (FIU), Miami, FL, USA

    Fahad Saeed

  3. Department of Human and Molecular Genetics, Herbert Wertheim School of Medicine, Florida International University, Miami, FL, USA

    Fahad Saeed

Authors
  1. Usman Tariq
    View author publications

    Search author on:PubMed Google Scholar

  2. Bilal Shabbir
    View author publications

    Search author on:PubMed Google Scholar

  3. Fahad Saeed
    View author publications

    Search author on:PubMed Google Scholar

Contributions

FS and UT conceived and designed the project. FS, BS and UT acquired the MS data, wrangled and processed the data to make it suitable for ML models. FS, UT and BS analyzed the data, designed the experiments, conducted the experiments, and presented the results in a comprehensive manner. UT wrote the ML model code, scripts and pipelines which were modified by BS for further investigation and results. FS, UT, and BS wrote the paper, and all authors contributed to the revisions and approved the final version. FS contributed to all aspects of the project.

Corresponding author

Correspondence to Fahad Saeed.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (download XLSX )

Supplementary Material 2 (download PDF )

Supplementary Material 3 (download XLSX )

Supplementary Material 4 (download XLSX )

Supplementary Material 5 (download XLSX )

Supplementary Material 6 (download XLSX )

Supplementary Material 7 (download DOCX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tariq, U., Shabbir, B. & Saeed, F. End-to-end deep attention-based multitask pipeline for predicting uncertainty-quantified peptide properties from mass spectrometry data. Sci Rep (2026). https://doi.org/10.1038/s41598-026-43215-2

Download citation

  • Received: 19 September 2025

  • Accepted: 02 March 2026

  • Published: 13 March 2026

  • DOI: https://doi.org/10.1038/s41598-026-43215-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Deep learning
  • Bioinformatics
  • Mass spectrometry
  • Uncertainty
Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research