Abstract
Mass-based filtering significantly reduces the peptide candidate pool for subsequent scoring in database search algorithms. While useful, filtering based on one property may lead to exclusion of non-abundant spectra and uncharacterized peptides – potentially exacerbating the streetlight effect. Here we present ProteoRift, a novel attention and multitask deep-network, which can predict multiple peptide properties (length, missed cleavages, and modification status) directly from spectra 77.8% of the time. Integrating ProteoRift into an end-to-end pipeline significantly reduces the search space compared to mass-only filtering. This delivers 8x to 12x speedups while maintaining peptide deduction accuracy comparable to established algorithmic techniques. We also developed uncertainty estimation metrics, which can distinguish between in-distribution and out-of-distribution data (ROC-AUC 0.99) and predict high-scoring mass spectra against the correct peptide (ROC-AUC 0.94). These models and metrics are integrated in an end-to-end pipeline available at https://github.com/pcdslab/ProteoRift.
Data availability
No new data was collected for this study. We collected Pride Archive (PXD) datasets, pre-processed and wrangled them into a suitable format for machine-learning workflows for our experimentation and evaluation of proteomics search from ProteomeXchange (PXD000612, PXD001468, PXD009861, PXD019774, PXD026295, PXD010595), and MassIVE (msv000082031) dataset. Proteome database files were downloaded from RefUP++ database (https://github.com/miinslin/ProteoStorm), and Uniport proteome ID UP000005640 (H. sapiens). The parameter, log, and result files in this study are available in https://osf.io/puefz/. Peptides with 1% FDR: https://osf.io/2sndq, Peptides with 5% FDR: https://osf.io/vq4ny, Unique Proteorift Peptides: https://osf.io/ke9dy, Mismatch Analysis: https://osf.io/t3dch, PXD019774_MSFragger_Crux_Analysis(https://osf.io/n527v), Entrapment analysis: https://osf.io/xhztm, https://osf.io/6s9kc.All code, associated models and weights are made open-source at: [https://github.com/pcdslab/ProteoRift](https:/github.com/pcdslab/ProteoRift).
References
Haseeb, M. & Saeed, F. High performance computing framework for tera-scale database search of mass spectrometry data. Nat. Comput. Sci. 2021 1, 550–561 (2021).
Haseeb, M. & Saeed, F. GPU-acceleration of the distributed-memory database peptide search of mass spectrometry data. Sci. Rep. 13, 18713 (2023).
Tariq, M. U. & Saeed, F. SpeCollate: Deep cross-modal similarity network for mass spectrometry data based peptide deductions. PLoS One 16, e0259349 (2021).
Tariq, M. U., Ebert, S. & Saeed, F. Making MS Omics Data ML-Ready: SpeCollate Protocols In (ed. Lisacek, F.) (2024).
McIlwain, S. et al. Crux: Rapid open source protein tandem mass spectrometry analysis. J. Proteome Res. 13, 4488–4491 (2014).
Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: Ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics. Nat. Methods 14, 513–520 (2017).
Sun, J. et al. Open-pFind enhances the identification of missing proteins from human testis tissue. J. Proteome Res. 18, 4189–4196 (2019).
Polasky, D. A. et al. MSFragger-Labile: A flexible method to improve labile PTM analysis in proteomics. Mol. Cell. Proteomics 22, 100538 (2023).
Kustatscher, G. et al. Understudied proteins: Opportunities and challenges for functional proteomics. Nat. Methods 19, 774–779 (2022).
Kustatscher, G. et al. An open invitation to the Understudied Proteins Initiative. Nat. Biotechnol. 40, 815–817 (2022).
Dunham, I. Human genes: Time to follow the roads less traveled?. PLoS Biol. 16, e3000034 (2018).
Haynes, W. A., Tomczak, A. & Khatri, P. Gene annotation bias impedes biomedical research. Sci. Rep. 8, 1362 (2018).
Nguengang Wakap, S. et al. Estimating cumulative point prevalence of rare diseases: Analysis of the Orphanet database. Eur. J. Hum. Genet. 28, 165–173 (2020).
Bakos, J., Zatkova, M., Bacova, Z. & Ostatnikova, D. The role of hypothalamic neuropeptides in neurogenesis and neuritogenesis. Neural Plast. https://doi.org/10.1155/2016/3276383 (2016).
Huang, Z. et al. Brain proteomic analysis implicates actin filament processes and injury response in resilience to Alzheimer’s disease. Nat. Commun. 14, 2747 (2023).
Johnson, E. C. B. et al. Large-scale proteomic analysis of Alzheimer’s disease brain and cerebrospinal fluid reveals early changes in energy metabolism associated with microglia and astrocyte activation. Nat. Med. 26, 769–780 (2020).
Chen, F., Chandrashekar, D. S., Varambally, S. & Creighton, C. J. Pan-cancer molecular subtypes revealed by mass-spectrometry-based proteomic characterization of more than 500 human cancers. Nat. Commun. 10, 1–15 (2019).
Leiserson, M. D. et al. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat. Genet. 47, 106–114 (2015).
Tran, N. H., Zhang, X., Xin, L., Shan, B. & Li, M. De novo peptide sequencing by deep learning. Proc. Natl. Acad. Sci. U. S. A. 114, 8247–8252 (2017).
Diament, B. J. & Noble, W. S. Faster SEQUEST searching for peptide identification from tandem mass spectra. J. Proteome Res. 10, 3871–3879 (2011).
Craig, R. & Beavis, R. C. TANDEM: Matching proteins with tandem mass spectra. Bioinformatics 20, 1466–1467 (2004).
Seydel, C. Diving deeper into the proteome. Nat. Methods. 19, 1036–1040 (2022).
Skinner, O. S. & Kelleher, N. L. Illuminating the dark matter of shotgun proteomics. Nat. Biotechnol. 33, 717–718 (2015).
Lazear, M. R. Sage: An open-source tool for fast proteomics searching and quantification at scale. J. Proteome Res. 22, 3652–3659 (2023).
Chick, J. M. et al. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat. Biotechnol. 33, 743–749 (2015).
Nesvizhskii, A. I. Proteogenomics: Concepts, applications and computational strategies. Nat. Methods 11, 1114–1125 (2014).
Burke, M. C. et al. The hybrid search: a mass spectral library search method for discovery of modifications in proteomics. J. Proteome Res. 16, 1924–1935 (2017).
Solntsev, S. K., Shortreed, M. R., Frey, B. L. & Smith, L. M. Enhanced global post-translational modification discovery with MetaMorpheus. J. Proteome Res. 17, 1844–1851 (2018).
Muth, T., Renard, B. Y. & Martens, L. Metaproteomic data analysis at a glance: Advances in computational microbial community proteomics. Expert Rev. Proteomics 13, 757–769 (2016).
Muth, T. et al. The MetaProteomeAnalyzer: A powerful open-source software suite for metaproteomics data analysis and interpretation. J. Proteome Res. 14, 1557–1565 (2015).
Schiebenhoefer, H. et al. A complete and flexible workflow for metaproteomics data analysis based on MetaProteomeAnalyzer and Prophane. Nat. Protoc. 15, 3212–3239 (2020).
Nesvizhskii, A. I. et al. Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Mol. Cell. Proteom. 5, 652–670 (2006).
Griss, J. et al. Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat. Methods 13, 651–656 (2016).
Ning, K., Fermin, D. & Nesvizhskii, A. I. Computational analysis of unassigned high-quality MS/MS spectra in proteomic data sets. Proteomics 10, 2712–2718 (2010).
Nielsen, M. L., Savitski, M. M. & Zubarev, R. A. Extent of modifications in Human proteome samples and their effect on dynamic range of analysis in shotgun proteomics* S.. Mol. Cell. Proteomics 5, 2384–2391 (2006).
Frank, A., Tanner, S., Bafna, V. & Pevzner, P. Peptide sequence tags for fast database search in mass-spectrometry. J. Proteome Res. 4, 1287–1295 (2005).
Zhang, S. et al. High-resolution quadrupole improves spectral purity and reduces interference from non-target ions in isobaric multiplexed quantitative proteomics.. Anal. Chim. Acta 1325, 343135 (2024).
Haseeb, M. & Saeed, F. Efficient shared peak counting in database peptide search using compact data structure for fragment-ion index. in IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 275–278 (IEEE, 2019). 275–278 (IEEE, 2019). (2019).
Wei, M.-M. et al. Integrating Transformer and Graph Attention Network for circRNA-miRNA interaction prediction.. IEEE J. Biomed. Health Inform. 29, 6105–6113 (2025).
Huang, Y.-A. et al. Consensus representation of multiple cell–cell graphs from gene signaling pathways for cell type annotation.. BMC Biol. 23, 23 (2025).
Wei, M., Wang, L., Su, X., Zhao, B. & You, Z. Multi-hop graph structural modeling for cancer-related circRNA-miRNA interaction prediction.. Pattern Recogn. 170, 112078 (2026).
Altenburg, T., Muth, T. & Renard, B. Y. yHydra: Deep learning enables an ultra fast open search by jointly embedding MS/MS spectra and peptides of mass spectrometry-based proteomics. http://biorxiv.org/lookup/doi/ (2021). https://doi.org/10.1101/2021.12.01.470818 doi:10.1101/2021.12.01.470818.
Gessulat, S. et al. Prosit: Proteome-wide prediction of peptide tandem mass spectra by deep learning.. Nat. Methods 16, 509–518 (2019).
Meyer, J. G. Deep learning neural network tools for proteomics. Cell Rep. Methods https://doi.org/10.1016/j.crmeth.2021.100003 (2021).
Cox, J. Prediction of peptide mass spectral libraries with machine learning. Nat. Biotechnol. 41, 33–43 (2023).
Qiao, R. et al. Computationally instrument-resolution-independent de novo peptide sequencing for high-resolution devices. Nat. Mach. Intell. 3, 420–425 (2021).
Yilmaz, M., Fondrie, W., Bittremieux, W., Oh, S. & Noble, W. S. De novo mass spectrometry peptide sequencing with a transformer model. in International Conference on Machine Learning 25514–25522PMLR, (2022).
Karunratanakul, K., Tang, H. Y., Speicher, D. W., Chuangsuwanich, E. & Sriswasdi, S. Uncovering thousands of new peptides with sequence-mask-search hybrid de novo peptide sequencing framework. Mol. Cell. Proteomics. 18, 2478–2491 (2019).
Blundell, C., Cornebise, J., Kavukcuoglu, K. & Wierstra, D. Weight uncertainty in neural network. in International conference on machine learning 1613–1622PMLR, (2015).
Gal, Y. & Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. in international conference on machine learning 1050–1059PMLR, (2016).
Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances neural Inform. Process. systems 30, (2017).
Liu, K., Ye, Y., Tang, H. & PepNet A fully convolutional neural network for De novo Peptide Sequencing. (2022).
Altenburg, T., Giese, S. H., Wang, S., Muth, T. & Renard, B. Y. Ad hoc learning of peptide fragmentation from mass spectra enables an interpretable detection of phosphorylated and cross-linked peptides. Nat. Mach. Intell. 4, 378–388 (2022).
Zeng, W.-F. et al. AlphaPeptDeep: A modular deep learning framework to predict peptide properties for proteomics. Nat. Commun. 13, 7238 (2022).
Kalhor, M., Lapin, J., Picciani, M. & Wilhelm, M. Rescoring peptide spectrum matches: Boosting proteomics performance by integrating peptide property predictors into peptide identification. Mol. Cell. Proteomics. 23, 100798 (2024).
Dorfer, V., Maltsev, S., Winkler, S. & Mechtler, K. CharmeRT: Boosting peptide identifications by chimeric spectra identification and retention time prediction. J. Proteome Res. 17, 2581–2589 (2018).
Deng, J., Julian, M. H. & Lazar, I. M. Partial enzymatic reactions: A missed opportunity in proteomics research. Rapid Commun. Mass Spectrom. 32, 2065–2073 (2018).
Sharma, K. et al. Ultradeep human phosphoproteome reveals a distinct regulatory nature of Tyr and Ser/Thr-based signaling. Cell Rep. 8, 1583–1594 (2014).
Bittremieux, W., Meysman, P., Noble, W. S. & Laukens, K. Fast open modification spectral library searching through approximate nearest neighbor indexing. J. Proteome Res. 17, 3463–3474 (2018).
Qi, Y. A. et al. Proteogenomic analysis unveils the HLA Class I-presented immunopeptidome in melanoma and EGFR-mutant lung adenocarcinoma. Mol. Cell. Proteomics 20, 100136 (2021).
Ino, Y. et al. Evaluation of four phosphopeptide enrichment strategies for mass spectrometry-based proteomic analysis. Proteomics 22, 2100216 (2022).
Beyter, D., Lin, M. S., Yu, Y., Pieper, R. & Bafna, V. Proteostorm: An ultrafast metaproteomics database search framework. Cell Syst. 7, 463–467 (2018).
Cogne, Y. et al. Comparative proteomics in the wild: Accounting for intrapopulation variability improves describing proteome response in a Gammarus pulex field population exposed to cadmium. Aquat. Toxicol. 214, 105244 (2019).
Cooper, J. C. et al. Altered localization of hybrid incompatibility proteins in Drosophila. Mol. Biol. Evol. 36, 1783–1792 (2019).
Wen, B. et al. Assessment of false discovery rate control in tandem mass spectrometry analysis using entrapment. Nat. Methods (7), https://doi.org/10.1038/s41592-025-02719-x (2025).
Bushuiev, R. et al. Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS. Nat. Biotechnol. https://doi.org/10.1038/s41587-025-02663-3 (2025).
Wang, L., Zhang, X., Su, H. & Zhu, J. A comprehensive survey of continual learning: Theory, method and application. IEEE Trans. Pattern Anal. Mach. Intell. https://doi.org/10.1109/TPAMI.2024.3367329 (2024).
Chambers, M. C. et al. A cross-platform toolkit for mass spectrometry and proteomics. Nat. Biotechnol. 30, 918–920 (2012).
Käll, L., Canterbury, J. D., Weston, J., Noble, W. S. & MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods. 4, 923–925 (2007).
Da Silva Antunes, R. et al. Urinary peptides as a novel source of T Cell allergen epitopes. Front. Immunol. 9, 886 (2018).
Huang, C. et al. Combined transcriptomics and proteomics forecast analysis for potential biomarker in the acute phase of temporal lobe epilepsy. Front. Neurosci. 17, 1145805 (2023).
Hüllermeier, E. & Waegeman, W. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Mach. Learn. 110, 457–506 (2021).
Der Kiureghian, A. (ed, O.) Aleatory or epistemic? Does it matter? Struct. Saf. 31 105–112 (2009).
Funding
This research was supported by the NIGMS of the National Institutes of Health (NIH) under award number: R35GM153434. The authors were further supported by the National Science Foundations (NSF) under the award number: NSF OAC-2312599. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health and/or National Science Foundation. This work used the NSF Extreme Science and Engineering Discovery Environment (XSEDE) Supercomputers through allocations: TG-CCR150017 and TG-ASC200004.
Author information
Authors and Affiliations
Contributions
FS and UT conceived and designed the project. FS, BS and UT acquired the MS data, wrangled and processed the data to make it suitable for ML models. FS, UT and BS analyzed the data, designed the experiments, conducted the experiments, and presented the results in a comprehensive manner. UT wrote the ML model code, scripts and pipelines which were modified by BS for further investigation and results. FS, UT, and BS wrote the paper, and all authors contributed to the revisions and approved the final version. FS contributed to all aspects of the project.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Tariq, U., Shabbir, B. & Saeed, F. End-to-end deep attention-based multitask pipeline for predicting uncertainty-quantified peptide properties from mass spectrometry data. Sci Rep (2026). https://doi.org/10.1038/s41598-026-43215-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-43215-2