Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

ProHap enables human proteomic database generation accounting for population diversity

Abstract

Amid the advances in genomics, the availability of large reference panels of human haplotypes is key to account for human diversity within and across populations. However, mass spectrometry-based proteomics does not benefit from this information. To address this gap, we introduce ProHap, a Python-based tool that constructs protein sequence databases from phased genotypes of reference panels. ProHap enables researchers to account for haplotype diversity in proteomic searches.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of the ProHap workflow.
Fig. 2: Impact of genetic variation in the populations of the 1000 Genomes Project on the proteomic search space.

Similar content being viewed by others

Data availability

The six databases of protein sequences derived from the phased genotypes of the 1000 Genomes Project, along with all necessary metadata, are available at https://doi.org/10.5281/zenodo.10149277. The database of protein sequences derived from Release 1.1 of the Haplotype Reference Consortium, along with metadata, is available at https://doi.org/10.5281/zenodo.12671301. The databases of protein sequences derived from the first release of the 44 samples of the Human Pangenome Reference Consortium, along with metadata, are available at https://doi.org/10.5281/zenodo.12686818. The dataset of the 1000 Genomes Project phase 3 aligned with the GRCh38 genome reference is available from https://www.internationalgenome.org/data-portal/data-collection/grch38. The first release of the HPRC is available in various formats from https://github.com/human-pangenomics/hpp_pangenome_resources. The decomposed VCF file generated using the Minigraph-Cactus method aligned with the GRCh38 genome reference was used in this work. The Haplotype Reference Consortium Release 1.1 is accessible by request from the European Genome-phenome Archive at the European Bioinformatics Institute (https://www.ebi.ac.uk/ega/studies/EGAS00001001710). Data access requests are reviewed by the Data Access Committee at the Wellcome Trust Sanger Institute. The UniProt and SwissProt databases are available from https://www.uniprot.org/uniprotkb?facets=model_organism%3A9606&query=*. The proteomic dataset of blood plasma samples is available through ProteomeXchange with the dataset identifier PXD004242. The complete list of peptide–spectrum matches obtained by the reprocessing is available at https://doi.org/10.5281/zenodo.12725745. The dataset of mass spectra from healthy donor hiPSC cultures is available through ProteomeXchange with the dataset identifier PXD053601. The experimental metadata have been generated using lesSDRF42 and are available through ProteomeXchange with the dataset identifier PXD053601.

Code availability

Source code, along with all documentation, is available at https://github.com/ProGenNo/ProHap (https://doi.org/10.5281/zenodo.12749956). Source code of PeptideAnnotator, along with documentation, is available at https://github.com/ProGenNo/ProHap_PeptideAnnotator (https://doi.org/10.5281/zenodo.12759867). The pipeline used downstream of ProHap to obtain the present results is available at https://github.com/ProGenNo/ProHapDatabaseAnalysis (https://doi.org/10.5281/zenodo.13910718).

References

  1. Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Menschaert, G. & Fenyö, D. Proteogenomics from a bioinformatics angle: a growing field. Mass Spectrom. Rev. 36, 584–599 (2017).

    Article  CAS  PubMed  Google Scholar 

  4. Nesvizhskii, A. I. Proteogenomics: concepts, applications and computational strategies. Nat. Methods 11, 1114–1125 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Wang, X. & Zhang, B. customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics 29, 3235–3237 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Umer, H. M. et al. Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides. Bioinformatics 38, 1470–1472 (2022).

    Article  CAS  PubMed  Google Scholar 

  7. Spooner, W. et al. Haplosaurus computes protein haplotypes for use in precision drug design. Nat. Commun. 9, 4128 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  8. Cao, X. & Xing, J. PrecisionProDB: improving the proteomics performance for precision medicine. Bioinformatics 37, 3361–3363 (2021).

    Article  CAS  PubMed  Google Scholar 

  9. MIT License (2024) https://choosealicense.com/licenses/mit/ (accessed 24 January 2024).

  10. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  PubMed  Google Scholar 

  11. Lowy-Gallego, E. et al. Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project. Wellcome Open Res 4, 50 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Vašíček, J. Protein haplotype sequences obtained by ProHap from the 1000 Genomes Project data set. Zenodo https://zenodo.org/records/12671237 (2024).

  13. Vašíček, J. Protein haplotype sequences obtained by ProHap from the Haplotype Reference Consortium Release 1.1 dataset. Zenodo https://doi.org/10.5281/zenodo.12671302 (2024).

  14. Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Vašíček, J. Protein haplotype sequences obtained by ProHap from the Human Pangenome Reference Consortium dataset. Zenodo https://doi.org/10.5281/zenodo.12686819 (2024).

  16. Vašíček, J. et al. Finding haplotypic signatures in proteins. GigaScience 12, giad093 (2023).

    Article  PubMed Central  Google Scholar 

  17. Geyer, P. E. et al. Proteomics reveals the effects of sustained weight loss on the human plasma proteome. Mol. Syst. Biol. 12, 901 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  18. Vašíček, J. & Skiadopoulou, D. Reprocessing of the dataset ‘Plasma Proteome Profiling Reveals the Effects of Weight Loss on the Apolipoprotein Family and Systemic Inflammation Status.’ Zenodo https://doi.org/10.5281/zenodo.12725746 (2024).

  19. Bader, J. M., Albrecht, V. & Mann, M. MS-based proteomics of body fluids: the end of the beginning. Mol. Cell. Proteomics 22, 100577 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Skiadopoulou, D. et al. Retention time and fragmentation predictors increase confidence in identification of common variant peptides. J. Proteome Res. 22, 3190–3199 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Moreno-Estrada, A. et al. The genetics of Mexico recapitulates Native American substructure and affects biomedical traits. Science 344, 1280–1285 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Fox, K. The illusion of inclusion: the ‘All of Us’ research program and Indigenous peoples’ DNA. N. Engl. J. Med. 383, 411–413 (2020).

    Article  PubMed  Google Scholar 

  23. Hudson, M. et al. Rights, interests and expectations: Indigenous perspectives on unrestricted access to genomic data. Nat. Rev. Genet. 21, 377–384 (2020).

    Article  CAS  PubMed  Google Scholar 

  24. Birney, E., Inouye, M., Raff, J., Rutherford, A. & Scally, A. The language of race, ethnicity, and ancestry in human genetic research. Preprint at https://doi.org/10.48550/arXiv.2106.10041 (2021).

  25. Cunningham, F. et al. Ensembl 2022. Nucleic Acids Res. 50, D988–D995 (2022).

    Article  CAS  PubMed  Google Scholar 

  26. Kowalski, M. H. et al. Use of >100,000 NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium whole genome sequences improves imputation quality and detection of rare variant associations in admixed African and Hispanic/Latino populations. PLoS Genet. 15, e1008500 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  27. Morales, J. et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 604, 310–315 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Declercq, A. et al. MS2Rescore: data-driven rescoring dramatically boosts immunopeptide identification rates. Mol. Cell. Proteomics 21(8), 100266 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Hickey, G. et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. 42, 663–673 (2024).

    Article  CAS  PubMed  Google Scholar 

  30. Stawiński, P. & Płoski, R. Genebe.net: implementation and validation of an automatic ACMG variant pathogenicity criteria assignment. Clin. Genet. 106, 119–126 (2024).

    Article  PubMed  Google Scholar 

  31. Vaudel, M., Barsnes, H., Berven, F. S., Sickmann, A. & Martens, L. SearchGUI: an open-source graphical user interface for simultaneous OMSSA and X!Tandem searches. Proteomics 11, 996–999 (2011).

    Article  CAS  PubMed  Google Scholar 

  32. Vaudel, M. et al. PeptideShaker enables reanalysis of MS-derived proteomics data sets. Nat. Biotechnol. 33, 22–24 (2015).

    Article  CAS  PubMed  Google Scholar 

  33. Fenyö, D. & Beavis, R. C. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. 75, 768–774 (2003).

    Article  PubMed  Google Scholar 

  34. Park, C. Y., Klammer, A. A., Käll, L., MacCoss, M. J. & Noble, W. S. Rapid and accurate peptide identification from tandem mass spectra. J. Proteome Res. 7, 3022–3027 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Käll, L., Canterbury, J. D., Weston, J., Noble, W. S. & MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 4, 923–925 (2007).

    Article  PubMed  Google Scholar 

  36. Bouwmeester, R., Gabriels, R., Hulstaert, N., Martens, L. & Degroeve, S. DeepLC can predict retention times for peptides that carry as-yet unseen modifications. Nat. Methods 18, 1363–1369 (2021).

    Article  PubMed  Google Scholar 

  37. Declercq, A. et al. Updated MS2PIP web server supports cutting-edge proteomics applications. Nucleic Acids Res. 51, W338–W342 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Unger, L., Mathisen, A. F., Chera, S., Legøy, T. A. & Ghila, L. The GLI code controls HNF1A levels during foregut differentiation. Int. J. Dev. Biol. https://doi.org/10.1387/ijdb.230220lg (2024).

  39. Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods 14, 513–520 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Teo, G. C., Polasky, D. A., Yu, F. & Nesvizhskii, A. I. Fast deisotoping algorithm and its implementation in the MSFragger search engine. J. Proteome Res. 20, 498–505 (2021).

    Article  CAS  PubMed  Google Scholar 

  41. Yang, K. L. et al. MSBooster: improving peptide identification rates using deep learning-based features. Nat. Commun. 14, 4539 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Claeys, T. et al. lesSDRF is more: maximizing the value of proteomics data through streamlined metadata annotation. Nat. Commun. 14, 6743 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Käll, L., Storey, J. D. & Noble, W. S. Non-parametric estimation of posterior error probabilities associated with peptides identified by tandem mass spectrometry. Bioinformatics 24, i42–i48 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This work was supported by the Research Council of Norway (project 301178 to M.V.), Stiftelsen Trond Mohn Foundation (Mohn Center of Diabetes Precision Medicine), and the University of Bergen. P.R.N. was supported by grants from the European Research Council (AdG 293574), Stiftelsen Trond Mohn Foundation (Mohn Center of Diabetes Precision Medicine), the University of Bergen, Haukeland University Hospital, the Research Council of Norway (FRIPRO grant 240413) and the Novo Nordisk Foundation (grant 54741). L.K. was supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. S.C. was supported by grants from Novo Nordisk Foundation (grant NNF21OC0067325) and Research Council of Norway (NFR 314397). N.B. was partially supported by the National Institutes of Health (award 5R24GM148372). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. The computations on publicly available data were performed on the Norwegian Research and Education Cloud (NREC), using resources provided by the University of Bergen and the University of Oslo (https://www.nrec.no). All analyses on HRC data were performed using digital laboratories in HUNT Cloud at the Norwegian University of Science and Technology, Trondheim, Norway. We are grateful for outstanding support from the HUNT Cloud community. The HRC data were used in a form agreed by the University of Bergen and the Wellcome Sanger Institute. This research was funded, in whole or in part, by the Research Council of Norway 301178. Mass spectrometry analyses were performed by the Proteomics Research Infrastructure (PRI) at the University of Copenhagen (UCPH), supported by the Novo Nordisk Foundation (NNF) (grant agreement number NNF19SA0059305).

Author information

Authors and Affiliations

Authors

Contributions

J.V. and M.V. designed the study, J.V. developed the software, D.S. and K.G.K. contributed to the software development and testing. L.U., S.C. and L.M.G. provided material for the stem cell experiment, K.G.K. designed the mass spectrometry experiment on stem cells, J.V., D.S., K.G.K. and M.V. analyzed the data. S.C., N.B., P.R.N., S.J., S.B., L.K. and M.V. supervised the work and contributed to the manuscript. J.V. and M.V. wrote the manuscript with input from all authors. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Marc Vaudel.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Juan Antonio Vizcaíno and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Examples of variant peptide–spectrum matches from the reanalysis of the public dataset of blood plasma samples.

Amino acid substitutions in sequences are marked in bold, posterior error probabilities are estimated by Percolator43. The top part of each plot shows the spectrum measured from the sample, with the blue peaks highlighting ions that match the theoretical spectrum. The bottom part of each plot shows the fragmentation and intensities predicted by MS2PIP37, with the green peak highlighting matched predictions, and the red peaks showing predictions that do not have a corresponding peak in the observed spectrum. Please follow these links to view the annotated spectra using the Universal Spectrum Identifier in MassIVE: Spectrum 1 (https://massive.ucsd.edu/ProteoSAFe/usi.jsp#%7B%22usi%22%3A%22mzspec%3AMSV000080596%3A20150907_QEp1_LC7_PhGe_SA_Plate2D_38_7%3Ascan%3A16943%3AWLNVQLSPR%2F2%22%7D), Spectrum 2 (https://massive.ucsd.edu/ProteoSAFe/usi.jsp#%7B%22usi%22%3A%22mzspec%3AMSV000080596%3A20150827_QEp1_LC7_PhGe_SA_Plate2C_06_4%3Ascan%3A22960%3AHPTIQGFASVAQEETEILTADMDAER%2F3%22%7D), Spectrum 3 (https://massive.ucsd.edu/ProteoSAFe/usi.jsp#%7B%22usi%22%3A%22mzspec%3AMSV000080596%3A20150618_QEp1_LC7_PhGe_SA_Plate1B_08_3%3Ascan%3A21099%3ASSMSPTTNVLLSPLSVATALSALSLGAEQR%2F3%22%7D), Spectrum 4 (https://massive.ucsd.edu/ProteoSAFe/usi.jsp#%7B%22usi%22%3A%22mzspec%3AMSV000080596%3A20150624_QEp1_LC7_PhGe_SA_Plate2A_06_2%3Ascan%3A18682%3AMSLSSFSVNRPFLFFIFEDTTGLPLFVGSVK%2F3%22%7D).

Extended Data Fig. 2 Examples of variant peptide–spectrum matches from the analysis of healthy donor stem cells.

Amino acid substitutions in sequences are marked in bold, posterior error probabilities are estimated by Percolator43. The top part of each plot shows the spectrum measured from the sample, with the blue peaks highlighting ions that match the theoretical spectrum. The bottom part of each plot shows the fragmentation and intensities predicted by MS2PIP37, with the green peak highlighting matched predictions, and the red peaks showing predictions that do not have a corresponding peak in the observed spectrum. A: Two peptide–spectrum matches covering the reference and alternative allele for a variant (rs1132979). Please follow these links to view the annotated spectra using the Universal Spectrum Identifier: Spectrum 1 (https://massive.ucsd.edu/ProteoSAFe/usi.jsp#%7B%22usi%22%3A%22mzspec%3APXD053601%3A20230601_ASC2_NEO3_X866_P244_R771_180min_uPAC110cm_27ms_DDA_S3.mzML%3Ascan%3A96472%3ANTVLATWQPYTTSK%2F2%22%7D), Spectrum 2 (https://massive.ucsd.edu/ProteoSAFe/usi.jsp#%7B%22usi%22%3A%22mzspec%3APXD053601%3A20230601_ASC2_NEO3_X866_P244_R771_180min_uPAC110cm_27ms_DDA_S3.mzML%3Ascan%3A92494%3ANTVLATWQPYSTSK%2F2%22%7D) B: Two multi-variant peptide–spectrum matches, covering the alternative allele for multiple variants co-occurring in the same haplotype of the donor. Please follow these links to view the annotated spectra using the Universal Spectrum Identifier: Spectrum 3 (https://massive.ucsd.edu/ProteoSAFe/usi.jsp#%7B%22usi%22%3A%22mzspec%3APXD053601%3A20230601_ASC2_NEO3_X866_P244_R771_180min_uPAC110cm_27ms_DDA_S2.mzML%3Ascan%3A107817%3AELSGLPSGPSAGSGPPPPPPGPPPPPVSTSSGSDESASR%2F3%22%7D), Spectrum 4 (https://massive.ucsd.edu/ProteoSAFe/usi.jsp#%7B%22usi%22%3A%22mzspec%3APXD053601%3A20230601_ASC2_NEO3_X866_P244_R771_180min_uPAC110cm_27ms_DDA_S2.mzML%3Ascan%3A147106%3ALLGLELSEAEALGADSAR%2F2%22%7D).

Extended Data Fig. 3 Overlap in peptides contained in ProHap databases and UniProt.

Bar chart showing overlap between databases generated using ProHap on diverse population panels and the canonical sequences available through UniProt. The percentages denote the proportion to the total number of tryptic peptides contained in each database generated by ProHap.

Extended Data Fig. 4 Example of a protein haplotype including four peptide categories that are expected.

Peptides are marked by the color representing the category, amino acid substitutions resulting from the alternative allele of three co-occurring genetic variants are marked by red squares. The asterisk sign (*) represents a stop codon.

Extended Data Fig. 5 Types of variants included in the ProHap database of the 1000 Genomes.

A: Histogram showing the number of included variants per type. B: Two examples of a context-dependent variant consequence. The variant synonymous in the translation of transcript 1 results in the amino acid substitution in the translation of transcript 2, or when a frameshift is introduced upstream.

Extended Data Table 1 Four examples of peptides carrying commonly detected variants
Extended Data Table 2 Number of samples per superpopulation in the 1000 Genomes Project dataset
Extended Data Table 3 Comparison between the features supported by selected proteogenomic database-generation tools

Supplementary information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vašíček, J., Kuznetsova, K.G., Skiadopoulou, D. et al. ProHap enables human proteomic database generation accounting for population diversity. Nat Methods 22, 273–277 (2025). https://doi.org/10.1038/s41592-024-02506-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1038/s41592-024-02506-0

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing