Abstract
Amid the advances in genomics, the availability of large reference panels of human haplotypes is key to account for human diversity within and across populations. However, mass spectrometry-based proteomics does not benefit from this information. To address this gap, we introduce ProHap, a Python-based tool that constructs protein sequence databases from phased genotypes of reference panels. ProHap enables researchers to account for haplotype diversity in proteomic searches.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout


Similar content being viewed by others
Data availability
The six databases of protein sequences derived from the phased genotypes of the 1000 Genomes Project, along with all necessary metadata, are available at https://doi.org/10.5281/zenodo.10149277. The database of protein sequences derived from Release 1.1 of the Haplotype Reference Consortium, along with metadata, is available at https://doi.org/10.5281/zenodo.12671301. The databases of protein sequences derived from the first release of the 44 samples of the Human Pangenome Reference Consortium, along with metadata, are available at https://doi.org/10.5281/zenodo.12686818. The dataset of the 1000 Genomes Project phase 3 aligned with the GRCh38 genome reference is available from https://www.internationalgenome.org/data-portal/data-collection/grch38. The first release of the HPRC is available in various formats from https://github.com/human-pangenomics/hpp_pangenome_resources. The decomposed VCF file generated using the Minigraph-Cactus method aligned with the GRCh38 genome reference was used in this work. The Haplotype Reference Consortium Release 1.1 is accessible by request from the European Genome-phenome Archive at the European Bioinformatics Institute (https://www.ebi.ac.uk/ega/studies/EGAS00001001710). Data access requests are reviewed by the Data Access Committee at the Wellcome Trust Sanger Institute. The UniProt and SwissProt databases are available from https://www.uniprot.org/uniprotkb?facets=model_organism%3A9606&query=*. The proteomic dataset of blood plasma samples is available through ProteomeXchange with the dataset identifier PXD004242. The complete list of peptide–spectrum matches obtained by the reprocessing is available at https://doi.org/10.5281/zenodo.12725745. The dataset of mass spectra from healthy donor hiPSC cultures is available through ProteomeXchange with the dataset identifier PXD053601. The experimental metadata have been generated using lesSDRF42 and are available through ProteomeXchange with the dataset identifier PXD053601.
Code availability
Source code, along with all documentation, is available at https://github.com/ProGenNo/ProHap (https://doi.org/10.5281/zenodo.12749956). Source code of PeptideAnnotator, along with documentation, is available at https://github.com/ProGenNo/ProHap_PeptideAnnotator (https://doi.org/10.5281/zenodo.12759867). The pipeline used downstream of ProHap to obtain the present results is available at https://github.com/ProGenNo/ProHapDatabaseAnalysis (https://doi.org/10.5281/zenodo.13910718).
References
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).
Menschaert, G. & Fenyö, D. Proteogenomics from a bioinformatics angle: a growing field. Mass Spectrom. Rev. 36, 584–599 (2017).
Nesvizhskii, A. I. Proteogenomics: concepts, applications and computational strategies. Nat. Methods 11, 1114–1125 (2014).
Wang, X. & Zhang, B. customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics 29, 3235–3237 (2013).
Umer, H. M. et al. Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides. Bioinformatics 38, 1470–1472 (2022).
Spooner, W. et al. Haplosaurus computes protein haplotypes for use in precision drug design. Nat. Commun. 9, 4128 (2018).
Cao, X. & Xing, J. PrecisionProDB: improving the proteomics performance for precision medicine. Bioinformatics 37, 3361–3363 (2021).
MIT License (2024) https://choosealicense.com/licenses/mit/ (accessed 24 January 2024).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Lowy-Gallego, E. et al. Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project. Wellcome Open Res 4, 50 (2019).
Vašíček, J. Protein haplotype sequences obtained by ProHap from the 1000 Genomes Project data set. Zenodo https://zenodo.org/records/12671237 (2024).
Vašíček, J. Protein haplotype sequences obtained by ProHap from the Haplotype Reference Consortium Release 1.1 dataset. Zenodo https://doi.org/10.5281/zenodo.12671302 (2024).
Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023).
Vašíček, J. Protein haplotype sequences obtained by ProHap from the Human Pangenome Reference Consortium dataset. Zenodo https://doi.org/10.5281/zenodo.12686819 (2024).
Vašíček, J. et al. Finding haplotypic signatures in proteins. GigaScience 12, giad093 (2023).
Geyer, P. E. et al. Proteomics reveals the effects of sustained weight loss on the human plasma proteome. Mol. Syst. Biol. 12, 901 (2016).
Vašíček, J. & Skiadopoulou, D. Reprocessing of the dataset ‘Plasma Proteome Profiling Reveals the Effects of Weight Loss on the Apolipoprotein Family and Systemic Inflammation Status.’ Zenodo https://doi.org/10.5281/zenodo.12725746 (2024).
Bader, J. M., Albrecht, V. & Mann, M. MS-based proteomics of body fluids: the end of the beginning. Mol. Cell. Proteomics 22, 100577 (2023).
Skiadopoulou, D. et al. Retention time and fragmentation predictors increase confidence in identification of common variant peptides. J. Proteome Res. 22, 3190–3199 (2023).
Moreno-Estrada, A. et al. The genetics of Mexico recapitulates Native American substructure and affects biomedical traits. Science 344, 1280–1285 (2014).
Fox, K. The illusion of inclusion: the ‘All of Us’ research program and Indigenous peoples’ DNA. N. Engl. J. Med. 383, 411–413 (2020).
Hudson, M. et al. Rights, interests and expectations: Indigenous perspectives on unrestricted access to genomic data. Nat. Rev. Genet. 21, 377–384 (2020).
Birney, E., Inouye, M., Raff, J., Rutherford, A. & Scally, A. The language of race, ethnicity, and ancestry in human genetic research. Preprint at https://doi.org/10.48550/arXiv.2106.10041 (2021).
Cunningham, F. et al. Ensembl 2022. Nucleic Acids Res. 50, D988–D995 (2022).
Kowalski, M. H. et al. Use of >100,000 NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium whole genome sequences improves imputation quality and detection of rare variant associations in admixed African and Hispanic/Latino populations. PLoS Genet. 15, e1008500 (2019).
Morales, J. et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 604, 310–315 (2022).
Declercq, A. et al. MS2Rescore: data-driven rescoring dramatically boosts immunopeptide identification rates. Mol. Cell. Proteomics 21(8), 100266 (2022).
Hickey, G. et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. 42, 663–673 (2024).
Stawiński, P. & Płoski, R. Genebe.net: implementation and validation of an automatic ACMG variant pathogenicity criteria assignment. Clin. Genet. 106, 119–126 (2024).
Vaudel, M., Barsnes, H., Berven, F. S., Sickmann, A. & Martens, L. SearchGUI: an open-source graphical user interface for simultaneous OMSSA and X!Tandem searches. Proteomics 11, 996–999 (2011).
Vaudel, M. et al. PeptideShaker enables reanalysis of MS-derived proteomics data sets. Nat. Biotechnol. 33, 22–24 (2015).
Fenyö, D. & Beavis, R. C. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. 75, 768–774 (2003).
Park, C. Y., Klammer, A. A., Käll, L., MacCoss, M. J. & Noble, W. S. Rapid and accurate peptide identification from tandem mass spectra. J. Proteome Res. 7, 3022–3027 (2008).
Käll, L., Canterbury, J. D., Weston, J., Noble, W. S. & MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 4, 923–925 (2007).
Bouwmeester, R., Gabriels, R., Hulstaert, N., Martens, L. & Degroeve, S. DeepLC can predict retention times for peptides that carry as-yet unseen modifications. Nat. Methods 18, 1363–1369 (2021).
Declercq, A. et al. Updated MS2PIP web server supports cutting-edge proteomics applications. Nucleic Acids Res. 51, W338–W342 (2023).
Unger, L., Mathisen, A. F., Chera, S., Legøy, T. A. & Ghila, L. The GLI code controls HNF1A levels during foregut differentiation. Int. J. Dev. Biol. https://doi.org/10.1387/ijdb.230220lg (2024).
Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods 14, 513–520 (2017).
Teo, G. C., Polasky, D. A., Yu, F. & Nesvizhskii, A. I. Fast deisotoping algorithm and its implementation in the MSFragger search engine. J. Proteome Res. 20, 498–505 (2021).
Yang, K. L. et al. MSBooster: improving peptide identification rates using deep learning-based features. Nat. Commun. 14, 4539 (2023).
Claeys, T. et al. lesSDRF is more: maximizing the value of proteomics data through streamlined metadata annotation. Nat. Commun. 14, 6743 (2023).
Käll, L., Storey, J. D. & Noble, W. S. Non-parametric estimation of posterior error probabilities associated with peptides identified by tandem mass spectrometry. Bioinformatics 24, i42–i48 (2008).
Acknowledgements
This work was supported by the Research Council of Norway (project 301178 to M.V.), Stiftelsen Trond Mohn Foundation (Mohn Center of Diabetes Precision Medicine), and the University of Bergen. P.R.N. was supported by grants from the European Research Council (AdG 293574), Stiftelsen Trond Mohn Foundation (Mohn Center of Diabetes Precision Medicine), the University of Bergen, Haukeland University Hospital, the Research Council of Norway (FRIPRO grant 240413) and the Novo Nordisk Foundation (grant 54741). L.K. was supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. S.C. was supported by grants from Novo Nordisk Foundation (grant NNF21OC0067325) and Research Council of Norway (NFR 314397). N.B. was partially supported by the National Institutes of Health (award 5R24GM148372). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. The computations on publicly available data were performed on the Norwegian Research and Education Cloud (NREC), using resources provided by the University of Bergen and the University of Oslo (https://www.nrec.no). All analyses on HRC data were performed using digital laboratories in HUNT Cloud at the Norwegian University of Science and Technology, Trondheim, Norway. We are grateful for outstanding support from the HUNT Cloud community. The HRC data were used in a form agreed by the University of Bergen and the Wellcome Sanger Institute. This research was funded, in whole or in part, by the Research Council of Norway 301178. Mass spectrometry analyses were performed by the Proteomics Research Infrastructure (PRI) at the University of Copenhagen (UCPH), supported by the Novo Nordisk Foundation (NNF) (grant agreement number NNF19SA0059305).
Author information
Authors and Affiliations
Contributions
J.V. and M.V. designed the study, J.V. developed the software, D.S. and K.G.K. contributed to the software development and testing. L.U., S.C. and L.M.G. provided material for the stem cell experiment, K.G.K. designed the mass spectrometry experiment on stem cells, J.V., D.S., K.G.K. and M.V. analyzed the data. S.C., N.B., P.R.N., S.J., S.B., L.K. and M.V. supervised the work and contributed to the manuscript. J.V. and M.V. wrote the manuscript with input from all authors. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Juan Antonio Vizcaíno and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Examples of variant peptide–spectrum matches from the reanalysis of the public dataset of blood plasma samples.
Amino acid substitutions in sequences are marked in bold, posterior error probabilities are estimated by Percolator43. The top part of each plot shows the spectrum measured from the sample, with the blue peaks highlighting ions that match the theoretical spectrum. The bottom part of each plot shows the fragmentation and intensities predicted by MS2PIP37, with the green peak highlighting matched predictions, and the red peaks showing predictions that do not have a corresponding peak in the observed spectrum. Please follow these links to view the annotated spectra using the Universal Spectrum Identifier in MassIVE: Spectrum 1 (https://massive.ucsd.edu/ProteoSAFe/usi.jsp#%7B%22usi%22%3A%22mzspec%3AMSV000080596%3A20150907_QEp1_LC7_PhGe_SA_Plate2D_38_7%3Ascan%3A16943%3AWLNVQLSPR%2F2%22%7D), Spectrum 2 (https://massive.ucsd.edu/ProteoSAFe/usi.jsp#%7B%22usi%22%3A%22mzspec%3AMSV000080596%3A20150827_QEp1_LC7_PhGe_SA_Plate2C_06_4%3Ascan%3A22960%3AHPTIQGFASVAQEETEILTADMDAER%2F3%22%7D), Spectrum 3 (https://massive.ucsd.edu/ProteoSAFe/usi.jsp#%7B%22usi%22%3A%22mzspec%3AMSV000080596%3A20150618_QEp1_LC7_PhGe_SA_Plate1B_08_3%3Ascan%3A21099%3ASSMSPTTNVLLSPLSVATALSALSLGAEQR%2F3%22%7D), Spectrum 4 (https://massive.ucsd.edu/ProteoSAFe/usi.jsp#%7B%22usi%22%3A%22mzspec%3AMSV000080596%3A20150624_QEp1_LC7_PhGe_SA_Plate2A_06_2%3Ascan%3A18682%3AMSLSSFSVNRPFLFFIFEDTTGLPLFVGSVK%2F3%22%7D).
Extended Data Fig. 2 Examples of variant peptide–spectrum matches from the analysis of healthy donor stem cells.
Amino acid substitutions in sequences are marked in bold, posterior error probabilities are estimated by Percolator43. The top part of each plot shows the spectrum measured from the sample, with the blue peaks highlighting ions that match the theoretical spectrum. The bottom part of each plot shows the fragmentation and intensities predicted by MS2PIP37, with the green peak highlighting matched predictions, and the red peaks showing predictions that do not have a corresponding peak in the observed spectrum. A: Two peptide–spectrum matches covering the reference and alternative allele for a variant (rs1132979). Please follow these links to view the annotated spectra using the Universal Spectrum Identifier: Spectrum 1 (https://massive.ucsd.edu/ProteoSAFe/usi.jsp#%7B%22usi%22%3A%22mzspec%3APXD053601%3A20230601_ASC2_NEO3_X866_P244_R771_180min_uPAC110cm_27ms_DDA_S3.mzML%3Ascan%3A96472%3ANTVLATWQPYTTSK%2F2%22%7D), Spectrum 2 (https://massive.ucsd.edu/ProteoSAFe/usi.jsp#%7B%22usi%22%3A%22mzspec%3APXD053601%3A20230601_ASC2_NEO3_X866_P244_R771_180min_uPAC110cm_27ms_DDA_S3.mzML%3Ascan%3A92494%3ANTVLATWQPYSTSK%2F2%22%7D) B: Two multi-variant peptide–spectrum matches, covering the alternative allele for multiple variants co-occurring in the same haplotype of the donor. Please follow these links to view the annotated spectra using the Universal Spectrum Identifier: Spectrum 3 (https://massive.ucsd.edu/ProteoSAFe/usi.jsp#%7B%22usi%22%3A%22mzspec%3APXD053601%3A20230601_ASC2_NEO3_X866_P244_R771_180min_uPAC110cm_27ms_DDA_S2.mzML%3Ascan%3A107817%3AELSGLPSGPSAGSGPPPPPPGPPPPPVSTSSGSDESASR%2F3%22%7D), Spectrum 4 (https://massive.ucsd.edu/ProteoSAFe/usi.jsp#%7B%22usi%22%3A%22mzspec%3APXD053601%3A20230601_ASC2_NEO3_X866_P244_R771_180min_uPAC110cm_27ms_DDA_S2.mzML%3Ascan%3A147106%3ALLGLELSEAEALGADSAR%2F2%22%7D).
Extended Data Fig. 3 Overlap in peptides contained in ProHap databases and UniProt.
Bar chart showing overlap between databases generated using ProHap on diverse population panels and the canonical sequences available through UniProt. The percentages denote the proportion to the total number of tryptic peptides contained in each database generated by ProHap.
Extended Data Fig. 4 Example of a protein haplotype including four peptide categories that are expected.
Peptides are marked by the color representing the category, amino acid substitutions resulting from the alternative allele of three co-occurring genetic variants are marked by red squares. The asterisk sign (*) represents a stop codon.
Extended Data Fig. 5 Types of variants included in the ProHap database of the 1000 Genomes.
A: Histogram showing the number of included variants per type. B: Two examples of a context-dependent variant consequence. The variant synonymous in the translation of transcript 1 results in the amino acid substitution in the translation of transcript 2, or when a frameshift is introduced upstream.
Supplementary information
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Vašíček, J., Kuznetsova, K.G., Skiadopoulou, D. et al. ProHap enables human proteomic database generation accounting for population diversity. Nat Methods 22, 273–277 (2025). https://doi.org/10.1038/s41592-024-02506-0
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s41592-024-02506-0


