Abstract
The National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) recently generated harmonized genomic, transcriptomic, proteomic, and clinical data for over 1,000 tumors across 10 cohorts to facilitate pan-cancer discovery research. However, protein expression comparison across CPTAC cohorts remains challenging due to non-uniform missing data and varying protein expression distribution patterns across tumor types. To enable the cancer research community to conduct robust cross-cohort protein expression analysis, we present a curated and normalized pan-cancer protein expression dataset derived from the CPTAC pan-cancer study. Our workflow integrates systematic filtering, various missing data handling and normalization strategies. We developed a novel algorithm to select robustly expressed proteins in tumors within any CPTAC cohort; applied a cohort hybrid imputation approach to protein abundance values from FragPipe within each cohort, based on protein expression distribution patterns; and calculated intensity-based absolute quantification using protein abundance values and applied both global and smooth quantile normalization methods. Our analysis demonstrates that global quantile normalization surpasses both smooth quantile normalization and no normalization, as evidenced by its higher rank correlation across cancer cohorts between CPTAC and TCGA for selected proteins. The findings suggest that combining cohort hybrid imputation with global quantile normalization is an effective method for creating a normalized CPTAC pan-cancer protein dataset, which can facilitate the study of protein expression across different cancer types and accelerate cancer research.
Similar content being viewed by others
Data availability
The search output and reports from the FragPipe, Protein‑level data without normalization and without imputation, Protein‑level data after imputation (no normalization) and protein‑level data for each normalization method after imputation of protein abundance estimation for the 10 CPTAC indications used in this work and scripts can be downloaded from https://doi.org/10.5281/zenodo.17203216.
References
Lindgren, C. M. et al. Simplified and unified access to cancer proteogenomic data. J. Proteome Res. 20, 1902–1910. https://doi.org/10.1021/acs.jproteome.0c00919 (2021).
Li, Y. et al. Proteogenomic data and resources for pan-cancer analysis. Cancer Cell. 41, 1397–1406. https://doi.org/10.1016/j.ccell.2023.06.009 (2023).
Wang, J. et al. Pan-cancer proteomics analysis to identify tumor-enriched and highly expressed cell surface antigens as potential targets for cancer therapeutics. Mol. Cell. Proteom. 22, 100626. https://doi.org/10.1016/j.mcpro.2023.100626 (2023).
Wang, J. et al. Abstract LB012: Evaluating computational approaches for CPTAC pan-cancer cross-cohort protein expression comparison. Cancer Res. 84, LB012–LB012. https://doi.org/10.1158/1538-7445.Am2024-lb012 (2024).
Moreno, P. et al. Expression Atlas update: Gene and protein expression in multiple species. Nucleic Acids Res. 50, D129–d140. https://doi.org/10.1093/nar/gkab1030 (2022).
Schmidt, T. et al. ProteomicsDB. Nucleic Acids Res. 46 (D1271-d1281). https://doi.org/10.1093/nar/gkx1029 (2018).
Yu, F. et al. Analysis of DIA proteomics data using MSFragger-DIA and FragPipe computational platform. Nat. Commun. 14, 4154. https://doi.org/10.1038/s41467-023-39869-5 (2023).
Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: Ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics. Nat. Methods. 14, 513–520. https://doi.org/10.1038/nmeth.4256 (2017).
Lazar, C., Burger, T. & Wieczorek, S. imputeLCMD: A collection of methods for left-censored missing data imputation. (2022). https://cran.r-project.org/web/packages/imputeLCMD/imputeLCMD.pdf
Gao, Q. et al. Integrated proteogenomic characterization of HBV-related hepatocellular carcinoma. Cell 179, 561–577. https://doi.org/10.1016/j.cell.2019.08.052 (2019).
Palstrøm, N. B., Matthiesen, R. & Beck, H. C. Data imputation in merged isobaric labeling-based relative quantification datasets. Methods Mol. Biol. 2051, 297–308. https://doi.org/10.1007/978-1-4939-9744-2_13 (2020).
Bolstad, B. M., Irizarry, R. A., Astrand, M. & Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193. https://doi.org/10.1093/bioinformatics/19.2.185 (2003).
Hicks, S. C. et al. Smooth quantile normalization. Biostatistics 19, 185–198. https://doi.org/10.1093/biostatistics/kxx028 (2018).
Shin, J. B. et al. Molecular architecture of the chick vestibular hair bundle. Nat. Neurosci. 16, 365–374. https://doi.org/10.1038/nn.3312 (2013).
Milo, R. What is the total number of protein molecules per cell volume? A call to rethink some published values. Bioessays 35, 1050–1055. https://doi.org/10.1002/bies.201300066 (2013).
Smyth, G. K. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3 https://doi.org/10.2202/1544-6115.1027 (2004).
Zhang, X. et al. Proteome-wide identification of ubiquitin interactions using UbIA-MS. Nat. Protoc. 13, 530–550. https://doi.org/10.1038/nprot.2017.147 (2018).
Kuleshov, M. V. et al. Enrichr: A comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 44, W90–97. https://doi.org/10.1093/nar/gkw377 (2016).
Korotkevich, G. et al. Fast gene set enrichment analysis. bioRxiv 060012 https://doi.org/10.1101/060012 (2021).
Subramanian, A. et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U S A. 102, 15545–15550. https://doi.org/10.1073/pnas.0506580102 (2005).
Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740. https://doi.org/10.1093/bioinformatics/btr260 (2011).
Chang, K. et al. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45, 1113–1120. https://doi.org/10.1038/ng.2764 (2013).
Plaia, A., Buscemi, S. & Sciandra, M. Consensus among preference rankings: A new weighted correlation coefficient for linear and weak orderings. Adv. Data Anal. Classif. 15, 1015–1037. https://doi.org/10.1007/s11634-021-00442-x (2021).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (Methodol.). 57, 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x (1995).
Jarnuczak, A. F. et al. An integrated landscape of protein expression in human cancer. Sci. Data. 8, 115. https://doi.org/10.1038/s41597-021-00890-2 (2021).
Kosti, I., Jain, N., Aran, D., Butte, A. J. & Sirota, M. Cross-tissue analysis of gene and protein expression in normal and cancer tissues. Sci. Rep. 6, 24799. https://doi.org/10.1038/srep24799 (2016).
Gong, T. Q. et al. Proteome-centric cross-omics characterization and integrated network analyses of triple-negative breast cancer. Cell. Rep. 38, 110460. https://doi.org/10.1016/j.celrep.2022.110460 (2022).
Aebersold, R. et al. How many human proteoforms are there? Nat. Chem. Biol. 14, 206–214. https://doi.org/10.1038/nchembio.2576 (2018).
Ponomarenko, E. A. et al. The size of the human proteome: The width and depth. Int. J. Anal. Chem. 7436849, (2016). https://doi.org/10.1155/2016/7436849 (2016).
McGurk, K. A. et al. The use of missing values in proteomic data-independent acquisition mass spectrometry to enable disease activity discrimination. Bioinformatics 36, 2217–2223. https://doi.org/10.1093/bioinformatics/btz898 (2020).
Jin, L. et al. A comparative study of evaluating missing value imputation methods in label-free proteomics. Sci. Rep. 11, 1760. https://doi.org/10.1038/s41598-021-81279-4 (2021).
Gardner, M. L. & Freitas, M. A. Multiple imputation approaches applied to the missing value problem in bottom-up proteomics. Int. J. Mol. Sci. 22, 9650. https://doi.org/10.3390/ijms22179650 (2021).
Lazar, C., Gatto, L., Ferro, M., Bruley, C. & Burger, T. Accounting for the multiple natures of missing values in label-free quantitative proteomics data sets to compare imputation strategies. J. Proteome Res. 15, 1116–1125. https://doi.org/10.1021/acs.jproteome.5b00981 (2016).
Arend, L. et al. Systematic evaluation of normalization approaches in tandem mass tag and label-free protein quantification data using PRONE. Brief. Bioinform. 26 https://doi.org/10.1093/bib/bbaf201 (2025).
Maes, E. et al. Determination of variation parameters as a crucial step in designing TMT-based clinical proteomics experiments. PLoS ONE. 10, e0120115. https://doi.org/10.1371/journal.pone.0120115 (2015).
Hui, H. W. H., Kong, W. & Goh, W. W. B. Thinking points for effective batch correction on biomedical data. Brief. Bioinform. 25 https://doi.org/10.1093/bib/bbae515 (2024).
Acknowledgements
We gratefully acknowledge the CPTAC for providing open-source proteomics data and Deborah Shuman of AstraZeneca for editing the manuscript and formatting the figures. We also acknowledge the presentation of a portion of this work at the AACR Annual Meeting 2024 (April 5–10, 2024, San Diego; https://aacrjournals.org/cancerres/issue/84/7_Supplement).
Funding
This study was funded by AstraZeneca.
Author information
Authors and Affiliations
Contributions
JW and WZ conceived and designed the study. JW, XT, YW, BSP, JB, EH, and WZ analyzed and interpreted the data. All authors contributed to the writing, review, and/or revision of the manuscript, have approved the final version of the manuscript, and agree to be accountable for all aspects of the work.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, J., Tian, X., Yu, W. et al. Enabling cross-indication protein expression analysis using a curated pan-cancer dataset and a tailored workflow. Sci Rep (2026). https://doi.org/10.1038/s41598-026-44872-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-44872-z


