Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
Enabling cross-indication protein expression analysis using a curated pan-cancer dataset and a tailored workflow
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 23 March 2026

Enabling cross-indication protein expression analysis using a curated pan-cancer dataset and a tailored workflow

  • Jixin Wang1,
  • Xiaowen Tian2,
  • Wen Yu3,
  • Benjamin S. Pullman4,
  • John Bullen Jr.5,
  • Elaine Hurt5 &
  • …
  • Wenyan Zhong6 

Scientific Reports , Article number:  (2026) Cite this article

  • 402 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Cancer
  • Computational biology and bioinformatics

Abstract

The National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) recently generated harmonized genomic, transcriptomic, proteomic, and clinical data for over 1,000 tumors across 10 cohorts to facilitate pan-cancer discovery research. However, protein expression comparison across CPTAC cohorts remains challenging due to non-uniform missing data and varying protein expression distribution patterns across tumor types. To enable the cancer research community to conduct robust cross-cohort protein expression analysis, we present a curated and normalized pan-cancer protein expression dataset derived from the CPTAC pan-cancer study. Our workflow integrates systematic filtering, various missing data handling and normalization strategies. We developed a novel algorithm to select robustly expressed proteins in tumors within any CPTAC cohort; applied a cohort hybrid imputation approach to protein abundance values from FragPipe within each cohort, based on protein expression distribution patterns; and calculated intensity-based absolute quantification using protein abundance values and applied both global and smooth quantile normalization methods. Our analysis demonstrates that global quantile normalization surpasses both smooth quantile normalization and no normalization, as evidenced by its higher rank correlation across cancer cohorts between CPTAC and TCGA for selected proteins. The findings suggest that combining cohort hybrid imputation with global quantile normalization is an effective method for creating a normalized CPTAC pan-cancer protein dataset, which can facilitate the study of protein expression across different cancer types and accelerate cancer research.

Similar content being viewed by others

Discovery of pathway-independent protein signatures associated with clinical outcome in human cancer cohorts

Article Open access 11 November 2022

An integrated landscape of protein expression in human cancer

Article Open access 23 April 2021

Pan-cancer analysis of RNA expression signatures associated with cancer tissue architecture

Article 17 October 2025

Data availability

The search output and reports from the FragPipe, Protein‑level data without normalization and without imputation, Protein‑level data after imputation (no normalization) and protein‑level data for each normalization method after imputation of protein abundance estimation for the 10 CPTAC indications used in this work and scripts can be downloaded from https://doi.org/10.5281/zenodo.17203216.

References

  1. Lindgren, C. M. et al. Simplified and unified access to cancer proteogenomic data. J. Proteome Res. 20, 1902–1910. https://doi.org/10.1021/acs.jproteome.0c00919 (2021).

    Google Scholar 

  2. Li, Y. et al. Proteogenomic data and resources for pan-cancer analysis. Cancer Cell. 41, 1397–1406. https://doi.org/10.1016/j.ccell.2023.06.009 (2023).

    Google Scholar 

  3. Wang, J. et al. Pan-cancer proteomics analysis to identify tumor-enriched and highly expressed cell surface antigens as potential targets for cancer therapeutics. Mol. Cell. Proteom. 22, 100626. https://doi.org/10.1016/j.mcpro.2023.100626 (2023).

    Google Scholar 

  4. Wang, J. et al. Abstract LB012: Evaluating computational approaches for CPTAC pan-cancer cross-cohort protein expression comparison. Cancer Res. 84, LB012–LB012. https://doi.org/10.1158/1538-7445.Am2024-lb012 (2024).

    Google Scholar 

  5. Moreno, P. et al. Expression Atlas update: Gene and protein expression in multiple species. Nucleic Acids Res. 50, D129–d140. https://doi.org/10.1093/nar/gkab1030 (2022).

    Google Scholar 

  6. Schmidt, T. et al. ProteomicsDB. Nucleic Acids Res. 46 (D1271-d1281). https://doi.org/10.1093/nar/gkx1029 (2018).

  7. Yu, F. et al. Analysis of DIA proteomics data using MSFragger-DIA and FragPipe computational platform. Nat. Commun. 14, 4154. https://doi.org/10.1038/s41467-023-39869-5 (2023).

    Google Scholar 

  8. Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: Ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics. Nat. Methods. 14, 513–520. https://doi.org/10.1038/nmeth.4256 (2017).

    Google Scholar 

  9. Lazar, C., Burger, T. & Wieczorek, S. imputeLCMD: A collection of methods for left-censored missing data imputation. (2022). https://cran.r-project.org/web/packages/imputeLCMD/imputeLCMD.pdf

  10. Gao, Q. et al. Integrated proteogenomic characterization of HBV-related hepatocellular carcinoma. Cell 179, 561–577. https://doi.org/10.1016/j.cell.2019.08.052 (2019).

    Google Scholar 

  11. Palstrøm, N. B., Matthiesen, R. & Beck, H. C. Data imputation in merged isobaric labeling-based relative quantification datasets. Methods Mol. Biol. 2051, 297–308. https://doi.org/10.1007/978-1-4939-9744-2_13 (2020).

    Google Scholar 

  12. Bolstad, B. M., Irizarry, R. A., Astrand, M. & Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–193. https://doi.org/10.1093/bioinformatics/19.2.185 (2003).

    Google Scholar 

  13. Hicks, S. C. et al. Smooth quantile normalization. Biostatistics 19, 185–198. https://doi.org/10.1093/biostatistics/kxx028 (2018).

    Google Scholar 

  14. Shin, J. B. et al. Molecular architecture of the chick vestibular hair bundle. Nat. Neurosci. 16, 365–374. https://doi.org/10.1038/nn.3312 (2013).

    Google Scholar 

  15. Milo, R. What is the total number of protein molecules per cell volume? A call to rethink some published values. Bioessays 35, 1050–1055. https://doi.org/10.1002/bies.201300066 (2013).

    Google Scholar 

  16. Smyth, G. K. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3 https://doi.org/10.2202/1544-6115.1027 (2004).

  17. Zhang, X. et al. Proteome-wide identification of ubiquitin interactions using UbIA-MS. Nat. Protoc. 13, 530–550. https://doi.org/10.1038/nprot.2017.147 (2018).

    Google Scholar 

  18. Kuleshov, M. V. et al. Enrichr: A comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 44, W90–97. https://doi.org/10.1093/nar/gkw377 (2016).

    Google Scholar 

  19. Korotkevich, G. et al. Fast gene set enrichment analysis. bioRxiv 060012 https://doi.org/10.1101/060012 (2021).

  20. Subramanian, A. et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U S A. 102, 15545–15550. https://doi.org/10.1073/pnas.0506580102 (2005).

    Google Scholar 

  21. Liberzon, A. et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739–1740. https://doi.org/10.1093/bioinformatics/btr260 (2011).

    Google Scholar 

  22. Chang, K. et al. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45, 1113–1120. https://doi.org/10.1038/ng.2764 (2013).

    Google Scholar 

  23. Plaia, A., Buscemi, S. & Sciandra, M. Consensus among preference rankings: A new weighted correlation coefficient for linear and weak orderings. Adv. Data Anal. Classif. 15, 1015–1037. https://doi.org/10.1007/s11634-021-00442-x (2021).

    Google Scholar 

  24. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (Methodol.). 57, 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x (1995).

    Google Scholar 

  25. Jarnuczak, A. F. et al. An integrated landscape of protein expression in human cancer. Sci. Data. 8, 115. https://doi.org/10.1038/s41597-021-00890-2 (2021).

    Google Scholar 

  26. Kosti, I., Jain, N., Aran, D., Butte, A. J. & Sirota, M. Cross-tissue analysis of gene and protein expression in normal and cancer tissues. Sci. Rep. 6, 24799. https://doi.org/10.1038/srep24799 (2016).

    Google Scholar 

  27. Gong, T. Q. et al. Proteome-centric cross-omics characterization and integrated network analyses of triple-negative breast cancer. Cell. Rep. 38, 110460. https://doi.org/10.1016/j.celrep.2022.110460 (2022).

    Google Scholar 

  28. Aebersold, R. et al. How many human proteoforms are there? Nat. Chem. Biol. 14, 206–214. https://doi.org/10.1038/nchembio.2576 (2018).

    Google Scholar 

  29. Ponomarenko, E. A. et al. The size of the human proteome: The width and depth. Int. J. Anal. Chem. 7436849, (2016). https://doi.org/10.1155/2016/7436849 (2016).

  30. McGurk, K. A. et al. The use of missing values in proteomic data-independent acquisition mass spectrometry to enable disease activity discrimination. Bioinformatics 36, 2217–2223. https://doi.org/10.1093/bioinformatics/btz898 (2020).

    Google Scholar 

  31. Jin, L. et al. A comparative study of evaluating missing value imputation methods in label-free proteomics. Sci. Rep. 11, 1760. https://doi.org/10.1038/s41598-021-81279-4 (2021).

    Google Scholar 

  32. Gardner, M. L. & Freitas, M. A. Multiple imputation approaches applied to the missing value problem in bottom-up proteomics. Int. J. Mol. Sci. 22, 9650. https://doi.org/10.3390/ijms22179650 (2021).

    Google Scholar 

  33. Lazar, C., Gatto, L., Ferro, M., Bruley, C. & Burger, T. Accounting for the multiple natures of missing values in label-free quantitative proteomics data sets to compare imputation strategies. J. Proteome Res. 15, 1116–1125. https://doi.org/10.1021/acs.jproteome.5b00981 (2016).

    Google Scholar 

  34. Arend, L. et al. Systematic evaluation of normalization approaches in tandem mass tag and label-free protein quantification data using PRONE. Brief. Bioinform. 26 https://doi.org/10.1093/bib/bbaf201 (2025).

  35. Maes, E. et al. Determination of variation parameters as a crucial step in designing TMT-based clinical proteomics experiments. PLoS ONE. 10, e0120115. https://doi.org/10.1371/journal.pone.0120115 (2015).

    Google Scholar 

  36. Hui, H. W. H., Kong, W. & Goh, W. W. B. Thinking points for effective batch correction on biomedical data. Brief. Bioinform. 25 https://doi.org/10.1093/bib/bbae515 (2024).

Download references

Acknowledgements

We gratefully acknowledge the CPTAC for providing open-source proteomics data and Deborah Shuman of AstraZeneca for editing the manuscript and formatting the figures. We also acknowledge the presentation of a portion of this work at the AACR Annual Meeting 2024 (April 5–10, 2024, San Diego; https://aacrjournals.org/cancerres/issue/84/7_Supplement).

Funding

This study was funded by AstraZeneca.

Author information

Authors and Affiliations

  1. Oncology Data Science & AI, AstraZeneca, Gaithersburg, MD, USA

    Jixin Wang

  2. Oncology Biometrics, Oncology R&D, AstraZeneca, Gaithersburg, MD, USA

    Xiaowen Tian

  3. Data Science and AI, BioPharmaceuticals R&D, AstraZeneca, Gaithersburg, MD, USA

    Wen Yu

  4. Centre for Genomics Research, Discovery Sciences, R&D, AstraZeneca, Gaithersburg, MD, USA

    Benjamin S. Pullman

  5. Oncology Targeted Delivery, Research and Early Development, Oncology R&D, AstraZeneca, Gaithersburg, MD, USA

    John Bullen Jr. & Elaine Hurt

  6. Oncology Data Science & AI, Oncology R&D, AstraZeneca, 350 5th Ave, New York, NY, 10118, USA

    Wenyan Zhong

Authors
  1. Jixin Wang
    View author publications

    Search author on:PubMed Google Scholar

  2. Xiaowen Tian
    View author publications

    Search author on:PubMed Google Scholar

  3. Wen Yu
    View author publications

    Search author on:PubMed Google Scholar

  4. Benjamin S. Pullman
    View author publications

    Search author on:PubMed Google Scholar

  5. John Bullen Jr.
    View author publications

    Search author on:PubMed Google Scholar

  6. Elaine Hurt
    View author publications

    Search author on:PubMed Google Scholar

  7. Wenyan Zhong
    View author publications

    Search author on:PubMed Google Scholar

Contributions

JW and WZ conceived and designed the study. JW, XT, YW, BSP, JB, EH, and WZ analyzed and interpreted the data. All authors contributed to the writing, review, and/or revision of the manuscript, have approved the final version of the manuscript, and agree to be accountable for all aspects of the work.

Corresponding author

Correspondence to Wenyan Zhong.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (download CSV )

Supplementary Material 2 (download CSV )

Supplementary Material 3 (download CSV )

Supplementary Material 4 (download CSV )

Supplementary Material 5 (download CSV )

Supplementary Material 6 (download CSV )

Supplementary Material 7 (download CSV )

Supplementary Material 8 (download CSV )

Supplementary Material 9 (download CSV )

Supplementary Material 10 (download DOCX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, J., Tian, X., Yu, W. et al. Enabling cross-indication protein expression analysis using a curated pan-cancer dataset and a tailored workflow. Sci Rep (2026). https://doi.org/10.1038/s41598-026-44872-z

Download citation

  • Received: 30 October 2024

  • Accepted: 16 March 2026

  • Published: 23 March 2026

  • DOI: https://doi.org/10.1038/s41598-026-44872-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • CPTAC
  • Proteomics
  • iBAQ
  • Normalization
  • Differential analysis
  • TCGA
Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing: Cancer

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

Get what matters in cancer research, free to your inbox weekly. Sign up for Nature Briefing: Cancer