Development and validation of a gene expression-based Breast Cancer Purity Score

Barreca, Marco; Dugo, Matteo; Galbardi, Barbara; Győrffy, Balázs; Valagussa, Pinuccia; Besozzi, Daniela; Viale, Giuseppe; Bianchini, Giampaolo; Gianni, Luca; Callari, Maurizio

doi:10.1038/s41698-024-00730-7

Download PDF

Article
Open access
Published: 24 October 2024

Development and validation of a gene expression-based Breast Cancer Purity Score

npj Precision Oncology volume 8, Article number: 242 (2024) Cite this article

3298 Accesses
1 Citations
11 Altmetric
Metrics details

Subjects

Abstract

The prevalence of malignant cells in clinical specimens, or tumour purity, is affected by both intrinsic biological factors and extrinsic sampling bias. Molecular characterization of large clinical cohorts is typically performed on bulk samples; data analysis and interpretation can be biased by tumour purity variability. Transcription-based strategies to estimate tumour purity have been proposed, but no breast cancer specific method is available yet. We interrogated over 6000 expression profiles from 10 breast cancer datasets to develop and validate a 9-gene Breast Cancer Purity Score (BCPS). BCPS outperformed existing methods for estimating tumour content. Adjusting transcriptomic profiles using the BCPS reduces sampling bias and aids data interpretation. BCPS-estimated tumour purity improved prognostication in luminal breast cancer, correlated with pathologic complete response in on-treatment biopsies from triple-negative breast cancer patients undergoing neoadjuvant treatment and effectively stratified the risk of relapse in HER2+ residual disease post-neoadjuvant treatment.

Development of an accurate breast cancer detection classifier based on platelet RNA

Article Open access 28 December 2024

A novel bioinformatic approach reveals cooperation between Cancer/Testis genes in basal-like breast tumors

Article Open access 11 March 2024

Comprehensive characterization of pre- and post-treatment samples of breast cancer reveal potential mechanisms of chemotherapy resistance

Article Open access 06 May 2022

Introduction

Clinical tumours are complex ecosystems, including neoplastic, immune and endothelial cells, fibroblasts and normal epithelium. Tumour microenvironment (TME) components functionally interact one another and with the tumour cells, in a complex network of signals impacting tumour growth and progression, response to treatment and prevalence of the different cell types^1,2.

As a consequence, tumour purity (i.e. the relative abundance of cancer cells) could represent a biologically relevant, intrinsic tumour feature. For example, a recent study in gastric cancer showed how tumour purity can predict response to chemotherapy, providing novel insights that could help improving prognostic risk stratification and facilitate treatment decision-making³. Similarly, association of purity with major clinical and molecular features was reported in glioma⁴. On the other hand, tissue organisation and tumour heterogeneity cause spatial variability in tumour purity. In the molecular analysis of clinical tumours, only small regions of the neoplastic lesion are typically investigated (e.g. tissue sections from a core biopsy). This introduces an extrinsic variation in tumour purity caused by the sampling procedure, which, in a pan-cancer analysis, was reported to outweigh intrinsic factors⁵.

In translational research, most high throughput experiments are performed on bulk tissue samples; consequently, variability in tumour content can influence the interpretation of molecular data and clinical decisions^5,6,7,8. In genomic analyses, purity can impact detectability of somatic mutations⁹ and copy number alteration events¹⁰. In transcriptomic profiling of bulk clinical specimens, the observed profile is the result of mRNAs expressed by all different cell types present in the tumour ecosystem.

A number of gene expression-based biomarkers have been developed, with some reaching clinical implementation. In breast cancer, the transcriptomics-based PAM50 molecular subtyping has been proposed and validated¹¹ and is now used to aid clinical decisions¹². Remarkably, the Normal-like subtype is allegedly considered an artifact caused by low tumour purity more than an actual cancer cell phenotype. Similarly, the classification into all other subtypes could be biased by variability in tumour purity. We and others have previously proposed the quantification of estrogen signalling and tumour proliferation as prognostic and predictive biomarkers in breast cancer^12,13. Tumour purity could introduce also in this case a relevant bias in their quantification.

Additionally, it is becoming increasingly common to study serial samples collected during treatment¹⁴. Treatment is expected to have a major and patient-specific impact on tumour purity. Consequently, interpretation of differences observed between pre- and post-treatment samples could be significantly biased by changes in tumour content. At the same time, treatment-induced changes in tumour purity could represent valuable information in predicting treatment response and outcome.

Tumour purity is generally estimated by the pathologists through visual or image analysis of tumour sections and, while this method is considered the gold standard, high inter-pathologist variability and discordance have been reported in different studies^6,15.

A few computational methods have been proposed to use molecular information for tumour purity estimation, such as gene expression^16,17, genomic⁹ or DNA methylation profiles⁹. These methods have the potential to bypass pathology-assessment of tumour cellularity and estimate tumour purity in the very same sample used to derive the molecular profile. One of the commonly used transcriptomics-based methods is ESTIMATE¹⁸, which was developed pan-cancer by combining an immune and a stroma score, but not including any tumour related genes.

In this study, we exploited over 6000 samples from 10 breast cancer datasets to systematically quantify the impact of intrinsic and extrinsic factors on cellularity and to generate and validate a Breast Cancer Purity Score (BCPS). BCPS outperformed ESTIMATE in quantifying tumour purity and can be successfully used to adjust for tumour purity variability when extrinsic factors are prevalent. BCPS can also capture treatment-induced changes carrying predictive and prognostic information.

Methods

Dataset collection and processing

To derive and validate the BCPS, a total of 10 gene expression datasets were collected and interrogated. Information on datasets source, sample usage, sample features and data processing are detailed in Table 1.

Table 1 Description of gene expression datasets interrogated to derive, validate and apply the BCPS

Full size table

All ethical approvals and consents to participate have been obtained as previously reported^19,20,21,22.

BCPS identification and quantification

To identify a consensus list of genes to be used as reporter of tumour cellularity, we applied tailored statistical analysis to four distinct breast cancer datasets as summarized in Fig. 2. From the intersection of candidate genes obtained from each of the four interrogated datasets (i.e. TCGA²³, PDX²⁴, NA-PHER2¹⁹, Metzger-Filho²⁵), 5 genes consistently associated with tumour content and 4 consistently associated with stroma content were selected. The two genesets were used together to perform a directional single-sample geneset enrichment analysis as implemented in the singscore (version 1.14.0) R package²⁶. Tumour genes were considered as ‘up’ and stroma genes as ‘down’; this way, high expression of tumour genes and low expression of stroma genes will lead to the highest score. Consequently, a higher score indicates higher tumour purity.

ESTIMATE score computation

ESTIMATE score¹⁸ was computed using the estimate (version 1.0.13) R package. The score is computed by combining a stromal and an immune score. To improve readability and comparability with the BCPS, ESTIMATE score was multiplied by −1 to have higher values indicating higher tumour purity.

Statistical analyses

Spearman’s correlation was calculated to evaluate the association between two continuous variables. Differences between correlation values obtained on the same set of data were compared using cocor (version 1.1.4) R package. Student’s t test (stats R package, version 4.1.2) and analysis of variance or ANOVA (car R package, version 3.0.14) were calculated to evaluate the statistical significance of association of a continuous variable with two or multiple classes, respectively.

Variance component analysis (VCA) was computed to evaluate the impact of clinico-pathological features on tumour purity as implemented in the VCA (version 1.4.3) R package. Samples with missing values in any of the variables were excluded.

To evaluate the ability of a variable to discriminate between two classes, Receiver Operating Characteristic curve (or ROC curve) and associated area under the ROC curve (AUC) were computed, as implemented in the ROCR (version 1.1.11) R package. Differences in AUC values were statistically evaluated using pROC (version 1.18.0).

Differential expression analysis was performed using the limma (version 3.50.0) R package. Nominal p values were corrected for multiple testing using the Benjamini–Hochberg method. Genes with False Discovery Rate (FDR) <0.05 and absolute log fold change >1 were considered significantly differentially expressed, unless otherwise specified.

To evaluate the association with survival for single or multiple variables, univariate or multivariate Cox regression analysis as implemented in the survival (version 3.2.15) R package was performed. In the TCGA and Brueffer²⁷ datasets, overall survival (OS) was the clinical endpoint. In the NeoSPHERE²² dataset, distant event-free survival (DEFS) was considered. To quantify and compare the prediction power of distinct models, the concordance index, as implemented by stat (version 4.1.3) R package, and the 7-years area under the ROC curve (AUC), as implemented by survivalROC (version 1.0.3) R package, were computed.

Results

Association of pathology-assessed cellularity with clinico-pathological features and patient’s outcome

While tumour purity can significantly impact quantification and interpretation of trancriptomic data, it is important to appreciate that both intrinsic and extrinsic factors can influence it. To obtain insights into the relative impact of these two aspects on breast cancer sample tumour content, we investigated whether pathology-assessed tumour cellularity is associated with biologically and clinically relevant features in breast cancer. Associations were evaluated in the TCGA, Metzger-Filho, Park and NA-PHER2 datasets, where tumour purity quantification by expert pathologists was available (Table 1). We found multiple significant associations as reported in Fig. 1a. In the TCGA, the invasive lobular cancers (ILC) had lower purity than other subtypes (Fig. 1a and Supplementary Fig. 1A). At the same time, in the Metzger-Filho dataset, including only ILCs, cellularity was significantly different between distinct ILC subtypes, with the lobular Classic subtype showing the lowest cellularity (Supplementary Fig. S3B). The proliferation marker Ki67 had a weak correlation with cellularity in the Metzger-Filho dataset (ρ = 0.19, p = 0.047), but the same association was not confirmed in Park and NA-PHER2 datasets (Fig. 1a, Supplementary Figs. 2M, 3C and 4B). Notably, different subtypes were included in the different datasets. On the contrary, high grade was consistently associated with higher cellularity (Fig. 1a, Supplementary Figs S2A, S3A). Breast cancer subtypes, either defined by the combination of ER and HER2 status or according to PAM50 classification, showed a significant association with cellularity in both TCGA and Park datasets. In particular, triple-negative or basal-like tumours had, on average, the highest cellularity (Fig. 1a, Supplementary Figs. 1E, F and 2E, F). Finally, in NA-PHER2 samples, stromal tumour-infiltrating lymphocytes (sTILs), but not intraepithelial tumour-infiltrating lymphocytes (iTILs), were negatively correlated with tumour cellularity (ρ = −0.33, p = 0.018) (Fig. 1a and Supplementary Fig. 4C, D).

**Fig. 1: Association of pathologist’s estimated tumour purity with molecular and clinico-pathological variables in breast cancer.**

To quantify the contribution of each clinico-pathological variable to the overall tumour cellularity variance, we performed a variance component analysis (VCA) for each dataset (Fig. 1b). In line with the analysis above, the histological type and grade, together with the molecular subtypes, explained the highest percentage of total variance. However, 52–100% of cellularity variance in each dataset was not explained by the included variables.

Next, we evaluated whether tumour cellularity was associated with distant metastasis free survival in the TCGA and Metzger-Filho datasets, overall and for each subtype defined by ER and HER2 status (Fig. 1c). Cellularity was significantly associated with prognosis only in lobular luminal cases of the Metzger-Filho dataset (p = 0.04).

Finally, in the Park and NA-PHER2 datasets, transcriptomic profiles were obtained from pre- and on-treatment biopsies. Because of treatment-induced tumour cell death, an overall reduction in tumour cellularity could be expected and was observed (p = 0.009 and p = 0.003, respectively). The biopsy timepoint explained 6.12% of the variance in the Park dataset and 16.3% in the NA-PHER2 dataset (Fig. 1d, e).

In summary, this analysis denoted that, in breast cancer, intrinsic tumour biology factors can affect tumour cellularity. However, over half of the variability observed in clinical specimens undergoing molecular characterization was not explained by the main clinico-pathological features and could be related to tumour sampling.

Development of a Breast Cancer Purity Score (BCPS)

Cellularity estimation by the pathologist is not always available and might not refer to the same tumour region undergoing molecular characterization. Consequently, we aimed at identifying a gene expression signature able to estimate tumour purity in a bulk transcriptomic analysis of clinical breast cancer samples. As detailed hereafter, to generate the BCPS we interrogated four distinct datasets: Bruna, NA-PHER2, NeoTRIP²¹ (surgical samples) and Metzger-Filho (Fig. 2 and Table 1).

We correlated gene expression with tumour cellularity in the NA-PHER2 (n = 92) dataset, identifying 733 genes with a positive correlation with tumour cellularity (ρ > 0.35) and 1009 genes with a negative correlation (ρ < −0.35). The same analysis was performed in the Metzger-Filho (n = 117) dataset, identifying 49 genes with a positive (ρ > 0.35) and 284 genes with a negative (ρ < −0.35) correlation with cellularity. Hence, we considered surgical samples from the NeoTRIP dataset grouped into two categories based on expert pathologist evaluation: high/mid tumour content or low/no tumour content. We assessed the ability of each gene to distinguish between the two classes by ROC curve analysis, identifying 570 genes with AUC ≥ 0.7 and 330 genes with AUC < 0.3. Results from the analysis in Fig. 1 helped us defining the BCPS development strategy. Indeed, the association between cellularity and breast cancer subtypes, suggested performing the correlation with cellularity separately in datasets including only one subtype (Table 1).

As a complementary strategy, in the Bruna dataset, transcriptomic profiles of patient-derived xenografts (PDXs) were compared with matched clinical samples from which the xenografts originated. Since the human stroma is completely lost during engraftment and replaced by mouse stroma²⁴, genes expressed by the human tumour microenvironment are expected to be downregulated in such comparison. At the same time, tumour specific genes are expected to be similarly expressed or upregulated in PDXs. The fold changes for selected genesets, expected to be tumour specific or stroma-specific, are reported in Supplementary Figure 5A–C as proof of concept. In total, 236 stroma-specific (with logFC < −1.5) and 5679 tumour specific genes (with logFC > 0) were identified, candidate to be good reporter of tumour content. All the statistical estimates are available in Supplementary Data 1.

From the candidate genes selected in each dataset (Supplementary Fig. 5D–G), we derived a consensus list of 5 tumour-associated (AP1M2, CDK5, PAFAH1B3, SLC25A10, SMG5) and 4 stroma-associated (CXCL12, IFFO1, MFAP4, TGFBR2) genes (Fig. 2). Their main known biological functions are summarised in Supplementary Data 2. This set of genes were used for a single-sample geneset enrichment analysis providing the BCPS, proportional to sample tumour purity, as characterised in the following section.

BCPS evaluation of performance and comparison with ESTIMATE score

To evaluate the ability of the BCPS to estimate tumour sample purity and to compare its performance with the commonly used ESTIMATE score¹⁸, we interrogated five additional independent datasets: TCGA, Park, METABRIC, NeoTRIP (on-treatment samples) and Bianchini²⁸ (Fig. 3 and Table 1).

**Fig. 3: BCPS validation and comparison with ESTIMATE score.**

In the TCGA and Park datasets, BCPS had a significantly higher correlation with cellularity than ESTIMATE (ρ = 0.34–0.37 vs ρ = 0.23–0.27 respectively; correlation difference p < 0.001) (Fig. 3a, b). In the METABRIC dataset, where cellularity was grouped into low, moderate and high classes, BCPS better discriminated between the groups and AUC for the prediction of high vs low cellularity was 0.75 for the BCPS and 0.62 for ESTIMATE (Supplementary Fig. 6A, B). ESTIMATE was strongly inversely correlated with a gene expression-based immune score²⁹ (ρ = −0.87 and ρ = −0.90 in TCGA and Park, respectively), while cellularity and BCPS showed weaker correlations (Supplementary Fig. 6C, D). In the TCGA dataset, Aran et al.⁵ computed multiple RNA, DNA and methylation based predictors of tumour purity. We compared their estimates with the BCPS in the set of samples where all the metrics were available. BCPS showed the highest correlation with the pathologist’s cellularity (Supplementary Fig. 6E).

The association with tumour content was also evaluated in the on-treatment biopsies of the NeoTRIP dataset (Fig. 3c, d). Samples were grouped into high/medium tumour content or low/no tumour content, as established by expert pathologists, and the ability of the two scores to discriminate between these two classes was quantified by Student’s t test and ROC curve analysis. BCPS better discriminated between the two groups, achieving a significantly higher AUC (BCPS AUC = 0.77, ESTIMATE AUC = 0.65, p < 0.001).

We then evaluated the BCPS and ESTIMATE score in the Bianchini dataset. This dataset contains matched primary breast cancer samples obtained by either fine-needle aspiration (FNA) or core-biopsy (CBX). The first sampling procedure is known to enrich for tumour cells while CBX better preserves the stromal content. Indeed, both BCPS and ESTIMATE score were higher in the FNA samples (paired Student’s t test p = 1.1 × 10⁻⁸ and p = 9 × 10⁻⁶, respectively). ROC curve analysis highlighted a significantly higher AUC for the BCPS compared to ESTIMATE score (BCPS AUC = 0.85, ESTIMATE AUC = 0.78, p = 0.046) (Fig. 3e, f).

In the Bianchini dataset, differences between matched FNA and CBX are expected to be primarily related to sampling differences affecting tumour content. Consequently, it represents a relevant setting where to quantify such bias and evaluate the ability of a purity score to correct for it. To this aim, we performed a paired differential analysis between FNA and CBX introducing either no correction or normalising the data using the BCPS or ESTIMATE score. BCPS-normalized data were obtained by taking the residuals of the linear regression models evaluating the relationship between BCPS and each gene. An example of gene expression before and after correction is shown in Fig. 3g. Differential analysis between FNA and CBX before correction identified 60 up-regulated genes and 409 down-regulated genes. Data correction using the ESTIMATE score reduced the number of differentially expressed genes to 89 up-regulated and 160 down-regulated, but only 40 up-regulated and 65 downregulated genes were observed in data corrected using the BCPS, showing the higher performance of the BCPS in adjusting for differences due to tumour content in clinical samples (Fig. 3h). This analysis indicated that in all situations where extrinsic factors are expected to largely overweigh intrinsic factors in affecting tumour purity, BCPS is a useful and effective tool to take tumour purity into consideration and correct for it.

BCPS recapitulates cellularity associations with clinico-pathological factors

For the TCGA and Park datasets, not used for the BCPS development, we evaluated the association with available clinico-pathological factors, as reported for the pathologist cellularity in Fig. 1.

BCPS was significantly associated with the same variables that were significantly associated with cellularity in both the TCGA and Park datasets (Figs. 4a and 1a). VCA analysis was quantitatively different but qualitatively similar (Figs. 4b and 1b), and the BCPS was significantly lower in on-treatment compared to pre-treatment samples in the Park dataset, as observed for the pathologist’s cellularity (Figs. 4d and 1d). Finally, the BCPS was not significantly associated with survival in any subtype in the TCGA dataset (Figs. 4c and 1c).

**Fig. 4: Association of the BCPS with clinico-pathological factors in the TCGA and Park datasets.**

While such associations would require external validation, these results support the validity of the BCPS in providing an estimate of tumour purity leading to conclusions similar to what could be drawn using the pathologist’s evaluation.

Use of the BCPS for prediction of prognosis and response to treatment

A typical goal in transcriptomic data analysis of clinical samples is the identification of genes or signatures associated with specific clinical endpoints. Quantification of expression could be affected by tumour content⁷. We previously developed and validated an ER-related and a proliferation-related metagenes as predictors of long-term outcome in ER+/HER2− breast cancer¹³. Here we applied the metagenes to 2277 ER+/HER2− breast cancer samples from the Brueffer dataset (Table 1). A multivariate Cox model with interactions, including the ER- and proliferation metagenes, and the BCPS explained better the survival data than a bivariate model without the BCPS (likelihood ratio test p = 0.035). The improvement was confirmed by a higher c-index and higher 7-years AUC when the BCPS was included in the model (Fig. 5a, b).

**Fig. 5: Use of the BCPS in breast cancer prognostication.**

Next, we focused on BCPS values in on-treatment biopsies of the NeoTRIP dataset, where the BCPS was significantly lower (p = 1.3 × 10⁻³) in cases eventually achieving pathological complete response or pCR (Fig. 5c). This indicates that, despite a possible sampling bias, evaluation of cellularity in on-treatment biopsies could help early prediction of patients responding or not to neoadjuvant treatments. Remarkably, this mimicked the predictive power of the pathologist’s evaluation in the same cohort, where 66.3% of patients with low/no cellularity had a pCR, versus only 35.3% of patients with mid/high cellularity in the on-treatment biopsy (p = 1.8 × 10⁻⁵).

A third example where evaluation of the BCPS could provide valuable information is in surgical samples obtained after neoadjuvant treatment. Cellularity was included as one of the factors determining the Residual Cancer Burden score (RCB) and positively contributes to increase the score³⁰. High RCB associates with worse prognosis. Here we evaluated the BCPS in post-treatment surgical samples from the neoadjuvant clinical trial NeoSPHERE²². In line with the prognostic role of pathologist’s cellularity as part of RCB score, the BCPS significantly stratified long-term patients’ risk by quantifying the amount of residual tumour left. The association remained significant either identifying 4 groups based on BCPS distribution quartiles (Supplementary Fig. 7B) or two groups using the median value (Fig. 5d).

Discussion

Multiple cell types constitute the cancer ecosystem and the prevalence of each cell type can be influenced by intrinsic and extrinsic factors. Prevalence of neoplastic cells, or tumour cellularity, is typically estimated by the pathologist on histological tissue sections. In four distinct breast cancer datasets we found significant associations between pathologist-estimated cellularity and clinico-pathological features. We observed the highest cellularity in Grade 3 and ER-negative breast cancers, and the lowest in low grade luminal A tumours. Additionally, cellularity was affected by the histological type, higher in Invasive Ductal Carcinoma (IDC) compared to Lobular but also variable between distinct Lobular subtypes. A weak negative correlation between cellularity and sTILs was observed in the NA-PHER2 trial. Overall, this confirmed that intrinsic factors have an impact on tumour cellularity in breast cancer, in line with what was previously reported in other cancer types^3,4. However, VCA analysis indicated that more than half of the variance is not explained by the investigated factors, supporting the idea that extrinsic factors, including sampling bias, can affect specimen cellularity.

Bulk transcriptomic data analysis and interpretation is directly affected by the tumour content in the specimen⁷. Depending on the specific context and aims of the analysis, if extrinsic factors are expected to prevail, it is advantageous to consider tumour cellularity in data modelling.

Tumour cellularity evaluated by the pathologist is not always available and a low concordance has been reported between pathologist estimation and estimation based on DNA or RNA profiling data⁶. This could be partially explained by spatial variability between the sample evaluated by the pathologist and the sample or sections undergoing nucleic acid extraction and quantification; an aspect particularly relevant in small core biopsies. Moreover, the pathologist quantification is partially subjective and qualitative, with a significant interobserver variability being previously reported^6,15.

Transcriptomics-based approaches to estimate tumour purity have been proposed. The most used is the ESTIMATE score by Yoshihara and colleagues¹⁸. It merges a stroma and immune signature to estimate the relative abundance of neoplastic cells and TME. It was developed to have a pan-cancer validity, but it is overly related to immune infiltration and does not include tumour specific genes. Quon and colleagues developed ISOPure³¹, a gene expression method using normal and tumour profiles to correct the latter for normal contamination. ISOPure substantially assumes that the TME would resemble the originating normal tissue, but it is well documented that this is not the case, with stroma, immune and endothelial cells dramatically altering their phenotypes as a result of the interplay with neoplastic cells¹.

The BCPS was developed by interrogating large sample cohorts to identify the best reporter genes in a context-specific data-driven way. We covered all breast cancer subtypes, aiming at including tumour specific genes not primarily affected by the tumour subtype. We also included data obtained by different platforms; consequently, we believe that the selected consensus genes will perform well independently of the technology used.

In four independent datasets we could validate the performance of the BCPS as a reporter of tumour purity. The BCPS systematically outperformed the ESTIMATE score¹⁸. It had significantly higher correlation with pathologist’s cellularity than ESTIMATE, significantly better identified samples with low/no tumour cells and better controlled for the sampling bias in matched CBX and FNA samples. In multiple datasets, ESTIMATE showed a very strong negative correlation (<−0.87) with a signature estimating the immune infiltration. This could partially explain its lower performance, due to a limited ability of estimating low cellularity in cold tumours.

In the TCGA dataset, we did reproduce ESTIMATE correlation values reported by Aran et al.⁵, and the BCPS correlation with the pathologist cellularity was the highest when considering the other purity estimation metrics reported in the study (i.e. ABSOLUTE⁹ and LUMP⁵ based on genomics and methylation data, respectively). Consequently, when multiple omics data are available for the same sample, BCPS could be preferred to estimate tumour purity, for example informing a parallel genomic analysis. Correlation with the pathologist estimation was still moderate, but in line with previous reports for the reasons outlined above.

The BCPS is a simple and easy to compute score, proportional to the quantity of neoplastic cells in a clinical breast cancer specimen. As a reporter of tumour purity, the BCPS could be informative to study intrinsic factors affecting the tumour content, but it can also be used to control for tumour purity in the analysis and interpretation of bulk transcriptomic data. We showed two strategies to include the BCPS in the analyses. The first is to directly adjust genes or signatures for the BCPS-estimated effect of tumour purity on their expression levels. This strategy was effective in remarkably reducing the bias between matched CBX and FNA samples, with FNA known to be enriched in tumour cells compared to CBX. Of note, after adjusting for tumour purity, the expression levels of tumour specific genes or signatures are interpretable as the expression levels in the tumour compartment. However, TME marker genes and signatures, e.g. immune-related, are often interpreted as a proxy for a specific TME cell type prevalence. In this case, adjusting for tumour content could lead to capture cell density more than their absolute quantity, possibly providing complementary information.

A second strategy is to include the BCPS in data modelling, for example in the fitting of survival models. In this context, the other included variables could be, to some extent, biologically related with tumour purity. This was the case using ER-related and proliferation-related metagenes to predict survival in ER+HER2− breast cancer. ER expression is associated with tumour purity, either estimated by the pathologist or using the BCPS. Similarly, the higher tumour purity in Grade 3 and triple-negative tumours as well as the higher tumour purity in luminal B compared to luminal A tumours, suggests a biological link between proliferation and tumour content. In this case, including the BCPS as a covariate in the model was advantageous compared to a direct BCPS adjustment of the metagenes.

Finally, the BCPS has potential as prognostic or predictive biomarker in specific contexts. In the NeoTRIP neoadjuvant trial, BCPS-estimated low tumour content in on-treatment biopsies was associated with the achievement of pCR at the end of treatment. We also found in the Neo SPHRE trial that the BCPS-estimated cellularity after neoadjuvant treatment of HER2+ breast cancer was associated with long-term outcome, mimicking the prognostic role of pathologist-estimated cellularity, which is one of the key components of the highly prognostic Residual Cancer Burden³⁰.

A possible limitation of the BCPS is related to the fact that pathologist’s cellularity itself is not a perfect gold standard, since high inter-pathologist variability has been reported. To mitigate this, we based the BCPS development on the mining of multiple datasets, where the cellularity was evaluated by independent pathologists and we added an orthogonal strategy based on matched tumour-PDX dataset, eventually identifying the consensus genes associated with tumour content despite all sources of variability. Moreover, gene expression could be affected by external factors, e.g. treatment, potentially introducing unwanted variability to the BCPS. However, this is likely to affect only a subset of the genes in the BCPS, leading to a limited impact on the overall score.

In this study, we developed and validated a straightforward tool to estimate tumour content from bulk transcriptomic breast cancer data, useful to explore the role of tumour purity, aid data interpretation and improve prognostication. The framework presented here could be successfully applied to other cancer types.

Data availability

Reference code to implement and use the BCPS is available under Academic Free License v. 3.0 at: https://github.com/BarrecaMarco/BCPS. Repositories and IDs of publicly available datasets used in this study are reported in Table 1. All the statistical estimates used for the BCPS development are available in Supplementary Data 1. BCPS values in NAPHER2, NeoTRIP and NeoSPHERE datasets are also reported in Supplementary Data 3-5. Additional information can be obtained from the corresponding authors upon reasonable request.

References

Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: The next generation. Cell 144, 646–674 (2011).
Article CAS PubMed Google Scholar
Junttila, M. R. & de Sauvage, F. J. Influence of tumour micro-environment heterogeneity on therapeutic response. Nature 501, 346–354 (2013).
Article CAS PubMed Google Scholar
Lou, S. et al. Comprehensive Characterization of Tumor Purity and Its Clinical Implications in Gastric Cancer. Front Cell Dev. Biol. 9, 3843 (2022).
Article Google Scholar
Zhang, C. et al. Tumor purity as an underlying key factor in glioma. Clin. Cancer Res. 23, 6279–6291 (2017).
Article CAS PubMed Google Scholar
Aran, D., Sirota, M. & Butte, A. J. Systematic pan-cancer analysis of tumour purity. Nat. Commun. 6, 8971 (2015).
Article CAS PubMed Google Scholar
Haider, S. et al. Systematic Assessment of Tumor Purity and Its Clinical Implications. JCO Precis Oncol. 4, 995–1005 (2020).
Article Google Scholar
Fisher, N. C. et al. Biological Misinterpretation of Transcriptional Signatures in Tumor Samples can Unknowingly Undermine Mechanistic Understanding and Faithful Alignment with Preclinical Data. Clin. Cancer Res. 28, OF1–OF14 (2022).
Article Google Scholar
Rhee, J. K. et al. Impact of tumor purity on immune gene expression and clustering analyses across multiple cancer types. Cancer Immunol. Res. 6, 87–97 (2018).
Article CAS PubMed Google Scholar
Carter, S. L. et al. Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30, 413–421 (2012).
Article CAS PubMed PubMed Central Google Scholar
van Loo, P. et al. Allele-specific copy number analysis of tumors. Proc. Natl Acad. Sci. USA 107, 16910–16915 (2010).
Article PubMed PubMed Central Google Scholar
Bernard, P. S. et al. Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes. J. Clin. Oncol. 27, 1160 (2009).
Article PubMed PubMed Central Google Scholar
Sparano, J. A. et al. Clinical and Genomic Risk to Guide the Use of Adjuvant Therapy for Breast Cancer. N. Engl. J. Med. 380, 2395–2405 (2019).
Article CAS PubMed PubMed Central Google Scholar
Callari, M. et al. Subtype-specific metagene-based prediction of outcome after neoadjuvant and adjuvant treatment in breast cancer. Clin. Cancer Res. 22, 337–345 (2016).
Article CAS PubMed Google Scholar
Park, Y. H., et al. Chemotherapy induces dynamic immune responses in breast cancers that impact treatment outcome. Nat. Commun. 11, https://doi.org/10.1038/s41467-020-19933-0 (2020).
Smits, A. J. J. et al. The estimation of tumor cell percentage for molecular testing by pathologists is not accurate. Mod. Pathol. 27, 168–174 (2014).
Article PubMed Google Scholar
Anghel, C. v. et al. ISOpureR: an R implementation of a computational purification algorithm of mixed tumour profiles. BMC Bioinformatics 16, https://doi.org/10.1186/S12859-015-0597-X (2015).
Ahn, J. et al. DeMix: deconvolution for mixed cancer transcriptomes using raw measured data. Bioinformatics 29, 1865–1871 (2013).
Article CAS PubMed PubMed Central Google Scholar
Yoshihara, K. et al. Inferring tumour purity and stromal and immune cell admixture from expression data. Nat. Commun. 4, 1–11 (2013).
Article CAS Google Scholar
Gianni, L. et al. Neoadjuvant treatment with trastuzumab and pertuzumab plus palbociclib and fulvestrant in HER2-positive, ER-positive breast cancer (NA-PHER2): an exploratory, open-label, phase 2 study. Lancet Oncol. 19, 249–256 (2018).
Article CAS PubMed Google Scholar
Gianni, L. et al. Effects of neoadjuvant trastuzumab, pertuzumab and palbociclib on Ki67 in HER2 and ER-positive breast cancer. NPJ Breast Cancer 8, https://doi.org/10.1038/s41523-021-00377-8 (2022).
Gianni, L. et al. Pathologic complete response (pCR) to neoadjuvant treatment with or without atezolizumab in triple negative, early high-risk and locally advanced breast cancer. NeoTRIP Michelangelo randomized study. Ann. Oncol. https://doi.org/10.1016/j.annonc.2022.02.004 (2022).
Bianchini, G. et al. Immune modulation of pathologic complete response after neoadjuvant HER2-directed therapies in the NeoSphere trial. Ann. Oncol. 26, 2429–2436 (2015).
Article CAS PubMed Google Scholar
Weinstein, J. N. et al. The Cancer Genome Atlas Pan-Cancer Analysis Project. Nat. Genet 45, 1113 (2013).
Article PubMed PubMed Central Google Scholar
Bruna, A. et al. A Biobank of Breast Cancer Explants with Preserved Intra-tumor Heterogeneity to Screen Anticancer Compounds. Cell 167, 260–274.e22 (2016).
Article CAS PubMed PubMed Central Google Scholar
Metzger-Filho, O. et al. Genomic grade adds prognostic value in invasive lobular carcinoma. Ann. Oncol. 24, 377–384 (2013).
Article CAS PubMed Google Scholar
Foroutan, M. et al. Single sample scoring of molecular phenotypes. BMC Bioinforma. 19, 1–10 (2018).
Article Google Scholar
Brueffer, C. et al. Clinical Value of RNA Sequencing-Based Classifiers for Prediction of the Five Conventional Breast Cancer Biomarkers: A Report From the Population-Based Multicenter Sweden Cancerome Analysis Network-Breast Initiative. JCO Precis. Oncology 2, 1–18 (2018).
Google Scholar
Bianchini, G. et al. Molecular anatomy of breast cancer stroma and its prognostic value in estrogen receptor-positive and -negative cancers. J. Clin. Oncol. 28, 4316–4323 (2010).
Article PubMed Google Scholar
Jimenez-Sanchez, A., Cast, O. & Miller, M. L. Comprehensive benchmarking and integration of tumor microenvironment cell estimation methods. Cancer Res. 79, 6238–6246 (2019).
Article CAS PubMed Google Scholar
Symmans, W. F. et al. Long-Term Prognostic Risk After Neoadjuvant Chemotherapy Associated With Residual Cancer Burden and Breast Cancer Subtype. J. Clin. Oncol. 35, 1049–1060 (2017).
Article CAS PubMed PubMed Central Google Scholar
Quon, G. et al. Computational purification of individual tumor gene expression profiles leads to significant improvements in prognostic prediction. Genome Med. 5, 29 (2013).
Article PubMed PubMed Central Google Scholar
Curtis, C. et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346–352 (2012).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work has been supported by Fondazione Michelangelo, Fondazione AIRC per la ricerca sul cancro (grant IG2018 - ID21787 to G.B.), the Breast Cancer Research Foundation (grant BCRF 21-181 to L.G.) and MUR (grant “Dipartimenti di Eccellenza 2023-2027" of the Department of Informatics, Systems and Communication of the University of Milano-Bicocca, Italy).

Author information

Authors and Affiliations

Department of Biotechnology and Biosciences, University of Milano-Bicocca, Milan, Italy
Marco Barreca
Fondazione Michelangelo, Milan, Italy
Marco Barreca, Luca Gianni, Luca Gianni, Pinuccia Valagussa, Luca Gianni & Maurizio Callari
IRCCS San Raffaele Hospital, Milan, Italy
Matteo Dugo, Barbara Galbardi, Milvia Zambetti, Stefania Zambelli, Giampaolo Bianchini & Giampaolo Bianchini
Department of Bioinformatics, Semmelweis University, H-1094, Budapest, Hungary
Balázs Győrffy
Department of Biophysics, Medical School, University of Pecs, H-7624, Pecs, Hungary
Balázs Győrffy
Cancer Biomarker Research Group, Institute of Molecular Life Sciences, Research Centre for Natural Sciences, H-1117, Budapest, Hungary
Balázs Győrffy
Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milan, Italy
Daniela Besozzi
Bicocca Bioinformatics, Biostatistics and Bioimaging (B4) Research Centre, Milan, Italy
Daniela Besozzi
European Institute of Oncology, Milan, Italy
Marco Colleoni & Giuseppe Viale
Università Vita-Salute San Raffaele, Milan, Italy
Giampaolo Bianchini
AUSL – IRCCS, Reggio Emilia, Italy
Giancarlo Bisagni
IEO, European Institute of Oncology, IRCCS, Milan, Italy
Marco Colleoni
IRCCS Ospedale Policlinico San Martino - University of Genoa, Genoa, Italy
Lucia Del Mastro
IRCCS Azienda Ospedaliero-Universitaria Bologna, Bologna, Italy
Claudio Zamagni
Udine Academic Hospital, Udine, Italy
Mauro Mansutti
Azienda Ospedaliero-Universitaria S. Anna, Ferrara, Italy
Antonio Frassoldati
Fondazione Piemontese per la Ricerca sul Cancro, Candiolo, Italy
Filippo Montemurro
Policlinico S. Orsola Malpighi SSD Oncologia Medica Addarii, Bologna, Italy
Claudio Zamagni
Università degli Studi di Genoa, Ospedale Policlinico San Martino, Genova, Italy
Lucia Del Mastro
Istituto Toscano Tumori, Grosseto, Italy
Carmelo Bengala
Fondazione IRCCS Istituto Nazionale Tumori, Milan, Italy
Gabriella Mariani
Ospedale Luigi Sacco, Milan, Italy
Anna Gambaro
AUSL IRCCS Arcispedale S. Maria Nuova, Reggio Emilia, Italy
Giancarlo Bisagni
Ospedale S. Maria della Misericordia, Udine, Italy
Stefania Russo
National Coordinator, National Taiwan University Hospital, Taipei, Taiwan
Chiun-Sheng Huang
Changhua Christian Hospital, Changhua City, Taiwan
Shou-Tung Chen
Kaohsiung Medical University Chung-Ho Memorial Hospital Cancer Center, Kaohsiung City, Taiwan
Ming Feng Hou
China Medical University Hospital, Taichung City, Taiwan
Liang-Chih Liu
Taipei Veterans General Hospital, Taipei, Taiwan
Ling Ming Tseng
National Coordinator, Mater Misericordiae University Hospital, Dublin, Ireland
Catherine Kelly
Cork University Hospital, Cork, Ireland
Seamus O’Reilly
Beaumont Hospital, Dublin, Ireland
Patrick Morris
St. James’s Hospital, Dublin, Ireland
John Kennedy
University Hospital Waterford, Waterford, Ireland
Miriam O’Connor
National Coordinator, IIIrd Medical Department with Hematology, Medical Oncology, Hemostaseology, Infectious Diseases and Rheumatology, Oncologic Center, Paracelsus Medical University Salzburg, Salzburg, Austria
Richard Greil
BrustGesundheitZentrum Tirol, Medical University Innsbruck, Innsbruck, Austria
Daniel Egle
National Coordinator, Agaplesion Markus Krankenhaus, Frankfurt, Germany
Mark Thill
Klinikum Augsburg International Patient Service, Augsburg, Germany
Jacqueline Sagasser
Frauenarzt-Zentrum-Zehlendorf, Berlin, Germany
Gerd Graffunder
Augusta-Kranken-Anstalt gGmbH Klinik, Bochum, Germany
Dirk Behringer
Bethanien-Krankenhaus Onkologisches Zentrum, Frankfurt, Germany
Hans Tesch
Gynäkologisch-Onkologische Praxis, Hannover, Germany
Hans-Joachim Lück
NCT Nationales Centrum für Tumorerkrankunge, Universitätsklinikum Heidelberg, Heidelberg, Germany
Andreas Schneeweiss
Brustzentrum St. Elisabeth-Krankenhaus, Köln, Germany
Claudia Schumacher
Uniklinik Köln Klinik und Poliklinic für Frauenheilkunde und Geburtshilfe Brestzentrum, Köln, Germany
Wolfram Malter
National Coordinator, N.N. Petrov Research Inst. Oncol, St, Petersburg, Russia
Vladimir Semiglazov
N.N. Blokhin Medical Research Center of Oncology, Moscow, Russia
Mona Frolova
Road clinical hospital of OJSC «Russian Railways», St.Petersburg, Russia
Alexander Vasiliev Gennadievich
St Petersburg Clinical Scientific Center, St. Petersburg, Russia
Nikita Volkov
National Coordinator, Hospital Clinico Universitario, Valencia, Spain
Begoña Bermejo
Hospital Duran i Reynal Institut Català d’Oncologia, Hospitalet de Llobregat, Spain
Catalina Falo
Hospital Universitario HM Sanchinarro, Madrid, Spain
Elena Sevillano
Hospital Universitario 12 de Octubre, Madrid, Spain
Eva Maria Ciruelos Gil
Hospital Clínico San Carlos, Madrid, Spain
José Ángel García Sáenz
Hospital Miguel Servet, Zaragoza, Spain
Anton Antón-Torres

Authors

Marco Barreca
View author publications
Search author on:PubMed Google Scholar
Matteo Dugo
View author publications
Search author on:PubMed Google Scholar
Barbara Galbardi
View author publications
Search author on:PubMed Google Scholar
Balázs Győrffy
View author publications
Search author on:PubMed Google Scholar
Pinuccia Valagussa
View author publications
Search author on:PubMed Google Scholar
Daniela Besozzi
View author publications
Search author on:PubMed Google Scholar
Giuseppe Viale
View author publications
Search author on:PubMed Google Scholar
Giampaolo Bianchini
View author publications
Search author on:PubMed Google Scholar
Luca Gianni
View author publications
Search author on:PubMed Google Scholar
Maurizio Callari
View author publications
Search author on:PubMed Google Scholar

Consortia

NA-PHER2 consortium

Luca Gianni
, Giancarlo Bisagni
, Marco Colleoni
, Lucia Del Mastro
, Claudio Zamagni
, Mauro Mansutti
, Milvia Zambetti
& Antonio Frassoldati

NeoTRIP consortium

Luca Gianni
, Filippo Montemurro
, Claudio Zamagni
, Lucia Del Mastro
, Carmelo Bengala
, Marco Colleoni
, Gabriella Mariani
, Anna Gambaro
, Stefania Zambelli
, Giampaolo Bianchini
, Giancarlo Bisagni
, Stefania Russo
, Chiun-Sheng Huang
, Shou-Tung Chen
, Ming Feng Hou
, Liang-Chih Liu
, Ling Ming Tseng
, Catherine Kelly
, Seamus O’Reilly
, Patrick Morris
, John Kennedy
, Miriam O’Connor
, Richard Greil
, Daniel Egle
, Mark Thill
, Jacqueline Sagasser
, Gerd Graffunder
, Dirk Behringer
, Hans Tesch
, Hans-Joachim Lück
, Andreas Schneeweiss
, Claudia Schumacher
, Wolfram Malter
, Vladimir Semiglazov
, Mona Frolova
, Alexander Vasiliev Gennadievich
, Nikita Volkov
, Begoña Bermejo
, Catalina Falo
, Elena Sevillano
, Eva Maria Ciruelos Gil
, José Ángel García Sáenz
& Anton Antón-Torres

Contributions

M.C. conceived the study and planned the computational and statistical analyses; M.B. performed the computational and statistical analyses; M.D., B.G. and D.B. contributed to the computational analysis; P.V. curated the clinical information for the internal datasets; G.V. evaluated the tumour content in the internal datasets; M.C. and M.B. drafted the manuscript; G.B. and L.G. contributed to study design and clinical interpretation of the results. The Na-PHER2 consortium and the NeoTRIP consortium ran the respective clinical studies collecting the samples and clinical information. All authors contributed to subsequent drafts and agreed with submission of the manuscript for publication.

Corresponding authors

Correspondence to Giampaolo Bianchini, Luca Gianni or Maurizio Callari.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary figures 1-7

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Supplementary Data 4

Supplementary Data 5

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Barreca, M., Dugo, M., Galbardi, B. et al. Development and validation of a gene expression-based Breast Cancer Purity Score. npj Precis. Onc. 8, 242 (2024). https://doi.org/10.1038/s41698-024-00730-7

Download citation

Received: 20 March 2024
Accepted: 06 October 2024
Published: 24 October 2024
DOI: https://doi.org/10.1038/s41698-024-00730-7