Abstract
The prevalence of malignant cells in clinical specimens, or tumour purity, is affected by both intrinsic biological factors and extrinsic sampling bias. Molecular characterization of large clinical cohorts is typically performed on bulk samples; data analysis and interpretation can be biased by tumour purity variability. Transcription-based strategies to estimate tumour purity have been proposed, but no breast cancer specific method is available yet. We interrogated over 6000 expression profiles from 10 breast cancer datasets to develop and validate a 9-gene Breast Cancer Purity Score (BCPS). BCPS outperformed existing methods for estimating tumour content. Adjusting transcriptomic profiles using the BCPS reduces sampling bias and aids data interpretation. BCPS-estimated tumour purity improved prognostication in luminal breast cancer, correlated with pathologic complete response in on-treatment biopsies from triple-negative breast cancer patients undergoing neoadjuvant treatment and effectively stratified the risk of relapse in HER2+ residual disease post-neoadjuvant treatment.
Similar content being viewed by others
Introduction
Clinical tumours are complex ecosystems, including neoplastic, immune and endothelial cells, fibroblasts and normal epithelium. Tumour microenvironment (TME) components functionally interact one another and with the tumour cells, in a complex network of signals impacting tumour growth and progression, response to treatment and prevalence of the different cell types1,2.
As a consequence, tumour purity (i.e. the relative abundance of cancer cells) could represent a biologically relevant, intrinsic tumour feature. For example, a recent study in gastric cancer showed how tumour purity can predict response to chemotherapy, providing novel insights that could help improving prognostic risk stratification and facilitate treatment decision-making3. Similarly, association of purity with major clinical and molecular features was reported in glioma4. On the other hand, tissue organisation and tumour heterogeneity cause spatial variability in tumour purity. In the molecular analysis of clinical tumours, only small regions of the neoplastic lesion are typically investigated (e.g. tissue sections from a core biopsy). This introduces an extrinsic variation in tumour purity caused by the sampling procedure, which, in a pan-cancer analysis, was reported to outweigh intrinsic factors5.
In translational research, most high throughput experiments are performed on bulk tissue samples; consequently, variability in tumour content can influence the interpretation of molecular data and clinical decisions5,6,7,8. In genomic analyses, purity can impact detectability of somatic mutations9 and copy number alteration events10. In transcriptomic profiling of bulk clinical specimens, the observed profile is the result of mRNAs expressed by all different cell types present in the tumour ecosystem.
A number of gene expression-based biomarkers have been developed, with some reaching clinical implementation. In breast cancer, the transcriptomics-based PAM50 molecular subtyping has been proposed and validated11 and is now used to aid clinical decisions12. Remarkably, the Normal-like subtype is allegedly considered an artifact caused by low tumour purity more than an actual cancer cell phenotype. Similarly, the classification into all other subtypes could be biased by variability in tumour purity. We and others have previously proposed the quantification of estrogen signalling and tumour proliferation as prognostic and predictive biomarkers in breast cancer12,13. Tumour purity could introduce also in this case a relevant bias in their quantification.
Additionally, it is becoming increasingly common to study serial samples collected during treatment14. Treatment is expected to have a major and patient-specific impact on tumour purity. Consequently, interpretation of differences observed between pre- and post-treatment samples could be significantly biased by changes in tumour content. At the same time, treatment-induced changes in tumour purity could represent valuable information in predicting treatment response and outcome.
Tumour purity is generally estimated by the pathologists through visual or image analysis of tumour sections and, while this method is considered the gold standard, high inter-pathologist variability and discordance have been reported in different studies6,15.
A few computational methods have been proposed to use molecular information for tumour purity estimation, such as gene expression16,17, genomic9 or DNA methylation profiles9. These methods have the potential to bypass pathology-assessment of tumour cellularity and estimate tumour purity in the very same sample used to derive the molecular profile. One of the commonly used transcriptomics-based methods is ESTIMATE18, which was developed pan-cancer by combining an immune and a stroma score, but not including any tumour related genes.
In this study, we exploited over 6000 samples from 10 breast cancer datasets to systematically quantify the impact of intrinsic and extrinsic factors on cellularity and to generate and validate a Breast Cancer Purity Score (BCPS). BCPS outperformed ESTIMATE in quantifying tumour purity and can be successfully used to adjust for tumour purity variability when extrinsic factors are prevalent. BCPS can also capture treatment-induced changes carrying predictive and prognostic information.
Methods
Dataset collection and processing
To derive and validate the BCPS, a total of 10 gene expression datasets were collected and interrogated. Information on datasets source, sample usage, sample features and data processing are detailed in Table 1.
All ethical approvals and consents to participate have been obtained as previously reported19,20,21,22.
BCPS identification and quantification
To identify a consensus list of genes to be used as reporter of tumour cellularity, we applied tailored statistical analysis to four distinct breast cancer datasets as summarized in Fig. 2. From the intersection of candidate genes obtained from each of the four interrogated datasets (i.e. TCGA23, PDX24, NA-PHER219, Metzger-Filho25), 5 genes consistently associated with tumour content and 4 consistently associated with stroma content were selected. The two genesets were used together to perform a directional single-sample geneset enrichment analysis as implemented in the singscore (version 1.14.0) R package26. Tumour genes were considered as ‘up’ and stroma genes as ‘down’; this way, high expression of tumour genes and low expression of stroma genes will lead to the highest score. Consequently, a higher score indicates higher tumour purity.
ESTIMATE score computation
ESTIMATE score18 was computed using the estimate (version 1.0.13) R package. The score is computed by combining a stromal and an immune score. To improve readability and comparability with the BCPS, ESTIMATE score was multiplied by −1 to have higher values indicating higher tumour purity.
Statistical analyses
Spearman’s correlation was calculated to evaluate the association between two continuous variables. Differences between correlation values obtained on the same set of data were compared using cocor (version 1.1.4) R package. Student’s t test (stats R package, version 4.1.2) and analysis of variance or ANOVA (car R package, version 3.0.14) were calculated to evaluate the statistical significance of association of a continuous variable with two or multiple classes, respectively.
Variance component analysis (VCA) was computed to evaluate the impact of clinico-pathological features on tumour purity as implemented in the VCA (version 1.4.3) R package. Samples with missing values in any of the variables were excluded.
To evaluate the ability of a variable to discriminate between two classes, Receiver Operating Characteristic curve (or ROC curve) and associated area under the ROC curve (AUC) were computed, as implemented in the ROCR (version 1.1.11) R package. Differences in AUC values were statistically evaluated using pROC (version 1.18.0).
Differential expression analysis was performed using the limma (version 3.50.0) R package. Nominal p values were corrected for multiple testing using the Benjamini–Hochberg method. Genes with False Discovery Rate (FDR) <0.05 and absolute log fold change >1 were considered significantly differentially expressed, unless otherwise specified.
To evaluate the association with survival for single or multiple variables, univariate or multivariate Cox regression analysis as implemented in the survival (version 3.2.15) R package was performed. In the TCGA and Brueffer27 datasets, overall survival (OS) was the clinical endpoint. In the NeoSPHERE22 dataset, distant event-free survival (DEFS) was considered. To quantify and compare the prediction power of distinct models, the concordance index, as implemented by stat (version 4.1.3) R package, and the 7-years area under the ROC curve (AUC), as implemented by survivalROC (version 1.0.3) R package, were computed.
Results
Association of pathology-assessed cellularity with clinico-pathological features and patient’s outcome
While tumour purity can significantly impact quantification and interpretation of trancriptomic data, it is important to appreciate that both intrinsic and extrinsic factors can influence it. To obtain insights into the relative impact of these two aspects on breast cancer sample tumour content, we investigated whether pathology-assessed tumour cellularity is associated with biologically and clinically relevant features in breast cancer. Associations were evaluated in the TCGA, Metzger-Filho, Park and NA-PHER2 datasets, where tumour purity quantification by expert pathologists was available (Table 1). We found multiple significant associations as reported in Fig. 1a. In the TCGA, the invasive lobular cancers (ILC) had lower purity than other subtypes (Fig. 1a and Supplementary Fig. 1A). At the same time, in the Metzger-Filho dataset, including only ILCs, cellularity was significantly different between distinct ILC subtypes, with the lobular Classic subtype showing the lowest cellularity (Supplementary Fig. S3B). The proliferation marker Ki67 had a weak correlation with cellularity in the Metzger-Filho dataset (ρ = 0.19, p = 0.047), but the same association was not confirmed in Park and NA-PHER2 datasets (Fig. 1a, Supplementary Figs. 2M, 3C and 4B). Notably, different subtypes were included in the different datasets. On the contrary, high grade was consistently associated with higher cellularity (Fig. 1a, Supplementary Figs S2A, S3A). Breast cancer subtypes, either defined by the combination of ER and HER2 status or according to PAM50 classification, showed a significant association with cellularity in both TCGA and Park datasets. In particular, triple-negative or basal-like tumours had, on average, the highest cellularity (Fig. 1a, Supplementary Figs. 1E, F and 2E, F). Finally, in NA-PHER2 samples, stromal tumour-infiltrating lymphocytes (sTILs), but not intraepithelial tumour-infiltrating lymphocytes (iTILs), were negatively correlated with tumour cellularity (ρ = −0.33, p = 0.018) (Fig. 1a and Supplementary Fig. 4C, D).
a Landscape of association of available molecular and clinico-pathological variables with cellularity in four breast cancer datasets: TCGA (n = 1073), Metzger-Filho (n = 117), Park (n = 112) and NA-PHER2 (n = 52) pre-treatment samples. (* association between purity and continuous variables was assessed by Spearman’s correlation, association with two categorical groups was assessed by Student’s t test, and association with multiple categorical groups was evaluated by one-way ANOVA). b Variance component analysis (VCA) for each dataset computed for samples with no missing information (TCGA, n = 690; Park, n = 72; Metzger-Filho, n = 111; NA-PHER2, n = 52). The analysis estimated the proportion of total variance explained by the provided variables. c Forest plot of Cox regression univariate analysis evaluating association of cellularity with overall survival in TCGA (n = 1073) and Metzger-Filho (n = 117) datasets. Samples were evaluated overall and stratified by subtype (TCGA: 426 Luminal, 162 HER2+, 113 TN; Metzger-Filho: 100 Luminal). d Cellularity changes in on-treatment biopsy (n = 86) compared to pre-treatment (n = 112) in the Park dataset. The impact of the timepoint on tumour purity was evaluated by Student’s t test and VCA. e Same analysis as in d for the NA-PHER2 dataset (n = 52, pre-treatment biopsy; n = 40, on-treatment biopsy).
To quantify the contribution of each clinico-pathological variable to the overall tumour cellularity variance, we performed a variance component analysis (VCA) for each dataset (Fig. 1b). In line with the analysis above, the histological type and grade, together with the molecular subtypes, explained the highest percentage of total variance. However, 52–100% of cellularity variance in each dataset was not explained by the included variables.
Next, we evaluated whether tumour cellularity was associated with distant metastasis free survival in the TCGA and Metzger-Filho datasets, overall and for each subtype defined by ER and HER2 status (Fig. 1c). Cellularity was significantly associated with prognosis only in lobular luminal cases of the Metzger-Filho dataset (p = 0.04).
Finally, in the Park and NA-PHER2 datasets, transcriptomic profiles were obtained from pre- and on-treatment biopsies. Because of treatment-induced tumour cell death, an overall reduction in tumour cellularity could be expected and was observed (p = 0.009 and p = 0.003, respectively). The biopsy timepoint explained 6.12% of the variance in the Park dataset and 16.3% in the NA-PHER2 dataset (Fig. 1d, e).
In summary, this analysis denoted that, in breast cancer, intrinsic tumour biology factors can affect tumour cellularity. However, over half of the variability observed in clinical specimens undergoing molecular characterization was not explained by the main clinico-pathological features and could be related to tumour sampling.
Development of a Breast Cancer Purity Score (BCPS)
Cellularity estimation by the pathologist is not always available and might not refer to the same tumour region undergoing molecular characterization. Consequently, we aimed at identifying a gene expression signature able to estimate tumour purity in a bulk transcriptomic analysis of clinical breast cancer samples. As detailed hereafter, to generate the BCPS we interrogated four distinct datasets: Bruna, NA-PHER2, NeoTRIP21 (surgical samples) and Metzger-Filho (Fig. 2 and Table 1).
Workflow involving four distinct datasets leading to the definition of the BCPS. In the NA-PHER2 and Metzger-Filho datasets, the correlation between tumour purity and the expression values of each gene was computed. In the Bruna dataset, primary tumours were compared to matched patient-derived xenografts to identify candidate tumour-specific and stroma-specific genes, exploiting the loss of human stroma during engraftment. In the NeoTRIP dataset, the ROC curve AUC was estimated for each gene considering surgical samples with medium/high or low/no cellularity, as annotated by expert pathologists. By applying for each analysis the indicated thresholds, 5 tumour-associated genes and 4 TME-associated genes were selected to generate the BCPS.
We correlated gene expression with tumour cellularity in the NA-PHER2 (n = 92) dataset, identifying 733 genes with a positive correlation with tumour cellularity (ρ > 0.35) and 1009 genes with a negative correlation (ρ < −0.35). The same analysis was performed in the Metzger-Filho (n = 117) dataset, identifying 49 genes with a positive (ρ > 0.35) and 284 genes with a negative (ρ < −0.35) correlation with cellularity. Hence, we considered surgical samples from the NeoTRIP dataset grouped into two categories based on expert pathologist evaluation: high/mid tumour content or low/no tumour content. We assessed the ability of each gene to distinguish between the two classes by ROC curve analysis, identifying 570 genes with AUC ≥ 0.7 and 330 genes with AUC < 0.3. Results from the analysis in Fig. 1 helped us defining the BCPS development strategy. Indeed, the association between cellularity and breast cancer subtypes, suggested performing the correlation with cellularity separately in datasets including only one subtype (Table 1).
As a complementary strategy, in the Bruna dataset, transcriptomic profiles of patient-derived xenografts (PDXs) were compared with matched clinical samples from which the xenografts originated. Since the human stroma is completely lost during engraftment and replaced by mouse stroma24, genes expressed by the human tumour microenvironment are expected to be downregulated in such comparison. At the same time, tumour specific genes are expected to be similarly expressed or upregulated in PDXs. The fold changes for selected genesets, expected to be tumour specific or stroma-specific, are reported in Supplementary Figure 5A–C as proof of concept. In total, 236 stroma-specific (with logFC < −1.5) and 5679 tumour specific genes (with logFC > 0) were identified, candidate to be good reporter of tumour content. All the statistical estimates are available in Supplementary Data 1.
From the candidate genes selected in each dataset (Supplementary Fig. 5D–G), we derived a consensus list of 5 tumour-associated (AP1M2, CDK5, PAFAH1B3, SLC25A10, SMG5) and 4 stroma-associated (CXCL12, IFFO1, MFAP4, TGFBR2) genes (Fig. 2). Their main known biological functions are summarised in Supplementary Data 2. This set of genes were used for a single-sample geneset enrichment analysis providing the BCPS, proportional to sample tumour purity, as characterised in the following section.
BCPS evaluation of performance and comparison with ESTIMATE score
To evaluate the ability of the BCPS to estimate tumour sample purity and to compare its performance with the commonly used ESTIMATE score18, we interrogated five additional independent datasets: TCGA, Park, METABRIC, NeoTRIP (on-treatment samples) and Bianchini28 (Fig. 3 and Table 1).
Evaluation of the BCPS and comparison with ESTIMATE score in the TCGA, Park, NeoTRIP and Bianchini datasets. a Spearman’s correlation between the pathologist-estimated cellularity and either ESTIMATE or BCPS in TCGA (n = 1073). b Same analysis as in a for the Park dataset (n = 225). c ESTIMATE score and BCPS values measured in samples with high/medium tumour content or low/no tumour content in the NeoTRIP dataset (n = 219, on-treatment biopsy); two-sided Student’s t test. d ESTIMATE score and BCPS ability to discriminate between the two classes in c quantified by AUC. e ESTIMATE score and BCPS values measured in core biopsies (CBX) and matched fine-needle aspirations (FNA, n = 37 pairs) from the Bianchini dataset; two-sided Student’s t test. f ESTIMATE Score and BCPS ability to discriminate between the two classes in e quantified by AUC. g Example of KRT18 gene expression correction using the BCPS and linear regression to remove the impact of tumour purity. The Bianchini dataset was used. h Volcano plots of differential gene expression analysis between FNA and CBX samples of the Bianchini dataset. The analysis was performed without any correction and after normalising gene expression using the BCPS or ESTIMATE scores.
In the TCGA and Park datasets, BCPS had a significantly higher correlation with cellularity than ESTIMATE (ρ = 0.34–0.37 vs ρ = 0.23–0.27 respectively; correlation difference p < 0.001) (Fig. 3a, b). In the METABRIC dataset, where cellularity was grouped into low, moderate and high classes, BCPS better discriminated between the groups and AUC for the prediction of high vs low cellularity was 0.75 for the BCPS and 0.62 for ESTIMATE (Supplementary Fig. 6A, B). ESTIMATE was strongly inversely correlated with a gene expression-based immune score29 (ρ = −0.87 and ρ = −0.90 in TCGA and Park, respectively), while cellularity and BCPS showed weaker correlations (Supplementary Fig. 6C, D). In the TCGA dataset, Aran et al.5 computed multiple RNA, DNA and methylation based predictors of tumour purity. We compared their estimates with the BCPS in the set of samples where all the metrics were available. BCPS showed the highest correlation with the pathologist’s cellularity (Supplementary Fig. 6E).
The association with tumour content was also evaluated in the on-treatment biopsies of the NeoTRIP dataset (Fig. 3c, d). Samples were grouped into high/medium tumour content or low/no tumour content, as established by expert pathologists, and the ability of the two scores to discriminate between these two classes was quantified by Student’s t test and ROC curve analysis. BCPS better discriminated between the two groups, achieving a significantly higher AUC (BCPS AUC = 0.77, ESTIMATE AUC = 0.65, p < 0.001).
We then evaluated the BCPS and ESTIMATE score in the Bianchini dataset. This dataset contains matched primary breast cancer samples obtained by either fine-needle aspiration (FNA) or core-biopsy (CBX). The first sampling procedure is known to enrich for tumour cells while CBX better preserves the stromal content. Indeed, both BCPS and ESTIMATE score were higher in the FNA samples (paired Student’s t test p = 1.1 × 10−8 and p = 9 × 10−6, respectively). ROC curve analysis highlighted a significantly higher AUC for the BCPS compared to ESTIMATE score (BCPS AUC = 0.85, ESTIMATE AUC = 0.78, p = 0.046) (Fig. 3e, f).
In the Bianchini dataset, differences between matched FNA and CBX are expected to be primarily related to sampling differences affecting tumour content. Consequently, it represents a relevant setting where to quantify such bias and evaluate the ability of a purity score to correct for it. To this aim, we performed a paired differential analysis between FNA and CBX introducing either no correction or normalising the data using the BCPS or ESTIMATE score. BCPS-normalized data were obtained by taking the residuals of the linear regression models evaluating the relationship between BCPS and each gene. An example of gene expression before and after correction is shown in Fig. 3g. Differential analysis between FNA and CBX before correction identified 60 up-regulated genes and 409 down-regulated genes. Data correction using the ESTIMATE score reduced the number of differentially expressed genes to 89 up-regulated and 160 down-regulated, but only 40 up-regulated and 65 downregulated genes were observed in data corrected using the BCPS, showing the higher performance of the BCPS in adjusting for differences due to tumour content in clinical samples (Fig. 3h). This analysis indicated that in all situations where extrinsic factors are expected to largely overweigh intrinsic factors in affecting tumour purity, BCPS is a useful and effective tool to take tumour purity into consideration and correct for it.
BCPS recapitulates cellularity associations with clinico-pathological factors
For the TCGA and Park datasets, not used for the BCPS development, we evaluated the association with available clinico-pathological factors, as reported for the pathologist cellularity in Fig. 1.
BCPS was significantly associated with the same variables that were significantly associated with cellularity in both the TCGA and Park datasets (Figs. 4a and 1a). VCA analysis was quantitatively different but qualitatively similar (Figs. 4b and 1b), and the BCPS was significantly lower in on-treatment compared to pre-treatment samples in the Park dataset, as observed for the pathologist’s cellularity (Figs. 4d and 1d). Finally, the BCPS was not significantly associated with survival in any subtype in the TCGA dataset (Figs. 4c and 1c).
a Landscape of association of available molecular and clinico-pathological variables with cellularity in the TCGA (n = 1073) and Park (n = 112) datasets. (* association between purity and continuous variables was assessed by Spearman’s correlation, association with two categorical groups was assessed by Student’s t test, and association with multiple categorical groups was evaluated by one-way ANOVA). b Variance component analysis (VCA) for each dataset computed for samples with no missing information (TCGA, n = 690; Park, n = 72). The analysis estimated the proportion of total variance explained by the provided variables. c Forest plot of Cox regression univariate analysis evaluating cellularity association with overall survival in the TCGA (n = 1082) dataset. Samples were evaluated overall and stratified by subtype (426 Luminal, 162 HER2+, 113 TN). d Cellularity changes in on-treatment biopsy compared to pre-treatment in the Park dataset (T1 = 112, T2 = 86). The impact of the timepoint on tumour purity was evaluated by Student’s t test and VCA.
While such associations would require external validation, these results support the validity of the BCPS in providing an estimate of tumour purity leading to conclusions similar to what could be drawn using the pathologist’s evaluation.
Use of the BCPS for prediction of prognosis and response to treatment
A typical goal in transcriptomic data analysis of clinical samples is the identification of genes or signatures associated with specific clinical endpoints. Quantification of expression could be affected by tumour content7. We previously developed and validated an ER-related and a proliferation-related metagenes as predictors of long-term outcome in ER+/HER2− breast cancer13. Here we applied the metagenes to 2277 ER+/HER2− breast cancer samples from the Brueffer dataset (Table 1). A multivariate Cox model with interactions, including the ER- and proliferation metagenes, and the BCPS explained better the survival data than a bivariate model without the BCPS (likelihood ratio test p = 0.035). The improvement was confirmed by a higher c-index and higher 7-years AUC when the BCPS was included in the model (Fig. 5a, b).
a, b In ER+/HER2− samples from the Brueffer dataset (n = 2277) 7-years overall survival was predicted using a multivariate Cox model with interactions including an ER and a Proliferation metagene with or without the BCPS. C-index (a) and 7-years AUC (b) were computed for the two models highlighting a performance improvement when the BCPS was included. c Association of the BCPS with pCR in on-treatment biopsies from the NeoTRIP dataset (n = 219). d BCPS quantified in the surgical samples of the NeoSPHERE trial was associated with DEFS. Two groups based on the BCPS median were identified and represented by Kaplan–Meier curves; differences were evaluated by log-rank test.
Next, we focused on BCPS values in on-treatment biopsies of the NeoTRIP dataset, where the BCPS was significantly lower (p = 1.3 × 10−3) in cases eventually achieving pathological complete response or pCR (Fig. 5c). This indicates that, despite a possible sampling bias, evaluation of cellularity in on-treatment biopsies could help early prediction of patients responding or not to neoadjuvant treatments. Remarkably, this mimicked the predictive power of the pathologist’s evaluation in the same cohort, where 66.3% of patients with low/no cellularity had a pCR, versus only 35.3% of patients with mid/high cellularity in the on-treatment biopsy (p = 1.8 × 10−5).
A third example where evaluation of the BCPS could provide valuable information is in surgical samples obtained after neoadjuvant treatment. Cellularity was included as one of the factors determining the Residual Cancer Burden score (RCB) and positively contributes to increase the score30. High RCB associates with worse prognosis. Here we evaluated the BCPS in post-treatment surgical samples from the neoadjuvant clinical trial NeoSPHERE22. In line with the prognostic role of pathologist’s cellularity as part of RCB score, the BCPS significantly stratified long-term patients’ risk by quantifying the amount of residual tumour left. The association remained significant either identifying 4 groups based on BCPS distribution quartiles (Supplementary Fig. 7B) or two groups using the median value (Fig. 5d).
Discussion
Multiple cell types constitute the cancer ecosystem and the prevalence of each cell type can be influenced by intrinsic and extrinsic factors. Prevalence of neoplastic cells, or tumour cellularity, is typically estimated by the pathologist on histological tissue sections. In four distinct breast cancer datasets we found significant associations between pathologist-estimated cellularity and clinico-pathological features. We observed the highest cellularity in Grade 3 and ER-negative breast cancers, and the lowest in low grade luminal A tumours. Additionally, cellularity was affected by the histological type, higher in Invasive Ductal Carcinoma (IDC) compared to Lobular but also variable between distinct Lobular subtypes. A weak negative correlation between cellularity and sTILs was observed in the NA-PHER2 trial. Overall, this confirmed that intrinsic factors have an impact on tumour cellularity in breast cancer, in line with what was previously reported in other cancer types3,4. However, VCA analysis indicated that more than half of the variance is not explained by the investigated factors, supporting the idea that extrinsic factors, including sampling bias, can affect specimen cellularity.
Bulk transcriptomic data analysis and interpretation is directly affected by the tumour content in the specimen7. Depending on the specific context and aims of the analysis, if extrinsic factors are expected to prevail, it is advantageous to consider tumour cellularity in data modelling.
Tumour cellularity evaluated by the pathologist is not always available and a low concordance has been reported between pathologist estimation and estimation based on DNA or RNA profiling data6. This could be partially explained by spatial variability between the sample evaluated by the pathologist and the sample or sections undergoing nucleic acid extraction and quantification; an aspect particularly relevant in small core biopsies. Moreover, the pathologist quantification is partially subjective and qualitative, with a significant interobserver variability being previously reported6,15.
Transcriptomics-based approaches to estimate tumour purity have been proposed. The most used is the ESTIMATE score by Yoshihara and colleagues18. It merges a stroma and immune signature to estimate the relative abundance of neoplastic cells and TME. It was developed to have a pan-cancer validity, but it is overly related to immune infiltration and does not include tumour specific genes. Quon and colleagues developed ISOPure31, a gene expression method using normal and tumour profiles to correct the latter for normal contamination. ISOPure substantially assumes that the TME would resemble the originating normal tissue, but it is well documented that this is not the case, with stroma, immune and endothelial cells dramatically altering their phenotypes as a result of the interplay with neoplastic cells1.
The BCPS was developed by interrogating large sample cohorts to identify the best reporter genes in a context-specific data-driven way. We covered all breast cancer subtypes, aiming at including tumour specific genes not primarily affected by the tumour subtype. We also included data obtained by different platforms; consequently, we believe that the selected consensus genes will perform well independently of the technology used.
In four independent datasets we could validate the performance of the BCPS as a reporter of tumour purity. The BCPS systematically outperformed the ESTIMATE score18. It had significantly higher correlation with pathologist’s cellularity than ESTIMATE, significantly better identified samples with low/no tumour cells and better controlled for the sampling bias in matched CBX and FNA samples. In multiple datasets, ESTIMATE showed a very strong negative correlation (<−0.87) with a signature estimating the immune infiltration. This could partially explain its lower performance, due to a limited ability of estimating low cellularity in cold tumours.
In the TCGA dataset, we did reproduce ESTIMATE correlation values reported by Aran et al.5, and the BCPS correlation with the pathologist cellularity was the highest when considering the other purity estimation metrics reported in the study (i.e. ABSOLUTE9 and LUMP5 based on genomics and methylation data, respectively). Consequently, when multiple omics data are available for the same sample, BCPS could be preferred to estimate tumour purity, for example informing a parallel genomic analysis. Correlation with the pathologist estimation was still moderate, but in line with previous reports for the reasons outlined above.
The BCPS is a simple and easy to compute score, proportional to the quantity of neoplastic cells in a clinical breast cancer specimen. As a reporter of tumour purity, the BCPS could be informative to study intrinsic factors affecting the tumour content, but it can also be used to control for tumour purity in the analysis and interpretation of bulk transcriptomic data. We showed two strategies to include the BCPS in the analyses. The first is to directly adjust genes or signatures for the BCPS-estimated effect of tumour purity on their expression levels. This strategy was effective in remarkably reducing the bias between matched CBX and FNA samples, with FNA known to be enriched in tumour cells compared to CBX. Of note, after adjusting for tumour purity, the expression levels of tumour specific genes or signatures are interpretable as the expression levels in the tumour compartment. However, TME marker genes and signatures, e.g. immune-related, are often interpreted as a proxy for a specific TME cell type prevalence. In this case, adjusting for tumour content could lead to capture cell density more than their absolute quantity, possibly providing complementary information.
A second strategy is to include the BCPS in data modelling, for example in the fitting of survival models. In this context, the other included variables could be, to some extent, biologically related with tumour purity. This was the case using ER-related and proliferation-related metagenes to predict survival in ER+HER2− breast cancer. ER expression is associated with tumour purity, either estimated by the pathologist or using the BCPS. Similarly, the higher tumour purity in Grade 3 and triple-negative tumours as well as the higher tumour purity in luminal B compared to luminal A tumours, suggests a biological link between proliferation and tumour content. In this case, including the BCPS as a covariate in the model was advantageous compared to a direct BCPS adjustment of the metagenes.
Finally, the BCPS has potential as prognostic or predictive biomarker in specific contexts. In the NeoTRIP neoadjuvant trial, BCPS-estimated low tumour content in on-treatment biopsies was associated with the achievement of pCR at the end of treatment. We also found in the Neo SPHRE trial that the BCPS-estimated cellularity after neoadjuvant treatment of HER2+ breast cancer was associated with long-term outcome, mimicking the prognostic role of pathologist-estimated cellularity, which is one of the key components of the highly prognostic Residual Cancer Burden30.
A possible limitation of the BCPS is related to the fact that pathologist’s cellularity itself is not a perfect gold standard, since high inter-pathologist variability has been reported. To mitigate this, we based the BCPS development on the mining of multiple datasets, where the cellularity was evaluated by independent pathologists and we added an orthogonal strategy based on matched tumour-PDX dataset, eventually identifying the consensus genes associated with tumour content despite all sources of variability. Moreover, gene expression could be affected by external factors, e.g. treatment, potentially introducing unwanted variability to the BCPS. However, this is likely to affect only a subset of the genes in the BCPS, leading to a limited impact on the overall score.
In this study, we developed and validated a straightforward tool to estimate tumour content from bulk transcriptomic breast cancer data, useful to explore the role of tumour purity, aid data interpretation and improve prognostication. The framework presented here could be successfully applied to other cancer types.
Data availability
Reference code to implement and use the BCPS is available under Academic Free License v. 3.0 at: https://github.com/BarrecaMarco/BCPS. Repositories and IDs of publicly available datasets used in this study are reported in Table 1. All the statistical estimates used for the BCPS development are available in Supplementary Data 1. BCPS values in NAPHER2, NeoTRIP and NeoSPHERE datasets are also reported in Supplementary Data 3-5. Additional information can be obtained from the corresponding authors upon reasonable request.
References
Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: The next generation. Cell 144, 646–674 (2011).
Junttila, M. R. & de Sauvage, F. J. Influence of tumour micro-environment heterogeneity on therapeutic response. Nature 501, 346–354 (2013).
Lou, S. et al. Comprehensive Characterization of Tumor Purity and Its Clinical Implications in Gastric Cancer. Front Cell Dev. Biol. 9, 3843 (2022).
Zhang, C. et al. Tumor purity as an underlying key factor in glioma. Clin. Cancer Res. 23, 6279–6291 (2017).
Aran, D., Sirota, M. & Butte, A. J. Systematic pan-cancer analysis of tumour purity. Nat. Commun. 6, 8971 (2015).
Haider, S. et al. Systematic Assessment of Tumor Purity and Its Clinical Implications. JCO Precis Oncol. 4, 995–1005 (2020).
Fisher, N. C. et al. Biological Misinterpretation of Transcriptional Signatures in Tumor Samples can Unknowingly Undermine Mechanistic Understanding and Faithful Alignment with Preclinical Data. Clin. Cancer Res. 28, OF1–OF14 (2022).
Rhee, J. K. et al. Impact of tumor purity on immune gene expression and clustering analyses across multiple cancer types. Cancer Immunol. Res. 6, 87–97 (2018).
Carter, S. L. et al. Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30, 413–421 (2012).
van Loo, P. et al. Allele-specific copy number analysis of tumors. Proc. Natl Acad. Sci. USA 107, 16910–16915 (2010).
Bernard, P. S. et al. Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes. J. Clin. Oncol. 27, 1160 (2009).
Sparano, J. A. et al. Clinical and Genomic Risk to Guide the Use of Adjuvant Therapy for Breast Cancer. N. Engl. J. Med. 380, 2395–2405 (2019).
Callari, M. et al. Subtype-specific metagene-based prediction of outcome after neoadjuvant and adjuvant treatment in breast cancer. Clin. Cancer Res. 22, 337–345 (2016).
Park, Y. H., et al. Chemotherapy induces dynamic immune responses in breast cancers that impact treatment outcome. Nat. Commun. 11, https://doi.org/10.1038/s41467-020-19933-0 (2020).
Smits, A. J. J. et al. The estimation of tumor cell percentage for molecular testing by pathologists is not accurate. Mod. Pathol. 27, 168–174 (2014).
Anghel, C. v. et al. ISOpureR: an R implementation of a computational purification algorithm of mixed tumour profiles. BMC Bioinformatics 16, https://doi.org/10.1186/S12859-015-0597-X (2015).
Ahn, J. et al. DeMix: deconvolution for mixed cancer transcriptomes using raw measured data. Bioinformatics 29, 1865–1871 (2013).
Yoshihara, K. et al. Inferring tumour purity and stromal and immune cell admixture from expression data. Nat. Commun. 4, 1–11 (2013).
Gianni, L. et al. Neoadjuvant treatment with trastuzumab and pertuzumab plus palbociclib and fulvestrant in HER2-positive, ER-positive breast cancer (NA-PHER2): an exploratory, open-label, phase 2 study. Lancet Oncol. 19, 249–256 (2018).
Gianni, L. et al. Effects of neoadjuvant trastuzumab, pertuzumab and palbociclib on Ki67 in HER2 and ER-positive breast cancer. NPJ Breast Cancer 8, https://doi.org/10.1038/s41523-021-00377-8 (2022).
Gianni, L. et al. Pathologic complete response (pCR) to neoadjuvant treatment with or without atezolizumab in triple negative, early high-risk and locally advanced breast cancer. NeoTRIP Michelangelo randomized study. Ann. Oncol. https://doi.org/10.1016/j.annonc.2022.02.004 (2022).
Bianchini, G. et al. Immune modulation of pathologic complete response after neoadjuvant HER2-directed therapies in the NeoSphere trial. Ann. Oncol. 26, 2429–2436 (2015).
Weinstein, J. N. et al. The Cancer Genome Atlas Pan-Cancer Analysis Project. Nat. Genet 45, 1113 (2013).
Bruna, A. et al. A Biobank of Breast Cancer Explants with Preserved Intra-tumor Heterogeneity to Screen Anticancer Compounds. Cell 167, 260–274.e22 (2016).
Metzger-Filho, O. et al. Genomic grade adds prognostic value in invasive lobular carcinoma. Ann. Oncol. 24, 377–384 (2013).
Foroutan, M. et al. Single sample scoring of molecular phenotypes. BMC Bioinforma. 19, 1–10 (2018).
Brueffer, C. et al. Clinical Value of RNA Sequencing-Based Classifiers for Prediction of the Five Conventional Breast Cancer Biomarkers: A Report From the Population-Based Multicenter Sweden Cancerome Analysis Network-Breast Initiative. JCO Precis. Oncology 2, 1–18 (2018).
Bianchini, G. et al. Molecular anatomy of breast cancer stroma and its prognostic value in estrogen receptor-positive and -negative cancers. J. Clin. Oncol. 28, 4316–4323 (2010).
Jimenez-Sanchez, A., Cast, O. & Miller, M. L. Comprehensive benchmarking and integration of tumor microenvironment cell estimation methods. Cancer Res. 79, 6238–6246 (2019).
Symmans, W. F. et al. Long-Term Prognostic Risk After Neoadjuvant Chemotherapy Associated With Residual Cancer Burden and Breast Cancer Subtype. J. Clin. Oncol. 35, 1049–1060 (2017).
Quon, G. et al. Computational purification of individual tumor gene expression profiles leads to significant improvements in prognostic prediction. Genome Med. 5, 29 (2013).
Curtis, C. et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346–352 (2012).
Acknowledgements
This work has been supported by Fondazione Michelangelo, Fondazione AIRC per la ricerca sul cancro (grant IG2018 - ID21787 to G.B.), the Breast Cancer Research Foundation (grant BCRF 21-181 to L.G.) and MUR (grant “Dipartimenti di Eccellenza 2023-2027" of the Department of Informatics, Systems and Communication of the University of Milano-Bicocca, Italy).
Author information
Authors and Affiliations
Consortia
Contributions
M.C. conceived the study and planned the computational and statistical analyses; M.B. performed the computational and statistical analyses; M.D., B.G. and D.B. contributed to the computational analysis; P.V. curated the clinical information for the internal datasets; G.V. evaluated the tumour content in the internal datasets; M.C. and M.B. drafted the manuscript; G.B. and L.G. contributed to study design and clinical interpretation of the results. The Na-PHER2 consortium and the NeoTRIP consortium ran the respective clinical studies collecting the samples and clinical information. All authors contributed to subsequent drafts and agreed with submission of the manuscript for publication.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Barreca, M., Dugo, M., Galbardi, B. et al. Development and validation of a gene expression-based Breast Cancer Purity Score. npj Precis. Onc. 8, 242 (2024). https://doi.org/10.1038/s41698-024-00730-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41698-024-00730-7