Introduction

Clinical tumours are complex ecosystems, including neoplastic, immune and endothelial cells, fibroblasts and normal epithelium. Tumour microenvironment (TME) components functionally interact one another and with the tumour cells, in a complex network of signals impacting tumour growth and progression, response to treatment and prevalence of the different cell types1,2.

As a consequence, tumour purity (i.e. the relative abundance of cancer cells) could represent a biologically relevant, intrinsic tumour feature. For example, a recent study in gastric cancer showed how tumour purity can predict response to chemotherapy, providing novel insights that could help improving prognostic risk stratification and facilitate treatment decision-making3. Similarly, association of purity with major clinical and molecular features was reported in glioma4. On the other hand, tissue organisation and tumour heterogeneity cause spatial variability in tumour purity. In the molecular analysis of clinical tumours, only small regions of the neoplastic lesion are typically investigated (e.g. tissue sections from a core biopsy). This introduces an extrinsic variation in tumour purity caused by the sampling procedure, which, in a pan-cancer analysis, was reported to outweigh intrinsic factors5.

In translational research, most high throughput experiments are performed on bulk tissue samples; consequently, variability in tumour content can influence the interpretation of molecular data and clinical decisions5,6,7,8. In genomic analyses, purity can impact detectability of somatic mutations9 and copy number alteration events10. In transcriptomic profiling of bulk clinical specimens, the observed profile is the result of mRNAs expressed by all different cell types present in the tumour ecosystem.

A number of gene expression-based biomarkers have been developed, with some reaching clinical implementation. In breast cancer, the transcriptomics-based PAM50 molecular subtyping has been proposed and validated11 and is now used to aid clinical decisions12. Remarkably, the Normal-like subtype is allegedly considered an artifact caused by low tumour purity more than an actual cancer cell phenotype. Similarly, the classification into all other subtypes could be biased by variability in tumour purity. We and others have previously proposed the quantification of estrogen signalling and tumour proliferation as prognostic and predictive biomarkers in breast cancer12,13. Tumour purity could introduce also in this case a relevant bias in their quantification.

Additionally, it is becoming increasingly common to study serial samples collected during treatment14. Treatment is expected to have a major and patient-specific impact on tumour purity. Consequently, interpretation of differences observed between pre- and post-treatment samples could be significantly biased by changes in tumour content. At the same time, treatment-induced changes in tumour purity could represent valuable information in predicting treatment response and outcome.

Tumour purity is generally estimated by the pathologists through visual or image analysis of tumour sections and, while this method is considered the gold standard, high inter-pathologist variability and discordance have been reported in different studies6,15.

A few computational methods have been proposed to use molecular information for tumour purity estimation, such as gene expression16,17, genomic9 or DNA methylation profiles9. These methods have the potential to bypass pathology-assessment of tumour cellularity and estimate tumour purity in the very same sample used to derive the molecular profile. One of the commonly used transcriptomics-based methods is ESTIMATE18, which was developed pan-cancer by combining an immune and a stroma score, but not including any tumour related genes.

In this study, we exploited over 6000 samples from 10 breast cancer datasets to systematically quantify the impact of intrinsic and extrinsic factors on cellularity and to generate and validate a Breast Cancer Purity Score (BCPS). BCPS outperformed ESTIMATE in quantifying tumour purity and can be successfully used to adjust for tumour purity variability when extrinsic factors are prevalent. BCPS can also capture treatment-induced changes carrying predictive and prognostic information.

Methods

Dataset collection and processing

To derive and validate the BCPS, a total of 10 gene expression datasets were collected and interrogated. Information on datasets source, sample usage, sample features and data processing are detailed in Table 1.

Table 1 Description of gene expression datasets interrogated to derive, validate and apply the BCPS

All ethical approvals and consents to participate have been obtained as previously reported19,20,21,22.

BCPS identification and quantification

To identify a consensus list of genes to be used as reporter of tumour cellularity, we applied tailored statistical analysis to four distinct breast cancer datasets as summarized in Fig. 2. From the intersection of candidate genes obtained from each of the four interrogated datasets (i.e. TCGA23, PDX24, NA-PHER219, Metzger-Filho25), 5 genes consistently associated with tumour content and 4 consistently associated with stroma content were selected. The two genesets were used together to perform a directional single-sample geneset enrichment analysis as implemented in the singscore (version 1.14.0) R package26. Tumour genes were considered as ‘up’ and stroma genes as ‘down’; this way, high expression of tumour genes and low expression of stroma genes will lead to the highest score. Consequently, a higher score indicates higher tumour purity.

ESTIMATE score computation

ESTIMATE score18 was computed using the estimate (version 1.0.13) R package. The score is computed by combining a stromal and an immune score. To improve readability and comparability with the BCPS, ESTIMATE score was multiplied by −1 to have higher values indicating higher tumour purity.

Statistical analyses

Spearman’s correlation was calculated to evaluate the association between two continuous variables. Differences between correlation values obtained on the same set of data were compared using cocor (version 1.1.4) R package. Student’s t test (stats R package, version 4.1.2) and analysis of variance or ANOVA (car R package, version 3.0.14) were calculated to evaluate the statistical significance of association of a continuous variable with two or multiple classes, respectively.

Variance component analysis (VCA) was computed to evaluate the impact of clinico-pathological features on tumour purity as implemented in the VCA (version 1.4.3) R package. Samples with missing values in any of the variables were excluded.

To evaluate the ability of a variable to discriminate between two classes, Receiver Operating Characteristic curve (or ROC curve) and associated area under the ROC curve (AUC) were computed, as implemented in the ROCR (version 1.1.11) R package. Differences in AUC values were statistically evaluated using pROC (version 1.18.0).

Differential expression analysis was performed using the limma (version 3.50.0) R package. Nominal p values were corrected for multiple testing using the Benjamini–Hochberg method. Genes with False Discovery Rate (FDR) <0.05 and absolute log fold change >1 were considered significantly differentially expressed, unless otherwise specified.

To evaluate the association with survival for single or multiple variables, univariate or multivariate Cox regression analysis as implemented in the survival (version 3.2.15) R package was performed. In the TCGA and Brueffer27 datasets, overall survival (OS) was the clinical endpoint. In the NeoSPHERE22 dataset, distant event-free survival (DEFS) was considered. To quantify and compare the prediction power of distinct models, the concordance index, as implemented by stat (version 4.1.3) R package, and the 7-years area under the ROC curve (AUC), as implemented by survivalROC (version 1.0.3) R package, were computed.

Results

Association of pathology-assessed cellularity with clinico-pathological features and patient’s outcome

While tumour purity can significantly impact quantification and interpretation of trancriptomic data, it is important to appreciate that both intrinsic and extrinsic factors can influence it. To obtain insights into the relative impact of these two aspects on breast cancer sample tumour content, we investigated whether pathology-assessed tumour cellularity is associated with biologically and clinically relevant features in breast cancer. Associations were evaluated in the TCGA, Metzger-Filho, Park and NA-PHER2 datasets, where tumour purity quantification by expert pathologists was available (Table 1). We found multiple significant associations as reported in Fig. 1a. In the TCGA, the invasive lobular cancers (ILC) had lower purity than other subtypes (Fig. 1a and Supplementary Fig. 1A). At the same time, in the Metzger-Filho dataset, including only ILCs, cellularity was significantly different between distinct ILC subtypes, with the lobular Classic subtype showing the lowest cellularity (Supplementary Fig. S3B). The proliferation marker Ki67 had a weak correlation with cellularity in the Metzger-Filho dataset (ρ = 0.19, p = 0.047), but the same association was not confirmed in Park and NA-PHER2 datasets (Fig. 1a, Supplementary Figs. 2M, 3C and 4B). Notably, different subtypes were included in the different datasets. On the contrary, high grade was consistently associated with higher cellularity (Fig. 1a, Supplementary Figs S2A, S3A). Breast cancer subtypes, either defined by the combination of ER and HER2 status or according to PAM50 classification, showed a significant association with cellularity in both TCGA and Park datasets. In particular, triple-negative or basal-like tumours had, on average, the highest cellularity (Fig. 1a, Supplementary Figs. 1E, F and 2E, F). Finally, in NA-PHER2 samples, stromal tumour-infiltrating lymphocytes (sTILs), but not intraepithelial tumour-infiltrating lymphocytes (iTILs), were negatively correlated with tumour cellularity (ρ = −0.33, p = 0.018) (Fig. 1a and Supplementary Fig. 4C, D).

Fig. 1: Association of pathologist’s estimated tumour purity with molecular and clinico-pathological variables in breast cancer.
figure 1

a Landscape of association of available molecular and clinico-pathological variables with cellularity in four breast cancer datasets: TCGA (n = 1073), Metzger-Filho (n = 117), Park (n = 112) and NA-PHER2 (n = 52) pre-treatment samples. (* association between purity and continuous variables was assessed by Spearman’s correlation, association with two categorical groups was assessed by Student’s t test, and association with multiple categorical groups was evaluated by one-way ANOVA). b Variance component analysis (VCA) for each dataset computed for samples with no missing information (TCGA, n = 690; Park, n = 72; Metzger-Filho, n = 111; NA-PHER2, n = 52). The analysis estimated the proportion of total variance explained by the provided variables. c Forest plot of Cox regression univariate analysis evaluating association of cellularity with overall survival in TCGA (n = 1073) and Metzger-Filho (n = 117) datasets. Samples were evaluated overall and stratified by subtype (TCGA: 426 Luminal, 162 HER2+, 113 TN; Metzger-Filho: 100 Luminal). d Cellularity changes in on-treatment biopsy (n = 86) compared to pre-treatment (n = 112) in the Park dataset. The impact of the timepoint on tumour purity was evaluated by Student’s t test and VCA. e Same analysis as in d for the NA-PHER2 dataset (n = 52, pre-treatment biopsy; n = 40, on-treatment biopsy).

To quantify the contribution of each clinico-pathological variable to the overall tumour cellularity variance, we performed a variance component analysis (VCA) for each dataset (Fig. 1b). In line with the analysis above, the histological type and grade, together with the molecular subtypes, explained the highest percentage of total variance. However, 52–100% of cellularity variance in each dataset was not explained by the included variables.

Next, we evaluated whether tumour cellularity was associated with distant metastasis free survival in the TCGA and Metzger-Filho datasets, overall and for each subtype defined by ER and HER2 status (Fig. 1c). Cellularity was significantly associated with prognosis only in lobular luminal cases of the Metzger-Filho dataset (p = 0.04).

Finally, in the Park and NA-PHER2 datasets, transcriptomic profiles were obtained from pre- and on-treatment biopsies. Because of treatment-induced tumour cell death, an overall reduction in tumour cellularity could be expected and was observed (p = 0.009 and p = 0.003, respectively). The biopsy timepoint explained 6.12% of the variance in the Park dataset and 16.3% in the NA-PHER2 dataset (Fig. 1d, e).

In summary, this analysis denoted that, in breast cancer, intrinsic tumour biology factors can affect tumour cellularity. However, over half of the variability observed in clinical specimens undergoing molecular characterization was not explained by the main clinico-pathological features and could be related to tumour sampling.

Development of a Breast Cancer Purity Score (BCPS)

Cellularity estimation by the pathologist is not always available and might not refer to the same tumour region undergoing molecular characterization. Consequently, we aimed at identifying a gene expression signature able to estimate tumour purity in a bulk transcriptomic analysis of clinical breast cancer samples. As detailed hereafter, to generate the BCPS we interrogated four distinct datasets: Bruna, NA-PHER2, NeoTRIP21 (surgical samples) and Metzger-Filho (Fig. 2 and Table 1).

Fig. 2: BCPS generation.
figure 2

Workflow involving four distinct datasets leading to the definition of the BCPS. In the NA-PHER2 and Metzger-Filho datasets, the correlation between tumour purity and the expression values of each gene was computed. In the Bruna dataset, primary tumours were compared to matched patient-derived xenografts to identify candidate tumour-specific and stroma-specific genes, exploiting the loss of human stroma during engraftment. In the NeoTRIP dataset, the ROC curve AUC was estimated for each gene considering surgical samples with medium/high or low/no cellularity, as annotated by expert pathologists. By applying for each analysis the indicated thresholds, 5 tumour-associated genes and 4 TME-associated genes were selected to generate the BCPS.

We correlated gene expression with tumour cellularity in the NA-PHER2 (n = 92) dataset, identifying 733 genes with a positive correlation with tumour cellularity (ρ > 0.35) and 1009 genes with a negative correlation (ρ < −0.35). The same analysis was performed in the Metzger-Filho (n = 117) dataset, identifying 49 genes with a positive (ρ > 0.35) and 284 genes with a negative (ρ < −0.35) correlation with cellularity. Hence, we considered surgical samples from the NeoTRIP dataset grouped into two categories based on expert pathologist evaluation: high/mid tumour content or low/no tumour content. We assessed the ability of each gene to distinguish between the two classes by ROC curve analysis, identifying 570 genes with AUC ≥ 0.7 and 330 genes with AUC < 0.3. Results from the analysis in Fig. 1 helped us defining the BCPS development strategy. Indeed, the association between cellularity and breast cancer subtypes, suggested performing the correlation with cellularity separately in datasets including only one subtype (Table 1).

As a complementary strategy, in the Bruna dataset, transcriptomic profiles of patient-derived xenografts (PDXs) were compared with matched clinical samples from which the xenografts originated. Since the human stroma is completely lost during engraftment and replaced by mouse stroma24, genes expressed by the human tumour microenvironment are expected to be downregulated in such comparison. At the same time, tumour specific genes are expected to be similarly expressed or upregulated in PDXs. The fold changes for selected genesets, expected to be tumour specific or stroma-specific, are reported in Supplementary Figure 5A–C as proof of concept. In total, 236 stroma-specific (with logFC < −1.5) and 5679 tumour specific genes (with logFC > 0) were identified, candidate to be good reporter of tumour content. All the statistical estimates are available in Supplementary Data 1.

From the candidate genes selected in each dataset (Supplementary Fig. 5D–G), we derived a consensus list of 5 tumour-associated (AP1M2, CDK5, PAFAH1B3, SLC25A10, SMG5) and 4 stroma-associated (CXCL12, IFFO1, MFAP4, TGFBR2) genes (Fig. 2). Their main known biological functions are summarised in Supplementary Data 2. This set of genes were used for a single-sample geneset enrichment analysis providing the BCPS, proportional to sample tumour purity, as characterised in the following section.

BCPS evaluation of performance and comparison with ESTIMATE score

To evaluate the ability of the BCPS to estimate tumour sample purity and to compare its performance with the commonly used ESTIMATE score18, we interrogated five additional independent datasets: TCGA, Park, METABRIC, NeoTRIP (on-treatment samples) and Bianchini28 (Fig. 3 and Table 1).

Fig. 3: BCPS validation and comparison with ESTIMATE score.
figure 3

Evaluation of the BCPS and comparison with ESTIMATE score in the TCGA, Park, NeoTRIP and Bianchini datasets. a Spearman’s correlation between the pathologist-estimated cellularity and either ESTIMATE or BCPS in TCGA (n = 1073). b Same analysis as in a for the Park dataset (n = 225). c ESTIMATE score and BCPS values measured in samples with high/medium tumour content or low/no tumour content in the NeoTRIP dataset (n = 219, on-treatment biopsy); two-sided Student’s t test. d ESTIMATE score and BCPS ability to discriminate between the two classes in c quantified by AUC. e ESTIMATE score and BCPS values measured in core biopsies (CBX) and matched fine-needle aspirations (FNA, n = 37 pairs) from the Bianchini dataset; two-sided Student’s t test. f ESTIMATE Score and BCPS ability to discriminate between the two classes in e quantified by AUC. g Example of KRT18 gene expression correction using the BCPS and linear regression to remove the impact of tumour purity. The Bianchini dataset was used. h Volcano plots of differential gene expression analysis between FNA and CBX samples of the Bianchini dataset. The analysis was performed without any correction and after normalising gene expression using the BCPS or ESTIMATE scores.

In the TCGA and Park datasets, BCPS had a significantly higher correlation with cellularity than ESTIMATE (ρ = 0.34–0.37 vs ρ = 0.23–0.27 respectively; correlation difference p < 0.001) (Fig. 3a, b). In the METABRIC dataset, where cellularity was grouped into low, moderate and high classes, BCPS better discriminated between the groups and AUC for the prediction of high vs low cellularity was 0.75 for the BCPS and 0.62 for ESTIMATE (Supplementary Fig. 6A, B). ESTIMATE was strongly inversely correlated with a gene expression-based immune score29 (ρ = −0.87 and ρ = −0.90 in TCGA and Park, respectively), while cellularity and BCPS showed weaker correlations (Supplementary Fig. 6C, D). In the TCGA dataset, Aran et al.5 computed multiple RNA, DNA and methylation based predictors of tumour purity. We compared their estimates with the BCPS in the set of samples where all the metrics were available. BCPS showed the highest correlation with the pathologist’s cellularity (Supplementary Fig. 6E).

The association with tumour content was also evaluated in the on-treatment biopsies of the NeoTRIP dataset (Fig. 3c, d). Samples were grouped into high/medium tumour content or low/no tumour content, as established by expert pathologists, and the ability of the two scores to discriminate between these two classes was quantified by Student’s t test and ROC curve analysis. BCPS better discriminated between the two groups, achieving a significantly higher AUC (BCPS AUC = 0.77, ESTIMATE AUC = 0.65, p < 0.001).

We then evaluated the BCPS and ESTIMATE score in the Bianchini dataset. This dataset contains matched primary breast cancer samples obtained by either fine-needle aspiration (FNA) or core-biopsy (CBX). The first sampling procedure is known to enrich for tumour cells while CBX better preserves the stromal content. Indeed, both BCPS and ESTIMATE score were higher in the FNA samples (paired Student’s t test p = 1.1 × 10−8 and p = 9 × 10−6, respectively). ROC curve analysis highlighted a significantly higher AUC for the BCPS compared to ESTIMATE score (BCPS AUC = 0.85, ESTIMATE AUC = 0.78, p = 0.046) (Fig. 3e, f).

In the Bianchini dataset, differences between matched FNA and CBX are expected to be primarily related to sampling differences affecting tumour content. Consequently, it represents a relevant setting where to quantify such bias and evaluate the ability of a purity score to correct for it. To this aim, we performed a paired differential analysis between FNA and CBX introducing either no correction or normalising the data using the BCPS or ESTIMATE score. BCPS-normalized data were obtained by taking the residuals of the linear regression models evaluating the relationship between BCPS and each gene. An example of gene expression before and after correction is shown in Fig. 3g. Differential analysis between FNA and CBX before correction identified 60 up-regulated genes and 409 down-regulated genes. Data correction using the ESTIMATE score reduced the number of differentially expressed genes to 89 up-regulated and 160 down-regulated, but only 40 up-regulated and 65 downregulated genes were observed in data corrected using the BCPS, showing the higher performance of the BCPS in adjusting for differences due to tumour content in clinical samples (Fig. 3h). This analysis indicated that in all situations where extrinsic factors are expected to largely overweigh intrinsic factors in affecting tumour purity, BCPS is a useful and effective tool to take tumour purity into consideration and correct for it.

BCPS recapitulates cellularity associations with clinico-pathological factors

For the TCGA and Park datasets, not used for the BCPS development, we evaluated the association with available clinico-pathological factors, as reported for the pathologist cellularity in Fig. 1.

BCPS was significantly associated with the same variables that were significantly associated with cellularity in both the TCGA and Park datasets (Figs. 4a and 1a). VCA analysis was quantitatively different but qualitatively similar (Figs. 4b and 1b), and the BCPS was significantly lower in on-treatment compared to pre-treatment samples in the Park dataset, as observed for the pathologist’s cellularity (Figs. 4d and 1d). Finally, the BCPS was not significantly associated with survival in any subtype in the TCGA dataset (Figs. 4c and 1c).

Fig. 4: Association of the BCPS with clinico-pathological factors in the TCGA and Park datasets.
figure 4

a Landscape of association of available molecular and clinico-pathological variables with cellularity in the TCGA (n = 1073) and Park (n = 112) datasets. (* association between purity and continuous variables was assessed by Spearman’s correlation, association with two categorical groups was assessed by Student’s t test, and association with multiple categorical groups was evaluated by one-way ANOVA). b Variance component analysis (VCA) for each dataset computed for samples with no missing information (TCGA, n = 690; Park, n = 72). The analysis estimated the proportion of total variance explained by the provided variables. c Forest plot of Cox regression univariate analysis evaluating cellularity association with overall survival in the TCGA (n = 1082) dataset. Samples were evaluated overall and stratified by subtype (426 Luminal, 162 HER2+, 113 TN). d Cellularity changes in on-treatment biopsy compared to pre-treatment in the Park dataset (T1 = 112, T2 = 86). The impact of the timepoint on tumour purity was evaluated by Student’s t test and VCA.

While such associations would require external validation, these results support the validity of the BCPS in providing an estimate of tumour purity leading to conclusions similar to what could be drawn using the pathologist’s evaluation.

Use of the BCPS for prediction of prognosis and response to treatment

A typical goal in transcriptomic data analysis of clinical samples is the identification of genes or signatures associated with specific clinical endpoints. Quantification of expression could be affected by tumour content7. We previously developed and validated an ER-related and a proliferation-related metagenes as predictors of long-term outcome in ER+/HER2− breast cancer13. Here we applied the metagenes to 2277 ER+/HER2− breast cancer samples from the Brueffer dataset (Table 1). A multivariate Cox model with interactions, including the ER- and proliferation metagenes, and the BCPS explained better the survival data than a bivariate model without the BCPS (likelihood ratio test p = 0.035). The improvement was confirmed by a higher c-index and higher 7-years AUC when the BCPS was included in the model (Fig. 5a, b).

Fig. 5: Use of the BCPS in breast cancer prognostication.
figure 5

a, b In ER+/HER2− samples from the Brueffer dataset (n = 2277) 7-years overall survival was predicted using a multivariate Cox model with interactions including an ER and a Proliferation metagene with or without the BCPS. C-index (a) and 7-years AUC (b) were computed for the two models highlighting a performance improvement when the BCPS was included. c Association of the BCPS with pCR in on-treatment biopsies from the NeoTRIP dataset (n = 219). d BCPS quantified in the surgical samples of the NeoSPHERE trial was associated with DEFS. Two groups based on the BCPS median were identified and represented by Kaplan–Meier curves; differences were evaluated by log-rank test.

Next, we focused on BCPS values in on-treatment biopsies of the NeoTRIP dataset, where the BCPS was significantly lower (p = 1.3 × 10−3) in cases eventually achieving pathological complete response or pCR (Fig. 5c). This indicates that, despite a possible sampling bias, evaluation of cellularity in on-treatment biopsies could help early prediction of patients responding or not to neoadjuvant treatments. Remarkably, this mimicked the predictive power of the pathologist’s evaluation in the same cohort, where 66.3% of patients with low/no cellularity had a pCR, versus only 35.3% of patients with mid/high cellularity in the on-treatment biopsy (p = 1.8 × 10−5).

A third example where evaluation of the BCPS could provide valuable information is in surgical samples obtained after neoadjuvant treatment. Cellularity was included as one of the factors determining the Residual Cancer Burden score (RCB) and positively contributes to increase the score30. High RCB associates with worse prognosis. Here we evaluated the BCPS in post-treatment surgical samples from the neoadjuvant clinical trial NeoSPHERE22. In line with the prognostic role of pathologist’s cellularity as part of RCB score, the BCPS significantly stratified long-term patients’ risk by quantifying the amount of residual tumour left. The association remained significant either identifying 4 groups based on BCPS distribution quartiles (Supplementary Fig. 7B) or two groups using the median value (Fig. 5d).

Discussion

Multiple cell types constitute the cancer ecosystem and the prevalence of each cell type can be influenced by intrinsic and extrinsic factors. Prevalence of neoplastic cells, or tumour cellularity, is typically estimated by the pathologist on histological tissue sections. In four distinct breast cancer datasets we found significant associations between pathologist-estimated cellularity and clinico-pathological features. We observed the highest cellularity in Grade 3 and ER-negative breast cancers, and the lowest in low grade luminal A tumours. Additionally, cellularity was affected by the histological type, higher in Invasive Ductal Carcinoma (IDC) compared to Lobular but also variable between distinct Lobular subtypes. A weak negative correlation between cellularity and sTILs was observed in the NA-PHER2 trial. Overall, this confirmed that intrinsic factors have an impact on tumour cellularity in breast cancer, in line with what was previously reported in other cancer types3,4. However, VCA analysis indicated that more than half of the variance is not explained by the investigated factors, supporting the idea that extrinsic factors, including sampling bias, can affect specimen cellularity.

Bulk transcriptomic data analysis and interpretation is directly affected by the tumour content in the specimen7. Depending on the specific context and aims of the analysis, if extrinsic factors are expected to prevail, it is advantageous to consider tumour cellularity in data modelling.

Tumour cellularity evaluated by the pathologist is not always available and a low concordance has been reported between pathologist estimation and estimation based on DNA or RNA profiling data6. This could be partially explained by spatial variability between the sample evaluated by the pathologist and the sample or sections undergoing nucleic acid extraction and quantification; an aspect particularly relevant in small core biopsies. Moreover, the pathologist quantification is partially subjective and qualitative, with a significant interobserver variability being previously reported6,15.

Transcriptomics-based approaches to estimate tumour purity have been proposed. The most used is the ESTIMATE score by Yoshihara and colleagues18. It merges a stroma and immune signature to estimate the relative abundance of neoplastic cells and TME. It was developed to have a pan-cancer validity, but it is overly related to immune infiltration and does not include tumour specific genes. Quon and colleagues developed ISOPure31, a gene expression method using normal and tumour profiles to correct the latter for normal contamination. ISOPure substantially assumes that the TME would resemble the originating normal tissue, but it is well documented that this is not the case, with stroma, immune and endothelial cells dramatically altering their phenotypes as a result of the interplay with neoplastic cells1.

The BCPS was developed by interrogating large sample cohorts to identify the best reporter genes in a context-specific data-driven way. We covered all breast cancer subtypes, aiming at including tumour specific genes not primarily affected by the tumour subtype. We also included data obtained by different platforms; consequently, we believe that the selected consensus genes will perform well independently of the technology used.

In four independent datasets we could validate the performance of the BCPS as a reporter of tumour purity. The BCPS systematically outperformed the ESTIMATE score18. It had significantly higher correlation with pathologist’s cellularity than ESTIMATE, significantly better identified samples with low/no tumour cells and better controlled for the sampling bias in matched CBX and FNA samples. In multiple datasets, ESTIMATE showed a very strong negative correlation (<−0.87) with a signature estimating the immune infiltration. This could partially explain its lower performance, due to a limited ability of estimating low cellularity in cold tumours.

In the TCGA dataset, we did reproduce ESTIMATE correlation values reported by Aran et al.5, and the BCPS correlation with the pathologist cellularity was the highest when considering the other purity estimation metrics reported in the study (i.e. ABSOLUTE9 and LUMP5 based on genomics and methylation data, respectively). Consequently, when multiple omics data are available for the same sample, BCPS could be preferred to estimate tumour purity, for example informing a parallel genomic analysis. Correlation with the pathologist estimation was still moderate, but in line with previous reports for the reasons outlined above.

The BCPS is a simple and easy to compute score, proportional to the quantity of neoplastic cells in a clinical breast cancer specimen. As a reporter of tumour purity, the BCPS could be informative to study intrinsic factors affecting the tumour content, but it can also be used to control for tumour purity in the analysis and interpretation of bulk transcriptomic data. We showed two strategies to include the BCPS in the analyses. The first is to directly adjust genes or signatures for the BCPS-estimated effect of tumour purity on their expression levels. This strategy was effective in remarkably reducing the bias between matched CBX and FNA samples, with FNA known to be enriched in tumour cells compared to CBX. Of note, after adjusting for tumour purity, the expression levels of tumour specific genes or signatures are interpretable as the expression levels in the tumour compartment. However, TME marker genes and signatures, e.g. immune-related, are often interpreted as a proxy for a specific TME cell type prevalence. In this case, adjusting for tumour content could lead to capture cell density more than their absolute quantity, possibly providing complementary information.

A second strategy is to include the BCPS in data modelling, for example in the fitting of survival models. In this context, the other included variables could be, to some extent, biologically related with tumour purity. This was the case using ER-related and proliferation-related metagenes to predict survival in ER+HER2− breast cancer. ER expression is associated with tumour purity, either estimated by the pathologist or using the BCPS. Similarly, the higher tumour purity in Grade 3 and triple-negative tumours as well as the higher tumour purity in luminal B compared to luminal A tumours, suggests a biological link between proliferation and tumour content. In this case, including the BCPS as a covariate in the model was advantageous compared to a direct BCPS adjustment of the metagenes.

Finally, the BCPS has potential as prognostic or predictive biomarker in specific contexts. In the NeoTRIP neoadjuvant trial, BCPS-estimated low tumour content in on-treatment biopsies was associated with the achievement of pCR at the end of treatment. We also found in the Neo SPHRE trial that the BCPS-estimated cellularity after neoadjuvant treatment of HER2+ breast cancer was associated with long-term outcome, mimicking the prognostic role of pathologist-estimated cellularity, which is one of the key components of the highly prognostic Residual Cancer Burden30.

A possible limitation of the BCPS is related to the fact that pathologist’s cellularity itself is not a perfect gold standard, since high inter-pathologist variability has been reported. To mitigate this, we based the BCPS development on the mining of multiple datasets, where the cellularity was evaluated by independent pathologists and we added an orthogonal strategy based on matched tumour-PDX dataset, eventually identifying the consensus genes associated with tumour content despite all sources of variability. Moreover, gene expression could be affected by external factors, e.g. treatment, potentially introducing unwanted variability to the BCPS. However, this is likely to affect only a subset of the genes in the BCPS, leading to a limited impact on the overall score.

In this study, we developed and validated a straightforward tool to estimate tumour content from bulk transcriptomic breast cancer data, useful to explore the role of tumour purity, aid data interpretation and improve prognostication. The framework presented here could be successfully applied to other cancer types.