Annotation-efficient deep learning for breast cancer whole-slide image classification using tumour infiltrating lymphocytes and slide-level labels

Perera, Rashindrie; Savas, Peter; Senanayake, Damith; Salgado, Roberto; Joensuu, Heikki; O’Toole, Sandra; Li, Jason; Loi, Sherene; Halgamuge, Saman

doi:10.1038/s44172-024-00246-9

Download PDF

Article
Open access
Published: 25 July 2024

Annotation-efficient deep learning for breast cancer whole-slide image classification using tumour infiltrating lymphocytes and slide-level labels

Communications Engineering volume 3, Article number: 104 (2024) Cite this article

3685 Accesses
5 Citations
Metrics details

Subjects

Abstract

Tumour-Infiltrating Lymphocytes (TILs) are pivotal in the immune response against cancer cells. Existing deep learning methods for TIL analysis in whole-slide images (WSIs) demand extensive patch-level annotations, often requiring labour-intensive specialist input. To address this, we propose a framework named annotation-efficient segmentation and attention-based classifier (ANSAC). ANSAC requires only slide-level labels to classify WSIs as having high vs. low TIL scores, with the binary classes divided by an expert-defined threshold. ANSAC automatically segments tumour and stroma regions relevant to TIL assessment, eliminating extensive manual annotations. Furthermore, it uses an attention model to generate a map that highlights the most pertinent regions for classification. Evaluating ANSAC on four breast cancer datasets, we demonstrate substantial improvements over three baseline methods in identifying TIL-relevant regions, with up to 8% classification improvement on a held-out test dataset. Additionally, we propose a pre-processing modification to a well-known method, enhancing its performance up to 6%.

An annotation-free whole-slide training approach to pathological classification of lung cancer types using deep learning

Article Open access 19 February 2021

Overcoming the limitations of patch-based learning to detect cancer in whole slide images

Article Open access 26 April 2021

Deep learning model improves tumor-infiltrating lymphocyte evaluation and therapeutic response prediction in breast cancer

Article Open access 30 August 2023

Introduction

Tumor-Infiltrating Lymphocytes (TILs) are a type of immune cells that can infiltrate the microenvironment of a tumor. TILs have been recognized as an important biomarker in several types of cancers, including breast cancer, due to their favorable correlations with treatment responses and patient survival^1,2,3,4. TIL assessment typically consists of several steps including but not limited to the identification of regions of interest (ROIs) such as tumor/stroma regions within the tissue section, and the identification of TIL nuclei within the ROIs^5,6. Although there are best practices in place for TIL assessment^5,6, the manual assessment of TILs on large tissue sections is a laborious and time-consuming task prone to inter-observer variability and ambiguity^6,7,8. Therefore, there is growing interest in using Deep Neural Networks (DNNs) for TIL analysis in whole-slide images (WSIs)^9,10.

Existing DNNs for TIL analysis divide the WSI, typically of giga-pixel magnitude, into relatively small patches of several hundreds of pixels in width and height and obtain extensive patch-level human annotations^{9,11,12,13,14}. This tedious and expensive manual task requires domain expertise and is typically performed by pathologists^15,16. Moreover, the division of WSIs also partially destroys the spatial arrangement and contextual information available in a full WSI required for TIL analysis. In contrast, our goal is to minimize manual supervision by requiring only slide-level annotations while also utilizing the spatial and contextual information present in the WSIs for TIL analysis.

Some existing methods that rely solely on slide-level labels use a pre-trained feature extractor to compress the information within the WSIs. These methods extract a set of feature embeddings from the WSIs before learning a downstream task, such as predicting the presence of metastasis or cancer sub-typing^{15,17,18,19,20,21}. There are two different strategies used in these methods. In both strategies, feature embeddings of the patches across the WSI are extracted^{15,16,17,18,19}. The first strategy considers the spatial arrangement of patches: it keeps track of the original patch coordinate information and uses the information to learn the downstream task^18,19. The second strategy does not consider the spatial arrangement: it treats the patches as a permutable set of patches and uses only the features extracted for each patch while learning the downstream task^15,17. However, to the best of our knowledge, there has been no prior work on using DNNs exclusively based on slide-level TIL labels. Taking these into consideration, we make three contributions:

1.
We first adapt and apply two well-known methods that follow the above two strategies to our problem of TIL analysis.
2.
We then propose a modification to one of these methods and show the performance improvement.
3.
Finally, we highlight the limitations of both strategies for TIL analysis and propose a strategy and a pipeline that addresses these limitations, which is our main contribution.

The first method we investigate is Neural Image Compression (NIC) and its associated convolutional neural network (CNN)^18,19,22. NIC involves encoding each patch within a WSI using a pre-trained feature extractor and arranging them into a 3D grid while retaining their original patch coordinates. The generated 3D grids along with their corresponding slide-level labels are then used to train a CNN for either classification or regression. Here, NIC only performs a compression of the gigapixel WSIs while the CNN attempts to learn the downstream task using the compressed WSI representations. We designed a CNN architecture, named ${C}_{\theta }$ to handle the sizes of WSIs in our datasets and the combined pipeline of NIC and CNN (${C}_{\theta }$) is referred to as NIC-CNN. NIC is capable of managing gigapixel-sized WSIs while maintaining the spatial and contextual information and has shown promising results on several tasks, including predicting the presence of metastasis and tumor proliferation speeds^18,19,22. Although NIC accounts for the spatial arrangement of patches, NIC-CNN fails to report similar good performance in TIL-based classification. Manual TIL evaluation requires the identification of ROIs (i.e., tumor/stroma regions) before identifying TIL cells. However, we believe NIC-CNN struggles to perform well in TIL classification as it does not have sufficient capability to infer and locate ROIs only using the slide-level TIL labels of a small number of training examples.

The second method we investigate is clustering-constrained-attention multiple-instance learning (CLAM), which has an inherent disadvantage as it does not consider the spatial arrangement of patches and ignores the patch coordinate information¹⁷. CLAM is based on the multiple-instance learning (MIL) algorithm^15,17 where the patches of a WSI are treated as a permutable set of patches. In the conventional MIL algorithm, a WSI is labeled as positive if at least one patch is positive or negative if all patches are negative¹⁷. CLAM extends MIL for multi-class classification using concepts such as clustering and attention where it uses features extracted from patches as a prior for learning the attention scores. It has reported high performance in the subtyping of renal-cell-carcinoma and non-small-cell-lung-cancer as well as in the detection of lymph node metastasis¹⁷. Despite these claims, the MIL-based approach in CLAM may not be efficiently applied to TIL analysis in WSIs due to three main limitations. Firstly, similar to NIC-CNN, CLAM does not identify the tumor/stroma regions before identifying TIL regions. Secondly, unlike in NIC-CNN, CLAM does not account for the spatial arrangement of the patches. Thirdly, the feature extractor used in CLAM has been pre-trained on an unrelated natural image dataset (ImageNet²³), which may lower the ability to extract histologically relevant information from patches²⁴.

We propose a way to address CLAM’s third limitation by enhancing its feature extractor through pre-training on histology images, which is our second contribution. This modification involves using momentum-based contrastive learning (MoCo)²⁵, a form of self-supervised learning (SSL) to train a feature extractor ${E}_{\theta }$, on histology images instead of natural images. By using histology images for MoCo training, we aim to capture histologically relevant information more effectively^24,26. We modify the original CLAM architecture to incorporate this change, resulting in the “CLAM-MoCo” architecture. However, the first and second limitations of CLAM remain unchanged.

We propose the annotation-efficient segmentation and attention-based classifier (ANSAC) framework illustrated in Fig. 1, which overcomes the limitations in the above methods as described below:

1.
Unlike NIC-CNN, CLAM and CLAM-MoCo, ANSAC employs an auxiliary, pre-trained segmentation module¹³ (Fig. 1a-vi) to identify ROIs in the WSI, without requiring explicit manual marking or annotation of the ROIs.
2.
Unlike NIC-CNN, ANSAC uses an attention model to identify the specific areas of high importance within the ROIs for the classification task.
3.
Unlike CLAM and CLAM-MoCo, ANSAC accounts for the contiguity of patches mapped into ROIs where TILs are scored.
4.
Unlike CLAM, ANSAC uses a histologically relevant MoCo-based feature (Fig. 1a-ii) extractor to extract features from WSI patches.

**Fig. 1: Overview of the ANSAC pipeline.**

Importantly, in comparison to existing computational methods that specifically focus on TIL analysis, ANSAC comprises of several unique contributions. Firstly, ANSAC represents a notable advancement in terms of annotation efficiency. Unlike existing TIL methods that necessitate manual annotations of TIL nuclei or regions at the patch level^9,11,12,27, ANSAC streamlines the annotation process by requiring manual annotation only at the slide-level. This distinction greatly reduces the need for patch-level annotations by the users, making the training process more scalable for large datasets. Secondly, ANSAC is the first computational TIL algorithm that introduces the utilization of attention-based mechanisms for distinguishing TIL-relevant ROIs. This innovative approach allows the model to focus on specific areas within the slide that are most indicative of TIL presence, enhancing the accuracy and interpretability of the analysis. Finally, in contrast to existing methods that ignore spatial information within the WSI^9,11, ANSAC explicitly accounts for the spatial information between patches when predicting the final slide-label. Overall, ANSAC stands out as a computational method in the field of TIL analysis, characterized by its annotation efficiency, utilization of attention-based mechanisms, and integration of spatial information.

One of the key features of ANSAC is its segmentation module. This module helps ANSAC to become annotation-efficient by only requiring manual input at the slide-level to learn the classification task. The segmentation module has been trained to identify $D=6$ number of ROIs, namely tumor, stroma, lymphocyte infiltrates, necrosis, other (rare) regions and background region¹³. ANSAC uses an intermediate layer from the module to produce a segmentation prediction for an input image. This prediction consists of $D$ maps, with each map corresponding to a specific ROI identified by the segmentation module. Unlike CLAM which uses patch features for learning attention, ANSAC uses the output from the segmentation module as a prior for learning attention. As shown in the reported results, ANSAC’s ability to use segmentation-based priors is beneficial in tasks such as TIL-based classification where identification of specific regions (i.e., tumor/stroma regions) should be done before determining the final prediction.

The pre-processing stage in ANSAC (shown in Fig. 1a) first divides the WSI $w$ into patches ${p}_{i,j}$ (Fig. 1a-i) and processes the patches through a pre-trained feature extractor ${E}_{\theta }$ (Fig. 1a-ii), resulting in low-dimensional feature embeddings denoted as ${p}_{{enc}(i,j)}$ (Fig. 1a-iii). These embeddings are then arranged into a multi-dimensional array that preserves the original spatial layout of the patches resulting in a feature-compressed representation of the WSI named as ${w}_{{enc}}$ (Fig. 1a-iv). To obtain histologically rich feature embeddings, we use the same feature extractor ${E}_{\theta }$ that we introduced for CLAM-MoCo as ANSAC’s feature extractor. Similar to the feature compression, ANSAC also compresses the WSI by processing the patches in an annotation-efficient manner through the pre-trained segmentation model¹³ ${S}_{\theta }$ (Fig. 1a-vi → a-vii → a-viii) to obtain a segmentation-compressed representation of the WSI denoted by ${w}_{{seg}}$.

The model training stage in ANSAC (shown in Fig. 1b) uses the feature and segmentation compressed representations of the WSI, i.e., ${w}_{{enc}}$ (Fig. 1b-i) and ${w}_{{seg}}$ (Fig. 1b-ii) to learn the classification task. ${w}_{{seg}}$ is split into $D$ separate region-specific maps and input into the attention model ${A}_{\theta }$ (Fig. 1b-iii) which comprises of $D$ parallel branches, one for each ROI identified by the segmentation model. The weights from the $D$ branches are summed up and normalized for each patch, resulting in a final attention weight. The attention weight of each patch is then arranged into a 2-dimensional grid in the corresponding original patch coordinates to obtain the attention map named as ${w}_{{attn}}$ (Fig. 1b-iv). This map is combined with ${w}_{{enc}}$, creating a weighted, compressed WSI representation denoted by ${w}_{{weighted}}$ (Fig. 1b-v). Finally, ${w}_{{weighted}}$ is input into a CNN (Fig. 1b-vi), which predicts a binary class label. For a fair comparison, we use the same CNN architecture ${C}_{\theta }$ that was used in the NIC-CNN method.

We evaluate the performance of ANSAC in comparison to NIC-CNN, CLAM, and CLAM-MoCo for the classification of high vs. low TIL scores in WSIs using four breast cancer WSI datasets: CLEOPATRA²⁸ $({D}_{{cleo}})$, FinHER²⁹ $({D}_{{fin}})$, The Cancer Genome Atlas Breast Invasive Carcinoma cohort (TCGA-BRCA) $\left({D}_{{tcga}}\right)$ and TIGER (${D}_{{tiger}}$) where ${D}_{{cleo}}$ and ${D}_{{fin}}$ are in-house datasets while ${D}_{{tcga}}$ and ${D}_{{tiger}}$ are publicly available datasets. We use ${D}_{{cleo}}$, ${D}_{{fin}}$ and ${D}_{{tcga}}$ datasets for model training, validation, and testing while ${D}_{{tiger}}$, is used as an independent test only dataset. It is worth noting that the NIC-CNN pipeline is made up of the following steps in Fig. 1bi → vi → vii. We observe that ANSAC has superior performance over NIC-CNN, CLAM, CLAM-MoCo across most datasets and that CLAM-MoCo is the second best. Moreover, in cases where segmentation labels are unavailable, CLAM-MoCo may offer a viable alternative to obtain inference with comparable performance to ANSAC.

Results

Datasets

Tissue images are commonly preserved for further analysis using either Formalin Fixation Paraffin Embedding (FFPE) or Fresh Freezing (FF). FFPE is a preservation method that enables long-term storage and accurate interpretation of tissue morphologies, while FF is typically prone to introducing misleading histological artefacts that can hinder accurate interpretation by pathologists³⁰. However, FF slides are still popularly used for clinical decision making due to its fast acquisition time (~15 min) compared to that of FFPE’s (~36 h)³⁰. Datasets ${D}_{{cleo}}$, ${D}_{{fin}}$, and ${D}_{{tiger}}$ are comprised of FFPE-preserved tissues, while ${D}_{{tcga}}$ contains FF tissues (refer Supplementary Table 1 for more information). To evaluate the model’s performance on mixtures of tissues, we also consider two different mixtures of the individual datasets: ${D}_{{mix\_FFPE}}$, which contains a mixture of only FFPE tissues and ${D}_{{mix\_all}}$, which contains a mixture of both FFPE and FF tissues (refer Table 1). Each individual and combined dataset is made of a training set for model training (${D}_{i}^{{train}}$), validation set for model selection (${D}_{i}^{{valid}}$), and testing set for evaluation (${D}_{i}^{{test}}$). We first divided ${D}_{{cleo}},$ ${D}_{{fin}}$, ${and}$ ${D}_{{tcga}}$ datasets into the three splits as training, validation, and testing following the $60-20-20$ proportion (refer Supplementary Table 2 for more information). The two combined dataset’s training, validation, and testing splits were then formed by concatenating the corresponding splits of the individual datasets.

Table 1 Number of WSIs present in the individual (2nd–4th columns), combined (5th-6th columns) and independent test (7th column) datasets where ${D}_{{cleo}}$, ${D}_{{fin}},{D}_{{tcga}}$, and ${D}_{{tiger}}$ correspond to the CLEOPATRA²⁸, FinHER²⁹, The Cancer Genome Atlas Breast Invasive Carcinoma cohort (TCGA-BRCA) and TIGER datasets while ${D}_{{mix\_FFPE}}$ and ${D}_{{mix\_all}}$ correspond to the combined datasets

Full size table

We report classification performance in the format of mean ± standard deviation across ten runs using different weight initializations and the Area under the Receiver Operating Characteristic Curve (AUC) metric^9,10, while additional metrics such as Sensitivity, Specificity and Accuracy are reported in Supplementary Tables 3–12. Lastly, we generate heatmaps from an intermediate layer of the trained models to visually compare their ability to identify TIL-relevant regions, such as tumor and stroma regions, during the classification task.

Comparison of ANSAC against other methods

The results reported by the four evaluated methods on the five individual and combined test datasets (2nd–6th columns in Table 1) are reported in Table 2. We trained one model each for all four methods using the corresponding training dataset $({D}_{i}^{{train}})$, validated with the corresponding validation dataset $({D}_{i}^{{valid}})$ and evaluated it on the corresponding testing dataset $({D}_{i}^{{test}})$.

Table 2 AUC scores reported for each model trained on ${D}_{i}^{{train}}$ and evaluated on ${D}_{i}^{{test}}$, where $i={cleo},{fin},{tcga},{mix\_FFPE},{and}$ ${mix\_all}$

Full size table

We first compare ANSAC with NIC-CNN, where ANSAC can be seen as an extension of NIC-CNN that incorporates segmentation information. The effectiveness of ANSAC’s unique segmentation map enabled attention model is demonstrated in Table 2, where ANSAC outperformed NIC-CNN on most individual and combined datasets except in ${D}_{{tcga}}^{{test}}$. ANSAC’s slightly lower performance the ${D}_{{tcga}}^{{test}}$ dataset possibly demonstrates a limitation of ANSAC’s segmentation module which was trained exclusively on FFPE data whereas ${D}_{{tcga}}^{{test}}$ contains FF tissues only (see Methods).

Next, we compare ANSAC with CLAM, where ANSAC consistently outperformed CLAM models by $8.33 \%$ for ${D}_{{cleo}}^{{test}}$, $1.83 \%$ for ${D}_{{fin}}^{{test}}$, $6.47 \%$ for ${D}_{{tcga}}^{{test}}$, $1.26 \%$ for ${D}_{{mix\_FFPE}}^{{test}}$ and $2.08 \%$ for ${D}_{{mix\_all}}^{{test}}$. This outcome validates our assumption that spatial and segmentation information is essential for the task at hand. We also observe that CLAM-MoCo outperformed CLAM across all test datasets, emphasizing the advantage of using histology-based feature extractors in digital pathology. Further, CLAM-MoCo also consistently outperformed NIC-CNN demonstrating the importance of accounting for attention region identification in classification.

Finally, we compare ANSAC and CLAM-MoCo to validate the importance of adding segmentation information for attention calculation. ANSAC showed clear superior performance over CLAM-MoCo across the two mixed datasets (${D}_{{mix\_FFPE}}^{{test}}$ and ${D}_{{mix\_all}}^{{test}}$). Among the two FFPE datasets, ANSAC had superior performance on ${D}_{{cleo}}^{{test}}$, while having competitive, second-best performance on ${D}_{{fin}}^{{test}}$. However, ANSAC’s performance on the ${D}_{{tcga}}^{{test}}$ dataset was lower compared to that of CLAM-MoCo. This confirms our previous observation that ANSAC’s segmentation module specializing in FFPE data resulted in the lower performance on ${D}_{{tcga}}^{{test}}$. However, the result reported for the FF tissue sections in ${D}_{{tcga}}^{{test}}$ by ANSAC (~85% AUC) is still promising and can be improved with more appropriate training of the segmentation module.

Performance on an independent test dataset

The trained models from Table 2 are used to obtain classification predictions on an independent test dataset, ${D}_{{tiger}}$. Since models trained on the larger and more diverse datasets have a better opportunity to generalize during their training, we expect the models trained on ${D}_{{mix\_FFPE}}^{{train}}$ and ${D}_{{mix\_all}}^{{train}}$ to perform better on ${D}_{{tiger}}$. As illustrated in the results shown in Table 3, this assumption is verified. ANSAC demonstrated superior performance compared to other methods across both combined datasets, indicating its robustness in handling WSIs from new data sources without the need for fine-tuning. The models trained on the three individual datasets generally underperformed on ${D}_{{tiger}}$, with only two exceptions (ANSAC trained on ${D}_{{cleo}}^{{train}}$ and CLAM-MoCo trained on ${D}_{{fin}}^{{train}}$). This is likely due to their limited generalizability compared to the models trained on the combined datasets. Notably, the two exceptions achieve somewhat good AUCs only due to their skewed predictions, as shown in the metrics reported in Supplementary Tables 8 and 9.

Table 3 AUC scores reported for ${D}_{{tiger}}$ from models trained on ${D}_{i}^{{train}}$, where $i={cleo},{fin},{tcga},{mi}{x}_{{FFPE}},{andmi}{x}_{{all}}.{D}_{{cleo}}^{{train}}$, ${D}_{{fin}}^{{train}},{and}$ ${D}_{{tcga}}^{{train}}$ correspond to the CLEOPATRA²⁸, FinHER²⁹, and The Cancer Genome Atlas Breast Invasive Carcinoma cohort (TCGA-BRCA) train datasets while ${D}_{{mix}{{{{\rm{\_}}}}}{FFPE}}^{{train}}$ and ${D}_{{mix}{{{{\rm{\_}}}}}{all}}^{{train}}$ correspond to the combined train datasets

Full size table

ANSAC performance on FFPE vs FF tissues

While FFPE slides are more frequently used in the studies reported^17,18,19, the impact of having both FFPE and FF tissues in model training is an area yet to be explored for TIL-based classification. Therefore, we investigate in Table 4, the robustness of the most promising method, ANSAC trained on ${D}_{{mix\_FFPE}}^{{train}}$ (only including FFPE data), ${D}_{{tcga}}^{{train}}$ (only including FF data) and ${D}_{{mix\_all}}^{{train}}$ (including both FFPE and FF data), considering their corresponding test datasets ${D}_{{mix\_FFPE}}^{{test}}$, ${D}_{{tcga}}^{{test}}$ and ${D}_{{mix\_all}}^{{test}}$ using AUC scores. Additionally, to quantitatively investigate the overall performance of the three models trained on the three training datasets and tested on each of the three testing datasets, we use the rank and average rank metric³¹. The rank summarizes the mean AUC reported for each test dataset by a model, where the model receives rank 1 if it has the highest AUC, rank 2 if it has the second highest, and so on. The average rank is an average of the individual ranks across the three test datasets (where the lower the average rank is, the better the model is).

Table 4 AUC metrics of ANSAC models on ${D}_{{mix\_FFPE}}^{{test}}$ (only including FFPE data), ${D}_{{tcga}}^{{test}}$ (only including FF data) and ${D}_{{mix\_all}}^{{test}}$(including both FFPE and FF data) where ${D}_{{mix\_all}}^{{test}}$ = ${D}_{{mix\_FFPE}}^{{test}}+{D}_{{tcga}}^{{test}}$. ${D}_{{tcga}}^{{test}}$ correspond to the Cancer Genome Atlas Breast Invasive Carcinoma cohort (TCGA-BRCA) test dataset while ${D}_{{mix\_FFPE}}^{{test}}$ and ${D}_{{mix\_all}}^{{test}}$ correspond to the combined test datasets

Full size table

As expected, the model trained on ${D}_{{mix\_FFPE}}^{{train}}$ performs relatively better on ${D}_{{mix\_FFPE}}^{{test}}$ than on ${D}_{{tcga}}^{{test}}$ as its training only specialises on FFPE slides. Similarly, the model trained on ${D}_{{tcga}}^{{train}}$ showed relatively better performance on ${D}_{{tcga}}^{{test}}$ compared to ${D}_{{mix\_FFPE}}^{{test}}$ since it was trained only on FF slides. However, both models exhibit lower performance when dealing with tissue types that were not encountered during their respecting training. In contrast to these, the model trained on ${D}_{{mix\_all}}^{{train}}$ (which has a larger training dataset including both FFPE and FF slides) consistently performed the best on ${D}_{{mix\_FFPE}}^{{test}}$, ${D}_{{tcga}}^{{test}}$ and ${D}_{{mix\_all}}^{{test}}$. As shown in Table 4, the model trained on ${D}_{{mix\_all}}^{{train}}$ reports the best average rank while models trained on ${D}_{{mix\_FFPE}}^{{train}}$ and ${D}_{{tcga}}^{{train}}$ report the second and third best average rank, respectively. While it was anticipated that the model trained on ${D}_{{mix\_all}}^{{train}}$ would performance well on ${D}_{{mix\_all}}^{{test}}$, it is noteworthy that it outperformed both the models trained on ${D}_{{mix\_FFPE}}^{{train}}$ and ${D}_{{tcga}}^{{train}}$ when evaluated on their respective test datasets. This finding emphasizes the effectiveness of incorporating both FFPE and FF tissues for more robust and generalized model training.

Interpretability and whole-slide attention visualization

The lack of interpretability and explainability of DNNs can lead to mistrust and hinder their adoption in real-world settings^10,32,33. Therefore, recent work attempt to visualize and interpret the output of intermediate layers of DNNs to provide clear explanations of model predictions^17,34. One such approach is to use heatmaps or class activation maps, which are constructed by overlaying an intermediate output from the DNN over the original image^17,34. These heatmaps highlight regions in the image that are highly activated and thus contribute more towards the class predictions.

Each of the three attention-based methods, CLAM, CLAM-MoCo, and ANSAC consists of an attention model within their pipeline. To visualize and verify the output of the attention models, we generated heatmaps for the attention values predicted by the three models trained on our largest training dataset, ${D}_{{mix\_all}}^{{train}}$ and compared them against manual ROI annotations¹⁷. Figure 2 shows the heatmaps of two WSIs with low TIL infiltration, while Fig. 3 shows the heatmaps of two WSIs with high TIL infiltration. In both Figs. 2 and 3, the top row illustrates the original WSI with manual ROI annotations provided by an expert breast pathologist, while the next three rows depict the heatmaps generated by CLAM, CLAM-MoCo and ANSAC, respectively. The original WSI is annotated with tumor (yellow line) and stroma (blue line) regions while the generated heatmaps are visualized under four threshold settings, $t$, ${{{{\rm{where}}}}}$ $t={{{\mathrm{0,0.4,0.6}}}}$ ${and}$ $0.8$. For each threshold $t$, we filter out attention values < $t$ such that only attention values greater than $t$ are retained for heatmap generation. Due to limited space, we have moved the heatmaps generated for $t=0.2$ to the Supplementary Figs. 1 and 2.

**Fig. 2: Visualization and interpretability of model predictions of low TIL infiltrated WSIs.**

**Fig. 3: Visualization and interpretability of model predictions of high TIL infiltrated WSIs.**

The heatmaps generated at $t=0$ (i.e., without any filtering) show that the heatmaps generated from CLAM and CLAM-MoCo contain an inconsistent mixture of attention values without any clear patterns or region separations, making it challenging to draw meaningful conclusions from them. In contrast, the heatmaps generated from ANSAC show a clear separation between regions with high and low attention values. Notably, upon closer examination of the heatmaps, we observe that the regions marked with high attention values (colored in dark red color) by ANSAC align closely with the manual ROI annotations. This observation is further validated as we vary $t$ from $0.4$ to $0.8$. With increasing $t$, the heatmaps gradually filter out low attention values, enhancing the visualization of regions with higher attention scores. Here, we observe that both CLAM and CLAM-MoCo predominantly award very low attention values to most tissue regions. Moreover, the heatmaps generated at $t=0.8$ demonstrate that the high attention regions predicted by these models are distributed throughout the image without clear localization. In comparison, ANSAC consistently assigns higher attention values to regions corresponding to manual ROI annotations, as evident in the heatmaps generated at $t=0.8$. Overall, the visual similarities observed between the manual annotations and heatmaps generated by ANSAC provide further evidence of ANSAC’s effectiveness in learning classification tasks in an annotation-efficient manner.

Discussion

In this paper, our primary aim is to advance computational methods for the highly challenging task of TIL-based WSI classification. We introduce ANSAC, a WSI classification pipeline that performs a binary classification of high versus low levels of TIL infiltration for a given tissue sample. While the model is only supervised using a slide-level class label, it automatically learns to pay attention to the corresponding TIL region. In clinical settings, this model can serve as a pre-selection tool to categorize patients based on high or low TIL infiltration levels. This is particularly useful for clinical trials where TIL levels are of interest.

We compare ANSAC with NIC-CNN, CLAM, and CLAM-MoCo using both quantitative (AUC) and qualitative (visualization of heatmaps) measures. We observed that compared to CLAM, CLAM-MoCo has an advantage through its histologically relevant, MoCo-based feature extractor; and compared to NIC-CNN, CLAM and CLAM-MoCo, ANSAC gains a notable advantage through its segmentation-based attention model. While CLAM-MoCo outperformed CLAM across all datasets, their generated heatmaps showed potential for improvement in accurately identifying relevant regions for the classification task. In contrast, ANSAC’s unique segmentation-based attention model was able to provide more informative heatmaps. In the absence of a pre-trained segmentation model, CLAM-MoCo can achieve good classification performance. However, in the presence of a pre-trained segmentation model, ANSAC can achieve better classification performance as shown by its superior heatmap visualizations.

There are several avenues of interest for future work. Firstly, ANSAC currently employs a straightforward attention mechanism comprising fully connected layers to calculate attention scores. For future investigations, exploring more advanced variations, such as multi-head attention inspired by the popular vision transformer, may help enhance performance. Moving forward, while ANSAC utilizes a two-step approach to compress and process WSIs, future research could explore streamlining this process into a single flow by leveraging advanced architectures such as hierarchical vision transformers³⁵. Additionally, if a sufficiently representative dataset becomes available, extending ANSAC to predict multi-class or continuous TIL classifications could be pursued. Furthermore, future research could assess ANSAC’s performance in scoring TILs across individual breast cancer subtypes or even other tumor types, provided datasets for those specific cancer types become accessible. Expanding the ANSAC pipeline to incorporate TIL scoring for survival prediction presents another avenue for broader clinical applications including the ability to assess the method’s ability to enhance the prognostic value of digital TIL analysis. Additionally, refining the segmentation model in ANSAC to handle a mixture of tissue types, such as FFPE and FF, could be explored if more FF tissues become available. Moreover, future endeavors may include further optimization of the MoCo feature extractor used in ANSAC by training it on larger datasets, provided sufficient computation resources are available. Lastly, while incorporating spatial information into TIL analysis offers potential enhancements in prognostic value, it also introduces complexities due to sample biases. Therefore, exploring avenues to standardize digital TIL scores for clinical usage could be a valuable focus of future work. Overall, our promising results establish a foundation for future research directions, particularly in annotation-efficient processing of WSIs, and serve as a starting point for the development of advanced WSI processing tools for tasks where extensive annotations are not readily available.

Methods

ANSAC

WSI compression using a pre-trained feature extractor

To efficiently process the large-scale WSI with its corresponding slide-level label effectively while minimizing the need for computational resources, we first compressed each gigapixel WSI into a smaller size using the methodology proposed by Tellez et al.^18,19. In this method, WSI $w$ with dimensions $M\times N\times 3$ (where $M,N$, and 3 represent the image’s height, width, and RGB channel depth) is divided into non-overlapping patches, ${p}_{i,j}$ with dimensions $R\times R\times 3$, where $i=1,\ldots ,\frac{M}{R},j=1,\ldots ,\frac{N}{R}$ and $R$ is the height and width of the patch as shown in Fig. 1a-i¹⁸. Next, each patch ${p}_{i,j}$ at coordinates $(i,j)$ within the WSI is encoded into a $C$-dimensional vector, ${p}_{{enc}(i,j)}$ using a pre-trained feature extractor ${E}_{\theta }$ (Eq. (1)) as shown in Fig. 1a-ii, a-iii. These encoded vectors are then arranged in a 3-dimensional grid while preserving their original coordinates $(i,j)$, generating the feature-embedding based compressed representation for the WSI, ${w}_{{enc}}\in {{\mathbb{R}}}^{\frac{M}{R}\times \frac{N}{R}\times C}$ (Eq. (2)) as shown in Fig. 1a-iv. For this study, we set $M=N=20480,R=$ 256 and $C$ = 128.

$${p}_{i,j}\in {{\mathbb{R}}}^{R\times R\times 3}{\to }^{{E}_{\theta }}{p}_{{enc}(i,j)}\in {{\mathbb{R}}}^{1\times C}$$

(1)

$$w\in {{\mathbb{R}}}^{M\times N\times 3}{\to }^{{E}_{\theta }}{w}_{{enc}}\in {{\mathbb{R}}}^{\frac{M}{R}\times \frac{N}{R}\times C}$$

(2)

To train ${E}_{\theta }$ we utilized self-supervised learning (SSL), which has proven to be advantageous in label and data-scarce domains such as in medical research^{18,25,26,36,37}. Specifically, we employed MoCo²⁵, which is a SSL-based approach, to train our encoder ${E}_{\theta }$. The MoCo algorithm uses contrastive learning and momentum-based weight updates to learn feature embeddings, giving rise to its name “Momentum Contrast” shortened as “MoCo”. Contrastive learning is an SSL method that uses positive and negative pairs of examples to learn representations that maximize the similarity between the positive pairs and minimize the similarity between the negative pairs. Recent studies have demonstrated that MoCo is a popular pretext task for representation learning in medical image analysis and has outperformed other SSL-based approaches such as SimCLR in medical image classification tasks³⁸. Therefore, we choose MoCo as the SSL approach to train ${E}_{\theta }.$

The MoCo-based encoder ${E}_{\theta }$ was trained using a subset of patches (n $\approx {{{\mathrm{210,000}}}}$) extracted from WSIs in the ${D}_{{cleo}}^{{train}}$ dataset and subsequently used as a feature extractor for all datasets. We utilized the Resnet50 architecture as the MoCo backbone, with feature dimension, momentum, Softmax temperature, and queue size set at $128$, ${{{\mathrm{0.999,0.07}}}}$, and $65536$, respectively²⁵. The feature extractor was trained for $100$ epochs, and the checkpoint at the epoch with the highest top-1 training accuracy was used to extract a 128- dimensional feature vector for each patch in the WSI.

WSI compression guided by pre-trained segmentation network

To automatically segment tissue regions, ANSAC uses an intermediate layer ${S}_{\theta }$ extracted from a convolutional segmentation network trained by Amgad et al.¹³. The segmentation network was made up of a VGG-16, Fully Convolutional Neural Network (FCN-8) architecture and was trained using a dataset consisting of $151$ FFPE WSIs from the TCGA-BRCA Project along with annotations of tissue regions. It was trained to identify four main tissue regions, namely “tumor”, “stroma”, “inflammatory infiltrates”, “necrosis”, and classify the remaining regions as either “other” or “background” regions¹³ (refer Supplementary Fig. 3). To reduce computational complexity and resources, we use the output from the intermediate ${S}_{\theta }$ layer in the segmentation network as the encoding ${p}_{{seg}(i,j)}$ where ${p}_{{seg}(i,j)}\in {{\mathbb{R}}}^{D\times 24x24}$ for a given input patch ${p}_{i,j}$ (Eq. (3)). We also conducted a visual inspection to ensure that this intermediate output contains the same information as the output from the final layer, albeit in a downsampled form. To simplify the dimensions further, we flatten ${p}_{{seg}(i,j)}$ to have $D\times Z$ dimensions where $Z$=576. We obtain such ${p}_{{seg}(i,j)}$ encodings for each patch within the WSI and arrange them following the Tellez et al.¹⁸ method as in the feature-embedding based compression performed above. This generates a segmentation-based compressed representation of the WSI, ${w}_{{seg}}$ (Eq. (4)) (see Fig. 1a-viii).

$${p}_{i,j}\in {{\mathbb{R}}}^{R\times R\times 3}{\to }^{{S}_{\theta }}{p}_{{seg}(i,j)}\in {{\mathbb{R}}}^{D\times Z}$$

(3)

$$w\in {{\mathbb{R}}}^{M\times N\times 3}{\to }^{{S}_{\theta }}{w}_{{seg}}\in {{\mathbb{R}}}^{\frac{M}{R}\times \frac{N}{R}\times D\times Z}$$

(4)

ANSAC’s attention-guided learning

While ${w}_{{enc}}$ can be directly processed in a conventional DNN¹⁸, we drew inspiration from previous work on attention-based approaches for WSI processing^17,39,40, and developed an attention module, ${A}_{\theta }$. This module was designed as a shallow, wide CNN that outputs attention maps to guide the high-importance regions in each WSI. The attention maps are computed based on the region information from the segmentation map ${w}_{{seg}}$. The attention model consists of $D$ parallel attention branches corresponding to the $D$ regions identified by the segmentation model as shown in Fig. 1b-iii. The individual branches of ${A}_{\theta }$ perform region-specific computations before combining the outputs to produce a final attention score for every patch in the WSI.

To execute the attention computation described above, the segmentation map from Eq. (4) is split into 6 region-specific maps, each with a resolution of ${{\mathbb{R}}}^{\frac{M}{R}\times \frac{N}{R}\times Z\times 1}$ (refer Fig. 1b-iii column 1). These maps are input into the corresponding region-specific branch, where each branch consists of several stacked fully connected layers, ${F}_{1}\in {{\mathbb{R}}}^{512\times 256}$, ${F}_{2}\in {{\mathbb{R}}}^{512\times 256}$ and ${F}_{3}\in {{\mathbb{R}}}^{256\times 1}$ (refer Fig. 1b-iii column 2). The attention for the ${k}^{{th}}$ patch belonging to the ${d}^{{th}}$ region denoted as ${h}_{d,k}$, is computed using the region-specific attention branch, as shown in Eq. (5). The outputs from all region-specific attention branches are summed to obtain the final attention ${a}_{k}$ for the ${k}^{{th}}$ patch as shown in Eq. (6). The attention map for the entire WSI obtained by computing the ${a}_{k}$ value for each patch is referred as ${w}_{{attn}}$ (refer Fig. 1b-iii column 3). ${w}_{{attn}}$ is normalized between 0 and 1, such that each patch that contains a more important region has a high value (closer to 1), and a patch that contains a less important region has a low value (closer to 0). Then it is reshaped to match the height and width of the compressed WSI as shown in Fig. 1b-iv. Finally, the slide-level attention weighted representation for the WSI, denoted as ${w}_{{weighted}}\in {{\mathbb{R}}}^{\frac{M}{R}\times \frac{N}{R}\times 3},$ is obtained through element-wise matrix multiplication of ${w}_{{enc}}$ and ${w}_{{attn}}$ using Eq. (7) (Fig. 1b-v).

$${h}_{d,k}=\frac{{e}^{{F}_{3}({relu}({F}_{1}{z}_{k})\, \odot \, {sigm}({F}_{2}{z}_{k}))}}{{\sum }_{j=1}^{K}{e}^{{F}_{3}({relu}({F}_{1}{z}_{k})\, \odot \, {sigm}({F}_{2}{z}_{k}))}}$$

(5)

$${a}_{k}={\sum }_{d=0}^{D}{h}_{d,k}$$

(6)

$${w}_{{weighted}}={w}_{{enc}}\odot {w}_{{attn}}$$

(7)

The weighted, compressed representation ${w}_{{weighted}}$ is then input into a CNN network $({C}_{\theta })$ as shown in Fig. 1b-vi. The CNN consists of four blocks, each composed of a 2D convolution layer, a Group Normalization layer, and a Rectified linear (ReLU) activation layer. The features obtained from the convolution blocks are then pooled, flattened, and fed into a fully connected layer, which generates the slide-level class prediction, as depicted in Fig. 1b-vii. As mentioned above, ${C}_{\theta }$ model architecture was also used to process the NIC-compressed image in the NIC-CNN baseline method. During training, the binary cross-entropy loss is calculated based on the class prediction of high vs. low TIL infiltration and is used to update the weights in both the attention and classification networks.

Model implementation and training

The model parameters are updated via a Stochastic Gradient Descent (SGD) optimizer with a learning rate of $1\times {10}^{-3}$ and a momentum of 0.9. The learning rate was set to decay by a pre-defined value when training reaches 120 and 180 epochs. To improve the model’s generalization and regularization, data augmentation was employed, involving a permutation of common augmentations, such as ${90}^{\circ },{180}^{\circ },{270}^{\circ }$ rotations, and horizontal and vertical flips. A batch size of 16 was used, and the early stop counter was set at 25 to prevent overfitting. A balanced batch sampler was also utilized to address the class imbalance issue, ensuring that each mini-batch had an equal distribution of all classes. The model was trained for ten pre-defined random seeds to ensure reproducibility and robustness. The training was terminated once the validation loss stopped improving beyond 25 epochs (early stop count), and the model with the lowest validation loss was selected as the final checkpoint for the test set. During inference, ANSAC processes and predicts the class label for a WSI within an average of 17 milliseconds. A PyTorch implementation of the ANSAC pipeline, including the segmentation model, will be made publicly available for use.

All evaluated models were trained on 4 Intel(R) Xeon(R) v100, 16GB VRAM GPUs. The pipeline was implemented in Python (version 3.7.4) and Pytorch (version 1.9.0) deep-learning library. It also uses other python supported libraries such as open-slide (version 1.1.2), OpenCV (version 4.2.0.34) and pillow (version 7.2.0), sk-learn, matplotlib (version 3.2.1), NumPy (version 1.17.3), scikit-learn (version 0.23.2), seaborn (version 0.10.1), tensorboard (version 1.15.0), TensorFlow (version 1.15.2), torchsummary (version 1.5.1), torchvision (version 0.10.0), umap-learn (version 0.4.6).

Comparative analysis using CLAM

CLAM uses concepts from MIL and clustering in its pipeline. It also employs an attention-based approach to assign weights to patches in a WSI before feeding them into the final classification layer. ANSAC is inspired by these ideas in CLAM, such as encoding patches extracted from WSI using feature embeddings and using a multi-branched attention network for attention weight training. However, ANSAC differs from CLAM as it arranges the extracted feature embeddings in their original coordinates and thereby makes use of the patch location information during the classifier training. Additionally, ANSAC introduces multiple attention branches to account for the number of classes obtained from the segmentation map, whereas CLAM adds N parallel attention branches to account for the N classes present in the classification task.

Training details

CLAM uses 10-fold cross-validation for model training and evaluation. However, to ensure a fair comparison with ANSAC, we made minor adjustments to the source code to run on ten random weight initializations (seeds) but with the same train, validation, and test divisions across all seeds. All other parameters and training details were kept as in the source code for CLAM. CLAM was trained on 1 Intel(R) Xeon(R) v100, 16GB GPU per RAM GPU.

Extension on CLAM

We replace CLAM’s feature extractor which was pre-trained on the ImageNet dataset²³ with the MoCo-based feature extractor, ${E}_{\theta }$ pre-trained on histology images. As mentioned above, ${E}_{\theta }$ converts each $256\times 256$ image into a $128$-dimensional feature vector. The original CLAM model uses 1024, 512 and 256 as the number of neurons in its network layers to accommodate the feature embedding extracted from the ImageNet pre-trained model. For our extension, these were changed to 128, 100, and 64 respectively to account for the feature embedding extracted from ${E}_{\theta }$.

Training details

Similar to CLAM, CLAM-MoCo was also run on 10 random weight initializations (seeds) but with the same train, validation, and test division across all seeds to maintain fair comparisons. All other parameters and training details were kept as provided by the authors in their source code for CLAM. CLAM-MoCo was trained on 1 Intel(R) Xeon(R) v100, 16GB GPU per RAM GPU.

WSI datasets

Supplementary Tables 1 and 2 includes a detailed description of the composition of the individual and mixed datasets. Each WSI dataset contains Breast Cancer (BC) tissue sections annotated by a board-certified breast pathologist with the corresponding TIL score $(0-100 \% )$ following the recommendations of the international working group⁵. Accordingly, 747 Her 2 + BC tissues from the CLEOPATRA clinical study²⁸ $({{{{{\rm{D}}}}}}_{{{{{\rm{cleo}}}}}})$, 651 ER + HER2-, Her 2+, TNBC (Triple Negative Breast Cancer) tissues from the FINHER clinical study²⁹ (${{{{{\rm{D}}}}}}_{{{{{\rm{fin}}}}}}$), 520 TCGA-BRCA tissues from TCGA (${{{{{\rm{D}}}}}}_{{tcga}}$) and 82 Her2+ and TNBC tissues from the TIGER challenge (${D}_{{{{{\rm{tiger}}}}}}$) along with their TIL scores were made available for model training and evaluation. While ${{{{{\rm{D}}}}}}_{{{{{\rm{cleo}}}}}},{{{{{\rm{andD}}}}}}_{{fin}}$ are in-house datasets, ${{{{{\rm{D}}}}}}_{{tcga}}$ and ${D}_{{{{{\rm{tiger}}}}}}$ WSI datasets are publicly available. For ${D}_{{{{{\rm{tiger}}}}}}$, we only use the TIGER challenge dataset subset for which the TIL scores are provided ($n=82$) (refer https://tiger.grand-challenge.org/Data/ for more details).

While TIL scores are assigned as percentages between 0 and 100% by annotators, we convert the task to a classification task by using a threshold to define two sets of classes as high vs low TIL infiltration. Here, for a given threshold,$t$, all scores below the threshold are labeled as having low TIL infiltration, and all scores above or equal to the threshold are labeled as having high TIL infiltration. Doing so presents us with a binary classification problem instead of the regression problem which is even more challenging to solve using only slide-level labels as ground truth. To ensure that the binary classes are defined in a clinically meaningful manner, it is vital to select an appropriate $t$ value to use as the threshold. In this work, we consider a clinically important factor, i.e., tissue disease stage, to determine the threshold for each dataset. Accordingly, for the datasets containing metastatic tissues (${{{{{\rm{D}}}}}}_{{{{{\rm{cleo}}}}}}$), we set $t$ at $20 \%$ and for the remaining datasets which all contain early-stage tissues, we set $t$ at $30 \%$. Additionally, while the datasets used in our experiments contained a mixture of BC subtypes, including ER + HER2−, HER2+, and TNBC, the analytical validity of models was evaluated independent of the BC subtype across all experiments.

WSI pre-processing

Extracting tissue regions on WSIs

Some WSIs from ${{{{{\rm{D}}}}}}_{{tcga}}$ and ${D}_{{{{{\rm{tiger}}}}}}$ datasets contained two/three tissue sections on a single WSI. Upon verification with our pathologists that a single tissue section is sufficient for model training, we divided the WSI into sub-sections where each section contained only a single tissue image and used one such section as the final WSI. While doing so manually is one option, we used a Python script to automate the process and reduce the manual intervention as much as possible. While the script could split a majority of WSIs correctly, there were a few instances where manual intervention was needed to refine the final output.

Tissue segmentation and patching

Given that digitized WSIs contain foreground and background regions, we used a filter on WSIs to filter out non-tissue regions. Each WSI was first resized into a common pixel size and divided into patches of size $256\times 256$ pixels. The resulting $6400$ patches for each WSI were arranged in a $80\times 80({{{{\rm{height}}}}}\times {{{{\rm{width}}}}})$ 2D matrix while preserving the original coordinates. A pre-processing script by IBM CODAIT Centre for Open-source Data & AI Technologies (https://github.com/CODAIT/deep-histopath) was used to filter and retain the foreground patches in each image. Patches detected as background were replaced with a black patch. After filtering, we saved the stack of patches in a NumPy file along with their coordinates.

Heatmap generation

We adapt the approach used in Lu et al.¹⁷ to visualize the output from the attention model (attention map) in the form of a heatmap. As done in the previous work, the attention map is converted to percentiles, normalized, and then mapped to the original spatial coordinates to obtain the corresponding heatmap.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

TCGA slide image data can be accessed through the Genomic Data Commons (GDC) portal https://portal.gdc.cancer.gov/. TIL scores for the TCGA slides are available from the authors upon request. TIGER slide image data, along with TIL scores, are publicly available at https://tiger.grand-challenge.org/. The slide images used in this study from CLEOPATRA and FINHER clinical trials are subject to strict control and usage requirements and have been used with direct permission from the data custodians.

Code availability

The ANSAC pipeline for pre-processing WSIs, model training, and evaluation can be accessed at https://github.com/rashindrie/ANSAC. The pipeline is provided under the CC BY-NC 4.0 license.

References

Loi, S. et al. Prognostic and Predictive Value of Tumour-Infiltrating Lymphocytes in a Phase III Randomized Adjuvant Breast Cancer Trial in Node-Positive Breast Cancer Comparing the Addition of Docetaxel to Doxorubicin With Doxorubicin-Based Chemotherapy: BIG 02-98. J. Clin. Oncol. 31, 860–867 (2013).
Article Google Scholar
Fridman, W. H., Zitvogel, L., Sautès-Fridman, C. & Kroemer, G. The immune contexture in cancer prognosis and treatment. Nat. Rev. Clin. Oncol. 14, 717–734 (2017).
Article Google Scholar
Savas, P. et al. Clinical relevance of host immunity in breast cancer: from TILs to the clinic. Nat. Rev. Clin. Oncol. 13, 228–241 (2016).
Article Google Scholar
Wein, L. et al. Clinical validity and Utility of Tumour-infiltrating lymphocytes in routine clinical practice for breast cancer patients: Current and future directions. Front. Oncol. 7, 156 (2017).
Article Google Scholar
Salgado, R. et al. The evaluation of tumour-infiltrating lymphocytes (TILS) in breast cancer: Recommendations by an International TILS Working Group 2014. Ann. Oncol. 26, 259–271 (2015).
Article Google Scholar
Hendry, S. et al. Assessing Tumour-infiltrating Lymphocytes in Solid Tumors: A Practical Review for Pathologists and Proposal for a Standardized Method From the International Immunooncology Biomarkers Working Group: Part 1: Assessing the Host Immune Response, TILs in Invas. Adv. Anat. Pathol. 24, 235–251 (2017).
Article Google Scholar
Denkert, C. et al. Standardized evaluation of tumour-infiltrating lymphocytes in breast cancer: Results of the ring studies of the international immuno-oncology biomarker working group. Mod. Pathol. 29, 1155–1164 (2016).
Article Google Scholar
Basavanhally, A. N. et al. Computerized image-based detection and grading of lymphocytic infiltration in HER2+ breast cancer histopathology. IEEE Trans. Biomed. Eng. 57, 642–653 (2010).
Article Google Scholar
Saltz, J. et al. Spatial Organization and Molecular Correlation of Tumour-Infiltrating Lymphocytes Using Deep Learning on Pathology Images. Cell Rep. 23, 181–193 (2018).
Article Google Scholar
Amgad, M. et al. Report on computational assessment of Tumour Infiltrating Lymphocytes from the International Immuno-Oncology Biomarker Working Group. npj Breast Cancer 6, 16 (2020).
Janowczyk, A. & Madabhushi, A. Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases. J. Pathol. Inform. 7, 29 (2016).
Swiderska-Chadaj, Z. et al. Learning to detect lymphocytes in immunohistochemistry with deep learning. Med. Image Anal. 58, 101547 (2019).
Article Google Scholar
Amgad, M. et al. Structured crowdsourcing enables convolutional segmentation of histology images. Bioinformatics 35, 3461–3467 (2019).
Article Google Scholar
Vu, Q. D. et al. Methods for segmentation and classification of digital microscopy tissue images. Front. Bioeng. Biotechnol. 7, 53 (2019).
Campanella, G. et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med. 25, 1301–1309 (2019).
Article Google Scholar
van der Laak, J., Litjens, G. & Ciompi, F. Deep learning in histopathology: the path to the clinic. Nat. Med. 27, 775–784 (2021).
Article Google Scholar
Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021).
Article Google Scholar
Tellez, D., Litjens, G., Van Der Laak, J. & Ciompi, F. Neural Image Compression for Gigapixel Histopathology Image Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 43, 567–578 (2021).
Article Google Scholar
Tellez, D. et al. Extending Unsupervised Neural Image Compression With Supervised Multitask Learning. 1–14 (2020).
Zheng, Y. et al. A graph-transformer for whole slide image classification. IEEE Trans. Med. Imaging, https://doi.org/10.1109/TMI.2022.3176598 (2022).
Pinckaers, H., van Ginneken, B. & Litjens, G. Streaming convolutional neural networks for end-to-end learning with multi-megapixel images. IEEE Trans. Pattern Anal. Mach. Intell. 1–10, https://doi.org/10.1109/TPAMI.2020.3019563 (2020).
Shi, Y., Yu, X., Sohn, K., Chandraker, M. & Jain, A. K. Towards Universal Representation Learning for Deep Face Recognition.
Russakovsky, O. et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 115, 211–252 (2015).
Article MathSciNet Google Scholar
Kornblith, S., Shlens, J. & Le, Q. V. Do Better ImageNet Models Transfer Better? In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2661–2671 (IEEE, 2019).
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9726–9735 (IEEE, 2020).
Azizi, S. et al. Big Self-Supervised Models Advance Medical Image Classification. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 3458–3468 (IEEE, 2021).
Shaban, M. et al. A digital score of tumour-associated stroma infiltrating lymphocytes predicts survival in head and neck squamous cell carcinoma. J. Pathol. 256, 174–185 (2022).
Article Google Scholar
Swain, S. M. et al. Pertuzumab, Trastuzumab, and Docetaxel in HER2-Positive Metastatic Breast Cancer. N. Engl. J. Med. 372, 724–734 (2015).
Article Google Scholar
Joensuu, H. et al. Adjuvant docetaxel or vinorelbine with or without trastuzumab for breast cancer. N. Engl. J. Med. 354, 809–820 (2006).
Article Google Scholar
Zhao, Y. et al. Fast FF-to-FFPE Whole Slide Image Translation via Laplacian Pyramid and Contrastive Learning. In Medical Image Computing and Computer Assisted Intervention -- MICCAI 2022 (eds. Wang, L., Dou, Q., Fletcher, P. T., Speidel, S. & Li, S.) vol. 1 409–419 (Springer Nature Switzerland, 2022).
Triantafillou, E. et al. Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples. In 2020 8th International Conference on Learning Representations (ICLR), 26–30 (2020).
Singh, A., Sengupta, S. & Lakshminarayanan, V. Explainable deep learning models in medical image analysis. J Imaging. 6, 52 (2020).
Aristidou, A., Jena, R. & Topol, E. J. Bridging the chasm between AI and clinical implementation. Lancet 399, 620 (2022).
Article Google Scholar
Srinidhi, C. L., Ciga, O. & Martel, A. L. Deep neural network models for computational histopathology: A survey. Med. Image Anal. 67, 101813 (2021).
Chen, R. J. et al. Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning. 16123–16134 https://doi.org/10.1109/cvpr52688.2022.01567 (2022).
van der Laak, J., Ciompi, F. & Litjens, G. No pixel-level annotations needed. Nat. Biomed. Eng. 3, 855–856 (2019).
Article Google Scholar
Liu, X. et al. Self-supervised Learning: Generative or Contrastive. IEEE Trans. Knowl. Data Eng. 1–23, https://doi.org/10.1109/TKDE.2021.3090866 (2021).
Shurrab, S. & Duwairi, R. Self-supervised learning methods and applications in medical imaging analysis: a survey. PeerJ Comput. Sci. 8, e1045 (2022).
Fan, M., Chakraborti, T., Chang, E. I. C., Xu, Y. & Rittscher, J. Microscopic Fine-Grained Instance Classification Through Deep Attention. Lect. Notes Comput. Sci. 12265, 490–499 (2020).
Article Google Scholar
Tomita, N. et al. Attention-Based Deep Neural Networks for Detection of Cancerous and Precancerous Esophagus Tissue on Histopathological Slides. JAMA Netw. Open 2, e1914645 (2019).

Download references

Acknowledgements

The Melbourne Research Scholarship, a scholarship from the Peter MacCallum Cancer Centre and the GCI top-up scholarship scheme supported R.P while the Australian Research Council Grant DP220101035 supported D.S. and S.H. S.L. is supported by the National Breast Cancer Foundation of Australia Endowed Chair and the Breast Cancer Research Foundation, New York. R.S. is supported by the Breast Cancer Research Foundation (BCRF, grant nr. 17-194). The research was undertaken using the LIEF HPC-GPGPU Facility hosted at the University of Melbourne. This Facility was established with the assistance of LIEF Grant LE170100200. We would also like to thank the CLEOPATRA study group (NCT00567190) and FinHER group for providing access to the slide images used in this manuscript. The authors thank Deshani Geethika, Nadarasar Bahavan, Jayanie Bogahawatte, Ken Chen, Maneesha Perera and Tamasha Malepathirana for proofreading the manuscript.

Author information

Authors and Affiliations

School of Electrical, Mechanical and Infrastructure Engineering, University of Melbourne, Melbourne, VIC, 3010, Australia
Rashindrie Perera, Damith Senanayake, Jason Li & Saman Halgamuge
Division of Cancer Research, Peter MacCallum Cancer Centre, Melbourne, VIC, 3000, Australia
Rashindrie Perera, Peter Savas, Roberto Salgado, Jason Li & Sherene Loi
Sir Peter MacCallum Department of Medical Oncology, University of Melbourne, Parkville, VIC, 3010, Australia
Peter Savas & Sherene Loi
Department of Pathology, GZA-ZNA Ziekenhuizen, Antwerp, Belgium
Roberto Salgado
Helsinki University Hospital and University of Helsinki, Helsinki, Finland
Heikki Joensuu
The Garvan Institute of Medical Research and The Kinghorn Cancer Centre, Sydney, NSW, 2010, Australia
Sandra O’Toole

Authors

Rashindrie Perera
View author publications
Search author on:PubMed Google Scholar
Peter Savas
View author publications
Search author on:PubMed Google Scholar
Damith Senanayake
View author publications
Search author on:PubMed Google Scholar
Roberto Salgado
View author publications
Search author on:PubMed Google Scholar
Heikki Joensuu
View author publications
Search author on:PubMed Google Scholar
Sandra O’Toole
View author publications
Search author on:PubMed Google Scholar
Jason Li
View author publications
Search author on:PubMed Google Scholar
Sherene Loi
View author publications
Search author on:PubMed Google Scholar
Saman Halgamuge
View author publications
Search author on:PubMed Google Scholar

Contributions

R.P. designed the experiments, wrote the code, performed the experiments, analyzed the results, and wrote the manuscript. D.S. planned experiments and analyzed the results. P.S. planned experiments and provided the in-house datasets. R.S. and S.O. provided annotations for in-house datasets. H.J. acquired and provided data and reviewed the manuscript for intellectual content. S.H., S.L., J.L. and P.S. supervised the research. All authors provided valuable feedback and edits to the manuscript.

Corresponding authors

Correspondence to Sherene Loi or Saman Halgamuge.

Ethics declarations

Competing interests

The authors declare the following competing interests: S.L. receives research funding to her institution from Novartis, Bristol Meyers Squibb, Merck, Puma Biotechnology, Eli Lilly, Nektar Therapeutics, Astra Zeneca, Roche-Genentech, and Seattle Genetics. She has acted as consultant (not compensated) to Seattle Genetics, Novartis, Bristol Meyers Squibb, Merck, AstraZeneca, Eli Lilly, Pfizer, Gilead Therapeutics and Roche-Genentech. She has acted as consultant (paid to her institution) to Aduro Biotech, Novartis, GlaxoSmithKline, Roche-Genentech, Astra Zeneca, Silverback Therapeutics, G1 Therapeutics, PUMA Biotechnologies, Pfizer, Gilead Therapeutics, Seattle Genetics, Daiichi Sankyo, Merck, Amunix, Tallac Therapeutics, Eli Lilly, and Bristol Meyers Squibb. R.S. reports research funding by Roche, Merck, and Puma Biotechnology; travel and congress-registration support by Roche, Merck, and Astra Zeneca; and was part of Advisory Boards for Bristol Myers Squibb and Roche. H.J. HJ is the Chairman of the Scientific Advisory Board, Orion Pharma; the Chairman of the Scientific Advisory Board, Neutron Therapeutics; has received a payment for a lecture from Deciphera Pharmaceuticals; owns stock of Sartar Therapeutics and Orion Pharma.

Peer review

Peer review information

Communications Engineering thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editors: [Or Perlman] and [Ros Daw and Mengying Su].

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Perera, R., Savas, P., Senanayake, D. et al. Annotation-efficient deep learning for breast cancer whole-slide image classification using tumour infiltrating lymphocytes and slide-level labels. Commun Eng 3, 104 (2024). https://doi.org/10.1038/s44172-024-00246-9

Download citation

Received: 29 November 2023
Accepted: 02 July 2024
Published: 25 July 2024
Version of record: 25 July 2024
DOI: https://doi.org/10.1038/s44172-024-00246-9

This article is cited by

Revolutionizing breast cancer immunotherapy by integrating AI and nanotechnology approaches: review of current applications and future directions
- Houda Bendani
- Nasma Boumajdi
- Azeddine Ibrahimi
Bioelectronic Medicine (2025)
Artificial intelligence based quantification of T lymphocyte infiltrate predicts prognosis in high grade breast cancer using deep learning and statistical validation
- Elham Saleh Albalawi
- Jibran Qayyum
- Junaid Qayyum
Discover Oncology (2025)
Integrating histopathology and transcriptomics for spatial tumor microenvironment profiling in a melanoma case study
- Óscar Lapuente-Santana
- Joan Kant
- Federica Eduati
npj Precision Oncology (2024)

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Datasets

Comparison of ANSAC against other methods

Performance on an independent test dataset

ANSAC performance on FFPE vs FF tissues

Interpretability and whole-slide attention visualization

Discussion

Methods

ANSAC

WSI compression using a pre-trained feature extractor

WSI compression guided by pre-trained segmentation network

ANSAC’s attention-guided learning

Model implementation and training

Comparative analysis using CLAM

Training details

Extension on CLAM

Training details

WSI datasets

WSI pre-processing

Extracting tissue regions on WSIs

Tissue segmentation and patching

Heatmap generation

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links