A histomorphological atlas of resected mesothelioma discovered by self-supervised learning from 3446 whole-slide images

Seyedshahi, Farzaneh; Rakovic, Kai; Poulain, Nicolas; Claudio Quiros, Adalberto; Powley, Ian R.; Richards, Cathy; Uraiby, Hussein; Klebe, Sonja; Moore, David A.; Nakas, Apostolos; Wilson, Claire R.; Sereno, Marco; Officer-Jones, Leah; Ficken, Catherine; Teodosio, Ana; Ballantyne, Fiona; Murphy, Daniel; Yuan, Ke; Le Quesne, John

doi:10.1038/s41467-025-63846-9

Download PDF

Article
Open access
Published: 07 October 2025

A histomorphological atlas of resected mesothelioma discovered by self-supervised learning from 3446 whole-slide images

Nature Communications volume 16, Article number: 8891 (2025) Cite this article

4383 Accesses
1 Citations
4 Altmetric
Metrics details

Subjects

Abstract

Mesothelioma is a highly lethal and poorly biologically understood disease which presents diagnostic challenges due to its morphological complexity. This study uses self-supervised AI (Artificial Intelligence) to map the histomorphological landscape of the disease. The resulting atlas consists of recurrent patterns identified from 3446 Hematoxylin and Eosin (H&E) stained images scanned from resected tumour slides. These patterns generate highly interpretable predictions, achieving state-of-the-art performance with 0.65 concordance index (c-index) for outcomes and 88% AUC in subtyping. Their clinical relevance is endorsed by comprehensive human pathological assessment. Furthermore, we characterise the molecular underpinnings of these diverse, meaningful, predictive patterns. Our approach both improves diagnosis and deepens our understanding of mesothelioma biology, highlighting the power of this self-learning method in clinical applications and scientific discovery.

Usability of deep learning and H&E images predict disease outcome-emerging tool to optimize clinical trials

Article Open access 15 June 2022

Swarm learning for decentralized artificial intelligence in cancer histopathology

Article Open access 25 April 2022

Mapping the landscape of histomorphological cancer phenotypes using self-supervised learning on unannotated pathology slides

Article Open access 11 June 2024

Introduction

Mesothelioma is a highly lethal cancer almost always caused by asbestos exposure^1,2. Early detection, critical for effective treatment, remains challenging^3,4. Mesothelioma’s biological diversity complicates histopathological diagnosis, as early malignancy can be difficult to distinguish from reactive changes. Diagnosing mesothelioma from H&E (Hematoxylin and Eosin) images is a subjective and time-intensive process even for skilled subspecialty histopathologists. Definitive diagnosis often remains elusive^5,6, even with immunohistochemistry or FISH (fluorescence in situ hybridisation). These diagnostic challenges are at least in part due to the difficulties in devising robust, manually applicable systems of morphological characterisation, in addition to the well-known issues of inter-pathologist agreement. The emergence of AI methods provides an opportunity to comprehensively describe the morphological complexity of mesothelioma to generate a quantitative visual dictionary of the disease.

Recent AI methods in mesothelioma primarily focus on tile-based or cell-based approaches, using supervised or weakly-supervised learning. MesoNet⁷ used Whole Slide Images (WSI) tiles to predict patient survival through a risk score based on malignant morphologies but lacked insight into the diversity of the tumour microenvironment. Cell-based methods like MesoGraph⁸ and SpindleMesoNET⁹ quantified malignancy through tumour cell shapes, especially spindle cells, but required extensive annotations and were computationally intensive for slide-level applications. While these approaches provide valuable insights, their findings are restricted by the nature and quality of their human annotations. Recent self-supervised models, such as Hierarchical Image Pyramid Transformer (HIPT)¹⁰, CTransPath¹¹, HistoSSLscaling¹², UNI¹³, and Histomorphological Phenotype Learning(HPL)¹⁴ as well as other self-supervised models such as RNAPath¹⁵, which focus on healthy tissue analysis, have been developed for H&E WSI histopathological analysis. Uniquely among these approaches, HPL focuses on identifying recurrent histomorphological patterns through clusters known as histomorphological phenotype clusters (HPCs). HPL leverages the Barlow Twins self-supervised framework¹⁶, using ResNet for feature extraction from 224 × 224 WSI patches, followed by clustering of tile feature vectors via the Leiden algorithm¹⁷. Each HPC represents a unique morphological pattern that can be associated with specific molecular landscapes or used to predict patient outcomes and mesothelioma subtypes by quantifying HPC frequencies. (Further details in the 'Online Methods' section.) Previously, HPL has been applied to lung cancer, revealing significant underlying patterns and giving impressive prognostic performance¹⁴, but it has not been implemented in mesothelioma. Self-supervised methods such as HPL depend upon accessing large volumes of training data, preferably from resected tumour material which offers large tissue areas with full morphological variance. This is especially challenging in mesothelioma, which is so often diagnosed from tiny biopsies and subsequently treated medically rather than surgically.

In this work, we curated 3446 whole slide images of 485 resected mesothelioma cases to generate a uniquely powerful training resource called Leicester Archival Thoracic Tumour Investigation Cohort-Mesothelioma (LATTICe-M), as shown with further details in Fig. 1a. We then applied the HPL pipeline to our dataset to build a comprehensive atlas of mesothelioma H&E morphology. (Fig. 1b)

**Fig. 1: LATTICe-M dataset overview and HPL pipeline for mesothelioma analysis.**

Results

Mapping the histomorphological phenotype landscape

We have identified 47 recurrent histomorphological phenotype clusters (HPCs) (Fig. 2a) based on morphological features encoded by self-learned neural networks. These HPCs are identified from 3,239,939 tiles extracted from 3446 images at 5x equivalent resolution. 41 of 47 HPCs are shared in more than 20% of cases, and none of them are case-specific. A threshold of >1% abundance was applied to call an HPC “present” in a case. HPCs were then binned by patient prevalence groups as well as coloured by rare and frequent (<20% and > 80%) in grey, intermediate (20−80%) in blue. As a result of this, two complementary bar charts summarise these distributions: one showing the percentage of cases per HPC and another counting HPCs within 10%-wide patient-prevalence bins. Rare HPCs (<20% prevalence) represent either normal tissues (open lung/muscle, which are minor tissue components in the tumour-rich blocks selected for scanning), reactive changes which are either unusual or not targeted for scanning (dense lymphocytes from tertiary lymphoid structures, pleural plaque), and a couple of the less common tumour phenotypes (cold, solid pattern epithelioid disease and plump disorganised spindle cells). The near-universal HPCs (>80%) represent features which are either very widespread in a surgical resection (e.g. talc pleurodesis, vessels, collagen) or quite broad ubiquitous malignant morphologies (e.g. infiltrated fat, sparse epithelioid disease). Interestingly, these more common HPCs often display lower ‘purity’, reflecting a broader morphological composition.

A team of subspecialty expert pathologists from 3 centres, who had no access to the WSI images or labels (blinded assessment), examined every HPC to achieve consensus morphological annotations for each one, derived from their defining features: epithelioid vs spindled morphology, inflammation, necrosis, cellularity, desmoplasia, atypia, and cluster purity and each HPC was given a summary title. They evaluated inflammation levels in each HPC, categorising them as None-Sparse, Mild-Moderate, or Marked. Most HPCs were None-Sparse, but some displayed notable patterns. Assessments of inflammation and necrosis exhibited the highest levels of consensus, with at least 50% of the HPCs receiving unanimous agreement in these categories. For epithelioid growth patterns, we also observed a relatively high level of full agreement. However, for spindle architecture in our non-epithelioid clusters (orderly/less orderly/disorderly), agreement among the pathologists was lower, perhaps reflecting the subjectivity of this measure. Across 47 HPCs with 3 raters, Fleiss’ Kappa scores (reported in the last row of Fig. 2c—with variable category definitions per component) for individual histopathological components ranged from 0.2 to 0.6, indicating fair to moderate agreement based on the interpretation scale proposed by ref. ¹⁸. This degree of agreement is in line with kappa scores for several diagnostic tasks in mesothelioma¹⁹. These annotations reveal areas of the UMAP containing multiple HPCs with broad similarities, such as spindled/collagenous HPCs, epitheloid tumour growth patterns, and lymphocytic infiltration, as well as peripheral and projecting clouds of morphologically highly distinct lung tissue and chest wall muscle tiles (Fig. 2d). This grouping is supported by the general co-occurrence of HPCs within each slide in the Supplementary Figs.. These HPCs enable the detailed automated spatial annotation of any mesothelioma whole slide image, as illustrated in Fig. 2e, highlighting two cases with highly divergent outcomes and morphologies. The first case, a sarcomatoid malignancy which resulted in death at 66 days, is highly morphologically diverse and contains abundant tiles in spindle cell-associated clusters, while the second is predominantly made up of a single epithelioid cluster.

**Fig. 2: HPC analysis and pathologist validation of mesothelioma tissue patterns.**

HPCs predict mesothelioma subtypes

The crucial histopathological distinction in mesothelioma subtyping lies between epithelioid and non-epithelioid (i.e. sarcomatoid/biphasic) variants. To classify this, we generated a numerical vector representing the percentage or frequency of each HPC for every WSI. This vector was transformed using the centred log-ratio (clr) transformation to enhance stability and interpretability, then feed into a logistic regression model for classification as either non-epithelioid or epithelioid. Tumours labelled as biphasic and sarcomatoid were combined into a single group, creating a binary classification task. 8 HPCs are significantly associated with the epithelioid subtype (HPCs 14, 39, 24, 25, 27, 40, 8, and 18), containing epithelioid malignancy, predominantly characterised by tubular patterns and solid sheets of epithelioid tumour cell growth. Of the 9 HPCs linked to the non-epithelioid subtype, 3 HPCs (15, 16, and 22, mostly containing disorderly spindle cells) are unanimously classified as non-epithelioid malignancy by our pathologists as well. The other 6 HPCs (6, 7, 35, 37, 45, and 28, mostly solid epithelioid growth pattern) contain more diverse appearances, including epithelioid HPCs, pleural plaque and muscle. This might be due to the inclusion of biphasic cases and lethal epithelioid patterns in this group and also suggests a possible link between sarcomatoid growth and invasion into the chest wall. (Fig. 3a) Our logistic regression classifier (Likelihood Ratio test statistic(40) = 1219.3, p = 1.195e-229) achieved 88% 5-fold cross-validated AUC (Area Under the Curve) Score on the LATTICe-M dataset and 80% on The Cancer Genome Atlas (TCGA) mesothelioma dataset and robust across varied clustering configurations (Fig. 3b). We visualised HPC compositions at the case level using a PCA plot, colour-coding cases by subtype at diagnosis. The transition from epithelioid cases to sarcomatoid cases through biphasic cases is clearly visible. (Fig. 3c)

**Fig. 3: HPC performance in mesothelioma subtype classification.**

HPCs as predictors of patient survival

We aggregated HPC frequencies across all samples per patient, summarising each case into a readily interpretable composition of morphologies. The 5-fold cross-validated c-index values for patient prognosis outcomes were 0.67 and 0.65 for the training and test LATTICe-M primary datasets, respectively, and 0.65 for the fully unseen TCGA cohort as an external dataset. The addition of clinical information, including mesothelioma subtype, TNM stage, and age, only modestly improved the ability of the algorithm to predict outcomes, yielding an increment in C-index of 0.01. We further verified that our approach is robust across other clustering configurations. Compared to similar research on mesothelioma outcome prediction using WSIs, such as MesoNet⁷, our model achieves at least a comparable c-index score for the same additional dataset (TCGA). While MesoNet reported a score of 0.656 for TCGA, we matched this performance, with c-index scores ranging from 0.64 to 0.7 across different folds, however, prioritising model interpretability through our morphology-based HPCs, also using a fully self-supervised pipeline.

We identified HPCs 10 (Log Hazard Ratio = −0.089, p = 0.001, Confidence Interval = [−0.145, −0.034]) and 27 (Log Hazard Ratio = −0.062, p = 0.008, Confidence Interval = [−0.109, −0.016]) ("epithelioid nests in bland stroma" and “dense lymphocytes") as positive survival factors, while HPCs 15 (Log Hazard Ratio = 0.052, p = 0.026, Confidence Interval = [0.006,0.098]) and 22 (Log Hazard Ratio = 0.042, p = 0.016, Confidence Interval = [0.008, 0.077]) ("disorderly spindle cells" and “transitional mesothelioma") emerge as strong predictors of poor outcome. (Fig. 4a) A comparison of tile-level UMAP plots colour-coded by hazard ratio reveals a high degree of similarity, further underscoring the strong links between sarcomatoid transformation and poor patient outcomes. The map highlights red areas (higher hazard ratio) like HPC 15 and 16 (Sarcomatoid HPCs) and blue areas (lower hazard ratios) like HPC 45, 27 and 5 (either non-tumourous tissue or lymphocyte HPCs) (Fig. 4b).

**Fig. 4: HPC performance in mesothelioma survival prediction and risk stratification.**

Next, we categorised patients into high- and low-risk groups based on their calculated hazard ratios for 60 months. Kaplan-Meier plots were generated for LATTICe-M train (Log-rank test statistic(1) = 62.41, p = 2.79e-15) and test (Log-rank test statistic(1) = 15.14, p = 9.96e-05) datasets, as well as the TCGA-Meso additional cohort (Log-rank test statistic(1) = 10.24, p = 0.00138), as shown in Fig. 4c. The model achieves impressive separation, predicting outcomes with surprising power in this very poor-prognosis population who face all the complex hazards of radical surgery and impaired respiratory physiology alongside the biology of their tumour burden. Figure 4d shows a comparison between classical histological grading of epithelioid pleural mesothelioma (as suggested in ref. ²⁰) and our model, both applied to the TCGA mesothelioma just epithelioid sample cases. Our pipeline demonstrates superiority (Log-rank test statistic(1) = 5.02, p = 0.025) against human grading (Log-rank test statistic(1) = 1.16, p = 0.282) for patient outcomes in this dataset. Figure 4e shows the SHAP (SHapley Additive exPlanations)²¹ decision plots for our Cox model, comparing a high-risk sarcomatoid case (red) and a relatively low-risk epithelioid case (blue). The plot shows how the model assigns high or low-risk labels for these patients based on the abundance/scarcity of influential HPCs, such as the highly lethal HPC 15 ("disorderly spindle cells"), or the protective HPC 27 ("dense lymphocytes"), which both contribute to the calculated risk in these two cases.

We further demonstrate the ability of our model to predict patient outcomes within disease subtypes (epithelioid vs non-epithelioid) groups. HPC frequencies were calculated, and survival was predicted separately for each group, identifying HPCs which underscored subtype-specific traits (Fig. 5a). For epithelioid cases, HPC 10 ("epithelioid nests in bland stroma") and HPC 22 ("transitional mesothelioma") emerged as significant predictors of good and bad outcomes, respectively. HPC 10 is likely to represent relatively indolent well-differentiated classically epithelioid disease. Interestingly, HPC 22, which is very enriched for the appearances of transitional mesothelioma, is not uncommon in cases diagnostically subtyped as epithelioid and is strongly predictive of poor outcomes in this group. This supports the view that transitional appearances signal early stages of transition to sarcomatoid growth²², and its presence in this group highlights the difficulty in human identification of this pattern²³. For biphasic/sarcomatoid cases, HPC 23 ("bland spindle cells and collagen") predicts poor outcome, perhaps identifying areas of cytologically bland desmoplastic differentiation, while the good prognostic association of HPC 27 ("dense lymphocytes") suggests towards a particular role for the immune system in sarcomatoid disease. We also show example tiles for specific HPC groups of interest based on pathologist annotations, including inflamed clusters, classical desmoplastic appearances, and necrosis (Fig. 5b).

**Fig. 5: Subtype-specific HPC analysis and immunohistochemical marker correlations.**

To further assess the biological significance of the identified HPCs, we investigated their associations with quantitative Immunohistochemistry (IHC) markers reflecting tumour cell proliferation and aberrant mRNA translation activity. HPCs with significant associations to previously obtained quantitative IHC markers²⁴ are shown in Fig. 5c. Notably, the HPCs with upregulation of mRNA translation, proliferation, and oxidative phosphorylation are nearly all associated with poor patient outcome and are all either sarcomatoid or poorly differentiated epithelioid in morphology, further underlining the linkage of these processes to tumour virulence. eIF4A1, the ubiquitous pro-proliferation translation initiation factor, is particularly closely related to poor outcome HPCs, supporting possible therapeutic targeting of this molecule. Negative associations with markers of oxidative phosphorylation and pro-translation mTOR signalling are only seen in areas of low-grade disease, or crush/diathermy artefact likely to degrade IHC signal. Figure 5d represents chromogenically IHC-stained tissue cores for each marker. The top row shows examples with high expression of the corresponding marker, while the bottom row shows cores with low expression. For each case, both the IHC-stained image and the corresponding H&E scan are displayed side by side. Additionally, a representative tile from each core is shown to highlight the cellular-level resolution of the tissue.

Molecular underpinnings of HPCs

We next investigated the biological underpinnings of HPC morphology to further explain our model’s predictive capabilities in mesothelioma prognosis and subtyping. This was achieved by quantifying associations between gene expression signatures and HPC composition in the TCGA Mesothelioma RNASeq dataset²⁵.

We used the MCPcounter algorithm to estimate cell types, including fibroblasts, endothelial cells, T cells, B lineage cells, myeloid dendritic cells, NK cells, and CD8 T cells from RNASeq data. (Fig. 6a) Expression of the proliferation marker Ki67 is also mapped, revealing especially high proliferation in spindle cell-enriched HPCs (HPCs 22, 16, 15, and 6). Critically, these HPCs are the most predictive of non-epithelioid subtype (3a). In contrast, HPCs defined by well-differentiated epithelioid disease and normal tissue (e.g., HPCs 17, 3, 10, and 19) show low proliferation.

**Fig. 6: Correlation between WSI-level HPC compositions and transcriptomic signatures in TCGA.**

Fibroblast signatures are strongly pronounced in multiple clusters which either significantly determine sarcomatoid disease or contain fibroblastic/collagenous/stroma-rich morphology indicative of fibroblast-like mesenchymal dedifferentiation of mesothelioma cells. Fibroblast signatures are minimal in HPCs representing papillary or micropapillary epithelioid mesothelioma, solid pattern disease, lung tissue, and large-vessel-rich HPCs consistent with more specialised epithelioid/tissue-specific phenotypes.

Lymphocyte-rich HPC 27 shows strong correlations with T cells, B lineage cells, and myeloid dendritic cells. Similarly, inflamed HPCs (HPCs 29, 1, 24, 27), identified by Hover-Net, indicate active immune environments linked to better prognosis and likely improved immunotherapy response. HPCs 1 (inflamed fat) and 27 (dense lymphocytes) are high in both B- and T-cell signatures and show strong inter-correlation, suggesting dense inflammation and tertiary lymphoid structure formation in chest wall fatty tissues, supporting previous observations that tertiary lymphoid structures are related to good outcome.²⁶

KEGG pathway correlations across HPCs again show clear separation between non-epithelioid and epithelioid subtypes (Fig. 6b), with generally heightened mitogenic signalling pathway activity in a group of sarcomatoid and fibroblastic clusters associated with aggressive biology. In contrast, HPCs linked to epithelioid growth and normal tissues exhibit relative down-regulation.

An analysis of cancer hallmark pathways further identifies the most sarcomatoid HPCs as a group with strong positive links to multiple proliferation-associated pathways (Fig. 6c), in addition to mitogenic signalling and multiple EMT-related pathways. Notably, the same group exhibits downregulation of oxidative phosphorylation components, indicating a metabolic shift towards hypoxia.

Validation on tiny tissue fragments

To assess the generalisability of our self-supervised model trained on the LATTICe-M dataset, we benchmarked its performance on the St. George’s Hospital TMA dataset from the MesoGraph study⁸. This external evaluation is significant for two reasons. First, no additional training was applied to the new dataset, so the results represent the pre-trained model’s performance on a fully unseen wholly exterior cohort. Second, although the model was trained on WSIs, it maintained strong performance on tissue microarray cores. These fragments are not only tiny ( ≈ 1 millimetre) but are selected to represent pure tumour tissue. In contrast, our model was trained on large diagnostic WSIs including background tissues, and used unsupervised clustering to filter artefacts.

As the image size in TMA cores is insufficient to support the previous frequency-based method, we employed multiple instance learning (MIL) to predict mesothelioma subtypes by summarising information across tiles from a core or a biopsy. We called this method HPL-MIL and benchmarked HPL-MIL against state-of-the-art methods, max-MIL and naive-MIL (patch-based MIL methods), PINS²⁷, CLAM²⁸, MesoGraph, on the 235 cores from St. George’s Hospital TMA cohort⁸ (Table 1). Each TMA core was treated as a bag of instances, where the instances are individual tile embeddings extracted from the core. Using an attention-based multiple instance learning approach, we obtained a core-level representation by computing a weighted average of tile embeddings. We then performed subtype classification of each core using the core-level labels available for the TMA dataset. HPL-MIL achieved higher AUC, Average Precision, Sensitivity and specificity scores across all the methods without pre-training on the cohort tissues.

Table 1 Performance inference metrics for HPL with Multiple Instance Learning (MIL) and other MIL-based approaches over TMA Core dataset

Full size table

Discussion

In this study, we applied our self-supervised HPL pipeline to the LATTICe-M cohort, which we believe to be the largest image collection in terms of area of mesothelioma tissue yet employed for AI training. We achieved state-of-the-art accuracy in two key clinically important tasks: a C-index of 0.65 in survival prediction across a 5-fold cross-validation and 88% AUC in subtype classification (epithelioid vs sarcomatoid/biphasic). Furthermore, our method outperformed human grading in prognostication for epithelioid cases, and we identified survival-linked histomorphological patterns within each subtype, emphasising the interpretability of self-supervised methods and identifying recurrent morphologies worthy of future study. Quantitative visual maps of HPCs (Fig. 2e) and SHAP decision plots (Fig. 4e) offer clinical utility for understanding AI diagnoses and future selection of therapies.

Our approach eliminates the need for retraining to retain performance on external mesothelioma datasets, thus addressing a key computational challenge in self-supervised models, as proven by its efficacy across three independent cohorts. It effectively extracts relevant morphological patterns from small TMA cores (e.g. the St. George’s dataset) and WSIs of varied origins and quality (e.g. TCGA and LATTICe-M), enabling real-time clinical decision-making without extensive preprocessing. CLAM²⁸ was benchmarked against HPL on both the TCGA and LATTICe-M datasets (full results in Supplementary Data). HPL consistently outperformed CLAM in both subtype classification and survival prediction, while maintaining high interpretability and biological relevance. This suggests the possibility of a robust diagnostic tool for both resection material and mesothelioma biopsies, which remain a major diagnostic challenge.

Our model has essentially created a morphological atlas of mesothelioma, discovering ab initio the characteristic recurrent H&E morphologies which comprise the disease. The fact that these morphologies have clear biological and clinicopathological significance proves their meaning and value. For example, the discovered linkage of the tumour microenvironment to patient survival shows how crucial the morphology of immune system engagement is to tumour virulence and biology, and suggests biomarker potential in predicting responses to immunotherapy. Furthermore, RNASeq data annotation of histomorphological clusters further illustrates connections between tumour microenvironment signatures, molecular pathways, and survival, offering valuable molecular insights into the biology of the disease.

The molecular associations of HPCs help us to understand tumour virulence and suggest numerous hypotheses for mechanistic testing. For example, we see numerous mRNA cancer hallmark pathways linked to high-risk sarcomatoid HPCs, helping to explain links between morphology and outcome in molecular terms and highlighting possible areas of target discovery. Sarcomatoid clusters are directly linked to signatures of proliferation, hypoxia, and EMT in bulk sequence data, without any requirement for spatial methods or microdissection. This is in keeping with biological knowledge that sarcomatoid mesothelioma cells can proliferate rapidly under hypoxic conditions²⁹ and supportive of the idea that sarcomatoid dedifferentiation represents co-option of a physiological EMT pathway.

Additionally, subtle transitional morphologies in cases classified as being epithelioid overall appear to have significant prognostic value in our survival analysis. This highlights the continuous nature of epithelioid to sarcomatoid transition, and suggests the importance of accurate identification of transitional states, which is a challenging task by eye, and which is likely to benefit from our approach. Furthermore, HPL could also be used to target therapy by identifying such cases with subtle sarcomatoid changes, which are likely to be more responsive to immunotherapies³⁰).

This study also has several limitations that warrant acknowledgment. First, staging data were missing in 32.23% of Leicester cases (165 patients), probably reducing the power of T/N/M-related analysis (Fig. 1a). Second, smoking history was incomplete in 43.55% of cases (223 patients), limiting cohort-wide assessment of its impact. These data gaps highlight the need for consistent clinical documentation in retrospective studies and constrain the use of these variables in survival and subtype prediction models alongside AI-derived features (HPC frequencies).

Methods

Datasets

The primary dataset used in this study is the Leicester Archival Thoracic Tumour Investigation Cohort-Mesothelioma (LATTICe-M)³¹, comprising 512 patients diagnosed with pleural mesothelioma who underwent surgical resection. Study clinical data were collected and managed using REDCap electronic data capture tools^32,33 hosted on secure research servers at University Hospitals of Leicester NHS Trust. Cases are histologically subtyped into epithelioid (n = 372), sarcomatoid (n = 107), and biphasic (n = 33). The cohort includes 436 male and 76 female patients (85.2% and 14.8%, respectively), consistent with mesothelioma incidence at the collection site. Sex was self-reported at the time of intake. Patient age ranged from 36 to 85 years (64.3 ± 8.6). No information on race, ethnicity, or other socially relevant variables was collected. Participants were not financially compensated. Sex and gender were reported in the study; however, no sex-based analysis was performed with the aim of training a self-supervised model. Disaggregated sex counts are available in the source data files.

Ethical approval was obtained from the UK National Health Service Research Ethics Committee (ref. no. 14/EM/1159). No prospective recruitment, interventions, or international data transfers were involved. There were no risks to participants or researchers, as only archived histopathology material was used under standard governance. Pathology annotation support was provided by S.K. (Adelaide, Australia), whose contributions were formally recognised through authorship. Data ownership is held by the Greater Glasgow and Clyde Biorepository, under governance via an amendment granted by the Leicester South REC. All research procedures were conducted in compliance with relevant ethical regulations, and written informed consent was obtained from all participants.

Figure 1 a presents additional clinical details. The WSIs were sectioned and stained with Hematoxylin and Eosin at Leicester University Hospital, scanned at 10X, 20X, or 40X magnifications. After tiling and background removal, WSIs with fewer than 100 tiles were excluded, leaving 485 patients and 3446 WSIs for the pipeline and downstream analysis. To identify significant clinical factors in this cohort, we employed a Cox proportional hazards model and found that age, mesothelioma subtype, and TNM stage significantly contributed to survival prediction. (Fig. 1a)

To validate our results, we used the publicly available Cancer Genome Atlas (TCGA)-mesothelioma cohort²⁵, an entirely differently-scanned dataset, still comprising WSIs but obtained from multiple centres. It includes 86 samples from 74 patients with both WSIs and RNAseq data available. This cohort was primarily used to discover links between HPCs and tumour microenvironment features, pathways, and hallmarks. All HPL pipeline steps were performed on the primary dataset (LATTICe-M), and evaluation scores were reported on the fully unseen additional TCGA dataset, without any further training.

Finally, we utilised the St. George’s Hospital dataset, consisting of H&E-stained TMAs from tumour biopsies collected at St. George’s Hospital, London. This dataset includes four TMA slides scanned at 20x magnification using a Hamamatsu Nanozoomer S360 scanner, comprising 235 cores labelled as epithelioid, biphasic, or sarcomatoid, as the only available clinical information. The dataset, introduced in the Mesograph study⁸, was used for training and testing. We employed it to demonstrate the robustness and generalisability of our trained WSI model by benchmarking and comparing its performance on TMA cores against different methods reported in the study. (Section 3)

Histomorphological phenotype learning (HPL)

HPL is a tool developed to detect and categorise histomorphological patterns within large collections of whole-slide images. HPL employs an automated, self-supervised deep learning approach, eliminating the need for expert pathologists to prelabel or manually define histomorphological patterns. Once these patterns are identified, new whole-slide images can be introduced to the trained model and classified according to the pre-established patterns. This feature allows pathologists to quantify specific patterns in new patient samples precisely. The clustering of each whole-slide image into meaningful histomorphological patterns follows several sequential steps, which are described below (Fig. 1b)

Whole-slide images pre-processing: In this first step, whole-slide images are segmented into non-overlapping 224 × 224-pixel tiles at 5X magnification, which corresponds to a pixel size of approximately 1.8 micrometres. Tiles that do not contain at least 60% tissue coverage are filtered out to maintain relevance. Consistent pixel size and magnification are ensured during the tile processing phase to guarantee uniformity in the resulting tiles. The tiling code used for this process is accessible on DeepPATH GitHub³⁴ for further details.
Feature extraction: HPL employs a self-supervised learning technique known as Barlow Twins¹⁶, which matches or even exceeds the performance of other self-supervised methods. Barlow Twins delivers state-of-the-art results in standard pathology tasks compared to DINO³⁵, MoCo³⁶, and SwAV³⁷ mothods³⁸. We previously compared Barlow Twins with DINO within HPL framework, and it showed improved performance with the cohort size similar to this study¹⁴. One key feature of HPL is its ability to maintain consistent image representations, even with slight colour or zoom level variations. This capability ensures that differences in image scanning or processing across datasets do not affect the results. The aim is to capture diverse visual patterns in tissue samples and represent them as feature vectors, capturing distinct characteristics like texture. Each 224 × 224-pixel tile is converted into a vector representation, denoted as {z ∈ R^D; D = 128}. During training, the model is optimised to produce consistent outputs for twin inputs, ensuring robustness in vector representation.
Clustering: After generating the vector representations, we employed the Leiden community detection algorithm¹⁷ (from the Python ScanPy library³⁹) to cluster tiles or vector representations with similar histomorphological features. Since neighbouring vector representations in high-dimensional space exhibit similarity, this method effectively groups the tiles based on shared morphological patterns captured by their feature vector representations.

We began with a subsample of 750,000 tiles and constructed a nearest-neighbour graph between the tiles. From this initial set of detected clusters, we assigned the remaining vector representations to these clusters (or graph nodes) based on their distance. The number of clusters identified depends on the chosen Leiden algorithm resolution. For our analysis, multiple resolutions were applied to capture varying levels of granularity. The resulting histomorphological phenotype clusters (HPCs) enabled the quantification of patients or whole-slide images (WSIs) based on these clusters, streamlining further analysis and simplifying the understanding of complex tissue patterns.
Preparing compositional vectors: At this stage, using the identified HPCs, we can characterise the entire tissue or patient by quantifying the frequency of each HPC (1). To achieve this, each whole-slide image (WSI) is transformed into a compositional vector A, where the dimensionality is equal to the total number of HPCs (c). Each element within the vector represents the percentage of the tissue area attributed to a specific HPC. This approach quantifies the contribution of each HPC to the overall tissue composition, allowing for a detailed analysis of the histomorphological landscape within a patient or a sample.
$$A=\{{a}_{0},{a}_{1},{a}_{2},\cdots \,,{a}_{c-1}\}{{\rm{s}}}.{{\rm{t}}}.\,\mathop{\sum }\limits_{i=0}^{c-1}{a}_{i}=1\,{{\rm{and}}}\,{a}_{i}\in \left[0,1\right]$$
(1)
For statistical compositional analysis and to prepare for the use of linear models, we apply the Centred Log-Ratio (clr) transformation⁴⁰ to our compositional vector A to minimise correlation between HPC frequencies. This transformation maps the vector composition from the c-part simplex into a c-dimensional Euclidean vector space. Additionally, to address zero elements in the dataset, we use multiplicative replacement⁴¹.
Subtype classification: For the diagnostic task, we employed the clr transformed compositional vectors derived from whole-slide images (WSI) and fed them into a logistic regression model (Scikit-learn⁴² and Statsmodels Python library⁴³). This approach is weakly-supervised, utilising patient-level labels assigned by pathologists, where each patient label is applied to both the patient and their corresponding slides. We combined sarcomatoid and biphasic mesothelioma into a single class (non-epithelioid class) and compared it against the majority class, primarily consisting of epithelioid samples. Also, to address the class imbalance in the primary dataset (1:3 ratio for non-epithelioid to Epithelioid subtypes), we applied an undersampling strategy using the Edited Nearest Neighbour (ENN) technique⁴⁴ (Imbalanced-learn Python library⁴⁵). This method reduced the majority class by removing redundant and noisy samples.

Ultimately, the logistic regression model used the compositional vectors of WSIs to classify mesothelioma subtypes based on the contributions of HPCs. In this approach, individual HPCs serve as distinct features for our logistic regression classifier, enabling us to rank the importance of each HPC and its role in predicting specific tumour subtypes within each sample. The predicted probability of being two classes is given by:
$$Y=\frac{{e}^{{b}_{0}+{b}_{1}*clr(A)}}{1+{e}^{{b}_{0}+{b}_{1}*clr(A)}}$$
(2)
Where b₀ is the bias or intercept term and b₁ is the coefficient for compositional vector (A).
Survival analysis: In the clinical outcome aspect of our study, we created a clr-transformed compositional vector for each patient, reflecting the overall HPCs composition. We then used the Cox proportional hazards regression model⁴⁶ to analyse patient survival in relation to the HPC composition vector. Finally, Kaplan-Meier plots⁴⁷ were employed to visually distinguish between high-risk and low-risk patient groups within each dataset. For this step, we used Lifelines⁴⁸ and SciPy Python libraries⁴⁹. For both subtype classification and survival prediction tasks, we employed five-fold cross-validation to ensure robust evaluation. The reported scores represent the average performance across all folds. Furthermore, we ensured no overlap of patients between the training and test sets, maintaining strict separation to prevent data leakage and guarantee unbiased assessments. However, for providing annotations and associations with the tumour microenvironment, we focused on a single fold for consistency and detailed analysis.

Expert pathologist annotation

Additionally, we engaged three expert pathologists to independently and blindly annotate HPCs without access to patient clinical data or additional HPC details. Each HPC was classified into one of three categories: epithelioid tumour, spindle cells/extracellular matrix, or non-tumour. The pathologists assessed each HPC’s primary and secondary architectural features, HPC purity, inflammation, necrosis, nuclear atypia and biphasic components (in malignant groups). Also, they evaluated patterns such as desmoplastia and cellularity in spindle cell HPCs, as well as the tumour-stroma ratio and stromal cellularity in epithelioid HPCs.

To assess agreement in our multi-centre annotation process, we used majority voting among the three expert pathologists who annotated the HPCs. Instances of unanimous agreement, where all three pathologists selected the same category and are marked with an asterisk (*). In contrast, cases of complete disagreement, where each pathologist chose a different category, are highlighted in grey in Fig. 2c. Also, inter-rater reliability was assessed using Fleiss’ Kappa. For each HPC i (N = 47), we counted the number of raters (n = 3) assigning it to each category j, and computed the marginal probability of category j as:

$${p}_{j}=\frac{1}{Nn}\mathop{\sum }\limits_{i=1}^{N}{n}_{ij},\qquad 1=\mathop{\sum }\limits_{j=1}^{k}{p}_{j}$$

(3)

We then calculated the average proportion of agreeing rater-pairs across clusters (observed agreement $\bar{P}$), estimated the agreement expected by chance and scaled the excess agreement relative to the maximum possible beyond chance:

$$\kappa=\frac{\bar{P}-\bar{{P}_{e}}}{1-\bar{{P}_{e}}}$$

(4)

A Kappa of 1 indicates perfect concordance, 0 reflects agreement no better than random, and values ≤0 denote worse-than-chance agreement. It is important to note that the number of categories available for annotation (denoted as j) varied across the different histomorphological components we selected, such as inflammation, necrosis, etc. As a result, the Fleiss’ Kappa values are inherently influenced by this variability and are not directly comparable across components. Specifically, components with more annotation categories introduce greater choice complexity, which tends to lower agreement scores. To prevent misinterpretation, we recommend referring to the majority voting results and the asterisk indicators of full agreement as complementary measures of reliability.

HPC cell type enrichment analysis

We used the deep learning model HoVer-Net⁵⁰ to segment cells in each tile within the HPCs and calculate the abundance of only inflammatory cells in every HPC. While the tiles used in the HPL framework were at 5x magnification, the HoVer-Net model was trained on 20x tiles. To bridge this difference, we first applied HoVer-Net to 20x tiles, then combined 16 tiles (arranged in a 4 × 4 grid) to create 5x equivalents. This allowed us to map HoVer-Net’s segmentation results, particularly the inflammation annotations, to specific tiles. Each tile was subsequently assigned to a corresponding HPC by calculating the average number of inflamed cells detected across the related tiles. Approximately 900 whole-slide images (WSIs) at 20x magnification were annotated using the HoVer-Net model for this analysis.

HPC tumour microenvironment signature associations

We correlated HPCs with tumour microenvironment features, hallmark pathways, and relevant biomarkers (Section 3). All WSIs from each patient were used to calculate the clr-transformed HPC compositional vectors. We then applied the single-sample Gene Set Enrichment Analysis (ssGSEA) to quantify pathway expression in both the Kyoto Encyclopedia of Genes and Genomes (KEGG)⁵¹ and Molecular Signatures Database (MSigDB)⁵² hallmark datasets. We also estimated immune cell subpopulation abundance using MCPcounter⁵³. Ki67 RNA expression levels were also utilised across the entire sample set to assess cellular proliferation in each case, providing further insight into tumour growth activity within different morphological HPCs. Correlations between the clr-transformed HPC compositions and pathway expression levels were calculated, with only correlations having a p-value below 0.01 retained, ensuring statistical significance.

To further evaluate the biological relevance of the identified HPCs, we sought associations between HPCs and quantitative IHC measures of tumour cell proliferation and dysregulation of mRNA translation. We used data previously generated from a study of the LATTICe-M TMA cohort which revealed the importance of translational dysregulation to mesothelioma development²⁴. Data were available from 8 TMAs, comprising 711 cores after quality control. To link molecular phenotype with spatial composition, we calculated the proportional representation of each HPC within each TMA core and then assessed the association between HPC proportions and marker expression. Marker positivity scores were derived from automated quantification pipelines applied to scanned IHC images. The correlation was calculated using the two-sided Pearson test and multiple comparison correction was applied using the Benjamini-Hochberg false discovery rate (FDR) method with α = 0.05.

Tissue microarray benchmarking

We also benchmarked and compared the HPL model (trained on WSIs) against other state-of-the-art AI methods using an independent small dataset of tissue microarray (TMA) cores to demonstrate its robustness. We used the St. George’s Hospital dataset, publicly released with the MesoGraph study⁸, comprising 235 cores with associated mesothelioma subtype labels. The study benchmarked methods such as max-MIL, naive-MIL (patch-based MIL approaches), PINS²⁷, CLAM²⁸, and MesoGraph on this dataset. Employing gated attention in MIL⁵⁴, we predicted the probability of each core belonging to a specific mesothelioma subtype, naming this approach HPL-MIL. In our weakly-supervised MIL setting, each TMA core is treated as a bag B = {h₁, h₂, . . . , h_k} of k tile embeddings (Instances). Each tile h_k is obtained from our HPL ResNet-128 encoder trained using the Barlow Twins framework on the LATTICe-M dataset. To derive a representation for the entire core, we use attention-based pooling:

$${{\bf{z}}}=\mathop{\sum }\limits_{k=1}^{K}{a}_{k}{{{\bf{h}}}}_{k}$$

(5)

where the attention weight a_k is computed as:

$${a}_{k}=\frac{\exp \left\{{{{\bf{w}}}}^{\top }\tanh \left({{\bf{V}}}{{{\bf{h}}}}_{k}^{\top }\right)\right\}}{\mathop{\sum }_{j=1}^{K}\exp \left\{{{{\bf{w}}}}^{\top }\tanh \left({{\bf{V}}}{{{\bf{h}}}}_{j}^{\top }\right)\right\}}$$

(6)

This allows the model to learn which tiles are more informative for core-level prediction. The resulting representation z is passed to a linear classifier for subtype classification. While MIL lacks full interpretability, it allowed us to benchmark mesothelioma subtype classification against existing MIL-based methods, representing the generalisability and scalability of the HPL pipeline. Despite this success, the remainder of the study prioritises the more interpretable histomorphological clusters and their morphological insights to address the complexity of mesothelioma.

We additionally benchmarked the CLAM (Clustering-constrained Attention Multiple Instance Learning)²⁸ framework using 128-dimensional tile embeddings extracted from our Barlow Twins-trained ResNet. Subtype classification was performed using a linear layer on top of CLAM outputs, while survival prediction was based on risk scores generated by the network and evaluated via a Cox proportional hazards model, enabling a fair comparison with HPL-based survival predictions. CLAM was trained for 50 epochs with early stopping, using the Adam optimiser with a binary loss and a learning rate of 10⁻⁴. The total loss combined slide and instance-level objectives with coefficients c₁ = 0.9 and c₂ = 0.3, as follows:

$$Los{s}_{total}={c}_{1}\times Los{s}_{slide}+{c}_{2}\times Los{s}_{tile}$$

(7)

The number of clusters was fixed at 8, consistent with the original CLAM configuration. WSIs were treated as bags, with subtype labels assigned at the bag level, and a gated attention, was used to compute instance-level attention. Five-fold cross-validation, aligned with the HPL evaluation, was applied throughout.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The LATTICe cohort (histology whole slide images and clinical data) used in this study is not publicly available due to their extremely large size and ethical limitations according to the LATTICe agreement, which makes public hosting technically impractical. However, we are delighted to make the data available for academic research purposes upon request. Interested researchers may contact the corresponding author via the email provided. Access will be granted for a limited period based on a clear research purpose and mutual agreement, with data use restricted to non-commercial research. We aim to respond to access requests as soon as possible. TCGA mesothelioma RNAseq data have been retrieved from UCSC Xena [https://xenabrowser.net/datapages/?cohort=GDC%20TCGA%20Mesothelioma%20(MESO)] and images from Genomic Data Commons (GDC) portal [https://www.cancer.gov/ccg/research/genome-sequencing/tcga/studied-cancers/mesothelioma-study]. St. George Hospital TMA Dataset is available on MesoGraph GitHub [https://github.com/measty/MesoGraph]. Source data are provided with this paper.

Code availability

The source code for this study is openly available under the MIT License on GitHub [https://github.com/FarzanehSeyedshahi/Histomorphological-Phenotype-Learning] and archived on Zenodo⁵⁵. Reproducible figures can be generated using the provided Jupyter notebooks. A step-by-step README on GitHub details installation and execution.

References

Wagner, J. C., Sleggs, C. A. & Marchand, P. Diffuse pleural mesothelioma and asbestos exposure in the north western cape province. Br. J. Ind. Med. 17, 260 (1960).
CAS PubMed PubMed Central Google Scholar
Molinari, L. Mesothelioma survival rates & factors that affect patients (2023). https://www.mesothelioma.com/mesothelioma/prognosis/survival-rate/.
Ricciardi, S. et al. Surgery for malignant pleural mesothelioma: an international guidelines review. J. Thorac. Dis. 10, S285 (2018).
Article PubMed PubMed Central Google Scholar
Scherpereel, A. et al. ERS/ESTS/EACTS/ESTRO guidelines for the management of malignant pleural mesothelioma. Eur. Respir. J. 55 (2020).
Travis, W. D. et al. The 2015 World Health Organization classification of lung tumors: impact of genetic, clinical and radiologic advances since the 2004 classification. J. Thorac. Oncol. 10, 1243–1260 (2015).
Article PubMed Google Scholar
Salle, F. G. et al. New insights on diagnostic reproducibility of biphasic mesotheliomas: a multi-institutional evaluation by the international mesothelioma panel from the mesopath reference center. J. Thorac. Oncol. 13, 1189–1203 (2018).
Article PubMed Central Google Scholar
Courtiol, P. et al. Deep learning-based classification of mesothelioma improves prediction of patient outcome. Nat. Med. 25, 1519–1525 (2019).
Article CAS PubMed Google Scholar
Eastwood, M. et al. MesoGraph: Automatic profiling of mesothelioma subtypes from histological images. Cell Rep. Med. 4, 101226 (2023).
Naso, J. R. et al. Deep-learning based classification distinguishes sarcomatoid malignant mesotheliomas from benign spindle cell mesothelial proliferations. Mod. Pathol. 34, 2028–2035 (2021).
Article PubMed Google Scholar
Chen, R. J. et al. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16144–16155 (IEEE, 2022).
Wang, X. et al. Transformer-based unsupervised contrastive learning for histopathological image classification. Med. image Anal. 81, 102559 (2022).
Article PubMed Google Scholar
Filiot, A. et al. Scaling self-supervised learning for histopathology with masked image modeling. medRxiv 2023–07 (2023).
Chen, R. J. et al. Towards a general-purpose foundation model for computational pathology. Nat. Med. 30, 850-862 (2024).
Claudio Quiros, A. et al. Mapping the landscape of histomorphological cancer phenotypes using self-supervised learning on unannotated pathology slides. Nat. Commun. 15, 4596 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Cisternino, F. et al. Self-supervised learning for characterising histomorphological diversity and spatial RNA expression prediction across 23 human tissue types. Nat. Commun. 15, 5906 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Zbontar, J., Jing, L., Misra, I., LeCun, Y. & Deny, S. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, 12310–12320 (PMLR, 2021).
Traag, V. A., Waltman, L. & Van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Landis, J. R. & Koch, G. G. The measurement of observer agreement for categorical data. Biometrics. 33, 159–174 (1977).
Mlika, M. & Mezni, F. Interobserver agreement in histopathological subtyping of malignant pleural mesotheliomas. Turkish J. Pathol. 37, 56 (2021).
Google Scholar
Nicholson, A. G. et al. Euracan/IASLC proposals for updating the histologic classification of pleural mesothelioma: towards a more multidisciplinary approach. J. Thorac. Oncol. 15, 29–49 (2020).
Article CAS PubMed Google Scholar
Lundberg, S. M. & Su-In, L. A unified approach to interpreting model predictions. Advances in neural information processing systems 30 (2017).
Salle, F. G. et al. Comprehensive molecular and pathologic evaluation of transitional mesothelioma assisted by deep learning approach: a multi-institutional study of the international mesothelioma panel from the mesopath reference center. J. Thorac. Oncol. 15, 1037–1053 (2020).
Article PubMed Central Google Scholar
Dacic, S. et al. Interobserver variation in the assessment of the sarcomatoid and transitional components in biphasic mesotheliomas. Mod. Pathol. 33, 255–262 (2020).
Article CAS PubMed Google Scholar
Grosso, S. et al. The pathogenesis of mesothelioma is driven by a dysregulated translatome. Nat. Commun. 12, 4920 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Hmeljak, J. et al. Integrative molecular characterization of malignant pleural mesothelioma. Cancer Discov. 8, 1548–1565 (2018).
Article CAS PubMed PubMed Central Google Scholar
Mannarino, L. et al. Epithelioid pleural mesothelioma is characterized by tertiary lymphoid structures in long survivors: results from the match study. Int. J. Mol. Sci. 23, 5786 (2022).
Article CAS PubMed PubMed Central Google Scholar
Eastwood, M. et al. Malignant mesothelioma subtyping of tissue images via sampling driven multiple instance prediction. In International Conference on Artificial Intelligence in Medicine, 263–272 (Springer, 2022).
Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021).
Article PubMed PubMed Central Google Scholar
Karatkevich, D. et al. Chemotherapy increases CDA expression and sensitizes malignant pleural mesothelioma cells to capecitabine treatment. Sci. Rep. 14, 18206 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Hashimoto, K. et al. Sarcomatoid malignant pleural mesothelioma treated with nivolumab: a case series. Oncol. Lett. 24, 1–7 (2022).
Article Google Scholar
Moore, D. A. et al. In situ growth in early lung adenocarcinoma may represent precursor growth or invasive clone outgrowth-"a clinically relevant distinction. Mod. Pathol. 32, 1095–1105 (2019).
Article CAS PubMed Google Scholar
Harris, P. A. et al. Research electronic data capture (REDCap)-"a metadata-driven methodology and workflow process for providing translational research informatics support. J. Biomed. Inform. 42, 377–381 (2009).
Article PubMed Google Scholar
Harris, P. A. et al. The redcap consortium: building an international community of software platform partners. J. Biomed. Inform. 95, 103208 (2019).
Article PubMed PubMed Central Google Scholar
Coudray, N. et al. Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018).
Article CAS PubMed PubMed Central Google Scholar
Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, 9650–9660 (IEEE, 2021).
He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9729–9738 (2020).
Caron, M. et al. Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inf. Process. Syst. 33, 9912–9924 (2020).
Google Scholar
Kang, M., Song, H., Park, S., Yoo, D. & Pereira, S. Benchmarking self-supervised learning on diverse pathology datasets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3344–3354 (IEEE, 2023).
Wolf, F. A., Angerer, P. & Theis, F. J. Scanpy: large-scale single-cell gene expression data analysis. Genome Biol. 19, 1–5 (2018).
Article Google Scholar
Egozcue, J. J., Pawlowsky-Glahn, V., Mateu-Figueras, G. & Barcelo-Vidal, C. Isometric logratio transformations for compositional data analysis. Math. Geol. 35, 279–300 (2003).
Article MathSciNet MATH Google Scholar
Martín-Fernández, J. A., Barceló-Vidal, C. & Pawlowsky-Glahn, V. Dealing with zeros and missing values in compositional data sets using nonparametric imputation. Math. Geol. 35, 253–278 (2003).
Article MATH Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet Google Scholar
Seabold, S. & Perktold, J. Statsmodels: econometric and statistical modeling with Python. SciPy 7, 92–96 (2010).
Article Google Scholar
Wilson, D. L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics 408–421 (IEEE, 1972).
Lematre, G., Nogueira, F. & Aridas, C. K. Imbalanced-learn: a Python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18, 1–5 (2017).
Google Scholar
Lin, D. Y. & Wei, L.-J. The robust inference for the Cox proportional hazards model. J. Am. Stat. Assoc. 84, 1074–1078 (1989).
Article MathSciNet Google Scholar
Efron, B. Logistic regression, survival analysis, and the Kaplan-Meier curve. J. Am. Stat. Assoc. 83, 414–425 (1988).
Article MathSciNet Google Scholar
Davidson-Pilon, C. lifelines: survival analysis in python. J. Open Source Softw. 4, 1317 (2019).
Article ADS Google Scholar
Virtanen, P. et al. Scipy 1.0: fundamental algorithms for scientific computing in Python. Nat. methods 17, 261–272 (2020).
Article CAS PubMed PubMed Central Google Scholar
Graham, S. et al. Hover-net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images. Medical Image Analysis 101563 (2019).
Kanehisa, M. & Goto, S. Kegg: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).
Article CAS PubMed PubMed Central Google Scholar
Liberzon, A. et al. Molecular signatures database (msigdb) 3.0. Bioinformatics 27, 1739–1740 (2011).
Article CAS PubMed PubMed Central Google Scholar
Becht, E. et al. Estimating the population abundance of tissue-infiltrating immune and stromal cell populations using gene expression. Genome Biol. 17, 1–20 (2016).
Google Scholar
Ilse, M., Tomczak, J. & Welling, M. Attention-based deep multiple instance learning. In International conference on machine learning, 2127–2136 (PMLR, 2018).
Claudio Quiros, A. & Seyedshahi, F. A histomorphological atlas of resected mesothelioma discovered by self-supervised learning from 3446 whole-slide images. FarzanehSeyedshahi/Histomorphological-Phenotype-Learning (2025). https://doi.org/10.5281/zenodo.16947385.

Download references

Acknowledgements

The authors would like to express their gratitude to the CRUK Scotland Institute, the NHS Greater Glasgow and Clyde Biorepository, the University of Leicester, and the University of Glasgow for their invaluable support and contributions to this study. The manuscript was critically reviewed by Catherine Winchester (CRUK Scotland Institute). CRUK Early Detection Programme, IAMMED-Meso, EDDPGM-Nov21\100001 supported the research. J.L.Q. is supported by the Mazumdar-Shaw Molecular Pathology Chair endowment at the University of Glasgow. KY acknowledges support from Cancer Research UK (EDDPGM-Nov21\100001 and DRCMDP-Nov23\100010), BBSRC BB\V016067\1, Prostate Cancer UK MA-TIA22-001 and EU Horizon 2020 grant ID: 101016851.

Author information

These authors contributed equally: Kai Rakovic, Nicolas Poulain.

Authors and Affiliations

School of Cancer Sciences, University of Glasgow, Glasgow, Scotland, UK
Farzaneh Seyedshahi, Kai Rakovic, Adalberto Claudio Quiros, Daniel Murphy, Ke Yuan & John Le Quesne
Cancer Research UK Scotland Institute, Glasgow, Scotland, UK
Farzaneh Seyedshahi, Kai Rakovic, Nicolas Poulain, Ian R. Powley, Leah Officer-Jones, Catherine Ficken, Fiona Ballantyne, Daniel Murphy, Ke Yuan & John Le Quesne
Pathology Department, Queen Elizabeth University Hospital, NHS Greater Glasgow and Clyde, Glasgow, Scotland, UK
Kai Rakovic & John Le Quesne
School of Computing Science, University of Glasgow, Glasgow, Scotland, UK
Adalberto Claudio Quiros & Ke Yuan
University Hospitals of Leicester, Leicester, UK
Cathy Richards, Hussein Uraiby & Apostolos Nakas
Flinders Health and Medical Research Institute, Adelaide, Australia
Sonja Klebe
CRUK Lung Cancer Centre of Excellence, UCL Cancer Institute, London, UK
David A. Moore
Department of Cellular Pathology, University College Hoapital, London, UK
David A. Moore
Leicester Medical School, University of Leicester, Leicester, UK
Claire R. Wilson
University of Leicester, Leicester, UK
Marco Sereno
Birmingham Tissue Analytics, University of Birmingham, Birmingham, UK
Ana Teodosio

Authors

Farzaneh Seyedshahi
View author publications
Search author on:PubMed Google Scholar
Kai Rakovic
View author publications
Search author on:PubMed Google Scholar
Nicolas Poulain
View author publications
Search author on:PubMed Google Scholar
Adalberto Claudio Quiros
View author publications
Search author on:PubMed Google Scholar
Ian R. Powley
View author publications
Search author on:PubMed Google Scholar
Cathy Richards
View author publications
Search author on:PubMed Google Scholar
Hussein Uraiby
View author publications
Search author on:PubMed Google Scholar
Sonja Klebe
View author publications
Search author on:PubMed Google Scholar
David A. Moore
View author publications
Search author on:PubMed Google Scholar
Apostolos Nakas
View author publications
Search author on:PubMed Google Scholar
Claire R. Wilson
View author publications
Search author on:PubMed Google Scholar
Marco Sereno
View author publications
Search author on:PubMed Google Scholar
Leah Officer-Jones
View author publications
Search author on:PubMed Google Scholar
Catherine Ficken
View author publications
Search author on:PubMed Google Scholar
Ana Teodosio
View author publications
Search author on:PubMed Google Scholar
Fiona Ballantyne
View author publications
Search author on:PubMed Google Scholar
Daniel Murphy
View author publications
Search author on:PubMed Google Scholar
Ke Yuan
View author publications
Search author on:PubMed Google Scholar
John Le Quesne
View author publications
Search author on:PubMed Google Scholar

Contributions

F.S. conceived the study, conducted the research, performed data analysis, and wrote the manuscript. K.R. provided pathology expertise, insights, and contributed to TCGA data annotations. N.P. performed R codings and TCGA sequencing data associations. A.C.Q. provided computational guidance insights for HPL analysis. C.R., S.K., and J.L.Q. performed histopathological annotations and provided mesothelioma subspecialty biological expertise. K.Y., D.M., and J.L.Q. supervised the project, provided guidance throughout, and contributed insights for running the project. I.R.P., H.U., D.A.M., A.N., C.W., M.S., L.O., C.F., A.T., and F.B. contributed to LATTICe-M dataset curation. All authors reviewed and approved the final manuscript.

Corresponding authors

Correspondence to Ke Yuan or John Le Quesne.

Ethics declarations

Competing interests

D.A.M. has received speaker fees from AstraZeneca, Eli Lilly, BMS, Takeda and Boehringer Ingelheim; consultancy fees from AstraZeneca, ThermoFisher, Takeda, Amgen, Janssen, MIM software, Bristol-Myers Squibb and Eli Lilly; and educational support from Takeda and Amgen. All other authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Haining Yang and the other anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Seyedshahi, F., Rakovic, K., Poulain, N. et al. A histomorphological atlas of resected mesothelioma discovered by self-supervised learning from 3446 whole-slide images. Nat Commun 16, 8891 (2025). https://doi.org/10.1038/s41467-025-63846-9

Download citation

Received: 05 January 2025
Accepted: 29 August 2025
Published: 07 October 2025
Version of record: 07 October 2025
DOI: https://doi.org/10.1038/s41467-025-63846-9