Main

Artificial intelligence has revolutionized digital pathology by enabling biomarker prediction from cancer tissues using high-resolution whole-slide images (WSIs)1,2,3,4,5,6. Moreover, these algorithms can substantially enhance diagnostic accuracy, efficiency and consistency, reducing the subjectivity associated with human interpretation7,8. In particular, deep learning can perform tasks such as disease grading, cancer subclassification or prognostic prediction9,10,11.

Recently, foundation models, which are trained on large-scale datasets, have been introduced to digital pathology12,13. These models use self-supervised learning (SSL) techniques to learn meaningful representations of histology tissue, which are crucial for clinical pathology tasks. SSL techniques such as contrastive learning14,15 and masked image modelling16 have shown improved performance, robustness and higher transferability compared with fully supervised learning. Another advantage lies in its ability to learn from vast amounts of unlabelled data, thereby considerably reducing the need for manual annotation17. The practical application of foundation models involves WSI tessellation into small, non-overlapping patches, after which image feature extraction is performed. These extracted features serve as inputs for training classification or regression models, such as ViTs18, tailored for specific tasks, such as mutation prediction, survival analysis, disease grading or cancer classification19. The limited availability and variable quality of public pathology data can hinder the performance of these models when applied to real-world clinical scenarios20. Recent efforts have demonstrated the potential of large-scale foundation models in computational pathology. Unlike earlier models that relied heavily on datasets such as The Cancer Genome Atlas (TCGA), contemporary foundation models are now trained on much larger proprietary cohorts such as Mass-100K (100,000 WSIs)21, Providence (171,000 WSIs)22 and Memorial Sloan Kettering Cancer Center (1,488,000 WSIs)23.

Foundation models have enabled the rapid development of specialized, task-specific downstream models by providing a stable base architecture. These downstream models require substantially less data and computational resources since they build upon the pre-existing foundation model. While the success of foundation models is typically measured by downstream model performance, their evaluation has largely been limited to narrow benchmarks without proper external validation. This restricted testing approach risks data leakage and selective reporting of only the best-performing models. As a result, most foundation models lack systematic evaluation across a broad spectrum of clinically relevant tasks, leaving their true capabilities and limitations incompletely understood.

In this study, we put forth a comprehensive benchmarking effort for histopathology foundation models. By including multiple proprietary cohorts from multiple countries, which were never part of any foundation model training, we effectively mitigate the risk of data leakage from pretraining datasets. Our benchmarking includes 19 foundation models and 31 clinically relevant evaluation tasks, 19 of which are the prediction of cancer biomarkers, using a total of 6,818 patients and 9,528 slides. This comprehensive evaluation bridges a notable gap in digital pathology literature and will serve as an important reference point for the digital pathology community helping to select the right foundation model for a specific digital pathology task.

Results

Benchmark of pathology foundation models

We benchmarked the performance of 19 foundation models and 14 ensembles derived from these models, trained as vision-language or vision-only, on 31 weakly supervised downstream prediction tasks related to morphology (n = 5), biomarkers (n = 19) and prognostication (n = 7) (Fig. 1).

Fig. 1: Experimental design of the study.
figure 1

Benchmarking of 19 histopathology foundation models using 13 cohorts and 31 tasks. a, Number of slides used from each of the 13 cohorts including 4 cancer types. b, About 9,528 haematoxylin and eosin (H&E) stained WSIs were preprocessed using the standardized STAMP19 pipeline. Feature extraction from the processed tiles was performed using 19 foundation models analysed in this study. The TCGA features were utilized for fivefold cross-validation with downstream transformer models on 31 classification tasks using STAMP. All models were subsequently applied to external features from CPTAC, Bern, Kiel, DACHS and IEO. The transformer architecture schematic shows layer normalization (Norm) and multi-headed self-attention (MSHA), followed by a MLP. c, All experiments were analysed using AUROCs, supplemented by AUPRC, Pearson’s correlation coefficient, DeLong’s test, balanced accuracy and F1 score. CONCH achieves the highest average AUROC across all tasks, followed by Virchow2, Prov-GigaPath and DinoSSLPath. The star indicates that Panakeia was tested on all tasks despite being specifically designed for BRCA and CRC. Attention heatmaps were generated for some slides to interpret differences between foundation models.

Source data

For the 5 morphology-related tasks, CONCH yielded the highest mean area under the receiver operating characteristic curve (AUROC) of 0.77, followed by Virchow2 and DinoSSLPath with mean AUROCs of 0.76 (Fig. 2c). Across the 19 biomarker-related tasks, Virchow2 and CONCH achieved the highest mean AUROCs of 0.73, followed closely by Prov-GigaPath with a mean AUROC of 0.72 (Fig. 2d). Finally, in the 7 prognostic-related tasks, CONCH yielded the highest mean AUROC of 0.63, followed by Virchow2 and BiomedCLIP with mean AUROCs of 0.61 (Fig. 2e). Averaged across all 31 tasks, CONCH and Virchow2 had the highest AUROCs of 0.71, followed by Prov-GigaPath and DinoSSLPath with AUROCs of 0.69. Subsequent rankings included H-optimus-0, UNI and Panakeia (0.68), Virchow, Hibou-L and CTransPath (0.67), BiomedCLIP and Kaiko (0.66), Phikon (0.65) and PLIP (0.64). Moreover, CONCH achieved the highest average area under the precision-recall curve (AUPRC), balanced accuracy and F1 scores (Extended Data Fig. 1), with the highest average AUROC in each cancer type obtained by CONCH (stomach adenocarcinoma (STAD), non-small-cell lung cancer (NSCLC)), Virchow2 (colorectal cancer (CRC)) and BiomedCLIP (breast cancer (BRCA)) (Extended Data Fig. 2a). To further validate our findings, we compared the performance of transformer-based aggregation with the widely used attention-based multiple instance learning (ABMIL) approach24. Across all 31 tasks, ABMIL performed slightly worse than the transformer-based model, with an average AUROC difference of 0.01, leaving the overall model rankings largely unchanged (Extended Data Fig. 3).

Fig. 2: Performance of 19 pathology foundation models on 31 weakly supervised prediction tasks.
figure 2

a, AUROC scores of the four best foundation models, task-wise normalization. b, AUROC scores of the two best foundation models compared with the average prediction of the four best models (Avg-Pred) and the concatenated vectors of CONCH and Prov-GigaPath (Concat). ce, Average AUROC scores of the five folds of each foundation model on morphology (c), biomarker (d) and prognosis (e) tasks. Task-wise normalization for better comparison of the foundation models. Tasks are sorted by their mean AUROC across all models, while models are sorted by their mean AUROC across all tasks. f, Stacked pie charts showing the number of tasks where each model achieved an average AUROC of >0.7, 0.6–0.7 or <0.6, grouped by task type. g, Average AUROC scores of the five folds using encoded tile embeddings from slide encoders versus the original tile embeddings. The star indicates that Panakeia was tested on all tasks despite being specifically designed for BRCA and CRC.

Source data

For histopathology slide encoders, we retrieved the encoded tile-level embeddings to make them applicable to our MIL approach. The original tile embeddings consistently outperformed their slide-level counterparts and the performance of the encoded tile embeddings is driven by the quality of the original tile embeddings and not by the slide encoder (Fig. 2g).

In statistical AUROC comparisons across 29 binary classification tasks, CONCH yielded higher AUROCs, which were significantly different from other models in a substantial number of tasks: PLIP (16), Phikon and BiomedCLIP (13), Kaiko (11) and 7 tasks each for Hibou-L, H-optimus-0, CTransPath, Virchow, Panakeia, UNI and DinoSSLPath, with 5 tasks each for Prov-GigaPath and Virchow2. Conversely, few models yielded higher AUROCs than CONCH: Virchow2 (6), Prov-GigaPath (3), Panakeia and Kaiko (2) and DinoSSLPath, UNI, Virchow and Hibou-L (1). Notably, PLIP, Phikon, BiomedCLIP, H-optimus-0 and CTransPath were not significantly better than CONCH in any of the tasks (P < 0.05; Extended Data Fig. 4b). Among the vision-only models, Virchow2 was significantly better than all other models in between 6 and 12 tasks (P < 0.05; Extended Data Fig. 4c).

Together, these data show that CONCH, a vision-language model trained on 1.17 million image-caption pairs (ICPs), performs on par with Virchow2, a vision-only model trained on 3.1 million WSIs, and together outperform all other pathology foundation models in the three highlighted domains of morphology, biomarkers and prognostication-based prediction tasks and that slide encoders are ineffective in an MIL set-up.

Performance of foundation models in scarce data settings

One of the predominant selling points of foundation models in computational pathology is the mitigation of the traditional requirement for extensive labelled datasets when analysing rare (molecular) events. Consequently, we analysed the performance of pathology foundation models across two dimensions: WSI count for foundation model training, and patient and positive case counts for downstream model training, with emphasis on low-prevalence scenarios that reflect real-world clinical applications.

From the foundation model perspective, positive correlations (r = 0.29–0.74) were observed between downstream performance and pretraining dataset size (WSIs, patients) or diversity (tissue sites) across morphology, biomarker and prognosis tasks, although most were not statistically significant. Significant correlations were found only for morphology with patient count (r = 0.73, P < 0.05) and tissue site diversity (r = 0.74, P < 0.05) (Fig. 3a). These findings suggest that these factors are important but not sole determinants, with the distribution of anatomic tissue sites (Supplementary Table 1 and Supplementary Fig. 1), architecture and dataset quality also playing critical roles. This is especially evident in vision-language models, where CONCH outperformed BiomedCLIP despite seeing far fewer ICPs (1.1 million versus 15 million) (Fig. 3b). Similarly, tissue representation in pretraining datasets showed a moderate, but not significant, correlation with performance by cancer type (Fig. 3c). Interestingly, Panakeia models showed decent performance on unrelated cancer types, with the BRCA model achieving average results in NSCLC and the CRC model performing similarly in STAD, despite no previous exposure to these tissues during training.

Fig. 3: The impact of data diversity and volume on downstream weakly supervised classification performance.
figure 3

ac, The impact of foundation model data diversity on downstream classification. Pearson’s correlation (two-sided) was used to assess associations between pretraining dataset characteristics and downstream performance. No adjustments were made for multiple comparisons. Correlation between the number of WSIs, patients and anatomic tissue sites in the pretraining dataset and the average AUROC for each downstream task type for all vision-only foundation models for which this data is available. Here, k denotes thousands. (a). Correlation between the number of ICPs (in millions, M) in the pretraining dataset and the average AUROC for each downstream task type for all vision-language foundation models (b). Performance of the respective cancer types correlated with the proportion of the cancer type in the pretraining dataset (c). All information that was available is shown (Supplementary Tables 68). d,e, Experiments with reduced downstream training sizes. Average AUROC scores across 29 tasks, trained with 75, 150 or 300 patients (d). Distribution of AUROC scores across all tasks for each model separately. Violin plots show kernel density estimates of AUROC scores, truncated at the observed range. The inner box marks the median and interquartile range (25th–75th percentiles), with whiskers extending to the most extreme values within 1.5× interquartile range (e). The star indicates that Panakeia was tested on all tasks despite being specifically designed for BRCA and CRC.

Source data

Downstream models were trained on randomly sampled cohorts of 300, 150 and 75 patients while keeping a similar ratio of positive samples, and consequently validated on full-size external cohorts. In the largest sampled cohort (n = 300), Virchow2 demonstrated superior performance in 8 tasks, followed closely by PRISM with 7 tasks. With the medium-sized sampled cohort (n = 150), PRISM dominated by leading in 9 tasks, while Virchow2 followed with 6 tasks. The smallest sampled cohort size (n = 75) showed more balanced results, with CONCH leading in 5 tasks, while PRISM and Virchow2 each led in 4 tasks. Performance metrics remained relatively stable between n = 75 and n = 150 cohorts (Fig. 3d,e and Extended Data Fig. 5).

To evaluate foundation models in real-world clinical scenarios, we focused on clinically relevant tasks with rare positive cases (>15%) in the TCGA training cohort. Key low-prevalence biomarkers included BRAF mutation (10%), CpG island methylator phenotype (CIMP) status (13%) and microsatellite instability (MSI) status (14%) in CRC; Epstein–Barr virus (EBV) positivity (8%) and M-status (7%) in STAD; and EGFR mutation (11%) and STK11 mutation (15%) in lung adenocarcinoma (LUAD). To avoid cancer type imbalance, these targets were only evaluated in DACHS, Kiel and CPTAC-LUAD. The results show that Prov-GigaPath (mean AUROC of 0.74) yields the highest performance in the highlighted low-prevalence tasks, followed by Virchow (0.73) and CONCH (0.72) (Extended Data Fig. 2b).

Finally, tasks were stratified into high- and low-performance tasks by the AUROC (Extended Data Fig. 6). In high-performance tasks (>0.75), Virchow2 demonstrated superior performance in high-performance tasks, followed by Prov-GigaPath and CONCH. Conversely, in low-performance tasks (≤0.75), CONCH yielded better results.

Together, these results indicate that the patient count, tissue site diversity and their distribution are important for downstream performance, although other factors such as architecture and dataset quality also have critical roles. Moreover, the performance in downstream tasks with low-prevalence cases indicates the limitations of current foundation models for nonetheless clinically relevant biomarkers. Lastly, we show differential model efficacy based on task complexity, with Virchow2 excelling in standard classification tasks while CONCH dominates in more challenging predictive scenarios. All models show similar performance declines with reduced training sizes, underlining the weakness of current pathology foundation models in scarce data scenarios.

Foundation models learn different tissue morphologies

To quantitatively measure prediction similarity across models, we calculated Cohen’s kappa25. For each task, labels were assigned using a majority vote across the cross-validation folds. Cohen’s kappa scores were generally moderate and varied across models. Notably, some pairs such as Panakeia and DinoSSLPath (0.55), PLIP and BiomedCLIP (0.52) and top performers such as Prov-GigaPath, CONCH, Virchow2 and DinoSSLPath showed higher agreement, whereas lower-performing models such as Hibou and Kaiko exhibited the least consensus (0.28) (Fig. 4b). Within individual model folds, BiomedCLIP and CONCH achieved the highest average kappa (0.41), followed by Virchow2, Panakeia and Prov-GigaPath (0.37), with Hibou (0.26) and Kaiko (0.24) ranking lowest, consistent with their AUROC performance (Fig. 4c).

Fig. 4: Divergence in tissue focus and predictive similarity among foundation models.
figure 4

a, Attention heatmap analysis for MSI-H classification in four different DACHS samples selected for correct predictions across selected foundation models. Thumbnails of the original WSIs and heatmaps of selected foundation models. b, Objective measure of similarity of prediction scores using Cohen’s kappa and majority vote across the five folds to binarize the predictions. Kappa scores of all combinations of foundation models tested in this study. c, Cohen’s kappa between the five folds of each foundation model. The star indicates that Panakeia was tested on all tasks despite being specifically designed for BRCA and CRC.

Source data

To identify the reasons behind the observed performance differences among the downstream models trained on top of the different foundation models, we investigated whether the models focus on different morphological properties for their predictions. We utilized attention heatmaps to compare model behaviour when the models (1) consistently predicted the label correctly and (2) were in disagreement regarding the predicted label. In cases where all models were in agreement on the correct prediction, the validity of the classification would be supported by their focus on relevant tissue regions for diagnosis. For example, in the prediction of MSI status, models predominantly highlighted tumour regions, as expected. However, models such as UNI, Hibou, Virchow and Kaiko occasionally highlighted pen marks, which is an undesired behaviour that suggests that predictions are being made through some form of pattern association rather than understanding the underlying biology (Fig. 4a and Extended Data Fig. 7b). To assess the impact of pen marks, we quantified their occurrence in 50 randomly sampled slides per test cohort and found them present in 90% of slides from DACHS and 22% from Bern, but absent elsewhere. Despite their presence, pen marks did not skew classification, as they were equally distributed across different classes. Models such as CONCH and Virchow focused on multiple small tissue areas, whereas Prov-GigaPath appears less selective in its attention (Fig. 4a). In NSCLC subtyping, models generally performed well, focusing mainly on tumour regions and ignoring healthy lung parenchyma (Extended Data Fig. 8b). In ESR1 overexpression prediction, Prov-GigaPath and Kaiko highlighted the majority of the WSI area, whereas CONCH and Virchow focused on a few small tissue areas (Extended Data Fig. 8c). By contrast, when analysing slides where models made inconsistent predictions, we found instances of model disagreement that led to errors. For instance, in the task of DACHS CRC sidedness, Virchow erroneously focused on pen marks (Extended Data Fig. 7b). However, no consistent pattern of errors emerged across the models to fully explain these discrepancies.

Together, these data indicate that foundation models vary in their focus on tissue regions and the morphological features that they prioritize, which impacts their predictive performance. The differences in attention across models suggest that combining models with complementary strengths could enhance overall predictive accuracy in ensemble approaches.

Ensemble of pathology foundation models improve performance

Lastly, we tested the hypothesis that creating an ensemble of pathology foundation models improves prediction performance. We utilized two approaches for ensembling models, taking the average of the various downstream models’ prediction scores trained on different foundation model backbones and concatenating feature vectors from different foundation model backbones to create a single downstream model.

Experiments show that ensembling by taking the average of the models’ prediction scores yielded a superior AUROC compared with either model used in isolation. The combination of the four top-performing models led to the highest improvement, achieving a mean AUROC 1.2% higher than CONCH (Extended Data Fig. 9), the leading individual model (Fig. 1b). Across all 31 tasks, the ensemble reduced misclassifications compared with CONCH by an average of 6.2% across the five folds (cut-off 0.5) (Supplementary Table 2). Therefore, these data show that ensembling the prediction scores of multiple high-performing models enhances performance on certain tasks beyond the capabilities of the best individual model.

Combining the best-performing models, CONCH and Virchow2, yielded a 1,792-dimensional vector with the highest AUROC of 71.9. Similarly, combining Virchow2 and Prov-GigaPath, the top-performing vision-only models, resulted in a 2,816-dimensional vector with an AUROC of 71.6. Individually, the models achieved AUROCs of 71.1 for CONCH, 70.9 for Virchow2 and 69.2 for Prov-GigaPath (Fig. 1b and Extended Data Fig. 9). Interestingly, Cohen’s kappa between the individual models did not strongly correlate with ensemble quality, indicating that low agreement does not necessarily translate to beneficial diversity in predictions. Similarly, no clear pattern was observed between the similarity of ensembles with their single model counterparts and factors such as model performance or embedding size (Extended Data Fig. 10). To quantify improvements, we conducted two-sided DeLong’s tests comparing AUROC scores of CONCH with ensembles and other single-model baselines. For each model, we averaged prediction scores across five folds, and across up to ten folds for ensembles. Bagging the five folds of the same foundation model increased AUROC scores, while integrating different models via stacking or concatenation yielded more pronounced improvements (Extended Data Fig. 4a). The CONCH and Virchow2 ensemble showed statistically significant differences in performance with higher AUROCs than CONCH in 9 of 29 tasks (P < 0.05), whereas the Virchow2 and Prov-GigaPath ensemble showed significant improvements in 7 tasks (Extended Data Fig. 4b).

These results demonstrate that ensemble approaches for pathology foundation models, as well as their downstream models, lead to enhanced prediction performance. This suggests that merging multiple foundation models through ensemble techniques can be beneficial.

Discussion

Weakly supervised computational pathology approaches, in which a deep learning system predicts a label directly from a WSI, have been massively successful in cancer research. They have been used to make the diagnosis of tumours, to predict biomarker status and to predict clinical outcomes directly from image data. Over 100 such tools are now approved for clinical use in the United States and the European Union26,27. Since 2022, foundation models have become an integral part of weakly supervised computational pathology pipelines and have improved performance and generalizability4,28. However, the current internal evaluation strategy for foundation models in computational pathology for clinically relevant tasks is limited. When groups that publish pathology foundation models evaluate them on tasks of their own choosing, there is a high potential for bias. Moreover, concerns about data leakage arise when foundation models are tested on images from the same institutions where they were trained.

In this study, we conducted a comprehensive evaluation of pathology foundation models in weakly supervised computational pathology on truly external datasets with no overlap between training and validation data. Our results show that while many existing foundation models achieve high performance on clinically relevant prediction tasks, CLIP-based approaches are not inherently superior, as evidenced by BiomedCLIP and PLIP’s performance. Instead, high-quality pretraining data and effective data cleaning are crucial for achieving top-tier performance. The best-performing model, CONCH, trained with multimodal data, suggests that incorporating text during training enhances image-only embedding quality. Similarly, Virchow2’s strong performance stems from its unprecedented tissue type diversity (approximately 200 versus 20–30 in other models) and more balanced distribution, avoiding over-representation of specific cancer types. In addition, the variability in the model’s performance can also be attributed to varying degrees of difficulty for each task. For instance, while differentiating between lung carcinoma subtypes is generally straightforward, other tasks such as stomach cancer subtyping can be more demanding. Here even pathologists can show a considerate degree of interobserver disagreement29.

In terms of prediction interpretability, our approach highlights that different foundation models focus on different areas in the tissue while still having a high agreement on the predicted label. Our technical analysis revealed that slide encoders showed no advantage over tile encoders in MIL set-ups, except in low-data scenarios, and the transformer-based STAMP architecture generally outperformed ABMIL outside of data-limited settings. Interestingly, while CONCH dominated in tasks when trained on the full dataset, its advantages diminished in low-data and low-prevalence settings. This performance dichotomy suggests that multimodal training of a foundation model, despite its presumed benefits, does not confer special advantages in the data-constrained scenarios often encountered in clinical settings within the scope of our experiments. We demonstrate that ensembling foundation models is beneficial, particularly when combining top-performing models, although prediction diversity (measured by Cohen’s kappa) does not directly correlate with ensemble performance. Even modest ensemble improvements may have clinical relevance by combining several learned perspectives of tissue morphology, as exemplified by the higher biomarker classification performance. Future work should incorporate more sophisticated methods than feature vector concatenation, especially for larger models where combining large vectors might lead to overfitting.

A key insight of our study is that performance of foundation models does not scale well with increasing numbers of images in the training set used for SSL. This means that bigger is not always better. Rather, the diversity of the training set suggests to be a key factor, favouring various sources of data, races and types of cancer. Our results will inform the future development of new foundation models. Specifically, using multimodal data to train models, even if the intention is just to apply them on unimodal data (that is, on images alone), should be encouraged. For healthcare institutions, this means that data that is available at scale, even without clinical association with clinical endpoints, is a valuable resource to train such models. Moreover, our findings suggest that the selected computational pathology tasks may be solvable primarily through local morphological patterns rather than requiring global spatial context. The performance achieved by randomly sampling 512 tiles per patient at each epoch suggests that for many tasks, the discriminative features exist at the local level. This observation is consistent with our comparison showing that tile-level encoders outperformed slide-level encoders despite the latter’s theoretical advantage in capturing global spatial relationships. Future research should explore in further depth whether the selected tasks and performance metrics adequately represent the spectrum of diagnostic challenges, particularly those requiring integration of long-range spatial dependencies across the entire slide.

Our study has limitations in that our evaluation tasks only contain certain tumour types. We focused on four cancer types, prioritizing truly external validation datasets over broader cancer type coverage. This differentiates our work from studies that train and test on the same cohort or WSIs from the same hospital used for pretraining. Moreover, we were limited to pathology foundation models licences, which are accessible in a research setting. For example, this excludes RudolphV and PLUTO from our analysis. While our datasets contained artefacts such as pen marks (present in 90% of DACHS and 22% of Bern samples), these had minimal impact on predictions owing to their even distribution across classes. Although we incorporated a broad range of foundation models applicable to histology data, exploring the potential of fine-tuning general-purpose models such as GPT-4o was outside our current scope. Our evaluation strategy is focused on a diverse set of biomarkers in cancer histopathology. Future work will expand upon the range of tumour types, biomarkers and patient cohorts to further evaluate the robustness of foundation models in pathology.

Methods

Ethics statement

This study was carried out in accordance with the Declaration of Helsinki. The Clinical Proteomic Tumor Analysis Consortium (CPTAC) and TCGA did not require formal ethics approval for a retrospective study of anonymized samples. The analysis of the testing cohort DACHS (an epidemiological study that is led by the German Cancer Research Center, DKFZ) was approved by the ethics committee of the Medical Faculty, University of Heidelberg, under 310/200130,31,32.

Datasets

The study used datasets from TCGA, CPTAC and proprietary cohorts. Specifically, cohorts from LUAD, lung squamous cell carcinoma (LUSC), CRC, STAD and BRCA were included. TCGA datasets were used for training of the models, and CPTAC, DACHS, Kiel, Bern and IEO were used for evaluation. This ensured that all testing was done on data that had neither been seen during training of the foundation models nor the aggregator models. For our analyses, we only use the CPTAC-2 and CPTAC-3 prospective collections (from 2018/20), which exclusively contain patients with CPTAC-IDs and have no overlap with TCGA patients.

For external validation, CPTAC datasets for LUAD, LUSC, colorectal adenocarcinoma and BRCA were used. No foundation models analysed in this study were trained on CPTAC, ensuring its suitability as an independent test cohort. In addition, for CRC, the DACHS cohort was utilized alongside CPTAC as another external test set. In STAD, proprietary datasets from Kiel and Bern served as external validation cohorts. For BRCA, the IEO dataset was used alongside CPTAC for external validation (Fig. 1a and Supplementary Fig. 2).

Experimental design

Digital pathology involves several task categories, including morphological, biomarker and prognostic tasks, and foundation models should be capable of performing well across all of them. In this study, we assembled and benchmarked 19 foundation models—the 12 pure vision models CTransPath28, DinoSSLPath33, Phikon16, UNI21, Virchow23, Kaiko (ViT-L/14)34, Prov-GigaPath22, Hibou-B, Hibou-L35, H-optimus-036, Virchow237 and Panakeia, the 3 vision-language models PLIP38, BiomedCLIP39 and CONCH40, and the 4 slide encoders GigaPath, MADELEINE41, PRISM42 and CHIEF43—across a comprehensive set of tasks from all three categories. Each category was assessed across all cancer types, apart from morphological features in BRCA and prognostic features in NSCLC owing to data unavailability. Biomarkers were selected based on clinical relevance, diversity and availability. Tasks were prioritized when they were associated with actionable therapeutic targets, as annotated by OncoKB44. To enable both training and independent testing, each task required ground truth data to be available in TCGA (for training) and at least one test cohort. For each cohort, only tasks with at least ten cases in each category were included (Supplementary Table 3). For visualization purposes, only 15 models (vision-only and vision-language models) are shown in most figures. The slide encoders were included selectively, such as in Fig. 2g for comparison with their tile embedding counterparts and in Fig. 3d,e and Extended Data Fig. 5 to highlight their potential benefits in scarce data settings. Extended Data Figs. 1 and 9 include all models to comprehensively show all experiments.

First, we investigated morphological classification tasks related to cancer subgroups with distinct phenotypic characteristics. The aim was to assess foundation models by evaluating their ability to discern established phenotypic distinctions. In CRC, the morphological task involved predicting whether the slide originated from the left or right side of the colon, excluding colon transversum samples owing to ambiguous classification. In STAD, the Lauren classification45 was chosen as the morphological task, classifying slides as ‘intestinal’, ‘diffuse’ or ‘mixed’, given the unavailability of ground truth for newer classification systems46,47. In lung cancer, the models were tasked with classifying samples into either adenocarcinoma or squamous cell carcinoma1.

Biomarker prediction tasks focused mainly on clinically relevant targets with some type of morphological correlation as demonstrated by previous computational pathology models. For CRC, these included BRAF, KRAS, MSI status, PIK3CA and CIMP status11. For STAD, EBV presence and MSI status were selected48. For LUAD, the targets were EGFR, STK11, KRAS and TP531. For BRCA, the targets were the expression of HER2, ER and PR receptors and PIK3CA mutations49,50. MSI status and CIMP status were binarized into MSI-high versus not MSI-high and CIMP-high versus not CIMP-high, respectively. HER2, ESR1 and PGR expression were binarized using the z-score of mRNA expression profiles, similar to a study by Wegscheider et al.51. This approach was preferred over immunohistochemistry labels owing to its objectivity and reduced variance error.

Prognostic tasks, which aim to predict clinical outcomes directly from WSIs, were selected based on their prognostic relevance. The tasks included N-status for CRC, STAD and BRCA, where all stages except N0 were classified as N+ (excluding Nx cases). M-status was analysed in CRC and STAD, performing binary classification of M0 versus M+.

By focusing on tasks with clear therapeutic actionability or prognostic relevance, we aimed to evaluate the practical utility of these models in a clinical setting. This comprehensive benchmarking study included 31 tasks across 8 external test cohorts, encompassing a wide range of clinically relevant classification tasks (Supplementary Table 4).

Image processing and deep learning techniques

The benchmarking was conducted using the STAMP pipeline version 1.1.1 (ref. 19) (Supplementary Table 5). Each classification task followed a two-step procedure (Fig. 1b). In the first step, feature vectors were extracted from WSIs utilizing the foundational models evaluated in this study. In the second step, these vectors were used to train a slide-level aggregator on the downstream tasks described above.

WSIs were segmented into N tiles, with an edge length of 224 pixels corresponding to 256 µm, resulting in an effective resolution of ~1.14 µm per pixel. All included foundation models in our benchmark, except for Prov-GigaPath22, tessellate the slide into tiles of 224 × 224 pixels. However, the Prov-GigaPath implementation transforms tiles using centre cropping from 256 × 256 into 224 × 224 before inputting it into the tile encoder. The slide encoder then processes these feature embeddings generated by the tile encoder, implicitly maintaining the 224 × 224 tile dimensionality throughout the pipeline. Therefore, our choice of tile dimensionality for slide tessellation is consistent with the foundation models selected for our analyses. Background tiles were excluded using Canny edge detection52. Stain normalization was not applied during preprocessing. Feature extraction was performed on each tile individually using the different foundational models. The embedding dimensions M varied across models, ranging from M = 384 for DinoSSLPath and Panakeia to M = 1,536 for Prov-GigaPath and H-optimus-0. Subsequently, each slide was transformed into a two-dimensional matrix with dimension N × M. The extracted feature vectors were input into a transformer-based aggregator model4. It utilizes multi-head attention, Gaussian error linear unit activation functions53, layer normalization and a multilayer perceptron (MLP) head to produce an output corresponding to the k possible classes for each task. A 5-fold cross-validation approach was implemented, resulting in the creation of 2,945 models (19 foundation models, 31 tasks and 5 folds) trained exclusively on TCGA datasets. We implemented stratified k-fold cross-validation to ensure that each fold maintains representative proportions of all classes, preventing scenarios where rare categories have zero instances in training runs. This approach follows standard practices in computational pathology and provides robust performance estimates and better generalization assessment10. All experiments were run on individual 40 GB NVIDIA RTX A6000 and L40 GPU (graphics processing unit) nodes. In addition to the transformer-based aggregator described, we evaluated ABMIL as an alternative aggregation method24. ABMIL introduces inductive bias by using attention mechanisms to assign weights to each tile in a slide, enabling the model to focus on the most informative regions.

To integrate slide encoders into the MIL pipeline, we extracted the encoded tile-level embeddings for Prov-GigaPath, MADELEINE, CHIEF and the 512 latents for PRISM. These encoded tile embeddings were subsequently treated as regular tile embeddings in all analyses. Unless explicitly stated otherwise, results presented throughout the study refer to the regular tile embeddings. Prov-GigaPath provides both a slide-level and a tile-level encoder, and we evaluated both approaches22. In the case of Virchow and Virchow2, Vorontsov et al. proposed concatenating the class token with the average pool of patch tokens for each tile embedding. To maintain consistency with other models that only use class tokens, two configurations were tested: one including and one excluding the averaged patch tokens. As the differences are very small, the version only using class tokens is shown in the main results for consistency with other models. For CONCH, we used the output of the attentional pooler that corresponds to image-text alignment, with an embedding dimension of 512. Although the Panakeia models are specifically designed for BRCA and CRC, respectively, we also evaluate the CRC model on STAD and the BRCA model on NSCLC. This is because their performance remains competitive in these contexts, and including these results provides the basis for comparison in subsequent analyses. For experiments involving combined feature vectors, vectors were concatenated, maintaining a single vector per tile. For instance, combining CONCH and Virchow2 resulted in a combined embedding dimension M of M = 1,792 (M = 512 for CONCH + M = 1,280 for Virchow2).

Explainability

To better interpret the output of the models, we generated whole-slide prediction heatmaps for selected tasks. These heatmaps illustrate the models’ focus on specific tissue areas, by weighting the scores assigned to individual tiles using gradient-weighted class activation mapping (Grad-CAM)54. It is important to note that a high number of positively contributing tiles do not automatically result in a high final score owing to the nonlinear aggregation process in neural networks55. The benchmarking effort involved 2,945 models and 9,528 slides, leading to a vast number of model-slide combinations. Thus, it was necessary to select a few informative examples methodically. Slides were selected by including cases where models showed strong disagreements and cases where all models performed well. The heatmaps were visually analysed and compared with the underlying WSI. To further analyse the similarity between different models, Cohen’s kappa25 was measured between each pair of foundation models.

Statistical analysis

The performance of the models was evaluated using the AUROC using fivefold cross-validation and deployment on external cohorts. Mean AUROC scores from the five cross-validation models deployed on external data were used for statistical and graphical evaluations. Predictions were made per patient, and all feature matrices belonging to one patient were concatenated for use in the model. In addition to AUROC, for completeness in the supplementary material, we also calculated the AUPRC, balanced accuracy and F1 scores. The two-sided DeLong’s test was used to test for statistically significant differences in AUROC scores. As the DeLong’s test is only applicable when a single prediction score is available for each model and sample, the average prediction score across all five folds was used. Owing to its multi-class nature, we excluded Lauren classification tasks from this analysis. This differs from the main metrics, where the AUROC/AUPRC/F1/balanced accuracy scores represent the mean across the five folds.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.