Background & Summary

The efficient utilization of digital pathology and computational resources has led to the rapid rise of AI-based computational pathology1,2. In recent years, general foundational models for pathology, pre-trained on large-scale data, have garnered significant attention3,4,5,6,7. These models have demonstrated strong feature extraction capabilities for pathological images, as evidenced by evaluations across a series of whole-slide image-level downstream tasks8,9,10. For example, CTranspath6 uses a Semantically-Relevant Contrastive Learning (SRCL) framework to pre-train a CNN-Transformer hybrid feature extractor on 150 million patches, with its effectiveness validated across five downstream tasks. UNI4 employed the self-supervised DINO-v211 method to train a robust general pathology visual encoder on one billion patches from approximately 100,000 whole slide images (WSIs). Gigapath5 utilized 1.3 billion patches to pre-train a visual encoder based on VIT-Gaint12 architecture and adopted LongNet13 to scale itself to a slide-level foundation model for slide-level representation learning. Virchow14 is a pathology foundation model based on the ViT-Huge architecture, trained using the DINOv2 approach on a dataset constructed from 1,488,550 whole slide images (WSIs), enabling clinical-grade diagnosis and rare disease identification. Pathorchestra15 trained a VIT-Large encoder on 300,000 WSIs and conducted extensive evaluation across 112 downstream tasks, achieving over 95% accuracy on 47 of them. These pathology-pre-trained models have demonstrated superior performance in downstream tasks including tumor classification, survival analysis, and lesion segmentation. PLIP3, pre-trained on approximately 200,000 pathology image-text pairs collected from medical Twitter, developed a multimodal pathology foundational model using contrastive learning16, capable of both image and text comprehension. CONCH7 employs CoCa17 for self-supervised pre-training on 1.17 million image-caption pairs and has been extensively evaluated across 14 downstream benchmarks, demonstrating its outstanding performance. In addition to patch-level encoders, some studies have focused on developing pretrained slide-level encoders, which are built upon patch encoders. For example, CHIEF18 constructs a slide encoder with an ABMIL19 architecture through vision-language joint training based on CTranspath6, using 60,530 WSIs. Prism20 is a Transformer-based slide encoder trained on 587,196 WSIs, built upon patch embeddings from Virchow. Titan21 is a slide-level encoder trained via slide-level vision-language contrastive learning, based on CONCH-V1.521, an upgraded version of the CONCH model. Slide-level encoders eliminate the need to retrain aggregators by directly generating WSI-level representations through inference, enabling downstream slide-level tasks such as classification, survival analysis, and report generation.

Acquiring finely annotated large-scale pathology image datasets remains challenging due to the extremely high resolution of pathology images and the specialized expertise required for annotations. Nonetheless, the continued development of foundational models and downstream tasks in computational pathology makes high-quality pathology image datasets increasingly essential.

The Camelyon series22,23 (http://gigadb.org/dataset/100439), a publicly available pathology dataset focused on detecting breast cancer lymph node metastasis, is widely used for evaluating multiple instance Learning (MIL) methods. However, as shown in Fig. 1, some images in the Camelyon series are of poor quality, exhibit treatment-related artifacts, and contain errors in slide-level labeling. The Camelyon-1622 dataset includes only tumor and negative labels, making it incompatible with Camelyon-1723 labels. Many pixel-level annotations are inaccurate, and some slides lack pixel-level annotations entirely. These issues hinder the accurate evaluation of deep learning methods in downstream pathology tasks.

Fig. 1
figure 1

Examples of issues in the Camelyon-16 and Camelyon-17 datasets. (a) The WSI shows a therapeutic response characterized by tissue fibrosis. (b) The WSI exhibits a blurred histiocyte-cancer boundary (left) and poor staining quality (right). (c) The cancerous region is missed in the annotation. (d) The WSI shows a therapeutic response with tissue necrosis.

In this paper, we filtered out and removed slides from the Camelyon dataset that were blurred, poorly stained, exhibited treatment-related artifacts, or were ambiguous in terms of positivity. Furthermore, we expanded the binary classification labels in Camelyon-1622 to a four-class system to facilitate the merging of the Camelyon-16 and Camelyon-1723 datasets. Finally, we corrected the pixel-level annotations in the Camelyon dataset and added pixel-level annotations to positive slides that previously lacked them. Using the corrected dataset, we reevaluated 12 main MIL methods, including ABMIL19, TransMIL24 and CLAM25, etc. in two pre-trained natural image feature encoders, ResNet-5026 and VIT-S12, as well as ten pathology-specific pre-trained feature encoders, PILP3, CONCH7, UNI4, Gigapath5, CONCH-V1.521, TITAN21, Virchow14, Prism20, Ctranspath6 and Chief18.

Technical Validation

Dataset Overview

The official Camelyon-1622 dataset contains 399 WSIs, split into 270 for training and 129 for testing. The training set includes 111 tumor slides and 259 negative slides, while the test set includes 49 tumor slides and 80 negative slides. The official Camelyon-1723 dataset consists of 1000 WSIs, evenly divided into 500 for training and 500 for testing. The training set consists of 318 negative slides, 59 micro-metastasis slides, 87 macro-metastasis slides, and 36 Isolated Tumor Cells (ITC) slides. The test set labels are not publicly available. After data cleaning by professional pathologists, the Camelyon-16 dataset consists of 386 WSIs: 238 negative, 71 micro-metastasis, 69 macro-metastasis, and 8 ITC WSIs. The Camelyon-17 dataset consists of 964 WSIs: 633 negative, 103 micro-metastasis, 182 macro-metastasis, and 46 ITC WSIs. We combined the updated Camelyon-16 and Camelyon-17 datasets to form the Camelyon+ dataset. Figure 2 shows the dataset overvire. It consists of 1,350 WSIs: 871 negative, 174 micro-metastasis, 251 macro-metastasis, and 54 ITC WSIs.

Fig. 2
figure 2

Data characteristics and metastasis categories in Camelyon datasets. (a) Distribution of WSIs across different metastasis categories (Negative, Micro, Macro, and ITC) in three datasets: Camelyon-16-Refine22, Camelyon-17-Refine23, and Camelyon+27. (b) Representative histopathological examples for each category: Negative, Micro, Macro, and ITC.

Exclusion Criteria

We excluded certain WSIs based on the following criteria: focal blurriness, poor staining quality, difficulty distinguishing positive foci, and the presence of treatment-related artifacts. Of the 49 slides we remove, 26 show therapeutic response, 3 have staining issues, 12 exhibit focal blurring, 4 are of poor quality, and 4 contain suspicious cancerous regions. The verification of WSI labels and the annotation work were performed in the ASAP pathology annotation software (https://computationalpathologygroup.github.io/ASAP) by a mid-level pathologist, in accordance with the 8th edition of the American Joint Committee on Cancer (AJCC), and consistency checks were conducted by a senior pathologist. The presence of treatment response may interfere with model construction. In pathology, tumor treatment response refers to the histological changes in tumors following treatments such as surgery, chemotherapy, radiotherapy, targeted therapy, or immunotherapy, and the corresponding reaction to these treatments. Pathological analysis can assess histological indicators such as tumor cell necrosis, proliferation, and apoptosis, thereby evaluating treatment efficacy. Two typical treatment responses are tissue necrosis and fibrosis. Necrosis refers to areas of dead tissue formed after tumor cells die following treatment. Fibrosis refers to the scar tissue formed as a result of the self-repair of the tumor after damage. Necrotic and fibrotic areas can affect the representation of the characteristics of tumor regions in computational pathology, thereby affecting the performance of downstream tasks.

Data Records

The Camelyon+27 dataset is available via ScienceDB: (https://doi.org/10.57760/sciencedb.16442). The original WSI data can be downloaded from the official dataset repository (http://gigadb.org/dataset/100439), so it has not been uploaded to the database. The Camelyon+ dataset is structured into several directories, each serving a specific function in supporting downstream computational pathology tasks. The directory structure includes the following main components: slide-labels/, name-convert/, pixel-annotations/, feature-files/, and h5py-files/.

  • slide-labels/: This directory stores slide-level classification annotations in Excel format. We provide two XLSX files: camelyon+(2-classes).xlsx and camelyon+(4-classes).xlsx, which correspond to binary classification (negative vs. tumor) and four-class classification (negative, micro, macro, ITC), respectively. Each file contains two columns: slide (the WSI ID) and label (the assigned class). These labels are derived from corrected and unified versions of the Camelyon-16 and Camelyon-17 datasets, supporting various supervised learning scenarios.

  • name-convert/: To eliminate annotation bias, all original WSI file names from the Camelyon-16 training set that contain diagnostic hints such as “tumor” or “normal” have been renamed. The name-convert.xlsx file in this directory provides a mapping between the original and new file names through two columns: Origin Name and New Name. This enables accurate cross-referencing during label alignment or post-hoc analysis.

  • pixel-annotations/: For positive WSIs, pixel-level tumor region annotations are provided in XML format. Polygonal coordinates of positive regions are stored in the XML files and can be visualized on whole-slide images using ASAP. These files include detailed boundary information and can be used for tasks such as semantic segmentation or weakly supervised learning.

  • feature-files/: To facilitate fair and reproducible benchmarking across various visual encoders, this directory contains patch-level features extracted at 20 × magnification using a diverse set of backbone models, including ResNet-5026, VIT-S12, PLIP3, CONCH7, CONCH-V1.521, Ctranspath6, UNI4, GigaPath5, Virchow14, Chief18, Prism20, and Titan21. All features are stored in .pt format, which is natively compatible with the PyTorch library and supports efficient loading during training and inference.

  • h5py-files/: This optional directory offers an alternative representation of extracted features in .h5 format, enabling high-speed access and batch-wise loading for large-scale training workflows.

This modular organization of Camelyon+27 supports a broad spectrum of tasks, including classification, segmentation, and representation learning, and provides a standardized testbed for developing and evaluating pathology foundation models.

Methods

Methodology

The objective of our designed benchmark is to utilize slide-level labels to predict metastasis types. The commonly used approach is to adopt a deep learning strategy based on MIL, which has been recognized in recent studies for its strong capability to represent slide-level features28,29,30,31,32. MIL is a weakly supervised approach where a single WSI is treated as a bag, and each patch within the WSI is considered an instance. If any instance is cancerous, the entire WSI is labeled as cancerous, while a WSI is classified as normal only if all instances are normal.

With the advancement of deep neural networks, embedding-based MIL has become the dominant approach for WSI analysis. In embedding-based MIL, a pre-trained feature extractor first extracts features from the WSI, followed by an aggregator that pools the features for downstream classification tasks. Mean-MIL and Max-MIL aggregate features using mean pooling and max pooling, respectively, though the pooling mechanism inevitably results in information loss. ABMIL19 introduces the attention mechanism into MIL, dynamically assigning weights to each instance based on attention scores. CLAM25 further enhances this by incorporating instance-level clustering mechanisms to introduce domain knowledge, providing additional supervision alongside attention-based weight assignments. TransMIL24 leverages self-attention within the MIL aggregator to capture relationships between different instances, thereby improving global modeling capabilities. AMD-MIL33 introduces an agent mechanism into the MIL aggregator and employs threshold filtering for feature selection, improving MIL performance. DSMIL34 models instance relationships directly using a dual-stream architecture and a trainable distance measurement module. DTFD35 addresses the issue of limited WSIs by creating pseudo-bags. WiKG36 treats WSIs as knowledge graphs, dynamically constructing neighboring nodes and directed edges based on relationships between instances, and then updates the head node using knowledge-aware attention. FR-MIL37 introduces a distribution re-calibration approach that adjusts the feature distribution of a WSI bag (instances) based on the statistics of the max-instance (key) feature.

Data Preprocessing

For all datasets, we crop non-overlapping 256 × 256 patches at 20 × magnification. We then use twelve feature extractors ResNet-5026, VIT-S12, PILP3, CONCH7, UNI4, Gigapath5, CONCH-V1.521, TITAN21, Virchow14, Prism20, Ctranspath6 and Chief18 to extract features from the WSIs. Subsequently, we conducted two sets of experiments. The first set is a comparative experiment on the Camelyon-1723 dataset before and after label correction. The Camelyon-17-Origin dataset follows the official split, with 500 WSIs for training and 500 WSIs for testing. The Camelyon-17-Refine dataset also maintains the official split but excludes slides that fall under exclusion criteria. The Camelyon-17-Refine training set contains 492 slides, while the test set includes 472 slides. This comparative experiment evaluates the impact of dataset quality on MIL models. Since the original version of Camelyon-1622 does not have four-class labels, we do not perform similar experiments on it. The next set is the benchmark experiments on Camelyon+27. We evaluate using five-fold cross-validation, with each fold employing stratified sampling to maintain a fixed proportion of different classes. Since each patient has multiple slides in the Camelyon-17 dataset, in order to prevent data leakage, we ensure that slides of the same label of the same patient do not appear in the training set and the validation set at the same time.

Camelyon-17 Comparative Experiment

In the comparative experiments before and after correction on the Camelyon-1723 dataset, we primarily evaluated three pathology pre-trained feature extractors: PLIP3, UNI4, and Gigapath5. The learning rate was set to 2e-4, using the Adam optimizer with a weight decay of 1e-5. We repeated the experiments with random seeds of 2023, 2024, and 2025, and reported the mean and standard deviation of the evaluation metrics as shown in Table 1 and Table 2. All experiments were conducted on a workstation equipped with 4 NVIDIA RTX 3090 GPUs. Due to the significant class imbalance in the Camelyon-17 four-class dataset, our analysis concentrated on two key evaluation metrics: AUC and F1-score. Figure 3 presents a visualization of these metrics for a single MIL model across different feature extractors, using bar charts for both the Camelyon-17-Origin and Camelyon-17-Refine datasets. This visualization effectively illustrates how these metrics vary as the dataset undergoes refinement. Our results indicate that both AUC and F1-score exhibited notable changes following the dataset’s adjustment. Figure 4 further visualizes the F1-score, AUC, and their combined values, highlighting the top three models to assess the impact of dataset refinement on the performance ranking of the models. While the overall model rankings shifted to some extent after the dataset refinement, the CLAM-MB25 model consistently maintained its top-ranked position, indicating its robustness. In summary, dataset refinement enhanced the accuracy of model evaluation metrics and improved the fairness of model rankings, establishing a more solid foundation for future research.

Table 1 Performance metrics of different methods on the Camelyon-17-Origin dataset.
Table 2 Performance metrics of different methods on the Camelyon-17-Refine dataset.
Fig. 3
figure 3

Performance comparison between Camelyon-17-Origin23 and Camelyon-17-Refine across different MIL models and feature encoders. (a–c) AUC comparison using three feature encoders: (a) PLIP3, (b) UNI4, and (c) Gigapath5. (d–f) F1-score comparison using three feature encoders: (d) PLIP, (e) UNI, and (f) Gigapath.

Fig. 4
figure 4

Radar chart analysis of MIL models across different encoders and dataset versions. Each subplot shows the performance of multiple MIL models on the Camelyon-17-Origin23 and Camelyon-17-Refine datasets using three feature encoders: (a,e) PLIP3, (b,f) UNI4, and (c,g) Gigapath5. The outermost yellow line represents the average of AUC and F1-score, green represents AUC, and blue represents F1-score. (d,h) summarize the total ranking across all encoders for each dataset. Top-3 ranked models are highlighted with dashed boxes.

Camelyon+ Benchmark Experiment

In the Benchmark Experiment on Camelyon+27, we maintained the same hyperparameter settings as in the comparative experiments on Camelyon-1723. On the merged Camelyon+ dataset, we evaluated the MIL approach using feature extractors from two pre-trained natural image models, ResNet-5026 and VIT-S12, and twelve pre-trained pathology image models, Ctranspath, PLIP3, CONCH7, CONCH-V1.521, UNI4, Gigapath5, Virchow14, Chief18, Prism20 and Titan21. We classify these feature extractors into four main categories: ResNet-50 and ViT-S fall under the domain of natural image pre-training; PLIP, CONCH and CONCH-V1.5 fall under the domain of image-text contrastive learning pre-training; Ctranspath, UNI, GigaPath and Virchow belong to the category of pathology-specific visual pre-training; Chief, Prsim, and Titan are slide-level encoders that can represent a slide as an embedding. We report the mean and standard deviation of model performance in Tables 312 which can serve as baselines and references for future work based on the Camelyon+ dataset. As shown in Fig. 5, we present a heatmap of the distribution of AUC and F1-score across different MIL models under various feature extractors. It can be observed that pathology-pretrained feature extractors significantly enhance the performance of MIL. Notably, the CONCH model, which uses a VIT-Base architecture with image-text contrastive learning, achieves performance comparable to the UNI and Gigapath models, which utilize VIT-Large and VIT-Giant architectures, respectively. Moreover, both UNI and Gigapath leverage larger training datasets. This suggests that image-text contrastive pretraining may hold greater potential than pure visual pretraining in the pathology domain. While the PLIP model is also pretrained using image-text contrastive learning, its performance does not match that of CONCH, likely due to its smaller dataset and the lower quality of data sourced from Twitter. Slide-level encoders have recently emerged as a popular direction in computational pathology pretraining, aiming to produce holistic representations for entire WSI. However, due to the extremely large resolution of WSIs, the current generation of slide-level encoders still exhibits limited representational capacity. As shown in Table 12, we report the linear probing performance of three representative slide encoders-Chief, Prism, and Titan-and compare them against patch-level encoders (namely Ctranspath, Virchow, and CONCH-V1.5) used for precomputing features in three state-of-the-art MIL frameworks. Among the slide-level encoders, Chief achieved the highest performance, yet still fell significantly short compared to MIL models trained on patch-level features extracted by Ctranspath. This highlights the current limitations of slide-level encoders in capturing the complex pathological patterns required for downstream clinical tasks.

Table 3 Results on the Camelyon+ dataset with ResNet-50-extracted features.
Table 4 Results on the Camelyon+ dataset with VIT-S-extracted features.
Table 5 Results on the Camelyon+ dataset with PLIP-extracted features.
Table 6 Results on the Camelyon+ dataset with CONCH-extracted features.
Table 7 Results on the Camelyon+ dataset with CONCH-V1.5-extracted features.
Table 8 Results on the Camelyon+ dataset with Ctranspath-extracted features.
Table 9 Results on the Camelyon+ dataset with UNI-extracted features.
Table 10 Results on the Camelyon+ dataset with Gigapath-extracted features.
Table 11 Results on the Camelyon+ dataset with Virchow-extracted features.
Fig. 5
figure 5

Benchmark comparison of MIL models with various feature encoders. (a) Mean AUC scores across 12 MIL models and 9 feature encoders, including ResNet-5026, ViT-S12, Ctranspath6, PLIP3, CONCH7, CONCH-V1.521, UNI4, Gigapath5, and Virchow14. (b) Mean F1-score under the same evaluation setup. Each cell represents the averaged performance across datasets. Warmer colors denote higher values. CLAM-MB25, TransMIL24, and AMD-MIL33 exhibit consistently strong performance across multiple encoders, while newer foundation models such as UNI, Gigapath, and Virchow lead to higher AUC and F1-scores than conventional encoders.

Table 12 Performance comparison between the slide encoder and its corresponding patch encoder.

Figure 6 further presents few-shot learning results using slide encoders under various N-way, K-shot settings, ranging from 1 to 32 shots. The observed performance consistently improves with increasing shot number, suggesting that expanding annotated datasets for tasks such as lymph node metastasis detection could meaningfully enhance the clinical utility of slide-level foundation models.

Fig. 6
figure 6

Few-shot classification performance on the Camelyon+27. Benchmark. Boxplots show the performance of three models (Titan21, Prism20, Chief18) under 2-way, 3-way, and 4-way classification settings, with varying numbers of shots (1, 2, 4, 8, 16, 32). (a–c) Accuracy, (d–f) Recall, and (g–i) F1-score across increasing few-shot levels. Each box represents model performance distribution over multiple test episodes. Titan and Chief generally outperform Prism in low-shot regimes, and the performance gap narrows with increasing shot numbers.

In the Camelyon+27 dataset, noisy samples were initially removed to construct a clean benchmark. To assess the impact of noisy data, we conducted comparative experiments by reintroducing the noisy subset into either the training set or the test set of Camelyon+. As shown in Fig. 7, we evaluated the performance under 9 patch-level encoders and 3 state-of-the-art MIL methods. We observed that, in most cases, adding noisy data to the training set while keeping the test set clean resulted in performance comparable to the original benchmark. In contrast, when the test set was augmented with noisy data while keeping the training set unchanged, the performance on the test set dropped significantly compared to the benchmark. These results suggest that while incorporating noisy data during training may improve robustness without harming performance, the presence of noise in the evaluation set substantially undermines the reliability of performance estimation. This highlights the importance of clean and well-curated test data when benchmarking and deploying pathology AI systems.

Fig. 7
figure 7

Impact of noisy data on model performance across different feature encoders. Bar plots show the AUC and F1-score of three MIL models (a) CLAM-MB25, (b) AMD-MIL33, and (c) FR-MIL37-when noisy data is added either to the training set or to the test set. Each group of bars compares the performance across multiple feature encoders, including ResNet-5026, ViT-S12, PLIP3, CONCH7, CONCH-V1.521, Ctranspath6, Gigapath5, UNI4, and Virchow14. Models exhibit more robust performance degradation when noise is added to the test set, while training set noise leads to more varied effects depending on the encoder.

In the benchmark results, while the model demonstrates relatively strong performance in terms of accuracy and AUC, the F1-score, recall, and precision are notably low. As illustrated in Fig. 8, we visualized the confusion matrices for the CLAM-MB25 and FR-MIL37 models. The results show that the models perform relatively well in classifying the negative, micro, and macro categories, but perform poorly in the ITC category. We used macro-averaging to calculate the F1-score, recall, and precision, and the model’s poor performance on the ITC category significantly lowered the overall performance metrics. To investigate this issue further, we analyzed the model’s difficulty in identifying ITC cases. One major factor is the severe class imbalance in the Camelyon+27 dataset. As shown in Fig. 2, the head class, negative, contains 871 slides, while the tail class, ITC, contains only 54 slides, resulting in an imbalance ratio of approximately 16.1. This imbalance classifies the dataset as having a moderately long-tailed distribution. Such imbalance highlights a key challenge in pathology image analysis: how to achieve balanced model performance on long-tailed datasets like Camelyon+, particularly since real-world pathological data naturally follow a long-tailed distribution. Furthermore, we identified a fundamental difference between the four-class classification task in Camelyon+ and typical cancer subtyping tasks. The ITC, micro, and macro categories in Camelyon+ are primarily distinguished by the size of metastatic regions, whereas MIL is generally better suited for binary classification tasks, such as detecting the presence or absence of cancer. This explains why models achieve high performance on binary tasks like those in Camelyon-1622 or Camelyon-1723. Consequently, Camelyon+ raises an important question about whether the MIL approach the most suitable paradigm for clinical classification tasks like Camelyon+, where categories are defined by the size of metastatic regions rather than distinct subtypes of cancer.

Fig. 8
figure 8

Confusion matrices of MIL models on the four-class metastasis classification task. Each subfigure presents the confusion matrix of a model-encoder pair on the Camelyon+27 benchmark. Rows indicate the ground truth labels and columns indicate the predicted labels. (a–d) show results for AMD-MIL33 combined with PLIP3, CONCH7, UNI4, and Gigapath5 encoders, respectively. (e–h) show results for FR-MIL37, and (i–l) for CLAM-MB25 under the same set of encoders.

Evaluation metrics

In the comparative experiments on the Camelyon-1723 dataset and in the benchmark evaluations, we used accuracy, AUC, F1-score, recall, precision, and Cohen’s kappa coefficient to assess classification performance. The kappa coefficient is a statistical metric used to evaluate the level of agreement between predicted and true labels. It is particularly useful for assessing the performance of classification models in multi-class settings, as it accounts for agreement that may occur by chance.

Usage Notes

The Camelyon+27 Dataset is publicly available under the Creative Commons Zero (CC0) license. However, please note that this dataset is not intended for developing diagnosis-focused algorithms or models, and should not be used as the sole basis for clinical evaluations in classification tasks.