Comprehensive Benchmark Dataset for Pathological Lymph Node Metastasis in Breast Cancer Sections

Ling, Xitong; Lei, Yuanyuan; Li, Jiawen; Cheng, Junru; Huang, Wenting; Guan, Tian; Guan, Jian; He, Yonghong

doi:10.1038/s41597-025-05586-5

Download PDF

Data Descriptor
Open access
Published: 07 August 2025

Comprehensive Benchmark Dataset for Pathological Lymph Node Metastasis in Breast Cancer Sections

Xitong Ling¹^na1,
Yuanyuan Lei²^na1,
Jiawen Li¹^na1,
Junru Cheng³,
Wenting Huang²,
Tian Guan¹,
Jian Guan² &
…
Yonghong He¹

Scientific Data volume 12, Article number: 1381 (2025) Cite this article

2041 Accesses
Metrics details

Subjects

Abstract

Advances in optical microscopy scanning have significantly contributed to computational pathology (CPath) by converting traditional histopathological slides into whole slide images (WSIs). This development enables comprehensive digital reviews by pathologists and accelerates AI-driven diagnostic support for WSI analysis. Recent advances in foundational pathology models have increased the need for benchmarking tasks. The Camelyon series is one of the most widely used open-source datasets in computational pathology. However, the quality, accessibility, and clinical relevance of the labels have not been comprehensively evaluated.In this study, we reprocessed 1,399 WSIs and labels from the Camelyon-16 and Camelyon-17 datasets, removing low-quality slides, correcting erroneous labels, and providing expert pixel annotations for tumor regions in the previously unreleased test set. Based on the sizes of re-annotated tumor regions, we upgraded the binary cancer screening task to a four-class task: negative, micro-metastasis, macro-metastasis, and Isolated Tumor Cells (ITC). We reevaluated pre-trained pathology feature extractors and multiple instance learning (MIL) methods using the cleaned dataset, providing a benchmark that advances AI development in histopathology.

Overcoming the limitations of patch-based learning to detect cancer in whole slide images

Article Open access 26 April 2021

A generalized deep learning framework for whole-slide image segmentation and analysis

Article Open access 02 June 2021

Effectiveness of transfer learning for enhancing tumor classification with a convolutional neural network on frozen sections

Article Open access 14 December 2020

Background & Summary

The efficient utilization of digital pathology and computational resources has led to the rapid rise of AI-based computational pathology^1,2. In recent years, general foundational models for pathology, pre-trained on large-scale data, have garnered significant attention^3,4,5,6,7. These models have demonstrated strong feature extraction capabilities for pathological images, as evidenced by evaluations across a series of whole-slide image-level downstream tasks^8,9,10. For example, CTranspath⁶ uses a Semantically-Relevant Contrastive Learning (SRCL) framework to pre-train a CNN-Transformer hybrid feature extractor on 150 million patches, with its effectiveness validated across five downstream tasks. UNI⁴ employed the self-supervised DINO-v2¹¹ method to train a robust general pathology visual encoder on one billion patches from approximately 100,000 whole slide images (WSIs). Gigapath⁵ utilized 1.3 billion patches to pre-train a visual encoder based on VIT-Gaint¹² architecture and adopted LongNet¹³ to scale itself to a slide-level foundation model for slide-level representation learning. Virchow¹⁴ is a pathology foundation model based on the ViT-Huge architecture, trained using the DINOv2 approach on a dataset constructed from 1,488,550 whole slide images (WSIs), enabling clinical-grade diagnosis and rare disease identification. Pathorchestra¹⁵ trained a VIT-Large encoder on 300,000 WSIs and conducted extensive evaluation across 112 downstream tasks, achieving over 95% accuracy on 47 of them. These pathology-pre-trained models have demonstrated superior performance in downstream tasks including tumor classification, survival analysis, and lesion segmentation. PLIP³, pre-trained on approximately 200,000 pathology image-text pairs collected from medical Twitter, developed a multimodal pathology foundational model using contrastive learning¹⁶, capable of both image and text comprehension. CONCH⁷ employs CoCa¹⁷ for self-supervised pre-training on 1.17 million image-caption pairs and has been extensively evaluated across 14 downstream benchmarks, demonstrating its outstanding performance. In addition to patch-level encoders, some studies have focused on developing pretrained slide-level encoders, which are built upon patch encoders. For example, CHIEF¹⁸ constructs a slide encoder with an ABMIL¹⁹ architecture through vision-language joint training based on CTranspath⁶, using 60,530 WSIs. Prism²⁰ is a Transformer-based slide encoder trained on 587,196 WSIs, built upon patch embeddings from Virchow. Titan²¹ is a slide-level encoder trained via slide-level vision-language contrastive learning, based on CONCH-V1.5²¹, an upgraded version of the CONCH model. Slide-level encoders eliminate the need to retrain aggregators by directly generating WSI-level representations through inference, enabling downstream slide-level tasks such as classification, survival analysis, and report generation.

Acquiring finely annotated large-scale pathology image datasets remains challenging due to the extremely high resolution of pathology images and the specialized expertise required for annotations. Nonetheless, the continued development of foundational models and downstream tasks in computational pathology makes high-quality pathology image datasets increasingly essential.

The Camelyon series^22,23 (http://gigadb.org/dataset/100439), a publicly available pathology dataset focused on detecting breast cancer lymph node metastasis, is widely used for evaluating multiple instance Learning (MIL) methods. However, as shown in Fig. 1, some images in the Camelyon series are of poor quality, exhibit treatment-related artifacts, and contain errors in slide-level labeling. The Camelyon-16²² dataset includes only tumor and negative labels, making it incompatible with Camelyon-17²³ labels. Many pixel-level annotations are inaccurate, and some slides lack pixel-level annotations entirely. These issues hinder the accurate evaluation of deep learning methods in downstream pathology tasks.

In this paper, we filtered out and removed slides from the Camelyon dataset that were blurred, poorly stained, exhibited treatment-related artifacts, or were ambiguous in terms of positivity. Furthermore, we expanded the binary classification labels in Camelyon-16²² to a four-class system to facilitate the merging of the Camelyon-16 and Camelyon-17²³ datasets. Finally, we corrected the pixel-level annotations in the Camelyon dataset and added pixel-level annotations to positive slides that previously lacked them. Using the corrected dataset, we reevaluated 12 main MIL methods, including ABMIL¹⁹, TransMIL²⁴ and CLAM²⁵, etc. in two pre-trained natural image feature encoders, ResNet-50²⁶ and VIT-S¹², as well as ten pathology-specific pre-trained feature encoders, PILP³, CONCH⁷, UNI⁴, Gigapath⁵, CONCH-V1.5²¹, TITAN²¹, Virchow¹⁴, Prism²⁰, Ctranspath⁶ and Chief¹⁸.

Technical Validation

Dataset Overview

The official Camelyon-16²² dataset contains 399 WSIs, split into 270 for training and 129 for testing. The training set includes 111 tumor slides and 259 negative slides, while the test set includes 49 tumor slides and 80 negative slides. The official Camelyon-17²³ dataset consists of 1000 WSIs, evenly divided into 500 for training and 500 for testing. The training set consists of 318 negative slides, 59 micro-metastasis slides, 87 macro-metastasis slides, and 36 Isolated Tumor Cells (ITC) slides. The test set labels are not publicly available. After data cleaning by professional pathologists, the Camelyon-16 dataset consists of 386 WSIs: 238 negative, 71 micro-metastasis, 69 macro-metastasis, and 8 ITC WSIs. The Camelyon-17 dataset consists of 964 WSIs: 633 negative, 103 micro-metastasis, 182 macro-metastasis, and 46 ITC WSIs. We combined the updated Camelyon-16 and Camelyon-17 datasets to form the Camelyon⁺ dataset. Figure 2 shows the dataset overvire. It consists of 1,350 WSIs: 871 negative, 174 micro-metastasis, 251 macro-metastasis, and 54 ITC WSIs.

Exclusion Criteria

We excluded certain WSIs based on the following criteria: focal blurriness, poor staining quality, difficulty distinguishing positive foci, and the presence of treatment-related artifacts. Of the 49 slides we remove, 26 show therapeutic response, 3 have staining issues, 12 exhibit focal blurring, 4 are of poor quality, and 4 contain suspicious cancerous regions. The verification of WSI labels and the annotation work were performed in the ASAP pathology annotation software (https://computationalpathologygroup.github.io/ASAP) by a mid-level pathologist, in accordance with the 8th edition of the American Joint Committee on Cancer (AJCC), and consistency checks were conducted by a senior pathologist. The presence of treatment response may interfere with model construction. In pathology, tumor treatment response refers to the histological changes in tumors following treatments such as surgery, chemotherapy, radiotherapy, targeted therapy, or immunotherapy, and the corresponding reaction to these treatments. Pathological analysis can assess histological indicators such as tumor cell necrosis, proliferation, and apoptosis, thereby evaluating treatment efficacy. Two typical treatment responses are tissue necrosis and fibrosis. Necrosis refers to areas of dead tissue formed after tumor cells die following treatment. Fibrosis refers to the scar tissue formed as a result of the self-repair of the tumor after damage. Necrotic and fibrotic areas can affect the representation of the characteristics of tumor regions in computational pathology, thereby affecting the performance of downstream tasks.

Data Records

The Camelyon⁺²⁷ dataset is available via ScienceDB: (https://doi.org/10.57760/sciencedb.16442). The original WSI data can be downloaded from the official dataset repository (http://gigadb.org/dataset/100439), so it has not been uploaded to the database. The Camelyon⁺ dataset is structured into several directories, each serving a specific function in supporting downstream computational pathology tasks. The directory structure includes the following main components: slide-labels/, name-convert/, pixel-annotations/, feature-files/, and h5py-files/.

slide-labels/: This directory stores slide-level classification annotations in Excel format. We provide two XLSX files: camelyon+(2-classes).xlsx and camelyon+(4-classes).xlsx, which correspond to binary classification (negative vs. tumor) and four-class classification (negative, micro, macro, ITC), respectively. Each file contains two columns: slide (the WSI ID) and label (the assigned class). These labels are derived from corrected and unified versions of the Camelyon-16 and Camelyon-17 datasets, supporting various supervised learning scenarios.
name-convert/: To eliminate annotation bias, all original WSI file names from the Camelyon-16 training set that contain diagnostic hints such as “tumor” or “normal” have been renamed. The name-convert.xlsx file in this directory provides a mapping between the original and new file names through two columns: Origin Name and New Name. This enables accurate cross-referencing during label alignment or post-hoc analysis.
pixel-annotations/: For positive WSIs, pixel-level tumor region annotations are provided in XML format. Polygonal coordinates of positive regions are stored in the XML files and can be visualized on whole-slide images using ASAP. These files include detailed boundary information and can be used for tasks such as semantic segmentation or weakly supervised learning.
feature-files/: To facilitate fair and reproducible benchmarking across various visual encoders, this directory contains patch-level features extracted at 20 × magnification using a diverse set of backbone models, including ResNet-50²⁶, VIT-S¹², PLIP³, CONCH⁷, CONCH-V1.5²¹, Ctranspath⁶, UNI⁴, GigaPath⁵, Virchow¹⁴, Chief¹⁸, Prism²⁰, and Titan²¹. All features are stored in .pt format, which is natively compatible with the PyTorch library and supports efficient loading during training and inference.
h5py-files/: This optional directory offers an alternative representation of extracted features in .h5 format, enabling high-speed access and batch-wise loading for large-scale training workflows.

This modular organization of Camelyon⁺²⁷ supports a broad spectrum of tasks, including classification, segmentation, and representation learning, and provides a standardized testbed for developing and evaluating pathology foundation models.

Methods

Methodology

The objective of our designed benchmark is to utilize slide-level labels to predict metastasis types. The commonly used approach is to adopt a deep learning strategy based on MIL, which has been recognized in recent studies for its strong capability to represent slide-level features^{28,29,30,31,32}. MIL is a weakly supervised approach where a single WSI is treated as a bag, and each patch within the WSI is considered an instance. If any instance is cancerous, the entire WSI is labeled as cancerous, while a WSI is classified as normal only if all instances are normal.

With the advancement of deep neural networks, embedding-based MIL has become the dominant approach for WSI analysis. In embedding-based MIL, a pre-trained feature extractor first extracts features from the WSI, followed by an aggregator that pools the features for downstream classification tasks. Mean-MIL and Max-MIL aggregate features using mean pooling and max pooling, respectively, though the pooling mechanism inevitably results in information loss. ABMIL¹⁹ introduces the attention mechanism into MIL, dynamically assigning weights to each instance based on attention scores. CLAM²⁵ further enhances this by incorporating instance-level clustering mechanisms to introduce domain knowledge, providing additional supervision alongside attention-based weight assignments. TransMIL²⁴ leverages self-attention within the MIL aggregator to capture relationships between different instances, thereby improving global modeling capabilities. AMD-MIL³³ introduces an agent mechanism into the MIL aggregator and employs threshold filtering for feature selection, improving MIL performance. DSMIL³⁴ models instance relationships directly using a dual-stream architecture and a trainable distance measurement module. DTFD³⁵ addresses the issue of limited WSIs by creating pseudo-bags. WiKG³⁶ treats WSIs as knowledge graphs, dynamically constructing neighboring nodes and directed edges based on relationships between instances, and then updates the head node using knowledge-aware attention. FR-MIL³⁷ introduces a distribution re-calibration approach that adjusts the feature distribution of a WSI bag (instances) based on the statistics of the max-instance (key) feature.

Data Preprocessing

For all datasets, we crop non-overlapping 256 × 256 patches at 20 × magnification. We then use twelve feature extractors ResNet-50²⁶, VIT-S¹², PILP³, CONCH⁷, UNI⁴, Gigapath⁵, CONCH-V1.5²¹, TITAN²¹, Virchow¹⁴, Prism²⁰, Ctranspath⁶ and Chief¹⁸ to extract features from the WSIs. Subsequently, we conducted two sets of experiments. The first set is a comparative experiment on the Camelyon-17²³ dataset before and after label correction. The Camelyon-17-Origin dataset follows the official split, with 500 WSIs for training and 500 WSIs for testing. The Camelyon-17-Refine dataset also maintains the official split but excludes slides that fall under exclusion criteria. The Camelyon-17-Refine training set contains 492 slides, while the test set includes 472 slides. This comparative experiment evaluates the impact of dataset quality on MIL models. Since the original version of Camelyon-16²² does not have four-class labels, we do not perform similar experiments on it. The next set is the benchmark experiments on Camelyon⁺²⁷. We evaluate using five-fold cross-validation, with each fold employing stratified sampling to maintain a fixed proportion of different classes. Since each patient has multiple slides in the Camelyon-17 dataset, in order to prevent data leakage, we ensure that slides of the same label of the same patient do not appear in the training set and the validation set at the same time.

Camelyon-17 Comparative Experiment

In the comparative experiments before and after correction on the Camelyon-17²³ dataset, we primarily evaluated three pathology pre-trained feature extractors: PLIP³, UNI⁴, and Gigapath⁵. The learning rate was set to 2e-4, using the Adam optimizer with a weight decay of 1e-5. We repeated the experiments with random seeds of 2023, 2024, and 2025, and reported the mean and standard deviation of the evaluation metrics as shown in Table 1 and Table 2. All experiments were conducted on a workstation equipped with 4 NVIDIA RTX 3090 GPUs. Due to the significant class imbalance in the Camelyon-17 four-class dataset, our analysis concentrated on two key evaluation metrics: AUC and F1-score. Figure 3 presents a visualization of these metrics for a single MIL model across different feature extractors, using bar charts for both the Camelyon-17-Origin and Camelyon-17-Refine datasets. This visualization effectively illustrates how these metrics vary as the dataset undergoes refinement. Our results indicate that both AUC and F1-score exhibited notable changes following the dataset’s adjustment. Figure 4 further visualizes the F1-score, AUC, and their combined values, highlighting the top three models to assess the impact of dataset refinement on the performance ranking of the models. While the overall model rankings shifted to some extent after the dataset refinement, the CLAM-MB²⁵ model consistently maintained its top-ranked position, indicating its robustness. In summary, dataset refinement enhanced the accuracy of model evaluation metrics and improved the fairness of model rankings, establishing a more solid foundation for future research.

Table 1 Performance metrics of different methods on the Camelyon-17-Origin dataset.

Full size table

Table 2 Performance metrics of different methods on the Camelyon-17-Refine dataset.

Full size table

Camelyon⁺ Benchmark Experiment

In the Benchmark Experiment on Camelyon⁺²⁷, we maintained the same hyperparameter settings as in the comparative experiments on Camelyon-17²³. On the merged Camelyon⁺ dataset, we evaluated the MIL approach using feature extractors from two pre-trained natural image models, ResNet-50²⁶ and VIT-S¹², and twelve pre-trained pathology image models, Ctranspath, PLIP³, CONCH⁷, CONCH-V1.5²¹, UNI⁴, Gigapath⁵, Virchow¹⁴, Chief¹⁸, Prism²⁰ and Titan²¹. We classify these feature extractors into four main categories: ResNet-50 and ViT-S fall under the domain of natural image pre-training; PLIP, CONCH and CONCH-V1.5 fall under the domain of image-text contrastive learning pre-training; Ctranspath, UNI, GigaPath and Virchow belong to the category of pathology-specific visual pre-training; Chief, Prsim, and Titan are slide-level encoders that can represent a slide as an embedding. We report the mean and standard deviation of model performance in Tables 3–12 which can serve as baselines and references for future work based on the Camelyon⁺ dataset. As shown in Fig. 5, we present a heatmap of the distribution of AUC and F1-score across different MIL models under various feature extractors. It can be observed that pathology-pretrained feature extractors significantly enhance the performance of MIL. Notably, the CONCH model, which uses a VIT-Base architecture with image-text contrastive learning, achieves performance comparable to the UNI and Gigapath models, which utilize VIT-Large and VIT-Giant architectures, respectively. Moreover, both UNI and Gigapath leverage larger training datasets. This suggests that image-text contrastive pretraining may hold greater potential than pure visual pretraining in the pathology domain. While the PLIP model is also pretrained using image-text contrastive learning, its performance does not match that of CONCH, likely due to its smaller dataset and the lower quality of data sourced from Twitter. Slide-level encoders have recently emerged as a popular direction in computational pathology pretraining, aiming to produce holistic representations for entire WSI. However, due to the extremely large resolution of WSIs, the current generation of slide-level encoders still exhibits limited representational capacity. As shown in Table 12, we report the linear probing performance of three representative slide encoders-Chief, Prism, and Titan-and compare them against patch-level encoders (namely Ctranspath, Virchow, and CONCH-V1.5) used for precomputing features in three state-of-the-art MIL frameworks. Among the slide-level encoders, Chief achieved the highest performance, yet still fell significantly short compared to MIL models trained on patch-level features extracted by Ctranspath. This highlights the current limitations of slide-level encoders in capturing the complex pathological patterns required for downstream clinical tasks.

Table 3 Results on the Camelyon⁺ dataset with ResNet-50-extracted features.

Full size table

Table 4 Results on the Camelyon⁺ dataset with VIT-S-extracted features.

Full size table

Table 5 Results on the Camelyon⁺ dataset with PLIP-extracted features.

Full size table

Table 6 Results on the Camelyon⁺ dataset with CONCH-extracted features.

Full size table

Table 7 Results on the Camelyon⁺ dataset with CONCH-V1.5-extracted features.

Full size table

Table 8 Results on the Camelyon⁺ dataset with Ctranspath-extracted features.

Full size table

Table 9 Results on the Camelyon⁺ dataset with UNI-extracted features.

Full size table

Table 10 Results on the Camelyon⁺ dataset with Gigapath-extracted features.

Full size table

Table 11 Results on the Camelyon⁺ dataset with Virchow-extracted features.

Full size table

Table 12 Performance comparison between the slide encoder and its corresponding patch encoder.

Full size table

Figure 6 further presents few-shot learning results using slide encoders under various N-way, K-shot settings, ranging from 1 to 32 shots. The observed performance consistently improves with increasing shot number, suggesting that expanding annotated datasets for tasks such as lymph node metastasis detection could meaningfully enhance the clinical utility of slide-level foundation models.

In the Camelyon⁺²⁷ dataset, noisy samples were initially removed to construct a clean benchmark. To assess the impact of noisy data, we conducted comparative experiments by reintroducing the noisy subset into either the training set or the test set of Camelyon⁺. As shown in Fig. 7, we evaluated the performance under 9 patch-level encoders and 3 state-of-the-art MIL methods. We observed that, in most cases, adding noisy data to the training set while keeping the test set clean resulted in performance comparable to the original benchmark. In contrast, when the test set was augmented with noisy data while keeping the training set unchanged, the performance on the test set dropped significantly compared to the benchmark. These results suggest that while incorporating noisy data during training may improve robustness without harming performance, the presence of noise in the evaluation set substantially undermines the reliability of performance estimation. This highlights the importance of clean and well-curated test data when benchmarking and deploying pathology AI systems.

In the benchmark results, while the model demonstrates relatively strong performance in terms of accuracy and AUC, the F1-score, recall, and precision are notably low. As illustrated in Fig. 8, we visualized the confusion matrices for the CLAM-MB²⁵ and FR-MIL³⁷ models. The results show that the models perform relatively well in classifying the negative, micro, and macro categories, but perform poorly in the ITC category. We used macro-averaging to calculate the F1-score, recall, and precision, and the model’s poor performance on the ITC category significantly lowered the overall performance metrics. To investigate this issue further, we analyzed the model’s difficulty in identifying ITC cases. One major factor is the severe class imbalance in the Camelyon⁺²⁷ dataset. As shown in Fig. 2, the head class, negative, contains 871 slides, while the tail class, ITC, contains only 54 slides, resulting in an imbalance ratio of approximately 16.1. This imbalance classifies the dataset as having a moderately long-tailed distribution. Such imbalance highlights a key challenge in pathology image analysis: how to achieve balanced model performance on long-tailed datasets like Camelyon⁺, particularly since real-world pathological data naturally follow a long-tailed distribution. Furthermore, we identified a fundamental difference between the four-class classification task in Camelyon⁺ and typical cancer subtyping tasks. The ITC, micro, and macro categories in Camelyon⁺ are primarily distinguished by the size of metastatic regions, whereas MIL is generally better suited for binary classification tasks, such as detecting the presence or absence of cancer. This explains why models achieve high performance on binary tasks like those in Camelyon-16²² or Camelyon-17²³. Consequently, Camelyon⁺ raises an important question about whether the MIL approach the most suitable paradigm for clinical classification tasks like Camelyon⁺, where categories are defined by the size of metastatic regions rather than distinct subtypes of cancer.

Evaluation metrics

In the comparative experiments on the Camelyon-17²³ dataset and in the benchmark evaluations, we used accuracy, AUC, F1-score, recall, precision, and Cohen’s kappa coefficient to assess classification performance. The kappa coefficient is a statistical metric used to evaluate the level of agreement between predicted and true labels. It is particularly useful for assessing the performance of classification models in multi-class settings, as it accounts for agreement that may occur by chance.

Usage Notes

The Camelyon⁺²⁷ Dataset is publicly available under the Creative Commons Zero (CC0) license. However, please note that this dataset is not intended for developing diagnosis-focused algorithms or models, and should not be used as the sole basis for clinical evaluations in classification tasks.

Code availability

The code related to dataset partitioning strategies, hyperparameter configurations, integration of MIL methods, and evaluation metric calculations is available at: https://github.com/lingxitong/MIL_BASELINE.

References

Song, A. H. et al. Artificial intelligence for digital and computational pathology. Nature Reviews Bioengineering 1, 930–949 (2023).
Article CAS Google Scholar
Van der Laak, J., Litjens, G. & Ciompi, F. Deep learning in histopathology: the path to the clinic. Nature medicine 27, 775–784 (2021).
Article PubMed Google Scholar
Huang, Z., Bianchi, F., Yuksekgonul, M., Montine, T. J. & Zou, J. A visual–language foundation model for pathology image analysis using medical twitter. Nature medicine 29, 2307–2316 (2023).
Article CAS PubMed Google Scholar
Chen, R. J. et al. Towards a general-purpose foundation model for computational pathology. Nature Medicine 30, 850–862 (2024).
Article CAS PubMed PubMed Central Google Scholar
Xu, H. et al. A whole-slide foundation model for digital pathology from real-world data. Nature 1–8 (2024).
Wang, X. et al. Transformer-based unsupervised contrastive learning for histopathological image classification. Medical image analysis 81, 102559 (2022).
Article ADS PubMed Google Scholar
Lu, M. Y. et al. A visual-language foundation model for computational pathology. Nature Medicine 30, 863–874 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Kather, J. N. et al. Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study. PLoS medicine 16, e1002730 (2019).
Article PubMed PubMed Central Google Scholar
Pataki, B. Á. et al. Huncrc: annotated pathological slides to enhance deep learning applications in colorectal cancer screening. Scientific Data 9, 370 (2022).
Article PubMed PubMed Central Google Scholar
Barbano, C. A. et al. Unitopatho, a labeled histopathological dataset for colorectal polyps classification and adenoma dysplasia grading. In 2021 IEEE International Conference on Image Processing (ICIP), 76–80 (IEEE, 2021).
Oquab, M. et al. DINOv2: Learning robust visual features without supervision. Transact. Mach. Learn. Res. oquab2024dinov (2023).
Dosovitskiy, A. et al. An image is worth 16×16 words: transformers for image recognition at scale. In International Conference on Learning Representations (2021).
Ding, J. et al. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486 (2023).
Vorontsov, E. et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nature medicine 30, 2924–2935 (2024).
Article CAS PubMed PubMed Central Google Scholar
Yan, F. et al. Pathorchestra: A comprehensive foundation model for computational pathology with over 100 diverse clinical-grade tasks. arXiv preprint arXiv:2503.24345 (2025).
Radford, A. et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763 (PMLR, 2021).
Yu, J. et al. CoCa: contrastive captioners are image–text foundation models. Trans. Mach. Learn. Artif. Intell. https://openreview.net/forum?id=Ee277P3AYC (2022).
Wang, X. et al. A pathology foundation model for cancer diagnosis and prognosis prediction. Nature 634, 970–978 (2024).
Article CAS PubMed PubMed Central Google Scholar
Ilse, M., Tomczak, J. & Welling, M. Attention-based deep multiple instance learning. In International conference on machine learning, 2127–2136 (PMLR, 2018).
Shaikovski, G. et al. Prism: A multi-modal generative foundation model for slide-level histopathology. arXiv preprint arXiv:2405.10254 (2024).
Ding, T. et al. Multimodal whole slide foundation model for pathology. arXiv preprint arXiv:2411.19666 (2024).
Bejnordi, B. E. et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama 318, 2199–2210 (2017).
Article Google Scholar
Bandi, P. et al. From detection of individual metastases to classification of lymph node status at the patient level: the camelyon17 challenge. IEEE transactions on medical imaging 38, 550–560 (2018).
Article Google Scholar
Shao, Z. et al. Transmil: Transformer based correlated multiple instance learning for whole slide image classification. Advances in neural information processing systems 34, 2136–2147 (2021).
Google Scholar
Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nature biomedical engineering 5, 555–570 (2021).
Article PubMed PubMed Central Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
Xitong, L. et al. Comprehensive benchmark dataset for pathological lymph node metastasis in breast cancer sections, https://doi.org/10.57760/sciencedb.16442 (2025).
Yan, R. et al. Shapley values-enabled progressive pseudo bag augmentation for whole-slide image classification. IEEE Transactions on Medical Imaging (2024).
Ouyang, M. et al. Mergeup-augmented semi-weakly supervised learning for wsi classification. arXiv preprint arXiv:2408.12825 (2024).
Chu, H. et al. Retmil: Retentive multiple instance learning for histopathological whole slide image classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 437–447 (Springer, 2024).
Qiehe, S. et al. Nciemil: Rethinking decoupled multiple instance learning framework for histopathological slide classification. In Medical Imaging with Deep Learning (2024).
Yang, S., Wang, Y. & Chen, H. Mambamil: Enhancing long sequence modeling with sequence reordering in computational pathology. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 296–306 (Springer, 2024).
Ling, Xitong, et al. Agent aggregator with mask denoise mechanism for histopathology whole slide image analysis. Proceedings of the 32nd ACM International Conference on Multimedia (2024).
Li, B., Li, Y. & Eliceiri, K. W. Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 14318–14328 (2021).
Zhang, H. et al. Dtfd-mil: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 18802–18812 (2022).
Li, J. et al. Dynamic graph representation with knowledge-aware attention for histopathology whole slide image analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11323–11332 (2024).
Chikontwe, P. et al. Fr-mil: Distribution re-calibration based multiple instance learning with transformer for whole slide image classification. IEEE Transactions on Medical Imaging 1–1, https://doi.org/10.1109/TMI.2024.3446716 (2024).
Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255 (Ieee, 2009).

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (NSFC) under Grant No. 82430062. We also gratefully acknowledge the support from the Shenzhen Engineering Research Centre (Grant No. XMHT20230115004) and the Shenzhen Science and Technology Innovation Commission (Grant No. KCXFZ20201221173207022). This work was also supported by the Shenzhen High-level Hospital Construction Fund. Additionally, we thank the Jilin FuyuanGuan Food Group Co., Ltd. for their collaboration.

Author information

These authors contributed equally: Xitong Ling, Yuanyuan Lei, Jiawen Li.

Authors and Affiliations

Shenzhen International Graduate School, Tsinghua University, Shenzhen, 518071, China
Xitong Ling, Jiawen Li, Tian Guan & Yonghong He
National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital & Shenzhen Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Shenzhen, 518116, China
Yuanyuan Lei, Wenting Huang & Jian Guan
Research Institute of Tsinghua, Guangzhou, 508157, China
Junru Cheng

Authors

Xitong Ling
View author publications
Search author on:PubMed Google Scholar
Yuanyuan Lei
View author publications
Search author on:PubMed Google Scholar
Jiawen Li
View author publications
Search author on:PubMed Google Scholar
Junru Cheng
View author publications
Search author on:PubMed Google Scholar
Wenting Huang
View author publications
Search author on:PubMed Google Scholar
Tian Guan
View author publications
Search author on:PubMed Google Scholar
Jian Guan
View author publications
Search author on:PubMed Google Scholar
Yonghong He
View author publications
Search author on:PubMed Google Scholar

Contributions

X.L. and J.L. conceptualized the study, designed the experiments, and conducted the specific experiments. Y.L., J.C., and W.H. were responsible for dataset correction and construction. T.G., J.G., and Y.H. contributed to the manuscript writing and provided insights into the development of the manuscript structure. All authors read and approved the final version of the manuscript.

Corresponding authors

Correspondence to Jian Guan or Yonghong He.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ling, X., Lei, Y., Li, J. et al. Comprehensive Benchmark Dataset for Pathological Lymph Node Metastasis in Breast Cancer Sections. Sci Data 12, 1381 (2025). https://doi.org/10.1038/s41597-025-05586-5

Download citation

Received: 24 November 2024
Accepted: 08 July 2025
Published: 07 August 2025
DOI: https://doi.org/10.1038/s41597-025-05586-5