Background & Summary

Colorectal cancer (CRC) ranks as the third most prevalent malignancy globally, with an estimated 1.93 million new diagnoses and 940,000 associated deaths annually1. Despite continuous improvements in detection and therapy, the intrinsic heterogeneity of CRC remains a formidable barrier to effective treatment and prognosis2,3. Therefore, precise disease stratification and outcome prediction are critical clinical needs. In recent years, deep learning (DL) has emerged as a transformative tool for pathological image interpretation, showing promise in tasks such as CRC subtype classification, biomarker prediction, treatment response forecasting, and survival estimation4,5,6. With the increasing integration of artificial intelligence and DL into clinical workflows for the detection and management of CRC, there is a growing potential to enhance patient care and therapeutic precision, yielding better quality of life and survival7,8,9.

However, the development and deployment of supervised DL models hinge on access to expansive, expertly annotated datasets10,11,12. The diversity, quality, and scale of training images directly influence the robustness of segmentation algorithms and their downstream clinical utility13,14,15. In medical image analysis, building large-scale, high-resolution datasets with detailed annotations has become a focal point of research16,17. However, to the best of our knowledge, large-scale, fully annotated CRC image datasets that meet the FAIR (Findable, Accessible, Interoperable, and Reusable) standards remain relatively limited.

While public datasets such as The Cancer Genome Atlas (TCGA) and the Clinical Proteomic Tumor Analysis Consortium (CPTAC) provide WSIs alongside clinical annotations, they often lack precise histological labels and standardized image preprocessing, restricting their usability in DL model training18,19. Similarly, challenge-specific datasets from initiatives like MICCAI typically target narrow tasks such as colorectal gland segmentation, offering limited annotation coverage, small cohorts, and sparse clinical metadata20,21,22. For example, Pataki et al. used DL to classify and annotate WSIs from 200 CRC patients to support diagnostic workflows. Yet, their dataset suffers from restricted sample size, absence of region-level segmentation, and no inclusion of prognostic data, ultimately hindering the generalizability of their model or its ability to reliably predict patient clinical outcome23. The Kather dataset is widely utilized in CRC classification studies, particularly for cervical lesion detection24. While it incorporates multi-class annotations, it too is constrained by limited cohort diversity and lacks comprehensive clinical profiles25,26.

In summary, existing CRC histopathology resources fall short in several key areas: dataset size, annotation granularity, and integration with clinical outcomes. These factors render these datasets insufficient as tools for the development of DL models that effectively mimic real-world clinical scenarios. Most publicly available datasets also originate from European or North American populations, with minimal representation of Asian cohorts, further limiting the generalizability of AI models developed using these datasets.

To overcome these limitations, we have developed HMU-CRC-Hist550K, a large, balanced, and richly annotated CRC histological dataset collected from Harbin Medical University Cancer Hospital. It includes 500 surgically excised specimens representing all tumor stages (I–IV) and yields a total of 550,000 high-resolution image tiles. By comparing with the TCGA or Kather datasets, we constructed a high-quality image dataset of CRC. Since pathologists rely on a priori knowledge, the annotations made by a single pathologist may lead to personal biases or errors, especially at more complex or ambiguous boundaries. To avoid inconsistencies in annotations caused by differences in interpretation. This study employed a three-level cross-validation process, significantly enhancing annotation consistency. Particularly when compared to the standard single-pathologist annotation method, it effectively reduces annotation bias and improves the reliability and reproducibility of the dataset.

Annotation was conducted using a rigorous three-level cross-validation process by expert pathologists27, resulting in precise pixel-level classification of eight distinct tissue types within the tumor microenvironment (TME) (Fig. 1A): adipose tissue (ADI), cellular debris (DEB), lymphocytes (LYM), mucus (MUC), smooth muscle (MUS), normal colon mucosa (NORM), cancer-associated stroma (STR), and colorectal adenocarcinoma epithelium (TUM) (Figure S1). This approach preserves the inherent spatial complexity of the TME and supports the development of generalizable DL models. In addition to morphological annotations, the dataset includes detailed patient metadata covering ten clinical parameters, including sex, age, TNM staging, treatment history, and survival outcomes. This dataset enables multi-modal analyses that bridge tissue architecture and patient prognosis.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Data preparation strategy and model framework. (A) Overview of the overall research framework. (B) Meta annotation pipeline. (C) The ViT model pipeline, in which images were divided into patches, flattened, and linearly projected. A transformer layer with multi-head attention was utilized for feature extraction, and a multi-layer perceptron (MLP) was used to generate final predictions for various tissue types. (D) An overview of the model pipeline in which an initial patch input is followed by feature extraction performed with a ResNet-based residual convolution structure and EfficientNet with MBConv blocks. Global average pooling, a flattening layer, a dropout layer, and a fully connected (FC) layer were used when processing extracted features for predictive tissue classification. The residual learning block (shown to the right) adds a skip connection (identity mapping) through weight layers and ReLU activations to enhance learning. Final predictions include various classes of tissues, including ADI, DEB, LYM, MUC, MUS, NORM, STR, and TUM.

With its release, the HMU-CRC-Hist550K dataset offers a robust platform for TME characterization, DL-based diagnostic innovation, molecular stratification, and tailored therapeutic planning in CRC. Furthermore, it lays the groundwork for developing pre-trained feature extractors suitable for transfer learning across other cancer types. All patient data have been ethically reviewed and de-identified in accordance with regulatory standards. The dataset’s structured format significantly lowers technical barriers to algorithm development and fosters reproducibility and translational research in computational pathology.

Methods

Study approval

The protocol for this study was reviewed and approved by the Ethics Committee of Harbin Medical University (Approval No: KY2024-16). The Harbin Medical University ethical committee approved a waiver of consent for the study and data publication because the data were anonymized and the study posed no direct risk to participants. All procedures involving human participants were conducted in compliance with the principles outlined in the Declaration of Helsinki.

Preprocessing and slide digitization

Histopathological tissue slides were collected from CRC patients treated at Harbin Medical University Cancer Hospital between 2013 and 2015. Specimens were processed using formalin fixation and paraffin embedding (FFPE), with slide selection based on confirmed postoperative pathological diagnoses. A total of 500 hematoxylin and eosin (H&E)-stained whole-slide images (WSIs) were acquired. These WSIs were scanned at 20× magnification using an Aperio AT2 digital slide scanner (Leica Biosystems, Nussloch, Germany) and stored in the ScanScope Virtual Slide (SVS) format for further analysis.

Annotation process

All WSIs were subjected to a multi-tiered annotation and validation procedure conducted by a team of three experienced pathologists. Two primary pathologists, Huiying Li and Yang Jiang, each with over five years of diagnostic experience, initially reviewed and annotated the slides independently. A third senior pathologist, Hongxue Meng, who has more than a decade of clinical experience, performed a final review to ensure consistency and accuracy. To standardize and ensure the quality of annotations, the following structured protocol was implemented: (1) Initial Annotation: The two primary pathologists independently annotated randomly assigned WSIs, labeling tissue types and boundaries based on predefined classification criteria; (2) Peer Review: Each annotated image was subsequently reviewed by another primary pathologist, with particular attention to inter-rater consistency and the biological plausibility of annotations; (3) Final Quality Check: The senior pathologist conducted a comprehensive review of all annotated slides. Discrepancies were resolved through consensus, and any inconsistencies were addressed via re-annotation or exclusion of the affected samples.

Each WSI was carefully annotated to identify eight distinct histological components of the TME in CRC, including ADI, DEB, LYM, MUC, smooth MUS, NORM, STR, and TUM (Fig. 1A). A meta-annotation framework27 to enhance both the efficiency and quality of the annotation process. The entire annotation workflow and dataset construction pipeline are illustrated in Fig. 1B. In practice, pathologists outlined representative regions within WSIs using rectangular bounding boxes, thereby reducing labeling complexity while capturing key morphological features.

For large, homogeneous tissue areas such as tumor or normal tissues, pathologists annotated only selected internal regions across multiple spatial zones to ensure maximal diversity and minimize redundancy. For tissue types typically present in smaller quantities (e.g., LYM or DEB), bounding boxes were drawn to encompass as much of the relevant region as possible. Following annotation, non-overlapping image tiles measuring 224 × 224 pixels were automatically extracted from the bounding boxes and saved in.png format. Each tile was assigned the tissue label corresponding to its source region.

To mitigate the effects of class imbalance stemming from the unequal spatial distribution of tissue types, a two-stage sampling strategy was implemented. For frequently occurring tissues such as TUM and MUS, up to 150 tiles were randomly selected per WSI. In contrast, for less frequent categories, all available annotated tiles were extracted in full. After tile extraction, a resampling step was applied to enhance class balance across the dataset. This process yielded a meta-annotated dataset containing nearly 550,000 image tiles sourced from 500 WSIs.

Clinical data acquisition

Patient clinical and pathological data were retrieved from the institutional information management system at Harbin Medical University Cancer Hospital. This database contains detailed records of demographic and clinical attributes, including sex, age, and tumor TNM staging. Post-discharge, all patients were followed by dedicated staff, and follow-up information was continuously updated in the same system. The dataset used in this study has been de-identified, and all personal identifiers were removed to ensure patient confidentiality. A comprehensive summary of patient demographics, clinical profiles, and pathological features is presented in Table 1.

Table 1 General clinical factors associated with the histological slide image dataset.

Data Records

The complete dataset, designated HMU-CRC-Hist550K, is publicly accessible via the Figshare platform28,29,30,31,32,33,34,35,36. Compiled by Harbin Medical University Cancer Hospital, it comprises 550,000 high-resolution image patches derived from 500 H&E-stained WSIs, alongside corresponding clinical and pathological metadata collected between 2013 and 2015. The dataset is organized into two primary components: (1) Annotated Image Patches: These represent tissue patches classified into eight categories based on the components of the TME: ADI, DEB, LYM, MUC, MUS, NORM, STR, and TUM; (2) Clinical Data File: Structured as a spreadsheet named “HMU CRC Clinical.csv” (Table 1), this file includes patient-level clinical and pathological information. The overall structure of the dataset consists of a root directory containing subfolders named after each TME component (e.g., ADI, DEB, LYM) together with the clinical data file (HMU CRC Clinical.csv) (Fig. 2). This dataset supports a broad range of downstream applications, including tissue classification, prognostic modeling, biomarker discovery, and prediction of clinical outcomes based on histological features of the tumor microenvironment.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Dataset organization summary.

Technical Validation

Annotation validation

As previously noted, all pathological slides underwent a comprehensive, multi-stage quality control process to ensure annotation reliability. Initially, the slides were evaluated for diagnostic purposes prior to digitization, which confirmed the integrity and suitability of the tissue sections. Following digital scanning, annotations were performed by pathology residents. During this phase, slides that exhibited blurriness or poor image quality were flagged and returned for re-scanning. Once high-resolution WSIs were obtained, a senior pathologist conducted a final review of all annotations, making corrections as needed. Each image and its corresponding regions were meticulously inspected to confirm accurate classification and labeling.

Tissue segmentation validation

To further validate the robustness of the dataset, we employed DL models to perform tissue segmentation tasks targeting the eight defined TME components (Fig. 1C,D). Three distinct neural network architectures (ResNet, EfficientNet, and ViT) were trained to classify these tissue types. To avoid class imbalance and ensure fair evaluation, we applied a stratified sampling strategy across both the training and testing datasets. Additionally, we enforced a patient-level split to prevent data leakage, ensuring that images from the same patient were not shared between training and validation.

The ResNet and EfficientNet models were trained using the AdamW optimizer with an initial learning rate of 0.01 and a weight decay factor of 1e-4 for five epochs. The ViT model was also optimized using AdamW, but with a learning rate of 3e-4 and a weight decay of 0.05. All networks were initialized with pre-trained weights sourced from ImageNet via Hugging Face (https://huggingface.co), enhancing their baseline performance.

During training, the dataset was divided into training and validation subsets in a 7:3 ratio. To enhance model reliability, we implemented a 10-fold cross-validation strategy to assess performance across folds. Post-training, model generalization was tested using two independent validation cohorts. On these validation datasets, the ResNet, ViT, and EfficientNet models achieved area under the curve (AUC) scores of 0.96, 0.99, and 0.99, respectively (Fig. 3C, F and I).

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

ROC curves for the ViT, ResNet, and EfficientNet models when used in the validation set. (A-I) ROC curves for the CRC-VAL-HE-7K, CRC-VAL-HE-100K, and independent validation sets when evaluated using the ViT (AC), ResNet (DF), and EfficientNet (GI) models.

To assess the domain transferability of the models, we evaluated them on the CRC-VAL-HE-7K and CRC-VAL-HE-100K test sets. The ResNet model achieved AUC values of 0.96 and 0.99 on CRC-VAL-HE-7K and CRC-VAL-HE-100K, respectively (Fig. 4A,B). The ViT model yielded perfect AUC scores of 1.00 on both validation sets (Fig. 4D,E), while EfficientNet reached scores of 0.99 and 1.00 (Fig. 4G,H). To provide a more granular evaluation of classification performance, confusion matrices were generated for each model (Figs. 57).

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Confusion matrices for the ViT, ResNet, and EfficientNet models when used in the validation set. (AI) Confusion matrices for the CRC-VAL-HE-7K, CRC-VAL-HE-100K, and independent validation sets when evaluated using the ViT (AC), ResNet (DF), and EfficientNet (GI) models.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Precision, recall, and F1 scores for the EfficientNet, ViT, and ResNet models when used to analyze different classes of tissues in the CRC-VAL-HE-7K validation set.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Precision, recall, and F1 scores for the EfficientNet, ViT, and ResNet models when used to analyze different classes of tissues in the CRC-VAL-HE-100K validation set.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Precision, recall, and F1 scores for the EfficientNet, ViT, and ResNet models when used to analyze different classes of tissues in an independent validation set.

Furthermore, to better contextualize the dataset’s performance ceiling, we have included additional comparisons with state-of-the-art segmentation frameworks, specifically TransUNet and Swin-Unet. In the CRC-VAL-HE-7K, CRC-VAL-HE-100K, and independent validation cohorts, both TransUNet and Swin-Unet achieved an AUC as high as 0.99 (Figure S2A-F). we also generated confusion matrices for both models (Figure S3A-F). These comparisons help situate our dataset’s performance within the broader landscape of advanced segmentation methods. These results confirm that all models demonstrated strong predictive accuracy across TME categories, reinforcing the validity of the dataset and the consistency of the annotations. Finally, we used Gradient-weighted Class Activation Mapping (Grad-CAM) to focus on the regions corresponding to the most prominent features in the pathological sections, thereby enhancing the interpretability and clinical translatability of the dataset (Figure S4).

In summary, the high performance of all three models across internal and external validation sets supports the integrity and utility of the dataset. The ability to successfully train and validate segmentation models with such high accuracy underscores the dataset’s reliability for downstream applications in computational pathology. In addition, our dataset focuses on static histology but does not involve longitudinal changes in TME. In studies on tumor evolution and treatment response, analyzing TME changes at different time points contributes to a deeper exploration of the biological behavior and pathogenesis of CRC. In the future, constructing a multimodal model by integrating TME data with imaging data from different treatment time points may further elucidate the prognostic heterogeneity of CRC.