Large-Scale Histological Image Dataset with Metadata for Colorectal Cancer Microenvironment

Wang, Hao; Li, Huiying; Xue, Jingmin; Jiang, Yang; Ma, Keru; Du, Fenqi; Mo, Genshen; Li, Hao; Huang, Yuze; Xie, Haonan; Meng, Hongxue; Han, Peng; Lou, Shenghan

doi:10.1038/s41597-026-06675-9

Download PDF

Data Descriptor
Open access
Published: 12 February 2026

Large-Scale Histological Image Dataset with Metadata for Colorectal Cancer Microenvironment

Hao Wang¹^na1,
Huiying Li²^na1,
Jingmin Xue³^na1,
Yang Jiang²,
Keru Ma¹,
Fenqi Du³,
Genshen Mo³,
Hao Li³,
Yuze Huang³,
Haonan Xie³,
Hongxue Meng²,
Peng Han^3,4,5 &
…
Shenghan Lou³

Scientific Data volume 13, Article number: 431 (2026) Cite this article

2330 Accesses
Metrics details

Subjects

Abstract

The pronounced heterogeneity of the tumor microenvironment (TME) in colorectal cancer (CRC) presents major obstacles in accurately predicting patient outcomes and tailoring treatment responses. Deciphering this intricate microenvironment based on histological images and classifying it into well-defined tissue components is critical for optimizing clinical interventions. Although deep learning (DL) has advanced substantially in medical imaging analysis, its application in CRC remains limited due to a shortage of comprehensively annotated datasets and large-scale, high-quality histological images. To address this gap, we present HMU-CRC-Hist550K, a curated dataset comprising 550,000 annotated image tiles derived from 500 whole-slide images, fully labeled into eight distinct TME tissue classes. The dataset represents a broad collection of publicly available CRC histology samples. Additionally, we demonstrate the utility of this resource by benchmarking three DL models on tissue segmentation tasks. HMU-CRC-Hist550K offers a valuable foundation for TME profiling, AI-assisted diagnosis, molecular subtype inference, and individualized therapy planning, while also enabling new research directions in modeling the spatial-temporal evolution of the TME.

Histopathological classification of colorectal cancer based on domain-specific transfer learning and multi-model feature fusion

Article Open access 08 October 2025

Fusion of classical and deep learning features with incremental learning for improved classification of lung and colon cancer

Article Open access 19 November 2025

A fusion model to predict the survival of colorectal cancer based on histopathological image and gene mutation

Article Open access 20 March 2025

Background & Summary

Colorectal cancer (CRC) ranks as the third most prevalent malignancy globally, with an estimated 1.93 million new diagnoses and 940,000 associated deaths annually¹. Despite continuous improvements in detection and therapy, the intrinsic heterogeneity of CRC remains a formidable barrier to effective treatment and prognosis^2,3. Therefore, precise disease stratification and outcome prediction are critical clinical needs. In recent years, deep learning (DL) has emerged as a transformative tool for pathological image interpretation, showing promise in tasks such as CRC subtype classification, biomarker prediction, treatment response forecasting, and survival estimation^4,5,6. With the increasing integration of artificial intelligence and DL into clinical workflows for the detection and management of CRC, there is a growing potential to enhance patient care and therapeutic precision, yielding better quality of life and survival^7,8,9.

However, the development and deployment of supervised DL models hinge on access to expansive, expertly annotated datasets^10,11,12. The diversity, quality, and scale of training images directly influence the robustness of segmentation algorithms and their downstream clinical utility^13,14,15. In medical image analysis, building large-scale, high-resolution datasets with detailed annotations has become a focal point of research^16,17. However, to the best of our knowledge, large-scale, fully annotated CRC image datasets that meet the FAIR (Findable, Accessible, Interoperable, and Reusable) standards remain relatively limited.

While public datasets such as The Cancer Genome Atlas (TCGA) and the Clinical Proteomic Tumor Analysis Consortium (CPTAC) provide WSIs alongside clinical annotations, they often lack precise histological labels and standardized image preprocessing, restricting their usability in DL model training^18,19. Similarly, challenge-specific datasets from initiatives like MICCAI typically target narrow tasks such as colorectal gland segmentation, offering limited annotation coverage, small cohorts, and sparse clinical metadata^20,21,22. For example, Pataki et al. used DL to classify and annotate WSIs from 200 CRC patients to support diagnostic workflows. Yet, their dataset suffers from restricted sample size, absence of region-level segmentation, and no inclusion of prognostic data, ultimately hindering the generalizability of their model or its ability to reliably predict patient clinical outcome²³. The Kather dataset is widely utilized in CRC classification studies, particularly for cervical lesion detection²⁴. While it incorporates multi-class annotations, it too is constrained by limited cohort diversity and lacks comprehensive clinical profiles^25,26.

In summary, existing CRC histopathology resources fall short in several key areas: dataset size, annotation granularity, and integration with clinical outcomes. These factors render these datasets insufficient as tools for the development of DL models that effectively mimic real-world clinical scenarios. Most publicly available datasets also originate from European or North American populations, with minimal representation of Asian cohorts, further limiting the generalizability of AI models developed using these datasets.

To overcome these limitations, we have developed HMU-CRC-Hist550K, a large, balanced, and richly annotated CRC histological dataset collected from Harbin Medical University Cancer Hospital. It includes 500 surgically excised specimens representing all tumor stages (I–IV) and yields a total of 550,000 high-resolution image tiles. By comparing with the TCGA or Kather datasets, we constructed a high-quality image dataset of CRC. Since pathologists rely on a priori knowledge, the annotations made by a single pathologist may lead to personal biases or errors, especially at more complex or ambiguous boundaries. To avoid inconsistencies in annotations caused by differences in interpretation. This study employed a three-level cross-validation process, significantly enhancing annotation consistency. Particularly when compared to the standard single-pathologist annotation method, it effectively reduces annotation bias and improves the reliability and reproducibility of the dataset.

Annotation was conducted using a rigorous three-level cross-validation process by expert pathologists²⁷, resulting in precise pixel-level classification of eight distinct tissue types within the tumor microenvironment (TME) (Fig. 1A): adipose tissue (ADI), cellular debris (DEB), lymphocytes (LYM), mucus (MUC), smooth muscle (MUS), normal colon mucosa (NORM), cancer-associated stroma (STR), and colorectal adenocarcinoma epithelium (TUM) (Figure S1). This approach preserves the inherent spatial complexity of the TME and supports the development of generalizable DL models. In addition to morphological annotations, the dataset includes detailed patient metadata covering ten clinical parameters, including sex, age, TNM staging, treatment history, and survival outcomes. This dataset enables multi-modal analyses that bridge tissue architecture and patient prognosis.

With its release, the HMU-CRC-Hist550K dataset offers a robust platform for TME characterization, DL-based diagnostic innovation, molecular stratification, and tailored therapeutic planning in CRC. Furthermore, it lays the groundwork for developing pre-trained feature extractors suitable for transfer learning across other cancer types. All patient data have been ethically reviewed and de-identified in accordance with regulatory standards. The dataset’s structured format significantly lowers technical barriers to algorithm development and fosters reproducibility and translational research in computational pathology.

Methods

Study approval

The protocol for this study was reviewed and approved by the Ethics Committee of Harbin Medical University (Approval No: KY2024-16). The Harbin Medical University ethical committee approved a waiver of consent for the study and data publication because the data were anonymized and the study posed no direct risk to participants. All procedures involving human participants were conducted in compliance with the principles outlined in the Declaration of Helsinki.

Preprocessing and slide digitization

Histopathological tissue slides were collected from CRC patients treated at Harbin Medical University Cancer Hospital between 2013 and 2015. Specimens were processed using formalin fixation and paraffin embedding (FFPE), with slide selection based on confirmed postoperative pathological diagnoses. A total of 500 hematoxylin and eosin (H&E)-stained whole-slide images (WSIs) were acquired. These WSIs were scanned at 20× magnification using an Aperio AT2 digital slide scanner (Leica Biosystems, Nussloch, Germany) and stored in the ScanScope Virtual Slide (SVS) format for further analysis.

Annotation process

All WSIs were subjected to a multi-tiered annotation and validation procedure conducted by a team of three experienced pathologists. Two primary pathologists, Huiying Li and Yang Jiang, each with over five years of diagnostic experience, initially reviewed and annotated the slides independently. A third senior pathologist, Hongxue Meng, who has more than a decade of clinical experience, performed a final review to ensure consistency and accuracy. To standardize and ensure the quality of annotations, the following structured protocol was implemented: (1) Initial Annotation: The two primary pathologists independently annotated randomly assigned WSIs, labeling tissue types and boundaries based on predefined classification criteria; (2) Peer Review: Each annotated image was subsequently reviewed by another primary pathologist, with particular attention to inter-rater consistency and the biological plausibility of annotations; (3) Final Quality Check: The senior pathologist conducted a comprehensive review of all annotated slides. Discrepancies were resolved through consensus, and any inconsistencies were addressed via re-annotation or exclusion of the affected samples.

Each WSI was carefully annotated to identify eight distinct histological components of the TME in CRC, including ADI, DEB, LYM, MUC, smooth MUS, NORM, STR, and TUM (Fig. 1A). A meta-annotation framework²⁷ to enhance both the efficiency and quality of the annotation process. The entire annotation workflow and dataset construction pipeline are illustrated in Fig. 1B. In practice, pathologists outlined representative regions within WSIs using rectangular bounding boxes, thereby reducing labeling complexity while capturing key morphological features.

For large, homogeneous tissue areas such as tumor or normal tissues, pathologists annotated only selected internal regions across multiple spatial zones to ensure maximal diversity and minimize redundancy. For tissue types typically present in smaller quantities (e.g., LYM or DEB), bounding boxes were drawn to encompass as much of the relevant region as possible. Following annotation, non-overlapping image tiles measuring 224 × 224 pixels were automatically extracted from the bounding boxes and saved in.png format. Each tile was assigned the tissue label corresponding to its source region.

To mitigate the effects of class imbalance stemming from the unequal spatial distribution of tissue types, a two-stage sampling strategy was implemented. For frequently occurring tissues such as TUM and MUS, up to 150 tiles were randomly selected per WSI. In contrast, for less frequent categories, all available annotated tiles were extracted in full. After tile extraction, a resampling step was applied to enhance class balance across the dataset. This process yielded a meta-annotated dataset containing nearly 550,000 image tiles sourced from 500 WSIs.

Clinical data acquisition

Patient clinical and pathological data were retrieved from the institutional information management system at Harbin Medical University Cancer Hospital. This database contains detailed records of demographic and clinical attributes, including sex, age, and tumor TNM staging. Post-discharge, all patients were followed by dedicated staff, and follow-up information was continuously updated in the same system. The dataset used in this study has been de-identified, and all personal identifiers were removed to ensure patient confidentiality. A comprehensive summary of patient demographics, clinical profiles, and pathological features is presented in Table 1.

Table 1 General clinical factors associated with the histological slide image dataset.

Full size table

Data Records

The complete dataset, designated HMU-CRC-Hist550K, is publicly accessible via the Figshare platform^{28,29,30,31,32,33,34,35,36}. Compiled by Harbin Medical University Cancer Hospital, it comprises 550,000 high-resolution image patches derived from 500 H&E-stained WSIs, alongside corresponding clinical and pathological metadata collected between 2013 and 2015. The dataset is organized into two primary components: (1) Annotated Image Patches: These represent tissue patches classified into eight categories based on the components of the TME: ADI, DEB, LYM, MUC, MUS, NORM, STR, and TUM; (2) Clinical Data File: Structured as a spreadsheet named “HMU CRC Clinical.csv” (Table 1), this file includes patient-level clinical and pathological information. The overall structure of the dataset consists of a root directory containing subfolders named after each TME component (e.g., ADI, DEB, LYM) together with the clinical data file (HMU CRC Clinical.csv) (Fig. 2). This dataset supports a broad range of downstream applications, including tissue classification, prognostic modeling, biomarker discovery, and prediction of clinical outcomes based on histological features of the tumor microenvironment.

Technical Validation

Annotation validation

As previously noted, all pathological slides underwent a comprehensive, multi-stage quality control process to ensure annotation reliability. Initially, the slides were evaluated for diagnostic purposes prior to digitization, which confirmed the integrity and suitability of the tissue sections. Following digital scanning, annotations were performed by pathology residents. During this phase, slides that exhibited blurriness or poor image quality were flagged and returned for re-scanning. Once high-resolution WSIs were obtained, a senior pathologist conducted a final review of all annotations, making corrections as needed. Each image and its corresponding regions were meticulously inspected to confirm accurate classification and labeling.

Tissue segmentation validation

To further validate the robustness of the dataset, we employed DL models to perform tissue segmentation tasks targeting the eight defined TME components (Fig. 1C,D). Three distinct neural network architectures (ResNet, EfficientNet, and ViT) were trained to classify these tissue types. To avoid class imbalance and ensure fair evaluation, we applied a stratified sampling strategy across both the training and testing datasets. Additionally, we enforced a patient-level split to prevent data leakage, ensuring that images from the same patient were not shared between training and validation.

The ResNet and EfficientNet models were trained using the AdamW optimizer with an initial learning rate of 0.01 and a weight decay factor of 1e-4 for five epochs. The ViT model was also optimized using AdamW, but with a learning rate of 3e-4 and a weight decay of 0.05. All networks were initialized with pre-trained weights sourced from ImageNet via Hugging Face (https://huggingface.co), enhancing their baseline performance.

During training, the dataset was divided into training and validation subsets in a 7:3 ratio. To enhance model reliability, we implemented a 10-fold cross-validation strategy to assess performance across folds. Post-training, model generalization was tested using two independent validation cohorts. On these validation datasets, the ResNet, ViT, and EfficientNet models achieved area under the curve (AUC) scores of 0.96, 0.99, and 0.99, respectively (Fig. 3C, F and I).

To assess the domain transferability of the models, we evaluated them on the CRC-VAL-HE-7K and CRC-VAL-HE-100K test sets. The ResNet model achieved AUC values of 0.96 and 0.99 on CRC-VAL-HE-7K and CRC-VAL-HE-100K, respectively (Fig. 4A,B). The ViT model yielded perfect AUC scores of 1.00 on both validation sets (Fig. 4D,E), while EfficientNet reached scores of 0.99 and 1.00 (Fig. 4G,H). To provide a more granular evaluation of classification performance, confusion matrices were generated for each model (Figs. 5–7).

Furthermore, to better contextualize the dataset’s performance ceiling, we have included additional comparisons with state-of-the-art segmentation frameworks, specifically TransUNet and Swin-Unet. In the CRC-VAL-HE-7K, CRC-VAL-HE-100K, and independent validation cohorts, both TransUNet and Swin-Unet achieved an AUC as high as 0.99 (Figure S2A-F). we also generated confusion matrices for both models (Figure S3A-F). These comparisons help situate our dataset’s performance within the broader landscape of advanced segmentation methods. These results confirm that all models demonstrated strong predictive accuracy across TME categories, reinforcing the validity of the dataset and the consistency of the annotations. Finally, we used Gradient-weighted Class Activation Mapping (Grad-CAM) to focus on the regions corresponding to the most prominent features in the pathological sections, thereby enhancing the interpretability and clinical translatability of the dataset (Figure S4).

In summary, the high performance of all three models across internal and external validation sets supports the integrity and utility of the dataset. The ability to successfully train and validate segmentation models with such high accuracy underscores the dataset’s reliability for downstream applications in computational pathology. In addition, our dataset focuses on static histology but does not involve longitudinal changes in TME. In studies on tumor evolution and treatment response, analyzing TME changes at different time points contributes to a deeper exploration of the biological behavior and pathogenesis of CRC. In the future, constructing a multimodal model by integrating TME data with imaging data from different treatment time points may further elucidate the prognostic heterogeneity of CRC.

Data availability

The complete dataset, designated HMU-CRC-Hist550K, is publicly accessible via the Figshare platform^{28,29,30,31,32,33,34,35,36}.

Code availability

The code used in this study was written in Python 3 and is available on GitHub (https://github.com/NakingLeo/HMUCRCHistosetValidationCode). The code is based on PyTorch (version 1.13.1) and OpenSlide (version 1.2).

References

Bray, F. et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a cancer journal for clinicians 74, 229-263, https://doi.org/10.3322/caac.21834 (2024).
Mo, S. et al. Patient-Derived Organoids from Colorectal Cancer with Paired Liver Metastasis Reveal Tumor Heterogeneity and Predict Response to Chemotherapy. Advanced science (Weinheim, Baden-Wurttemberg, Germany) 9, e2204097, https://doi.org/10.1002/advs.202204097 (2022).
Article CAS PubMed PubMed Central Google Scholar
Lei, J. X. et al. Deciphering tertiary lymphoid structure heterogeneity reveals prognostic signature and therapeutic potentials for colorectal cancer: a multicenter retrospective cohort study. International journal of surgery (London, England) 110, 5627–5640, https://doi.org/10.1097/js9.0000000000001684 (2024).
Article PubMed PubMed Central Google Scholar
Wang, Z. et al. Label Cleaning Multiple Instance Learning: Refining Coarse Annotations on Single Whole-Slide Images. IEEE transactions on medical imaging 41, 3952–3968, https://doi.org/10.1109/tmi.2022.3202759 (2022).
Article ADS PubMed PubMed Central Google Scholar
Lim, Y. et al. Artificial intelligence-powered spatial analysis of tumor-infiltrating lymphocytes for prediction of prognosis in resected colon cancer. NPJ precision oncology 7, 124, https://doi.org/10.1038/s41698-023-00470-0 (2023).
Article CAS PubMed PubMed Central Google Scholar
Foersch, S. et al. Multistain deep learning for prediction of prognosis and therapy response in colorectal cancer. Nature medicine 29, 430–439, https://doi.org/10.1038/s41591-022-02134-1 (2023).
Article CAS PubMed Google Scholar
Odisio, B. C. et al. Software-based versus visual assessment of the minimal ablative margin in patients with liver tumours undergoing percutaneous thermal ablation (COVER-ALL): a randomised phase 2 trial. The lancet. Gastroenterology & hepatology, https://doi.org/10.1016/s2468-1253(25)00024-x (2025).
Wallace, M. B. et al. Impact of Artificial Intelligence on Miss Rate of Colorectal Neoplasia. Gastroenterology 163, 295–304.e295, https://doi.org/10.1053/j.gastro.2022.03.007 (2022).
Article PubMed Google Scholar
Kanth, P. & Inadomi, J. M. Screening and prevention of colorectal cancer. BMJ (Clinical research ed.) 374, n1855, https://doi.org/10.1136/bmj.n1855 (2021).
Article PubMed Google Scholar
Zhang, P. et al. Systematic inference of super-resolution cell spatial profiles from histology images. Nature communications 16, 1838, https://doi.org/10.1038/s41467-025-57072-6 (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Niehues, J. M. et al. Generalizable biomarker prediction from cancer pathology slides with self-supervised deep learning: A retrospective multi-centric study. Cell reports. Medicine 4, 100980, https://doi.org/10.1016/j.xcrm.2023.100980 (2023).
Article CAS PubMed PubMed Central Google Scholar
Yu, G. et al. Accurate recognition of colorectal cancer with semi-supervised deep learning on pathological images. Nature communications 12, 6311, https://doi.org/10.1038/s41467-021-26643-8 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Maška, M. et al. The Cell Tracking Challenge: 10 years of objective benchmarking. Nature methods 20, 1010–1020, https://doi.org/10.1038/s41592-023-01879-y (2023).
Article CAS PubMed PubMed Central Google Scholar
Sirinukunwattana, K. et al. Image-based consensus molecular subtype (imCMS) classification of colorectal cancer using deep learning. Gut 70, 544–554, https://doi.org/10.1136/gutjnl-2019-319866 (2021).
Article CAS PubMed Google Scholar
Wagner, S. J. et al. Transformer-based biomarker prediction from colorectal cancer histology: A large-scale multicentric study. Cancer cell 41, 1650–1661.e1654, https://doi.org/10.1016/j.ccell.2023.08.002 (2023).
Article CAS PubMed PubMed Central Google Scholar
Li, X. et al. Transformer-Based Visual Segmentation: A Survey. IEEE transactions on pattern analysis and machine intelligence 46, 10138–10163, https://doi.org/10.1109/tpami.2024.3434373 (2024).
Article ADS PubMed Google Scholar
Dong, D. et al. Deep learning radiomic nomogram can predict the number of lymph node metastasis in locally advanced gastric cancer: an international multicenter study. Annals of oncology: official journal of the European Society for Medical Oncology 31, 912–920, https://doi.org/10.1016/j.annonc.2020.04.003 (2020).
Article CAS PubMed Google Scholar
Zhou, C. et al. Histopathology classification and localization of colorectal cancer using global labels by weakly supervised deep learning. Computerized medical imaging and graphics: the official journal of the Computerized Medical Imaging Society 88, 101861, https://doi.org/10.1016/j.compmedimag.2021.101861 (2021).
Article PubMed Google Scholar
Nguyen, H. G. et al. Image-based assessment of extracellular mucin-to-tumor area predicts consensus molecular subtypes (CMS) in colorectal cancer. Modern pathology: an official journal of the United States and Canadian Academy of Pathology, Inc 35, 240–248, https://doi.org/10.1038/s41379-021-00894-8 (2022).
Article CAS PubMed Google Scholar
Liu, M., Jiang, J. & Wang, Z. Colonic Polyp Detection in Endoscopic Videos With Single Shot Detection Based Deep Convolutional Neural Network. IEEE access: practical innovations, open solutions 7, 75058–75066, https://doi.org/10.1109/access.2019.2921027 (2019).
Article PubMed PubMed Central Google Scholar
Sirinukunwattana, K. et al. Gland segmentation in colon histology images: The glas challenge contest. Medical image analysis 35, 489–502, https://doi.org/10.1016/j.media.2016.08.008 (2017).
Article PubMed Google Scholar
Sun, M., Wang, J., Gong, Q. & Huang, W. Enhancing gland segmentation in colon histology images using an instance-aware diffusion model. Computers in biology and medicine 166, 107527, https://doi.org/10.1016/j.compbiomed.2023.107527 (2023).
Article PubMed Google Scholar
Pataki, B. et al. HunCRC: annotated pathological slides to enhance deep learning applications in colorectal cancer screening. Scientific data 9, 370, https://doi.org/10.1038/s41597-022-01450-y (2022).
Article PubMed PubMed Central Google Scholar
Kather, J. N. et al. Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study. PLoS medicine 16, e1002730, https://doi.org/10.1371/journal.pmed.1002730 (2019).
Article CAS PubMed PubMed Central Google Scholar
Kather, J. N. et al. Multi-class texture analysis in colorectal cancer histology. Scientific reports 6, 27988, https://doi.org/10.1038/srep27988 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Albashish, D. Ensemble of adapted convolutional neural networks (CNN) methods for classifying colon histopathological images. PeerJ. Computer science 8, e1031, https://doi.org/10.7717/peerj-cs.1031 (2022).
Article PubMed PubMed Central Google Scholar
Liang, J. et al. Deep learning supported discovery of biomarkers for clinical prognosis of liver cancer. Nature Machine Intelligence 5, 408–420, https://doi.org/10.1038/s42256-023-00635-3 (2023).
Article Google Scholar
Lou, S. Colorectal Cancer Histopathological Tissue Image Dataset: Adipose Tissue (ADI), https://doi.org/10.6084/m9.figshare.28931402.v1 (2025).
Lou, S. Colorectal Cancer Histopathological Tissue Image Dataset: Lymphocytes (LYM), https://doi.org/10.6084/m9.figshare.28936250.v1 (2025).
Lou, S. Colorectal Cancer Histopathological Tissue Image Dataset: Cellular Debris (DEB), https://doi.org/10.6084/m9.figshare.28939016.v1 (2025).
Lou, S. Colorectal Cancer Histopathological Tissue Image Dataset: Mucus (MUC), https://doi.org/10.6084/m9.figshare.28939100.v1 (2025).
Lou, S. Colorectal Cancer Histopathological Tissue Image Dataset: Smooth Muscle (MUS), https://doi.org/10.6084/m9.figshare.28939115.v1 (2025).
Lou, S. Colorectal Cancer Histopathological Tissue Image Dataset: Normal Colon Mucosa (NORM), https://doi.org/10.6084/m9.figshare.28939151.v1 (2025).
Lou, S. Colorectal Cancer Histopathological Tissue Image Dataset: Cancer-Associated Stroma (STR), https://doi.org/10.6084/m9.figshare.28939169.v1 (2025).
Lou, S. Colorectal Cancer Histopathological Tissue Image Dataset: Colorectal Adenocarcinoma Epithelium (TUM), https://doi.org/10.6084/m9.figshare.28939460.v1 (2025).
Lou, S. Colorectal Cancer Histopathological Tissue Image Dataset: HMU-CRC-Clinical, https://doi.org/10.6084/m9.figshare.28940609.v1 (2025).

Download references

Acknowledgements

This study was supported by Heilongjiang Provincial Higher Education Institutions Collaborative Innovation Cultivation Project (LJGXCG2023-087), Harbin Medical University Cancer Hospital Ascend Leading Disciplines Plan (PDYS-2024-14), Heilongjiang Provincial Natural Science Foundation of China (LH2023H096), the Postdoctoral Research Project in Heilongjiang Province (LBH-Z22210), and the China Postdoctoral Science Foundation (2023MD744213).

Author information

These authors contributed equally: Hao Wang, Huiying Li, Jingmin Xue.

Authors and Affiliations

Department of Breast Surgery, Harbin Medical University Cancer Hospital, No.150 Haping Road, Harbin, Heilongjiang, 150081, China
Hao Wang & Keru Ma
Department of Pathology, Harbin Medical University Cancer Hospital, No.150 Haping Road, Harbin, Heilongjiang, 150081, China
Huiying Li, Yang Jiang & Hongxue Meng
Department of Oncology Surgery, Harbin Medical University Cancer Hospital, No.150 Haping Road, Harbin, Heilongjiang, 150081, China
Jingmin Xue, Fenqi Du, Genshen Mo, Hao Li, Yuze Huang, Haonan Xie, Peng Han & Shenghan Lou
Heilongjiang Province Key Laboratory of Molecular Oncology, No.150 Haping Road, Harbin, Heilongjiang, 150081, China
Peng Han
Heilongjiang Cancer Institute, No.150 Haping Road, Harbin, Heilongjiang, 150081, China
Peng Han

Authors

Hao Wang
View author publications
Search author on:PubMed Google Scholar
Huiying Li
View author publications
Search author on:PubMed Google Scholar
Jingmin Xue
View author publications
Search author on:PubMed Google Scholar
Yang Jiang
View author publications
Search author on:PubMed Google Scholar
Keru Ma
View author publications
Search author on:PubMed Google Scholar
Fenqi Du
View author publications
Search author on:PubMed Google Scholar
Genshen Mo
View author publications
Search author on:PubMed Google Scholar
Hao Li
View author publications
Search author on:PubMed Google Scholar
Yuze Huang
View author publications
Search author on:PubMed Google Scholar
Haonan Xie
View author publications
Search author on:PubMed Google Scholar
Hongxue Meng
View author publications
Search author on:PubMed Google Scholar
Peng Han
View author publications
Search author on:PubMed Google Scholar
Shenghan Lou
View author publications
Search author on:PubMed Google Scholar

Contributions

H.W., H.L., J.X. and S.L. convinced of the project. H.W., H.L., J.X. and S.L. designed the model and developed the algorithms. H.L., Y.J., and H.M. provided and annotated the slide images. J.X., F.D., K.M., G.M., H.L., Y.H., H.X. and P.H. participated in data collection and processing. H.W., and S.L., wrote the manuscript. K.M., Y.J., F.D., G.M. and Y.H. participated in workstation environment development. All the authors participated in the final revision of the paper.

Corresponding authors

Correspondence to Hongxue Meng, Peng Han or Shenghan Lou.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Large-Scale Histological Image Dataset with Metadata for Colorectal Cancer Microenvironment (download DOCX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, H., Li, H., Xue, J. et al. Large-Scale Histological Image Dataset with Metadata for Colorectal Cancer Microenvironment. Sci Data 13, 431 (2026). https://doi.org/10.1038/s41597-026-06675-9

Download citation

Received: 09 May 2025
Accepted: 22 January 2026
Published: 12 February 2026
Version of record: 20 March 2026
DOI: https://doi.org/10.1038/s41597-026-06675-9