Abstract
Effective diagnosis and treatment of lung adenocarcinoma depends on accurate typing, subtyping, and grading. Herein, we present the CLWD dataset, a valuable resource for the lung cancer pathology community, comprising 408 whole-slide images (WSIs) from 210 patients specifically curated for the study of lung adenocarcinoma subtypes. Scanned at 80 × magnification, it is one of the largest datasets in Asia, with a particular emphasis on Chinese patient demographics. Notably, the dataset includes comprehensive clinical information, such as age, sex, and diagnosis, providing a robust foundation for diverse research needs. Publicly accessible, it supports a range of applications, including machine learning model development and validation. An initial evaluation of lung adenocarcinoma subtype classification using a multi-instance learning framework demonstrated that this dataset can substantially advance global research and improve the accuracy of subtype diagnosis.
Similar content being viewed by others
Data availability
The dataset is publicly available via Figshare24 and can also be accessed directly through our Pathology Image Repository (https://leelab.kmmu.edu.cn/PathologyRepository). Otherwise, the JPG version of the dataset also available at the Hugging Face repository (https://huggingface.co/datasets/kmmuleelab/Lung_Pathology_Image_JPG).
Code availability
The code for preprocessing and deep learning models is publicly available on GitHub: https://github.com/DrNeilChen/CLWD.
References
Lortet-Tieulent, J. et al. International trends in lung cancer incidence by histological subtype: adenocarcinoma stabilizing in men but still increasing in women. Lung Cancer 84, 13–22 (2014).
Sung, H. et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin 71, 209–249 (2021).
Travis, W. D. et al. The 2015 World Health Organization Classification of Lung Tumors: Impact of Genetic, Clinical and Radiologic Advances Since the 2004 Classification. J Thorac Oncol 10, 1243–1260 (2015).
Travis, W. D. et al. International association for the study of lung cancer/american thoracic society/european respiratory society international multidisciplinary classification of lung adenocarcinoma. J Thorac Oncol 6, 244–285 (2011).
Xiang, C. et al. Distinct mutational features across preinvasive and invasive subtypes identified through comprehensive profiling of surgically resected lung adenocarcinoma. Mod Pathol 35, 1181–1192 (2022).
Caso, R. et al. The Underlying Tumor Genomics of Predominant Histologic Subtypes in Lung Adenocarcinoma. J Thorac Oncol 15, 1844–1856 (2020).
Zhang, Y. et al. Excellent Prognosis of Patients With Invasive Lung Adenocarcinomas During Surgery Misdiagnosed as Atypical Adenomatous Hyperplasia, Adenocarcinoma In Situ, or Minimally Invasive Adenocarcinoma by Frozen Section. Chest 159, 1265–1272 (2021).
Zhai, W. et al. Prognostic Nomograms Based on Ground Glass Opacity and Subtype of Lung Adenocarcinoma for Patients with Pathological Stage IA Lung Adenocarcinoma. Front Cell Dev Biol 9, 769881 (2021).
Cancer Genome Atlas Research, N. Comprehensive genomic characterization of squamous cell lung cancers. Nature 489, 519–525 (2012).
Chen, F. et al. Moving pan-cancer studies from basic research toward the clinic. Nat Cancer 2, 879–890 (2021).
Shmatko, A., Ghaffari Laleh, N., Gerstung, M. & Kather, J. N. Artificial intelligence in histopathology: enhancing cancer research and clinical oncology. Nat Cancer 3, 1026–1038 (2022).
Janowczyk, A. & Madabhushi, A. Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases. J Pathol Inform 7, 29 (2016).
Ozkan, T. A. et al. Interobserver variability in Gleason histological grading of prostate cancer. Scand J Urol 50, 420–424 (2016).
Elmore, J. G. et al. Diagnostic concordance among pathologists interpreting breast biopsy specimens. JAMA 313, 1122–1132 (2015).
Campanella, G. et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat Med 25, 1301–1309 (2019).
Gehrung, M. et al. Triage-driven diagnosis of Barrett’s esophagus for early detection of esophageal adenocarcinoma using deep learning. Nat Med 27, 833–841 (2021).
Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat Biomed Eng 5, 555–570 (2021).
Yang, H. et al. Deep learning-based six-type classifier for lung cancer and mimics from histopathological whole slide images: a retrospective study. BMC Med 19, 80 (2021).
Gertych, A. et al. Convolutional neural networks can accurately distinguish four histologic growth patterns of lung adenocarcinoma in digital slides. Sci Rep 9, 1483 (2019).
Lu, M. Y. et al. AI-based pathology predicts origins for cancers of unknown primary. Nature 594, 106–110 (2021).
Wei, J. W. et al. Pathologist-level classification of histologic patterns on resected lung adenocarcinoma slides with deep neural networks. Sci Rep 9, 3358 (2019).
Shao, Z. et al. TransMIL: Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classication. in Neural Information Processing Systems (2021).
Zheng, Y. et al. A Graph-Transformer for Whole Slide Image Classification. IEEE Trans Med Imaging 41, 3003–3015 (2022).
Chen, Y. CLWD: a Chinese histopathology dataset for lung adenocarcinoma subtype classification. figshare https://doi.org/10.6084/m9.figshare.29035847 (2025).
Chen, Y. et al. Lung_Pathology_Image_JPG (Revision 312c831). Hugging Face https://doi.org/10.57967/hf/7794 (2026).
Acknowledgements
This study was supported by the National Natural Science Foundation of China (No. 82560572, No. 82404091, and No. 62302429), the Yunnan Province Applied Basic Research Program Kunming Medical University Joint Project (202401AY070001-120), the Health Commission Foundation of Yunnan Province (2023-KHRCBZ-B15), the Kunming University of Science and Technology Joint Medical Project (KUST-KH2023013Y), Yunnan Fundamental Research Projects(202501CF070023), Kunming University of Science and Technology Joint Medical Project (KUST-KH2022018Y), Major Science and Technology Projects of Yunnan Province (202402AA310016), Basic Research Science and Technology Foundation of Yunnan Province (202201AS070009) and Xing Dian Foundation of Yunnan Province (XDYC-MY-2022-0029).
Author information
Authors and Affiliations
Contributions
Conceptualization: J.P., J.L., D.P.T. and J.N.; Methodology and formal analysis: Y.C., H.Y.Z. and J.L.; Investigation: Y.C., H.Y.Z. and L.W.; Data curation: L.W., L.L., R.S.L., Y.H.J., P.R.T. and Y.L.; Writing-Original Draft: Y.C. and H.Y.Z.; Writing-Review & Editing: L.L., J.P., J.L., D.P.T. and J.N.; Supervision: J.P., J.L., D.P.T. and J.N. All authors have read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Chen, Y., Zhao, H., Wang, L. et al. CLWD: a Chinese histopathology dataset for lung adenocarcinoma subtype classification. Sci Data (2026). https://doi.org/10.1038/s41597-026-06906-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-026-06906-z


