Abstract
The development of artificial intelligence systems for colonoscopy analysis often necessitates expert-annotated image datasets. However, limitations in dataset size and diversity impede model performance and generalization. Image–text colonoscopy records from routine clinical practice, comprising millions of images and text reports, serve as a valuable data source, although annotating them is labour intensive. Here we leverage recent advancements in large language and vision models and propose EndoKED, a data mining paradigm for deep knowledge extraction and distillation. EndoKED automates the transformation of raw colonoscopy records into image datasets with pixel-level annotation. We apply EndoKED to multicentre datasets of raw colonoscopy records (~1 million images), showing its superior performance in detecting polyps at the report and image levels, as well as annotating polyps at the pixel level. The state-of-the-art performance and generalization ability of polyp segmentation models are achieved through EndoKED pretraining. Furthermore, the EndoKED vision backbone enables data-efficient learning for optical biopsy, achieving expert-level performance in internal, external and prospective validation datasets.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to the full article PDF.
USD 39.95
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Data availability
The main data supporting the findings of this study are available within the article and Supplementary Information. The public datasets used for segmentation can be accessed through the following web links: CVC-ClinicDB (https://polyp.grand-challenge.org/CVCClinicDB/), Kvasir-SEG (https://datasets.simula.no/kvasir-seg/), ETIS (https://polyp.grand-challenge.org/ETISLarib/), CVC-ColonDB (http://vi.cvc.uab.es/colon-qa/cvccolondb/), CVC-300 (https://pages.cvc.uab.es/CVC-Colon/) and PolypGen (https://www.synapse.org/#!Synapse:syn26376615). The large-scale clinical pretraining datasets are not publicly available owing to patient privacy constraints. Access to these restricted colonoscopy reports for non-commercial use may be granted upon reasonable request to corresponding author S.W. All requests will be reviewed and are subject to a formal data sharing agreement, consent from each participating medical centre and ethics approval. Source data are provided with this paper.
Code availability
Code51 and trained models of this study are publicly available via GitHub at https://github.com/shuowang26/EndoKED.
References
Arnold, M. et al. Global patterns and trends in colorectal cancer incidence and mortality. Gut 66, 683–691 (2017).
Bray, F. et al. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 68, 394–424 (2018).
Atkin, W. et al. Long term effects of once-only flexible sigmoidoscopy screening after 17 years of follow-up: the UK Flexible Sigmoidoscopy Screening randomised controlled trial. Lancet 389, 1299–1311 (2017).
Le Berre, C. et al. Application of artificial intelligence to gastroenterology and hepatology. Gastroenterology 158, 76–94.e2 (2020).
Ali, S. Where do we stand in AI for endoscopic image analysis? Deciphering gaps and future directions. NPJ Digit. Med. 5, 184 (2022).
Messmann, H. et al. Expected value of artificial intelligence in gastrointestinal endoscopy: European Society of Gastrointestinal Endoscopy (ESGE) Position Statement. Endoscopy 54, 1211–1231 (2022).
Ali, S. et al. A multi-centre polyp detection and segmentation dataset for generalisability assessment. Sci. Data 10, 75 (2023).
Ali, S. et al. Assessing generalisability of deep learning-based polyp detection and segmentation methods through a computer vision challenge. Sci. Rep. 14, 2032 (2024).
Ali, S. et al. Deep learning for detection and segmentation of artefact and disease instances in gastrointestinal endoscopy. Med. Image Anal. 70, 102002 (2021).
Ahmad, O. F. et al. Artificial intelligence and computer-aided diagnosis in colonoscopy: current evidence and future directions. Lancet Gastroenterol. Hepatol. 4, 71–80 (2019).
Gupta, M. & Mishra, A. A systematic review of deep learning based image segmentation to detect polyp. Artif. Intell. Rev. 57, 7 (2024).
Lieberman, D. et al. Standardized colonoscopy reporting and data system: report of the Quality Assurance Task Group of the National Colorectal Cancer Roundtable. Gastrointest. Endosc. 65, 757–766 (2007).
Bernal, J. et al. Comparative validation of polyp detection methods in video colonoscopy: results from the MICCAI 2015 Endoscopic Vision Challenge. IEEE Trans. Med. Imaging 36, 1231–1249 (2017).
Hassan, C., Balsamo, G., Lorenzetti, R., Zullo, A. & Antonelli, G. Artificial intelligence allows leaving-in-situ colorectal polyps. Clin. Gastroenterol. Hepatol. 20, 2505–2513.e4 (2022).
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning (eds Meila, M. et al.) 8748–8763 (PMLR, 2021).
Tiu, E. et al. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng. 6, 1399–1406 (2022).
Zhang, X., Wu, C., Zhang, Y., Xie, W. & Wang, Y. Knowledge-enhanced visual-language pre-training on chest radiology images. Nat. Commun. 14, 4542 (2023).
Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).
Adams, L. C. et al. Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study. Radiology 307, e230725 (2023).
Kirillov, A. et al. Segment anything. In Proc. IEEE/CVF International Conference on Computer Vision (eds Agapito, L. et al.) 4015–4026 (IEEE, 2023).
Wu, J. et al. Medical SAM adapter: adapting Segment Anything Model for medical image segmentation. Med. Image Anal. 102, 103547 (2025).
Brown, T. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems (eds Larochelle, H. et al.) 1877–1901 (Curran Associates, 2020).
Bai, Y. et al. Constitutional AI: harmlessness from AI feedback. Preprint at https://arxiv.org/abs/2212.08073 (2022).
Wang, W. et al. ERNIE 3.0: large-scale knowledge enhanced pre-training for language understanding and generation. Preprint at https://arxiv.org/abs/2107.02137 (2021).
Pogorelov, K. et al. KVASIR: a multi-class image dataset for computer aided gastrointestinal disease detection. In Proc. 8th ACM on Multimedia Systems Conference (eds Chen, K. T. et al.) 164–169 (ACM, 2017).
Bernal, J. et al. WM-DOVA maps for accurate polyp highlighting in colonoscopy: validation vs. saliency maps from physicians. Comput. Med. Imaging Graph. 43, 99–111 (2015).
Bernal, J., Sánchez, J. & Vilariño, F. Towards automatic polyp detection with a polyp appearance model. Pattern Recognit. 45, 3166–3182 (2012).
Vázquez, D. et al. A benchmark for endoluminal scene segmentation of colonoscopy images. J. Healthc. Eng. 2017, 4037190 (2017).
Silva, J., Histace, A., Romain, O., Dray, X. & Granado, B. Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer. Int. J. Comput. Assist. Radiol. Surg. 9, 283–293 (2014).
Falk, T. et al. U-Net: deep learning for cell counting, detection, and morphometry. Nat. Methods 16, 67–70 (2019).
Zhou, Z. et al. UNet++: a nested U-Net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support (eds Stoyanov, D. et al.) 3–11 (Springer, 2018).
Lei, J. et al. C2FNet: a coarse-to-fine network for multi-view 3D point cloud generation. IEEE Trans. Image Process. 31, 6707–6718 (2022).
Yin, Z. Duplex contextual relation network for polyp segmentation. In Proc. IEEE International Symposium on Biomedical Imaging (eds Awate, S. P. et al.) 1–5 (IEEE, 2022).
Zhang, R. et al. Lesion-aware dynamic kernel for polyp segmentation. In Proc. Medical Image Computing and Computer Assisted Intervention (eds Wang, L. et al.) 99–109 (Springer, 2022).
Dong, B. et al. Polyp-PVT: polyp segmentation with pyramid vision transformers. CAAI Artif. Intell. Res. 2, 9150015 (2023).
Misawa, M. et al. Development of a computer-aided detection system for colonoscopy and a publicly accessible large colonoscopy video database (with video). Gastrointest. Endosc. 93, 960–967.e3 (2021).
Ma, Y., Chen, X., Cheng, K., Li, Y. & Sun, B. LDPolypVideo benchmark: a large-scale colonoscopy video dataset of diverse polyps. In Proc. Medical Image Computing and Computer Assisted Intervention (eds de Bruijne, M. et al.) 387–396 (Springer, 2021).
Wang, H. et al. EndoBoost: a plug-and-play module for false positive suppression during computer-aided polyp detection in real-world colonoscopy (with dataset). Preprint at https://arxiv.org/abs/2212.12204 (2022).
Jha, D. et al. Kvasir-seg: a segmented polyp dataset. In International Conference on Multimedia Modeling (eds Ro, Y. M. et al.) 451–462 (Springer, 2020).
Ren, G. et al. Towards automated polyp segmentation using weakly- and semi-supervised learning and deformable transformers. In Proc. IEEE/CVF International Conference on Computer Vision (eds Agapito, L. et al.) 4355–4364 (IEEE, 2023).
Cheng, J. et al. SAM-Med2D. Preprint at https://arxiv.org/abs/2308.16184 (2023).
Mazurowski, M. A. et al. Segment Anything Model for medical image analysis: an experimental study. Med. Image Anal. 89, 102918 (2023).
Zhang, Y. et al. SamDSK: combining Segment Anything Model with domain-specific knowledge for semi-supervised learning in medical image segmentation. In Chinese Conference on Pattern Recognition and Computer Vision (eds Lin, Z. et al.) 343–357 (Springer, 2024).
Ahmad, O. F. et al. Establishing key research questions for the implementation of artificial intelligence in colonoscopy: a modified Delphi method. Endoscopy 53, 893–901 (2021).
Chen, P.-J. et al. Accurate classification of diminutive colorectal polyps using computer-aided analysis. Gastroenterology 154, 568–575 (2018).
Byrne, M. F. et al. Real-time differentiation of adenomatous and hyperplastic diminutive colorectal polyps during analysis of unaltered videos of standard colonoscopy using a deep learning model. Gut 68, 94–100 (2019).
Genta, R. M. & Sonnenberg, A. Big data in gastroenterology research. Nat. Rev. Gastroenterol. Hepatol. 11, 386–390 (2014).
Ilse, M., Tomczak, J. & Welling, M. Attention-based deep multiple instance learning. In Proc. 35th International Conference on Machine Learning (eds Dy, J. et al.) 2127–2136 (PMLR, 2018).
Li, B., Li, Y. & Eliceiri, K. W. Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (eds Forsyth, D. et al) 14318–14328 (IEEE, 2021).
Ru, L., Zheng, H., Zhan, Y. & Du, B. Token contrast for weakly-supervised semantic segmentation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (eds Geiger, A. et al.) 3093–3102 (IEEE, 2023).
Yang, Z. & Wang, S. shuowang26/EndoKED: zendo (Version 20250724). Zenodo https://doi.org/10.5281/zenodo.16370348/ (2025).
Acknowledgements
S.W. was supported in part by the National Key Research and Development Program of China (number 2024YFF1207500), the Shanghai Municipal Education Commission Project for Promoting Research Paradigm Reform and Empowering Disciplinary Advancement through Artificial Intelligence (24RGZNC01) and the International Science and Technology Cooperation Program of the Shanghai Action Plan for Science (23410710400). Y. Zhu acknowledges support from the National Natural Science Foundation of China (82203193). Q.L. was supported by the National Natural Science Foundation of China (82170555) and the Shanghai Academic/Technology Research Leader Program (22XD1422400).
Author information
Authors and Affiliations
Contributions
S.W., Y. Zhu, Q.L., P.Z. and Y.G. conceptualized the study. S.W. and Y. Zhu led the study design, data analysis and paper drafting. Z.Y., X.L., P.F., H.W. and Y. Zhang also performed data analysis and interpretation, while Z.Y. and X.L. assisted with drafting the paper. M.W., Z.S., Q.L., P.Z. and Y.G. supervised the project and critically revised the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethics statement
The study was approved by the ethics committees of Zhongshan Hospital, Xiamen Branch of Zhongshan Hospital, Zhengzhou Central Hospital and No. 988 Hospital of Joint Logistics Support Force.
Peer review
Peer review information
Nature Biomedical Engineering thanks Sharib Ali, Thomas de Lange and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Examples of polyp segmentation on challenging samples.
Qualitative comparison of nine state-of-the-art segmentation models (columns, model names at bottom) on various colonoscopy images (top row). The second row displays the ground truth segmentations (expert annotation). The fourth row shows the baseline performance of each model (w/o EndoKED pre-training). The fifth row shows the performance of the same models after pre-training with the proposed distilled annotations (with EndoKED pre-training). The pre-trained models consistently produce more accurate and robust segmentation masks that are closer to the expert annotations, with fewer false positives and artifacts. The third row shows the output of the EndoKEG-SEG model for reference.
Supplementary information
Supplementary Information (download PDF )
Supplementary Notes 1–5, Fig. 1 and Tables 1–9.
Source data
Source Data Fig. 4 (download XLSX )
Precision-Recall (PR) curve data for Fig. 4c,d.
Source Data Fig. 5 (download XLSX )
ROC curve data for Fig. 5a and data-efficiency curve data for Fig 5b.
Source Data Fig. 6 (download XLSX )
Embedding UMAP data for Fig. 6.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, S., Zhu, Y., Yang, Z. et al. Leveraging large language and vision models for knowledge extraction from large-scale image–text colonoscopy records. Nat. Biomed. Eng (2025). https://doi.org/10.1038/s41551-025-01500-x
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41551-025-01500-x


