Leveraging large language and vision models for knowledge extraction from large-scale image–text colonoscopy records

Wang, Shuo; Zhu, Yan; Yang, Zhiwei; Luo, Xiaoyuan; Zhang, Yizhe; Fu, Peiyao; Wang, Haoran; Wang, Manning; Song, Zhijian; Li, Quanlin; Zhou, Pinghong; Guo, Yike

doi:10.1038/s41551-025-01500-x

Article
Published: 16 September 2025

Leveraging large language and vision models for knowledge extraction from large-scale image–text colonoscopy records

Shuo Wang ORCID: orcid.org/0000-0002-2947-8783^1,2,3,4^na1,
Yan Zhu^4,5^na1,
Zhiwei Yang^1,2,6^na1,
Xiaoyuan Luo^1,2^na1,
Yizhe Zhang⁷,
Peiyao Fu^4,5,
Haoran Wang^1,2,
Manning Wang^1,2,
Zhijian Song^1,2,
Quanlin Li ORCID: orcid.org/0000-0002-9108-8786^4,5,
Pinghong Zhou ORCID: orcid.org/0000-0002-5434-0540^4,5 &
…
Yike Guo⁸

Nature Biomedical Engineering (2025) Cite this article

3441 Accesses
5 Citations
11 Altmetric
Metrics details

Subjects

Abstract

The development of artificial intelligence systems for colonoscopy analysis often necessitates expert-annotated image datasets. However, limitations in dataset size and diversity impede model performance and generalization. Image–text colonoscopy records from routine clinical practice, comprising millions of images and text reports, serve as a valuable data source, although annotating them is labour intensive. Here we leverage recent advancements in large language and vision models and propose EndoKED, a data mining paradigm for deep knowledge extraction and distillation. EndoKED automates the transformation of raw colonoscopy records into image datasets with pixel-level annotation. We apply EndoKED to multicentre datasets of raw colonoscopy records (~1 million images), showing its superior performance in detecting polyps at the report and image levels, as well as annotating polyps at the pixel level. The state-of-the-art performance and generalization ability of polyp segmentation models are achieved through EndoKED pretraining. Furthermore, the EndoKED vision backbone enables data-efficient learning for optical biopsy, achieving expert-level performance in internal, external and prospective validation datasets.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview of the EndoKED design and applications to polyp diagnosis.**

**Fig. 2: Network design for cross-scale knowledge distillation.**

**Fig. 3: Datasets for training and validating the EndoKED framework.**

**Fig. 4: Performance of the EndoKED framework in report abstraction, image-level detection, pixel-level segmentation and model pretraining.**

**Fig. 5: Accuracy and data efficiency of the EndoKED-PATH optical biopsy model.**

**Fig. 6: Visualization of the distribution of polyp images in the feature space using Uniform Manifold Approximation and Projection (UMAP).**

NA-segformer: A multi-level transformer model based on neighborhood attention for colonoscopic polyp segmentation

Article Open access 28 September 2024

REAL-Colon: A dataset for developing real-world AI applications in colonoscopy

Article Open access 25 May 2024

Towards full integration of explainable artificial intelligence in colon capsule endoscopy’s pathway

Article Open access 18 February 2025

Data availability

The main data supporting the findings of this study are available within the article and Supplementary Information. The public datasets used for segmentation can be accessed through the following web links: CVC-ClinicDB (https://polyp.grand-challenge.org/CVCClinicDB/), Kvasir-SEG (https://datasets.simula.no/kvasir-seg/), ETIS (https://polyp.grand-challenge.org/ETISLarib/), CVC-ColonDB (http://vi.cvc.uab.es/colon-qa/cvccolondb/), CVC-300 (https://pages.cvc.uab.es/CVC-Colon/) and PolypGen (https://www.synapse.org/#!Synapse:syn26376615). The large-scale clinical pretraining datasets are not publicly available owing to patient privacy constraints. Access to these restricted colonoscopy reports for non-commercial use may be granted upon reasonable request to corresponding author S.W. All requests will be reviewed and are subject to a formal data sharing agreement, consent from each participating medical centre and ethics approval. Source data are provided with this paper.

Code availability

Code⁵¹ and trained models of this study are publicly available via GitHub at https://github.com/shuowang26/EndoKED.

References

Arnold, M. et al. Global patterns and trends in colorectal cancer incidence and mortality. Gut 66, 683–691 (2017).
Article PubMed Google Scholar
Bray, F. et al. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 68, 394–424 (2018).
PubMed Google Scholar
Atkin, W. et al. Long term effects of once-only flexible sigmoidoscopy screening after 17 years of follow-up: the UK Flexible Sigmoidoscopy Screening randomised controlled trial. Lancet 389, 1299–1311 (2017).
Article PubMed PubMed Central Google Scholar
Le Berre, C. et al. Application of artificial intelligence to gastroenterology and hepatology. Gastroenterology 158, 76–94.e2 (2020).
Article PubMed Google Scholar
Ali, S. Where do we stand in AI for endoscopic image analysis? Deciphering gaps and future directions. NPJ Digit. Med. 5, 184 (2022).
Article PubMed PubMed Central Google Scholar
Messmann, H. et al. Expected value of artificial intelligence in gastrointestinal endoscopy: European Society of Gastrointestinal Endoscopy (ESGE) Position Statement. Endoscopy 54, 1211–1231 (2022).
Article PubMed Google Scholar
Ali, S. et al. A multi-centre polyp detection and segmentation dataset for generalisability assessment. Sci. Data 10, 75 (2023).
Article PubMed PubMed Central Google Scholar
Ali, S. et al. Assessing generalisability of deep learning-based polyp detection and segmentation methods through a computer vision challenge. Sci. Rep. 14, 2032 (2024).
Article CAS PubMed PubMed Central Google Scholar
Ali, S. et al. Deep learning for detection and segmentation of artefact and disease instances in gastrointestinal endoscopy. Med. Image Anal. 70, 102002 (2021).
Article PubMed Google Scholar
Ahmad, O. F. et al. Artificial intelligence and computer-aided diagnosis in colonoscopy: current evidence and future directions. Lancet Gastroenterol. Hepatol. 4, 71–80 (2019).
Article PubMed Google Scholar
Gupta, M. & Mishra, A. A systematic review of deep learning based image segmentation to detect polyp. Artif. Intell. Rev. 57, 7 (2024).
Article Google Scholar
Lieberman, D. et al. Standardized colonoscopy reporting and data system: report of the Quality Assurance Task Group of the National Colorectal Cancer Roundtable. Gastrointest. Endosc. 65, 757–766 (2007).
Article PubMed Google Scholar
Bernal, J. et al. Comparative validation of polyp detection methods in video colonoscopy: results from the MICCAI 2015 Endoscopic Vision Challenge. IEEE Trans. Med. Imaging 36, 1231–1249 (2017).
Article PubMed Google Scholar
Hassan, C., Balsamo, G., Lorenzetti, R., Zullo, A. & Antonelli, G. Artificial intelligence allows leaving-in-situ colorectal polyps. Clin. Gastroenterol. Hepatol. 20, 2505–2513.e4 (2022).
Article PubMed Google Scholar
Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning (eds Meila, M. et al.) 8748–8763 (PMLR, 2021).
Tiu, E. et al. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nat. Biomed. Eng. 6, 1399–1406 (2022).
Article PubMed PubMed Central Google Scholar
Zhang, X., Wu, C., Zhang, Y., Xie, W. & Wang, Y. Knowledge-enhanced visual-language pre-training on chest radiology images. Nat. Commun. 14, 4542 (2023).
Article CAS PubMed PubMed Central Google Scholar
Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).
Article CAS PubMed Google Scholar
Adams, L. C. et al. Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study. Radiology 307, e230725 (2023).
Article PubMed Google Scholar
Kirillov, A. et al. Segment anything. In Proc. IEEE/CVF International Conference on Computer Vision (eds Agapito, L. et al.) 4015–4026 (IEEE, 2023).
Wu, J. et al. Medical SAM adapter: adapting Segment Anything Model for medical image segmentation. Med. Image Anal. 102, 103547 (2025).
Article PubMed Google Scholar
Brown, T. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems (eds Larochelle, H. et al.) 1877–1901 (Curran Associates, 2020).
Bai, Y. et al. Constitutional AI: harmlessness from AI feedback. Preprint at https://arxiv.org/abs/2212.08073 (2022).
Wang, W. et al. ERNIE 3.0: large-scale knowledge enhanced pre-training for language understanding and generation. Preprint at https://arxiv.org/abs/2107.02137 (2021).
Pogorelov, K. et al. KVASIR: a multi-class image dataset for computer aided gastrointestinal disease detection. In Proc. 8th ACM on Multimedia Systems Conference (eds Chen, K. T. et al.) 164–169 (ACM, 2017).
Bernal, J. et al. WM-DOVA maps for accurate polyp highlighting in colonoscopy: validation vs. saliency maps from physicians. Comput. Med. Imaging Graph. 43, 99–111 (2015).
Article PubMed Google Scholar
Bernal, J., Sánchez, J. & Vilariño, F. Towards automatic polyp detection with a polyp appearance model. Pattern Recognit. 45, 3166–3182 (2012).
Article Google Scholar
Vázquez, D. et al. A benchmark for endoluminal scene segmentation of colonoscopy images. J. Healthc. Eng. 2017, 4037190 (2017).
Article PubMed PubMed Central Google Scholar
Silva, J., Histace, A., Romain, O., Dray, X. & Granado, B. Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer. Int. J. Comput. Assist. Radiol. Surg. 9, 283–293 (2014).
Article PubMed Google Scholar
Falk, T. et al. U-Net: deep learning for cell counting, detection, and morphometry. Nat. Methods 16, 67–70 (2019).
Article CAS PubMed Google Scholar
Zhou, Z. et al. UNet++: a nested U-Net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support (eds Stoyanov, D. et al.) 3–11 (Springer, 2018).
Lei, J. et al. C2FNet: a coarse-to-fine network for multi-view 3D point cloud generation. IEEE Trans. Image Process. 31, 6707–6718 (2022).
Article PubMed Google Scholar
Yin, Z. Duplex contextual relation network for polyp segmentation. In Proc. IEEE International Symposium on Biomedical Imaging (eds Awate, S. P. et al.) 1–5 (IEEE, 2022).
Zhang, R. et al. Lesion-aware dynamic kernel for polyp segmentation. In Proc. Medical Image Computing and Computer Assisted Intervention (eds Wang, L. et al.) 99–109 (Springer, 2022).
Dong, B. et al. Polyp-PVT: polyp segmentation with pyramid vision transformers. CAAI Artif. Intell. Res. 2, 9150015 (2023).
Article Google Scholar
Misawa, M. et al. Development of a computer-aided detection system for colonoscopy and a publicly accessible large colonoscopy video database (with video). Gastrointest. Endosc. 93, 960–967.e3 (2021).
Article PubMed Google Scholar
Ma, Y., Chen, X., Cheng, K., Li, Y. & Sun, B. LDPolypVideo benchmark: a large-scale colonoscopy video dataset of diverse polyps. In Proc. Medical Image Computing and Computer Assisted Intervention (eds de Bruijne, M. et al.) 387–396 (Springer, 2021).
Wang, H. et al. EndoBoost: a plug-and-play module for false positive suppression during computer-aided polyp detection in real-world colonoscopy (with dataset). Preprint at https://arxiv.org/abs/2212.12204 (2022).
Jha, D. et al. Kvasir-seg: a segmented polyp dataset. In International Conference on Multimedia Modeling (eds Ro, Y. M. et al.) 451–462 (Springer, 2020).
Ren, G. et al. Towards automated polyp segmentation using weakly- and semi-supervised learning and deformable transformers. In Proc. IEEE/CVF International Conference on Computer Vision (eds Agapito, L. et al.) 4355–4364 (IEEE, 2023).
Cheng, J. et al. SAM-Med2D. Preprint at https://arxiv.org/abs/2308.16184 (2023).
Mazurowski, M. A. et al. Segment Anything Model for medical image analysis: an experimental study. Med. Image Anal. 89, 102918 (2023).
Article PubMed PubMed Central Google Scholar
Zhang, Y. et al. SamDSK: combining Segment Anything Model with domain-specific knowledge for semi-supervised learning in medical image segmentation. In Chinese Conference on Pattern Recognition and Computer Vision (eds Lin, Z. et al.) 343–357 (Springer, 2024).
Ahmad, O. F. et al. Establishing key research questions for the implementation of artificial intelligence in colonoscopy: a modified Delphi method. Endoscopy 53, 893–901 (2021).
Article PubMed Google Scholar
Chen, P.-J. et al. Accurate classification of diminutive colorectal polyps using computer-aided analysis. Gastroenterology 154, 568–575 (2018).
Article PubMed Google Scholar
Byrne, M. F. et al. Real-time differentiation of adenomatous and hyperplastic diminutive colorectal polyps during analysis of unaltered videos of standard colonoscopy using a deep learning model. Gut 68, 94–100 (2019).
Article PubMed Google Scholar
Genta, R. M. & Sonnenberg, A. Big data in gastroenterology research. Nat. Rev. Gastroenterol. Hepatol. 11, 386–390 (2014).
Article PubMed Google Scholar
Ilse, M., Tomczak, J. & Welling, M. Attention-based deep multiple instance learning. In Proc. 35th International Conference on Machine Learning (eds Dy, J. et al.) 2127–2136 (PMLR, 2018).
Li, B., Li, Y. & Eliceiri, K. W. Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (eds Forsyth, D. et al) 14318–14328 (IEEE, 2021).
Ru, L., Zheng, H., Zhan, Y. & Du, B. Token contrast for weakly-supervised semantic segmentation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (eds Geiger, A. et al.) 3093–3102 (IEEE, 2023).
Yang, Z. & Wang, S. shuowang26/EndoKED: zendo (Version 20250724). Zenodo https://doi.org/10.5281/zenodo.16370348/ (2025).

Download references

Acknowledgements

S.W. was supported in part by the National Key Research and Development Program of China (number 2024YFF1207500), the Shanghai Municipal Education Commission Project for Promoting Research Paradigm Reform and Empowering Disciplinary Advancement through Artificial Intelligence (24RGZNC01) and the International Science and Technology Cooperation Program of the Shanghai Action Plan for Science (23410710400). Y. Zhu acknowledges support from the National Natural Science Foundation of China (82203193). Q.L. was supported by the National Natural Science Foundation of China (82170555) and the Shanghai Academic/Technology Research Leader Program (22XD1422400).

Author information

These authors contributed equally: Shuo Wang, Yan Zhu, Zhiwei Yang, Xiaoyuan Luo.

Authors and Affiliations

Digital Medical Research Centre, School of Basic Medical Sciences, Fudan University, Shanghai, China
Shuo Wang, Zhiwei Yang, Xiaoyuan Luo, Haoran Wang, Manning Wang & Zhijian Song
Shanghai Key Laboratory of MICCAI, Shanghai, China
Shuo Wang, Zhiwei Yang, Xiaoyuan Luo, Haoran Wang, Manning Wang & Zhijian Song
Data Science Institute, Imperial College London, London, UK
Shuo Wang
Shanghai Collaborative Innovation Centre of Endoscopy, Shanghai, China
Shuo Wang, Yan Zhu, Peiyao Fu, Quanlin Li & Pinghong Zhou
Endoscopy Centre and Endoscopy Research Institute, Zhongshan Hospital, Fudan University, Shanghai, China
Yan Zhu, Peiyao Fu, Quanlin Li & Pinghong Zhou
Academy for Engineering and Technology, Fudan University, Shanghai, China
Zhiwei Yang
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
Yizhe Zhang
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China
Yike Guo

Authors

Shuo Wang
View author publications
Search author on:PubMed Google Scholar
Yan Zhu
View author publications
Search author on:PubMed Google Scholar
Zhiwei Yang
View author publications
Search author on:PubMed Google Scholar
Xiaoyuan Luo
View author publications
Search author on:PubMed Google Scholar
Yizhe Zhang
View author publications
Search author on:PubMed Google Scholar
Peiyao Fu
View author publications
Search author on:PubMed Google Scholar
Haoran Wang
View author publications
Search author on:PubMed Google Scholar
Manning Wang
View author publications
Search author on:PubMed Google Scholar
Zhijian Song
View author publications
Search author on:PubMed Google Scholar
Quanlin Li
View author publications
Search author on:PubMed Google Scholar
Pinghong Zhou
View author publications
Search author on:PubMed Google Scholar
Yike Guo
View author publications
Search author on:PubMed Google Scholar

Contributions

S.W., Y. Zhu, Q.L., P.Z. and Y.G. conceptualized the study. S.W. and Y. Zhu led the study design, data analysis and paper drafting. Z.Y., X.L., P.F., H.W. and Y. Zhang also performed data analysis and interpretation, while Z.Y. and X.L. assisted with drafting the paper. M.W., Z.S., Q.L., P.Z. and Y.G. supervised the project and critically revised the paper.

Corresponding authors

Correspondence to Shuo Wang, Quanlin Li or Pinghong Zhou.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics statement

The study was approved by the ethics committees of Zhongshan Hospital, Xiamen Branch of Zhongshan Hospital, Zhengzhou Central Hospital and No. 988 Hospital of Joint Logistics Support Force.

Peer review

Peer review information

Nature Biomedical Engineering thanks Sharib Ali, Thomas de Lange and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Examples of polyp segmentation on challenging samples.

Qualitative comparison of nine state-of-the-art segmentation models (columns, model names at bottom) on various colonoscopy images (top row). The second row displays the ground truth segmentations (expert annotation). The fourth row shows the baseline performance of each model (w/o EndoKED pre-training). The fifth row shows the performance of the same models after pre-training with the proposed distilled annotations (with EndoKED pre-training). The pre-trained models consistently produce more accurate and robust segmentation masks that are closer to the expert annotations, with fewer false positives and artifacts. The third row shows the output of the EndoKEG-SEG model for reference.

Supplementary information

Supplementary Information (download PDF )

Supplementary Notes 1–5, Fig. 1 and Tables 1–9.

Reporting Summary (download PDF )

Peer Review File (download PDF )

Source data

Source Data Fig. 4 (download XLSX )

Precision-Recall (PR) curve data for Fig. 4c,d.

Source Data Fig. 5 (download XLSX )

ROC curve data for Fig. 5a and data-efficiency curve data for Fig 5b.

Source Data Fig. 6 (download XLSX )

Embedding UMAP data for Fig. 6.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, S., Zhu, Y., Yang, Z. et al. Leveraging large language and vision models for knowledge extraction from large-scale image–text colonoscopy records. Nat. Biomed. Eng (2025). https://doi.org/10.1038/s41551-025-01500-x

Download citation

Received: 11 October 2023
Accepted: 06 August 2025
Published: 16 September 2025
Version of record: 16 September 2025
DOI: https://doi.org/10.1038/s41551-025-01500-x