An ophthalmic video foundation model for surgical recognition and navigation with wet-lab porcine eye validation

Tu, Puxun; Zheng, Ce; Xie, Xiaoling; Lv, Jiao; Xie, Meng; Guo, Jinming; Yin, Shengjie; Qiu, Kunliang; Wei, Yue; Wang, Chongyang; Cai, Jingfeng; Mi, Wei; Wang, Yafu; Zhang, Xiao; Jiachu, Danba; Peng, Kun; Zhang, Wanqi; Tong, Fengfeng; Chu, Huiying; Zhao, Peiquan; Zhang, Mingzhi; Chen, Xiaojun

doi:10.1038/s41551-026-01622-w

Article
Published: 03 March 2026

An ophthalmic video foundation model for surgical recognition and navigation with wet-lab porcine eye validation

Puxun Tu ORCID: orcid.org/0000-0003-4809-9081¹^na1,
Ce Zheng ORCID: orcid.org/0000-0001-7146-9138²^na1,
Xiaoling Xie ORCID: orcid.org/0009-0002-3255-0341³^na1,
Jiao Lv²,
Meng Xie²,
Jinming Guo³,
Shengjie Yin³,
Kunliang Qiu³,
Yue Wei⁴,
Chongyang Wang⁴,
Jingfeng Cai^5,6,
Wei Mi²,
Yafu Wang²,
Xiao Zhang⁷,
Danba Jiachu⁸,
Kun Peng⁹,
Wanqi Zhang¹⁰,
Fengfeng Tong¹¹,
Huiying Chu¹¹,
Peiquan Zhao ORCID: orcid.org/0000-0002-5092-9550²,
Mingzhi Zhang ORCID: orcid.org/0000-0001-9032-7274³ &
…
Xiaojun Chen ORCID: orcid.org/0000-0002-0298-4491^1,12

Nature Biomedical Engineering (2026)Cite this article

1656 Accesses
15 Altmetric
Metrics details

Subjects

Abstract

Foundation models in artificial intelligence are revolutionizing healthcare by utilizing large-scale unlabelled data for pretraining. However, their intraoperative applications remain underexplored owing to limited surgical data and the challenges of real-time deployment. Here we show the development of the ophthalmic video foundation model (OVFM), designed for microscopic ophthalmic surgical recognition and navigation. Leveraging a self-supervised video transformer structure and trained on an ophthalmic video dataset comprising 1.1 million clips across 144 surgical types, OVFM learns the spatiotemporal motion features of ophthalmic procedures. We demonstrate OVFM’s superior performance across seven downstream tasks. To enable real-time use, we applied knowledge distillation, reducing the model’s size while retaining its accuracy, which allows for deployment on surgical microscope units. In cataract surgeries performed by ten surgeons on wet-lab porcine eyes, the OVFM-powered system enhanced surgical performance and reduced skill gaps, demonstrating notable potential for real-time, intraoperative applications across various surgical fields.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to the full article PDF.

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Performance evaluation of OVFM across downstream tasks.**

**Fig. 3: Performance evaluation of OVFM distillation.**

**Fig. 4: Integration of OVFM into the surgical microscope.**

**Fig. 5: User study on wet-lab porcine eyes using the OVFM-powered surgical microscope.**

FOVEA: Preoperative and intraoperative retinal fundus images with optic disc and retinal vessel annotations

Article Open access 26 April 2025

Surgical capacity in ophthalmology: the unmet need for sustainable solutions

Article Open access 16 December 2025

The development of an eye movement-based deep learning system for laparoscopic surgical skills assessment

Article Open access 15 August 2022

Data availability

The Cataract-1K dataset for pretraining is available at https://www.synapse.org/Synapse:syn53404507. The Cataract-1K surgical step recognition dataset is available at https://www.synapse.org/Synapse:syn53395146. The CATARACTS dataset, used for tool presence recognition, is available at https://ieee-dataport.org/open-access/cataracts. The Cataract-1K complications detection dataset is available at https://www.synapse.org/Synapse:syn53395402, and the Cataract-1K surgical scene segmentation dataset is available at https://www.synapse.org/Synapse:syn53395479. All in-house datasets can be available upon request from the corresponding authors, subject to a signed Data Use Agreement and Non-Disclosure Agreement with the respective hospital. Source data are provided with this paper.

Code availability

The code is publicly available at https://github.com/puxuntu/OVFM.

References

He, Y. et al. Foundation model for advancing healthcare: challenges, opportunities and future directions. IEEE Rev. Biomed. Eng. 18, 172–191 (2025).
Article PubMed Google Scholar
Zhou, C. et al. A comprehensive survey on pretrained foundation models: a history from BERT to ChatGPT. Int. J. Mach. Learn. Cybern. 16, 9851–9915 (2025).
Article Google Scholar
Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).
Article CAS PubMed Google Scholar
Christensen, M. et al. Vision–language foundation model for echocardiogram interpretation. Nat. Med. 30, 1481–1488 (2024).
Article CAS PubMed PubMed Central Google Scholar
Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 622, 156–163 (2023).
Article CAS PubMed PubMed Central Google Scholar
Xu, H. et al. A whole-slide foundation model for digital pathology from real-world data. Nature 630, 181–188 (2024).
Article CAS PubMed PubMed Central Google Scholar
Chen, R. J. et al. Towards a general-purpose foundation model for computational pathology. Nat. Med. 30, 850–862 (2024).
Article CAS PubMed PubMed Central Google Scholar
Kim, C. et al. Transparent medical image AI via an image–text foundation model grounded in medical literature. Nat. Med. 30, 1154–1165 (2024).
Article CAS PubMed Google Scholar
Li, J. et al. Integrated image-based deep learning and language models for primary diabetes care. Nat. Med. 30, 2886–2896 (2024).
Article CAS PubMed PubMed Central Google Scholar
Varghese, C. et al. Artificial intelligence in surgery. Nat. Med. 30, 1257–1268 (2024).
Article CAS PubMed Google Scholar
Wang, Z. et al. Foundation model for endoscopy video analysis via large-scale self-supervised pre-train. In International Conference on Medical Image Computing and Computer-Assisted Intervention 101–111 (Springer, 2023).
Hirsch, R. et al. Self-supervised learning for endoscopic video analysis. In International Conference on Medical Image Computing and Computer-Assisted Intervention 569–578 (Springer, 2023).
Bahri, Y. et al. Explaining neural scaling laws. Proc. Natl Acad. Sci. USA 121, e2311878121 (2024).
Article CAS PubMed PubMed Central Google Scholar
Hu, M. et al. OphNet: a large-scale video benchmark for ophthalmic surgical workflow understanding. In European Conference on Computer Vision 481–500 (Springer, 2024).
Ghamsarian, N. et al. Cataract-1k dataset for deep-learning-assisted analysis of cataract surgery videos. Sci. Data 11, 373 (2024).
Article PubMed PubMed Central Google Scholar
Al Hajj, H. et al. Cataracts: challenge on automatic tool annotation for cataract surgery. Med. Image Anal. 52, 24–41 (2019).
Article PubMed Google Scholar
Schoeffmann, K. et al. Cataract-101: video dataset of 101 cataract surgeries. In Proc. 9th ACM Multimedia Systems Conference 421–425 (ACM, 2018).
Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proc. IEEE/CVF International Conference on Computer Vision 9650–9660 (IEEE, 2021).
He, K. et al. Momentum contrast for unsupervised visual representation learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 9729–9738 (IEEE, 2020).
He, K. et al. Masked autoencoders are scalable vision learners. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 16000–16009 (IEEE, 2022).
Ranasinghe, K. et al. Self-supervised video transformer. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 2874–2884 (IEEE, 2022).
Gou, J. et al. Knowledge distillation: a survey. Int. J. Comput. Vis. 129, 1789–1819 (2021).
Article Google Scholar
Bai, Y. et al. Masked autoencoders enable efficient knowledge distillers. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 24256–24265 (IEEE, 2023).
Huang, W. et al. Generic-to-specific distillation of masked autoencoders. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 15996–16005 (IEEE, 2023).
Jiao, X. et al. TinyBERT: distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 4163–4174 (ACL, 2020).
Kozak, I. & Rahn, U. Navigation technology/eye-tracking in ophthalmology: principles, applications and benefits—a narrative review. Ann. Eye Sci. 6, 6 (2021).
Article Google Scholar
Kose, B. et al. Results of Callisto eye system in toric intraocular lens alignment. Beyoglu Eye J. 5, 1–4 (2020).
PubMed PubMed Central Google Scholar
Carreira, J. & Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 6299–6308 (IEEE, 2017).
Protserov, S. et al. Development, deployment and scaling of operating room-ready artificial intelligence for real-time surgical decision support. npj Digit. Med. 7, 231 (2024).
Article PubMed PubMed Central Google Scholar
Wang, Z. et al. Autolaparo: a new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy. In International Conference on Medical Image Computing and Computer-Assisted Intervention 486–496 (Springer, 2022).
Tong, B. C. et al. Outcomes of video-assisted thoracoscopic decortication. Ann. Thorac. Surg. 89, 220–225 (2010).
Article PubMed Google Scholar
Ma, L. & Fei, B. Comprehensive review of surgical microscopes: technology development and medical applications. J. Biomed. Opt. 26, 010901 (2021).
Article CAS PubMed PubMed Central Google Scholar
Cobianchi, L. et al. Artificial intelligence and surgery: ethical dilemmas and open issues. J. Am. Coll. Surg. 235, 268–275 (2022).
Article PubMed Google Scholar
Dosovitskiy, A. et al. An image is worth 16 × 16 words: transformers for image recognition at scale. In International Conference on Learning Representations (ICLR, 2021).
Bertasius, G., Wang, H. & Torresani, L. Is space-time attention all you need for video understanding? In International Conference on Machine Learning 813-824 (PMLR, 2021).
Thomsen, A. S. S. et al. Update on simulation-based surgical training and assessment in ophthalmology: a systematic review. Ophthalmology 122, 1111–1130 (2015).
Article PubMed Google Scholar

Download references

Acknowledgements

This work was supported by grants from the National Natural Science Foundation of China (82330063 to X.C., 823B2045 to P.T. and 82571270 to C.Z.), the Foundation of Science and Technology Commission of Shanghai Municipality (24490710300 to X.C.), the Explorers Program of Shanghai (Basic Research Funding, number 24TS1413000 to X.C.), the Shanghai Leading Talent Program of Eastern Talent Plan (BJKJ2024003 to X.C.), the Hospital Funded Clinical Research, Xinhua Hospital Affiliated to Shanghai Jiao Tong University School of Medicine (21XJMR02 to C.Z.), and the China Postdoctoral Science Foundation (2025M772924 to P.T.).

Author information

These authors contributed equally: Puxun Tu, Ce Zheng, Xiaoling Xie.

Authors and Affiliations

Institute of Biomedical Manufacturing and Life Quality Engineering, School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, China
Puxun Tu & Xiaojun Chen
Department of Ophthalmology, Xinhua Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China
Ce Zheng, Jiao Lv, Meng Xie, Wei Mi, Yafu Wang & Peiquan Zhao
Joint Shantou International Eye Center of Shantou University and the Chinese University of Hong Kong, Shantou University Medical College, Shantou, China
Xiaoling Xie, Jinming Guo, Shengjie Yin, Kunliang Qiu & Mingzhi Zhang
Shanghai Mediworks Precision Instruments Co. Ltd, Shanghai, China
Yue Wei & Chongyang Wang
Shanghai Aier Eye Hospital, Shanghai, China
Jingfeng Cai
Shanghai Aier Eye Institute, Shanghai, China
Jingfeng Cai
Jiaozuo Apex Eye Hospital, Jiaozuo, China
Xiao Zhang
Kham Eye Centre, Kandze Prefecture People’s Hospital, Kangding, China
Danba Jiachu
Taizhou Aier Eye Hospital, Taizhou, China
Kun Peng
Guangzhou Eighth People’s Hospital, Guangzhou Medical University, Guangzhou, China
Wanqi Zhang
Huzhou Aier Eye Hospital, Huzhou, China
Fengfeng Tong & Huiying Chu
Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai, China
Xiaojun Chen

Authors

Puxun Tu
View author publications
Search author on:PubMed Google Scholar
Ce Zheng
View author publications
Search author on:PubMed Google Scholar
Xiaoling Xie
View author publications
Search author on:PubMed Google Scholar
Jiao Lv
View author publications
Search author on:PubMed Google Scholar
Meng Xie
View author publications
Search author on:PubMed Google Scholar
Jinming Guo
View author publications
Search author on:PubMed Google Scholar
Shengjie Yin
View author publications
Search author on:PubMed Google Scholar
Kunliang Qiu
View author publications
Search author on:PubMed Google Scholar
Yue Wei
View author publications
Search author on:PubMed Google Scholar
Chongyang Wang
View author publications
Search author on:PubMed Google Scholar
Jingfeng Cai
View author publications
Search author on:PubMed Google Scholar
Wei Mi
View author publications
Search author on:PubMed Google Scholar
Yafu Wang
View author publications
Search author on:PubMed Google Scholar
Xiao Zhang
View author publications
Search author on:PubMed Google Scholar
Danba Jiachu
View author publications
Search author on:PubMed Google Scholar
Kun Peng
View author publications
Search author on:PubMed Google Scholar
Wanqi Zhang
View author publications
Search author on:PubMed Google Scholar
Fengfeng Tong
View author publications
Search author on:PubMed Google Scholar
Huiying Chu
View author publications
Search author on:PubMed Google Scholar
Peiquan Zhao
View author publications
Search author on:PubMed Google Scholar
Mingzhi Zhang
View author publications
Search author on:PubMed Google Scholar
Xiaojun Chen
View author publications
Search author on:PubMed Google Scholar

Contributions

P.T., C.Z., P.Z., M.Z. and X.C. conceptualized the project. P.T., C.Z. and X.C. acquired the funding. P.T., Y. Wei, C.W. and X.C. developed the algorithms and the system. P.T., X.X., J.L., M.X., J.G., Y. Wang, C.W. and X.C. designed and performed the experiments. P.T., C.Z., Y. Wei, H.C. and X.C. analysed the data. S.Y., K.Q., J.C., W.M., X.Z., D.J., K.P., W.Z. and F.T. constructed the dataset. P.T., C.Z. and X.C. wrote and edited the paper.

Corresponding authors

Correspondence to Ce Zheng, Peiquan Zhao, Mingzhi Zhang or Xiaojun Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biomedical Engineering thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Geographic distribution of datasets used for pretraining.

Seven in-house datasets were collected from different medical centers across China, while one publicly available dataset was sourced from Austria.

Source data

Extended Data Fig. 2 Characteristics of the pretraining dataset.

a, Distribution of video durations in the dataset. The left histogram shows the number of videos for each duration interval (in minutes). A magnified view of the long-tail distribution is provided for video durations exceeding 200 minutes. b, Age distribution of patients associated with the surgical videos. The histogram shows the number of videos across different patient age groups, excluding cases where age was not recorded. c, Sex distribution of patients featured in the dataset.

Source data

Extended Data Fig. 3 Frequency of videos per calendar year.

Panels a–g present the Xinhua-pretrain, ST-pretrain, Aier-SH-pretrain, Aier-HZ-pretrain, Aier-JF-pretrain, GZ-pretrain, and Kandze-pretrain datasets, respectively.

Source data

Extended Data Fig. 4 Schematics of the seven downstream tasks used to evaluate OVFM.

Spatiotemporal-level tasks include surgical step recognition, tool presence recognition, complication detection, and surgical skill assessment, where OVFM is connected to a linear layer for classification. Spatial-level tasks include surgical scene segmentation, limbus boundary segmentation, and nucleus localization, where OVFM is connected to a decoder.

Extended Data Fig. 5 Extended performance evaluation of OVFM.

ROC curves for surgical step recognition on the Xinhua-Cata dataset (n = 23 videos) (a) and Aier-Cata dataset (n = 26 videos) (b), comparing OVFM with four models across four steps: incision, capsulorhexis, lens implant, and others. Solid lines show empirical ROC curves from the full test set; shaded regions denote 95% bootstrap CI.

Source data

Extended Data Fig. 6 Effects of surgical data diversity and fine-tuning strategies on OVFM performance.

a, Comparison of fine-tuning strategies for OVFM across multiple tasks. Two strategies—freezing the backbone vs. fine-tuning the backbone—are evaluated on tasks: surgical step recognition (n = 28 videos), complication detection (n = 49 videos), surgical skill assessment (n = 82 videos), surgical scene segmentation (n = 10 videos), limbus segmentation (n = 28 videos), and nucleus block localization (n = 20 videos). b, Fine-tuning strategy comparison for tool existence recognition. We evaluate the same two fine-tuning strategies—freezing vs. fine-tuning the backbone—for tool presence detection (n = 25 videos), with detailed performance reported per tool type. c-f, Impact of surgical type diversity on OVFM performance in surgical step recognition (n = 28 videos). The performance of OVFM pretraining using only anterior segment video data, only posterior segment video data, and a combination of both anterior and posterior segment video data are compared. Box plots summarize the distribution of performance metrics across clustered bootstrap iterations. The center line denotes the median, the box spans the interquartile range (25th–75th percentiles), and whiskers extend to the minimum and maximum values within 1.5 × the interquartile range. Statistical comparisons were performed using a two-sided paired clustered bootstrap hypothesis test.

Source data

Extended Data Fig. 7 Performance analysis of the distilled OVFM model on retrospective clinical video cases.

This figure presents the surgical step recognition and limbus boundary segmentation performance of the distilled OVFM model on two clinical cases (a and b). For each case, the left panel displays the confusion matrix for surgical step recognition, which illustrates the relationship between the model’s predictions and the annotated labels for each surgical step. Each cell reports both the actual count (outside the parentheses) and the proportion relative to the total number of samples for that class (inside the parentheses, ranging from 0 to 1). The right panel presents qualitative results for surgical step recognition and limbus boundary segmentation. The color bar illustrates the alignment between predicted surgical steps and ground truth annotations. Limbus boundary segmentation overlays predicted boundaries (red) with ground truth (green) on the original video frames.

Source data

Extended Data Fig. 8 Intraoperative comparisons of incision and capsulorhexis steps with and without navigation.

a, Incision step comparison. Representative surgical scenes for two surgeons, comparing performance with and without navigation. The intraoperative incision targets and post-surgery incision lines are shown, with zoomed-in views highlighting discrepancies. b, Capsulorhexis step comparison. Surgical scenes from the capsulorhexis step, demonstrating the differences between with and without navigation. Target capsulorhexis ranges and actual outcomes are compared post-surgery.

Supplementary information

Supplementary Information (download PDF )

Supplementary Notes 1–16, Tables 1–11 and Figs. 1 and 2.

Reporting Summary (download PDF )

Supplementary Video 1 (download MP4 )

The diversity of ophthalmic surgical types in the pretraining dataset.

Supplementary Video 2 (download MP4 )

OVFM-guided navigation scenes in porcine eye surgeries.

Source data

Source Data Fig. 1 (download XLSX )

Statistical source data.

Source Data Fig. 2 (download XLSX )

Statistical source data.

Source Data Fig. 3 (download XLSX )

Statistical source data.

Source Data Fig. 4 (download XLSX )

Statistical source data.

Source Data Fig. 5 (download XLSX )

Statistical source data.

Source Data Extended Data Fig. 1 (download XLSX )

Statistical source data.

Source Data Extended Data Fig. 2 (download XLSX )

Statistical source data.

Source Data Extended Data Fig. 3 (download XLSX )

Statistical source data.

Source Data Extended Data Fig. 5 (download XLSX )

Statistical source data.

Source Data Extended Data Fig. 6 (download XLSX )

Statistical source data.

Source Data Extended Data Fig. 7 (download XLSX )

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tu, P., Zheng, C., Xie, X. et al. An ophthalmic video foundation model for surgical recognition and navigation with wet-lab porcine eye validation. Nat. Biomed. Eng (2026). https://doi.org/10.1038/s41551-026-01622-w

Download citation

Received: 05 October 2024
Accepted: 23 January 2026
Published: 03 March 2026
Version of record: 03 March 2026
DOI: https://doi.org/10.1038/s41551-026-01622-w

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links