Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

An ophthalmic video foundation model for surgical recognition and navigation with wet-lab porcine eye validation

Abstract

Foundation models in artificial intelligence are revolutionizing healthcare by utilizing large-scale unlabelled data for pretraining. However, their intraoperative applications remain underexplored owing to limited surgical data and the challenges of real-time deployment. Here we show the development of the ophthalmic video foundation model (OVFM), designed for microscopic ophthalmic surgical recognition and navigation. Leveraging a self-supervised video transformer structure and trained on an ophthalmic video dataset comprising 1.1 million clips across 144 surgical types, OVFM learns the spatiotemporal motion features of ophthalmic procedures. We demonstrate OVFM’s superior performance across seven downstream tasks. To enable real-time use, we applied knowledge distillation, reducing the model’s size while retaining its accuracy, which allows for deployment on surgical microscope units. In cataract surgeries performed by ten surgeons on wet-lab porcine eyes, the OVFM-powered system enhanced surgical performance and reduced skill gaps, demonstrating notable potential for real-time, intraoperative applications across various surgical fields.

This is a preview of subscription content, access via your institution

Access options

Buy this article

USD 39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of this study.
Fig. 2: Performance evaluation of OVFM across downstream tasks.
Fig. 3: Performance evaluation of OVFM distillation.
Fig. 4: Integration of OVFM into the surgical microscope.
Fig. 5: User study on wet-lab porcine eyes using the OVFM-powered surgical microscope.

Similar content being viewed by others

Data availability

The Cataract-1K dataset for pretraining is available at https://www.synapse.org/Synapse:syn53404507. The Cataract-1K surgical step recognition dataset is available at https://www.synapse.org/Synapse:syn53395146. The CATARACTS dataset, used for tool presence recognition, is available at https://ieee-dataport.org/open-access/cataracts. The Cataract-1K complications detection dataset is available at https://www.synapse.org/Synapse:syn53395402, and the Cataract-1K surgical scene segmentation dataset is available at https://www.synapse.org/Synapse:syn53395479. All in-house datasets can be available upon request from the corresponding authors, subject to a signed Data Use Agreement and Non-Disclosure Agreement with the respective hospital. Source data are provided with this paper.

Code availability

The code is publicly available at https://github.com/puxuntu/OVFM.

References

  1. He, Y. et al. Foundation model for advancing healthcare: challenges, opportunities and future directions. IEEE Rev. Biomed. Eng. 18, 172–191 (2025).

    Article  PubMed  Google Scholar 

  2. Zhou, C. et al. A comprehensive survey on pretrained foundation models: a history from BERT to ChatGPT. Int. J. Mach. Learn. Cybern. 16, 9851–9915 (2025).

    Article  Google Scholar 

  3. Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).

    Article  CAS  PubMed  Google Scholar 

  4. Christensen, M. et al. Vision–language foundation model for echocardiogram interpretation. Nat. Med. 30, 1481–1488 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 622, 156–163 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Xu, H. et al. A whole-slide foundation model for digital pathology from real-world data. Nature 630, 181–188 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Chen, R. J. et al. Towards a general-purpose foundation model for computational pathology. Nat. Med. 30, 850–862 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Kim, C. et al. Transparent medical image AI via an image–text foundation model grounded in medical literature. Nat. Med. 30, 1154–1165 (2024).

    Article  CAS  PubMed  Google Scholar 

  9. Li, J. et al. Integrated image-based deep learning and language models for primary diabetes care. Nat. Med. 30, 2886–2896 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Varghese, C. et al. Artificial intelligence in surgery. Nat. Med. 30, 1257–1268 (2024).

    Article  CAS  PubMed  Google Scholar 

  11. Wang, Z. et al. Foundation model for endoscopy video analysis via large-scale self-supervised pre-train. In International Conference on Medical Image Computing and Computer-Assisted Intervention 101–111 (Springer, 2023).

  12. Hirsch, R. et al. Self-supervised learning for endoscopic video analysis. In International Conference on Medical Image Computing and Computer-Assisted Intervention 569–578 (Springer, 2023).

  13. Bahri, Y. et al. Explaining neural scaling laws. Proc. Natl Acad. Sci. USA 121, e2311878121 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Hu, M. et al. OphNet: a large-scale video benchmark for ophthalmic surgical workflow understanding. In European Conference on Computer Vision 481–500 (Springer, 2024).

  15. Ghamsarian, N. et al. Cataract-1k dataset for deep-learning-assisted analysis of cataract surgery videos. Sci. Data 11, 373 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  16. Al Hajj, H. et al. Cataracts: challenge on automatic tool annotation for cataract surgery. Med. Image Anal. 52, 24–41 (2019).

    Article  PubMed  Google Scholar 

  17. Schoeffmann, K. et al. Cataract-101: video dataset of 101 cataract surgeries. In Proc. 9th ACM Multimedia Systems Conference 421–425 (ACM, 2018).

  18. Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proc. IEEE/CVF International Conference on Computer Vision 9650–9660 (IEEE, 2021).

  19. He, K. et al. Momentum contrast for unsupervised visual representation learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 9729–9738 (IEEE, 2020).

  20. He, K. et al. Masked autoencoders are scalable vision learners. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 16000–16009 (IEEE, 2022).

  21. Ranasinghe, K. et al. Self-supervised video transformer. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 2874–2884 (IEEE, 2022).

  22. Gou, J. et al. Knowledge distillation: a survey. Int. J. Comput. Vis. 129, 1789–1819 (2021).

    Article  Google Scholar 

  23. Bai, Y. et al. Masked autoencoders enable efficient knowledge distillers. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 24256–24265 (IEEE, 2023).

  24. Huang, W. et al. Generic-to-specific distillation of masked autoencoders. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 15996–16005 (IEEE, 2023).

  25. Jiao, X. et al. TinyBERT: distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 4163–4174 (ACL, 2020).

  26. Kozak, I. & Rahn, U. Navigation technology/eye-tracking in ophthalmology: principles, applications and benefits—a narrative review. Ann. Eye Sci. 6, 6 (2021).

    Article  Google Scholar 

  27. Kose, B. et al. Results of Callisto eye system in toric intraocular lens alignment. Beyoglu Eye J. 5, 1–4 (2020).

    PubMed  PubMed Central  Google Scholar 

  28. Carreira, J. & Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 6299–6308 (IEEE, 2017).

  29. Protserov, S. et al. Development, deployment and scaling of operating room-ready artificial intelligence for real-time surgical decision support. npj Digit. Med. 7, 231 (2024).

    Article  PubMed  PubMed Central  Google Scholar 

  30. Wang, Z. et al. Autolaparo: a new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy. In International Conference on Medical Image Computing and Computer-Assisted Intervention 486–496 (Springer, 2022).

  31. Tong, B. C. et al. Outcomes of video-assisted thoracoscopic decortication. Ann. Thorac. Surg. 89, 220–225 (2010).

    Article  PubMed  Google Scholar 

  32. Ma, L. & Fei, B. Comprehensive review of surgical microscopes: technology development and medical applications. J. Biomed. Opt. 26, 010901 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Cobianchi, L. et al. Artificial intelligence and surgery: ethical dilemmas and open issues. J. Am. Coll. Surg. 235, 268–275 (2022).

    Article  PubMed  Google Scholar 

  34. Dosovitskiy, A. et al. An image is worth 16 × 16 words: transformers for image recognition at scale. In International Conference on Learning Representations (ICLR, 2021).

  35. Bertasius, G., Wang, H. & Torresani, L. Is space-time attention all you need for video understanding? In International Conference on Machine Learning 813-824 (PMLR, 2021).

  36. Thomsen, A. S. S. et al. Update on simulation-based surgical training and assessment in ophthalmology: a systematic review. Ophthalmology 122, 1111–1130 (2015).

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

This work was supported by grants from the National Natural Science Foundation of China (82330063 to X.C., 823B2045 to P.T. and 82571270 to C.Z.), the Foundation of Science and Technology Commission of Shanghai Municipality (24490710300 to X.C.), the Explorers Program of Shanghai (Basic Research Funding, number 24TS1413000 to X.C.), the Shanghai Leading Talent Program of Eastern Talent Plan (BJKJ2024003 to X.C.), the Hospital Funded Clinical Research, Xinhua Hospital Affiliated to Shanghai Jiao Tong University School of Medicine (21XJMR02 to C.Z.), and the China Postdoctoral Science Foundation (2025M772924 to P.T.).

Author information

Authors and Affiliations

Authors

Contributions

P.T., C.Z., P.Z., M.Z. and X.C. conceptualized the project. P.T., C.Z. and X.C. acquired the funding. P.T., Y. Wei, C.W. and X.C. developed the algorithms and the system. P.T., X.X., J.L., M.X., J.G., Y. Wang, C.W. and X.C. designed and performed the experiments. P.T., C.Z., Y. Wei, H.C. and X.C. analysed the data. S.Y., K.Q., J.C., W.M., X.Z., D.J., K.P., W.Z. and F.T. constructed the dataset. P.T., C.Z. and X.C. wrote and edited the paper.

Corresponding authors

Correspondence to Ce Zheng, Peiquan Zhao, Mingzhi Zhang or Xiaojun Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Biomedical Engineering thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Geographic distribution of datasets used for pretraining.

Seven in-house datasets were collected from different medical centers across China, while one publicly available dataset was sourced from Austria.

Source data

Extended Data Fig. 2 Characteristics of the pretraining dataset.

a, Distribution of video durations in the dataset. The left histogram shows the number of videos for each duration interval (in minutes). A magnified view of the long-tail distribution is provided for video durations exceeding 200 minutes. b, Age distribution of patients associated with the surgical videos. The histogram shows the number of videos across different patient age groups, excluding cases where age was not recorded. c, Sex distribution of patients featured in the dataset.

Source data

Extended Data Fig. 3 Frequency of videos per calendar year.

Panels ag present the Xinhua-pretrain, ST-pretrain, Aier-SH-pretrain, Aier-HZ-pretrain, Aier-JF-pretrain, GZ-pretrain, and Kandze-pretrain datasets, respectively.

Source data

Extended Data Fig. 4 Schematics of the seven downstream tasks used to evaluate OVFM.

Spatiotemporal-level tasks include surgical step recognition, tool presence recognition, complication detection, and surgical skill assessment, where OVFM is connected to a linear layer for classification. Spatial-level tasks include surgical scene segmentation, limbus boundary segmentation, and nucleus localization, where OVFM is connected to a decoder.

Extended Data Fig. 5 Extended performance evaluation of OVFM.

ROC curves for surgical step recognition on the Xinhua-Cata dataset (n = 23 videos) (a) and Aier-Cata dataset (n = 26 videos) (b), comparing OVFM with four models across four steps: incision, capsulorhexis, lens implant, and others. Solid lines show empirical ROC curves from the full test set; shaded regions denote 95% bootstrap CI.

Source data

Extended Data Fig. 6 Effects of surgical data diversity and fine-tuning strategies on OVFM performance.

a, Comparison of fine-tuning strategies for OVFM across multiple tasks. Two strategies—freezing the backbone vs. fine-tuning the backbone—are evaluated on tasks: surgical step recognition (n = 28 videos), complication detection (n = 49 videos), surgical skill assessment (n = 82 videos), surgical scene segmentation (n = 10 videos), limbus segmentation (n = 28 videos), and nucleus block localization (n = 20 videos). b, Fine-tuning strategy comparison for tool existence recognition. We evaluate the same two fine-tuning strategies—freezing vs. fine-tuning the backbone—for tool presence detection (n = 25 videos), with detailed performance reported per tool type. c-f, Impact of surgical type diversity on OVFM performance in surgical step recognition (n = 28 videos). The performance of OVFM pretraining using only anterior segment video data, only posterior segment video data, and a combination of both anterior and posterior segment video data are compared. Box plots summarize the distribution of performance metrics across clustered bootstrap iterations. The center line denotes the median, the box spans the interquartile range (25th–75th percentiles), and whiskers extend to the minimum and maximum values within 1.5 × the interquartile range. Statistical comparisons were performed using a two-sided paired clustered bootstrap hypothesis test.

Source data

Extended Data Fig. 7 Performance analysis of the distilled OVFM model on retrospective clinical video cases.

This figure presents the surgical step recognition and limbus boundary segmentation performance of the distilled OVFM model on two clinical cases (a and b). For each case, the left panel displays the confusion matrix for surgical step recognition, which illustrates the relationship between the model’s predictions and the annotated labels for each surgical step. Each cell reports both the actual count (outside the parentheses) and the proportion relative to the total number of samples for that class (inside the parentheses, ranging from 0 to 1). The right panel presents qualitative results for surgical step recognition and limbus boundary segmentation. The color bar illustrates the alignment between predicted surgical steps and ground truth annotations. Limbus boundary segmentation overlays predicted boundaries (red) with ground truth (green) on the original video frames.

Source data

Extended Data Fig. 8 Intraoperative comparisons of incision and capsulorhexis steps with and without navigation.

a, Incision step comparison. Representative surgical scenes for two surgeons, comparing performance with and without navigation. The intraoperative incision targets and post-surgery incision lines are shown, with zoomed-in views highlighting discrepancies. b, Capsulorhexis step comparison. Surgical scenes from the capsulorhexis step, demonstrating the differences between with and without navigation. Target capsulorhexis ranges and actual outcomes are compared post-surgery.

Supplementary information

Supplementary Information (download PDF )

Supplementary Notes 1–16, Tables 1–11 and Figs. 1 and 2.

Reporting Summary (download PDF )

Supplementary Video 1 (download MP4 )

The diversity of ophthalmic surgical types in the pretraining dataset.

Supplementary Video 2 (download MP4 )

OVFM-guided navigation scenes in porcine eye surgeries.

Source data

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tu, P., Zheng, C., Xie, X. et al. An ophthalmic video foundation model for surgical recognition and navigation with wet-lab porcine eye validation. Nat. Biomed. Eng (2026). https://doi.org/10.1038/s41551-026-01622-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1038/s41551-026-01622-w

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing