Large-scale self-supervised video foundation model for intelligent surgery

Yang, Shu; Zhou, Fengtao; Mayer, Leon; Huang, Fuxiang; Chen, Yiliang; Wang, Yihui; He, Sunan; Nie, Yuxiang; Wang, Xi; Jin, Yueming; Sun, Huihui; Xu, Shuchang; Liu, Alex Qinyang; Li, Zheng; Qin, Jing; Teoh, Jeremy YuenChun; Maier-Hein, Lena; Chen, Hao

doi:10.1038/s41746-026-02403-0

Download PDF

Article
Open access
Published: 04 February 2026

Large-scale self-supervised video foundation model for intelligent surgery

Shu Yang¹,
Fengtao Zhou¹,
Leon Mayer^2,3,
Fuxiang Huang¹,
Yiliang Chen⁴,
Yihui Wang¹,
Sunan He¹,
Yuxiang Nie¹,
Xi Wang¹,
Yueming Jin^5,6,
Huihui Sun⁷,
Shuchang Xu⁷,
Alex Qinyang Liu⁸,
Zheng Li⁸,
Jing Qin⁴,
Jeremy YuenChun Teoh⁸,
Lena Maier-Hein^2,3,9,10,11 &
…
Hao Chen^{1,12,13,14,15}

npj Digital Medicine , Article number: (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Computer-Assisted Intervention has the potential to revolutionize modern surgery, with surgical scene understanding serving as a critical component in supporting decision-making and improving procedural efficacy. While existing AI-driven approaches alleviate annotation burdens via self-supervised spatial representation learning, their lack of explicit temporal modeling during pre-training fundamentally restricts the capture of dynamic surgical contexts, resulting in incomplete spatiotemporal understanding. In this work, we introduce the first video-level surgical pre-training framework that enables joint spatiotemporal representation learning from large-scale surgical video data. To achieve this, we constructed a large-scale surgical video dataset comprising 3650 videos and 3.55 million frames, spanning more than 20 surgical procedures and over 10 anatomical structures. Building upon this dataset, we propose SurgVISTA (Surgical Video-level Spatial-Temporal Architecture), a reconstruction-based pre-training method that jointly captures intricate spatial structures and temporal dynamics. Additionally, SurgVISTA incorporates image-level knowledge distillation guided by a surgery-specific expert model to enhance the learning of fine-grained anatomical and semantic features. To validate its effectiveness, we established a comprehensive benchmark comprising 13 video-level datasets spanning six surgical procedures across four tasks. Extensive experiments demonstrate that SurgVISTA consistently outperforms both natural- and surgical-domain pre-trained models, demonstrating strong potential to advance intelligent surgical systems in clinically meaningful scenarios.

Data availability

Publicly available datasets used to construct the pre-training corpus and evaluation benchmarks are summarized in Supplementary Table 37. The remaining clinical data cannot be shared publicly due to institutional and patient privacy restrictions.

Code availability

The implementations of SurgVISTA framework will be released in GitHub: https://github.com/isyangshu/SurgVISTA. The pre-trained natural-domain parameters used in this study are listed in Supplementary Table 34, while the pre-trained surgical-domain parameters are listed in Supplementary Table 35. The other public codes used in this study are listed in Supplementary Table 36.

References

Kiyasseh, D. et al. A vision transformer for decoding surgeon activity from surgical videos. Nat. Biomed. Eng. 7, 780–796 (2023).
Google Scholar
Ma, R. et al. Surgical gestures as a method to quantify surgical performance and predict patient outcomes. NPJ Digit. Med. 5, 187 (2022).
Google Scholar
Kiyasseh, D. et al. Human visual explanations mitigate bias in AI-based assessment of surgeon skills. NPJ Digit. Med. 6, 54 (2023).
Google Scholar
Demir, K. C. et al. Deep learning in surgical workflow analysis: a review of phase and step recognition. IEEE J. Biomed. Health Inform. 27, 5405–5417 (2023).
Google Scholar
Maier-Hein, L. et al. Surgical data science for next-generation interventions. Nat. Biomed. Eng. 1, 691–696 (2017).
Google Scholar
Surgical Data Science Initiative. Surgical data science. http://www.surgical-data-science.org/ (2025).
Maier-Hein, L. et al. Surgical data science–from concepts toward clinical translation. Med. image Anal. 76, 102306 (2022).
Google Scholar
Caron, M. et al. Emerging properties in self-supervised vision transformers. in Proc. IEEE/CVF International Conference on Computer Vision, 9650–9660 (IEEE, 2021).
He, K. et al. Masked autoencoders are scalable vision learners. in Proc. IEEE/CVF Conference On Computer Vision And Pattern Recognition, 16000–16009 (IEEE, 2022).
Oquab, M. et al. Dinov2: learning robust visual features without supervision. Trans. Mach. Learn. Res. (2024).
Azizi, S. et al. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nat. Biomed. Eng. 7, 756–779 (2023).
Google Scholar
Ma, J. et al. A generalizable pathology foundation model using a unified knowledge distillation pretraining framework. Nat. Biomed. Eng. https://doi.org/10.1038/s41551-025-01488-4 (2025).
Huang, S.-C. et al. Self-supervised learning for medical image classification: a systematic review and implementation guidelines. NPJ Digit. Med. 6, 74 (2023).
Google Scholar
Krishnan, R., Rajpurkar, P. & Topol, E. J. Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng. 6, 1346–1352 (2022).
Google Scholar
Chen, R. J. et al. Towards a general-purpose foundation model for computational pathology. Nat. Med. 30, 850–862 (2024).
Google Scholar
Ramesh, S. et al. Dissecting self-supervised learning methods for surgical computer vision. Med. Image Anal. 88, 102844 (2023).
Google Scholar
Batić, D., Holm, F., Özsoy, E., Czempiel, T. & Navab, N. Endovit: pretraining vision transformers on a large collection of endoscopic images. Int. J. Comput. Assist. Radiol. Surg. 19, 1085–1091 (2024).
Google Scholar
Hirsch, R. et al. Self-supervised learning for endoscopic video analysis. in Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention, 569–578 (Springer, 2023).
Schmidgall, S., Kim, J. W., Jopling, J. & Krieger, A. General surgery vision transformer: a video pre-trained foundation model for general surgery. arXiv preprint arXiv: https://arxiv.org/abs/2403.05949 (2024).
Czempiel, T. et al. Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In Proc. Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23, 343–352 (Springer, 2020).
Twinanda, A. P. et al. Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE Trans. Med. imaging 36, 86–97 (2016).
Google Scholar
Stauder, R. et al. The tum lapchole dataset for the M2CAI 2016 workflow challenge. arXiv preprint arXiv: https://arxiv.org/abs/1610.09278 (2016).
Wagner, M. et al. Comparative validation of machine learning algorithms for surgical workflow and skill analysis with the heichole benchmark. Med. Image Anal. 86, 102770 (2023).
Google Scholar
Das, A. et al. Pitvis-2023 challenge: workflow recognition in videos of endoscopic pituitary surgery. Med. Image Anal. 106, 103716 (2025).
Google Scholar
Valderrama, N. et al. Towards holistic surgical scene understanding. in International Conference on Medical Image Computing and Computer-assisted Intervention, 442–452 (Springer, 2022).
Wang, Z. et al. Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 486–496 (Springer, 2022).
Lavanchy, J. L. et al. Challenges in multi-centric generalization: Phase and step recognition in roux-en-y gastric bypass surgery. Int. J. Comput. Assist. Radiol. Surg. 19, 2249–2257 (2024).
Google Scholar
Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. In Proc. International Conference on Learning Representations (2021).
Tong, Z., Song, Y., Wang, J. & Wang, L. Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. 35, 10078–10093 (2022).
Google Scholar
Wang, Y. et al. Internvideo: general video foundation models via generative and discriminative learning. arXiv preprint arXiv: https://arxiv.org/abs/2212.03191 (2022).
Wang, R. et al. Masked video distillation: rethinking masked feature modeling for self-supervised video representation learning. in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6312–6322 (IEEE, 2023).
Li, K. et al. Unmasked teacher: towards training-efficient video foundation models. in Proc. IEEE/CVF International Conference on Computer Vision, 19948–19960 (IEEE, 2023).
White, C. et al. Livebench: a challenging, contamination-free llm benchmark. In Proc. International Conference on Learning Representations (2025).
Wang, L. et al. Videomae v2: Scaling video masked autoencoders with dual masking. in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14549–14560 (IEEE, 2023).
Bertasius, G., Wang, H. & Torresani, L. Is space-time attention all you need for video understanding? In Proc. International Conference on Machine Learning, 813–824 (Proceedings of Machine Learning Research, 2021).
Feichtenhofer, C., Li, Y., He, K. et al. Masked autoencoders as spatiotemporal learners. Adv. neural Inf. Process. Syst. 35, 35946–35958 (2022).
Google Scholar
Yang, S., Luo, L., Wang, Q. & Chen, H. Surgformer: surgical transformer with hierarchical temporal attention for surgical phase recognition. in Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention, 606–616 (Springer, 2024).
Guo, D., Si, W., Li, Z., Pei, J. & Heng, P.-A. Surgical workflow recognition and blocking effectiveness detection in laparoscopic liver resections with pringle maneuver. In Proc. 39th AAAI Conference on Artificial Intelligence, 3220-3228 (AAAI Press, 2025).
Al Hajj, H. et al. Cataracts: Challenge on automatic tool annotation for cataract surgery. Med. image Anal. 52, 24–41 (2019).
Google Scholar
Schoeffmann, K. et al. Cataract-101: video dataset of 101 cataract surgeries. in Proc. 9th ACM Multimedia Systems Conference, 421–425 (ACM, 2018).
Primus, M. J. et al. Frame-based classification of operation phases in cataract surgery videos. in MultiMedia Modeling: 24th International Conference, MMM 2018, Bangkok, Thailand, February 5-7, 2018, Proceedings, Part I 24, 241–253 (Springer, 2018).
Jin, Y. et al. Temporal memory relation network for workflow recognition from surgical video. IEEE Trans. Med. Imaging 40, 1911–1923 (2021).
Google Scholar
Yang, S. et al. Surgpetl: Parameter-efficient image-to-surgical-video transfer learning for surgical phase recognition. IEEE Trans. Med. Imaging (2025).
Schoeffmann, K. et al. Video retrieval in laparoscopic video recordings with dynamic content descriptors. Multim. Tools Appl. 77, 16813–16832 (2018).
Google Scholar
Nwoye, C. I. et al. Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos. Med. Image Anal. 78, 102433 (2022).
Google Scholar
Nwoye, C. I. et al. Recognition of instrument-tissue interactions in endoscopic videos via action triplets. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23, 364–374 (Springer, 2020).
Ríos, M. S. et al. Cholec80-cvs: An open dataset with an evaluation of strasberg’s critical view of safety for AI. Sci. Data 10, 194 (2023).
Google Scholar
Murali, A. et al. Latent graph representations for critical view of safety assessment. in Proc. IEEE Transactions on Medical Imaging (IEEE, 2023).
Deng, J. et al. Imagenet: A large-scale hierarchical image database. in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (IEEE, 2009).
Carreira, J. & Zisserman, A. Quo Vadis, action recognition? A new model and the Kinetics dataset. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4724–4733 (IEEE, 2017).
Li, K. et al. UniFormerV2: Unlocking the potential of image ViTs for video understanding. In Proc. IEEE/CVF International Conference on Computer Vision, 1632–1643 (IEEE, 2023).
Goyal, R. et al. The" something something" video database for learning and evaluating visual common sense. in Proc. IEEE International Conference on Computer Vision, 5842–5850 (IEEE, 2017).
Gu, C. et al. Ava: a video dataset of spatio-temporally localized atomic visual actions. in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 6047–6056 (IEEE, 2018).
Bain, M., Nagrani, A., Varol, G. & Zisserman, A. Frozen in time: a joint video and image encoder for end-to-end retrieval. in Proc. IEEE/CVF International Conference on Computer Vision, 1728–1738 (IEEE, 2021).
Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C. & Zisserman, A. A short note about kinetics-600. arXiv preprint arXiv: https://arxiv.org/abs/1808.01340 (2018).
Carreira, J., Noland, E., Hillier, C. & Zisserman, A. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv: https://arxiv.org/abs/1907.06987 (2019).
Radford, A. et al. Learning transferable visual models from natural language supervision. in Proc. International Conference on Machine Learning, 8748–8763 (PMLR, 2021).
Chen, X., Fan, H., Girshick, R. & He, K. Improved baselines with momentum contrastive learning. arXiv preprint arXiv: https://arxiv.org/abs/2003.04297 (2020).
Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. in Proc. International Conference on Machine Learning, 1597–1607 (PMLR, 2020).
Caron, M. et al. Unsupervised learning of visual features by contrasting cluster assignments. Adv. neural Inf. Process. Syst. 33, 9912–9924 (2020).
Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 770–778 (IEEE, 2016).
Assran, M. et al. Masked siamese networks for label-efficient learning. in European Conference on Computer Vision, 456–473 (Springer, 2022).
Liu, X. et al. Efficientvit: memory efficient vision transformer with cascaded group attention. in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14420–14430 (IEEE, 2023).
Nwoye, C. I. & Padoy, N. Data splits and metrics for method benchmarking on surgical action triplet datasets. arXiv preprint arXiv: https://arxiv.org/abs/2204.05235 (2022).
Bawa, V. S. et al. The saras endoscopic surgeon action detection (ESAD) dataset: challenges and methods. arXiv preprint arXiv: https://arxiv.org/abs/2104.03178 (2021).

Download references

Acknowledgements

The work described in this paper was supported by the Germany/Hong Kong Joint Research Scheme, sponsored by the Research Grants Council of Hong Kong and the Germany Academic Exchange Service (Reference No. G-HKUST605/24); by the Hong Kong Innovation and Technology Commission (Project No. GHP/006/22GD and ITCPD/17-9); and by the National Natural Science Foundation of China (Grant No. 62402458).

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China
Shu Yang, Fengtao Zhou, Fuxiang Huang, Yihui Wang, Sunan He, Yuxiang Nie, Xi Wang & Hao Chen
Division of Intelligent Medical Systems, German Cancer Research Center (DKFZ) Heidelberg, Heidelberg, Germany
Leon Mayer & Lena Maier-Hein
Faculty of Medicine, Heidelberg University Hospital, Heidelberg, Germany
Leon Mayer & Lena Maier-Hein
School of Nursing, The Hong Kong Polytechnic University, Hong Kong SAR, China
Yiliang Chen & Jing Qin
Department of Biomedical Engineering, National University of Singapore, Singapore, Singapore
Yueming Jin
Department of Electrical and Computer Engineering, National University of Singapore, Singapore, Singapore
Yueming Jin
Department of Gastroenterology of Tongji Hospital, School of Medicine, Tongji University, Shanghai, China
Huihui Sun & Shuchang Xu
Department of Surgery, The Chinese University of Hong Kong, Hong Kong SAR, China
Alex Qinyang Liu, Zheng Li & Jeremy YuenChun Teoh
HI Helmholtz Imaging, German Cancer Research Center (DKFZ) Heidelberg, Heidelberg, Germany
Lena Maier-Hein
Faculty of Mathematics and Computer Science, Heidelberg University, Heidelberg, Germany
Lena Maier-Hein
National Center for Tumor Diseases (NCT), NCT Heidelberg, Heidelberg, Germany
Lena Maier-Hein
Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China
Hao Chen
Division of Life Science, The Hong Kong University of Science and Technology, Hong Kong SAR, China
Hao Chen
State Key Laboratory of Nervous System Disorders, The Hong Kong University of Science and Technology, Hong Kong SAR, China
Hao Chen
Shenzhen-Hong Kong Collaborative Innovation Research Institute, The Hong Kong University of Science and Technology, Shenzhen, China
Hao Chen

Authors

Shu Yang
View author publications
Search author on:PubMed Google Scholar
Fengtao Zhou
View author publications
Search author on:PubMed Google Scholar
Leon Mayer
View author publications
Search author on:PubMed Google Scholar
Fuxiang Huang
View author publications
Search author on:PubMed Google Scholar
Yiliang Chen
View author publications
Search author on:PubMed Google Scholar
Yihui Wang
View author publications
Search author on:PubMed Google Scholar
Sunan He
View author publications
Search author on:PubMed Google Scholar
Yuxiang Nie
View author publications
Search author on:PubMed Google Scholar
Xi Wang
View author publications
Search author on:PubMed Google Scholar
Yueming Jin
View author publications
Search author on:PubMed Google Scholar
Huihui Sun
View author publications
Search author on:PubMed Google Scholar
Shuchang Xu
View author publications
Search author on:PubMed Google Scholar
Alex Qinyang Liu
View author publications
Search author on:PubMed Google Scholar
Zheng Li
View author publications
Search author on:PubMed Google Scholar
Jing Qin
View author publications
Search author on:PubMed Google Scholar
Jeremy YuenChun Teoh
View author publications
Search author on:PubMed Google Scholar
Lena Maier-Hein
View author publications
Search author on:PubMed Google Scholar
Hao Chen
View author publications
Search author on:PubMed Google Scholar

Contributions

S.Y., L.M.-H., and H.C. conceived and designed the work. S.Y. contributed to the technical implementation and conducted experiments. F.Z. participated in discussions regarding the design of the self-supervised learning framework and were responsible for reproducing the natural-domain models. L.M. participated in discussions regarding the design of the self-supervised learning framework and contributed to part of the experimental evaluations. F.H., Y.W., S.H., Y.N., and Y.C. collected the data for self-supervised learning and downstream task evaluation. Xi.W., Y.J. and J.Q. offered insightful suggestions for the experimental design and thoughtfully directing the research trajectory. H.S., S.X., A.Q.L, Z.L., and J.Y.T. provided clinical expertise and facilitated access to proprietary datasets. All authors contributed to the drafting and revising of the manuscript. L.M.-H. and H.C. supervised the research.

Corresponding authors

Correspondence to Lena Maier-Hein or Hao Chen.

Ethics declarations

Competing interests

S.Y. and H.C. are inventors on a patent application related to this work that is currently being prepared for filing via the Patent Cooperation Treaty (PCT) route, with The Hong Kong University of Science and Technology as the applicant. The application will cover the pre-training framework, model architecture, and pre-trained parameters presented in this manuscript. All other authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Yang, S., Zhou, F., Mayer, L. et al. Large-scale self-supervised video foundation model for intelligent surgery. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02403-0

Download citation

Received: 21 August 2025
Accepted: 22 January 2026
Published: 04 February 2026
DOI: https://doi.org/10.1038/s41746-026-02403-0