Abstract
Computer-Assisted Intervention has the potential to revolutionize modern surgery, with surgical scene understanding serving as a critical component in supporting decision-making and improving procedural efficacy. While existing AI-driven approaches alleviate annotation burdens via self-supervised spatial representation learning, their lack of explicit temporal modeling during pre-training fundamentally restricts the capture of dynamic surgical contexts, resulting in incomplete spatiotemporal understanding. In this work, we introduce the first video-level surgical pre-training framework that enables joint spatiotemporal representation learning from large-scale surgical video data. To achieve this, we constructed a large-scale surgical video dataset comprising 3650 videos and 3.55 million frames, spanning more than 20 surgical procedures and over 10 anatomical structures. Building upon this dataset, we propose SurgVISTA (Surgical Video-level Spatial-Temporal Architecture), a reconstruction-based pre-training method that jointly captures intricate spatial structures and temporal dynamics. Additionally, SurgVISTA incorporates image-level knowledge distillation guided by a surgery-specific expert model to enhance the learning of fine-grained anatomical and semantic features. To validate its effectiveness, we established a comprehensive benchmark comprising 13 video-level datasets spanning six surgical procedures across four tasks. Extensive experiments demonstrate that SurgVISTA consistently outperforms both natural- and surgical-domain pre-trained models, demonstrating strong potential to advance intelligent surgical systems in clinically meaningful scenarios.
Data availability
Publicly available datasets used to construct the pre-training corpus and evaluation benchmarks are summarized in Supplementary Table 37. The remaining clinical data cannot be shared publicly due to institutional and patient privacy restrictions.
Code availability
The implementations of SurgVISTA framework will be released in GitHub: https://github.com/isyangshu/SurgVISTA. The pre-trained natural-domain parameters used in this study are listed in Supplementary Table 34, while the pre-trained surgical-domain parameters are listed in Supplementary Table 35. The other public codes used in this study are listed in Supplementary Table 36.
References
Kiyasseh, D. et al. A vision transformer for decoding surgeon activity from surgical videos. Nat. Biomed. Eng. 7, 780–796 (2023).
Ma, R. et al. Surgical gestures as a method to quantify surgical performance and predict patient outcomes. NPJ Digit. Med. 5, 187 (2022).
Kiyasseh, D. et al. Human visual explanations mitigate bias in AI-based assessment of surgeon skills. NPJ Digit. Med. 6, 54 (2023).
Demir, K. C. et al. Deep learning in surgical workflow analysis: a review of phase and step recognition. IEEE J. Biomed. Health Inform. 27, 5405–5417 (2023).
Maier-Hein, L. et al. Surgical data science for next-generation interventions. Nat. Biomed. Eng. 1, 691–696 (2017).
Surgical Data Science Initiative. Surgical data science. http://www.surgical-data-science.org/ (2025).
Maier-Hein, L. et al. Surgical data science–from concepts toward clinical translation. Med. image Anal. 76, 102306 (2022).
Caron, M. et al. Emerging properties in self-supervised vision transformers. in Proc. IEEE/CVF International Conference on Computer Vision, 9650–9660 (IEEE, 2021).
He, K. et al. Masked autoencoders are scalable vision learners. in Proc. IEEE/CVF Conference On Computer Vision And Pattern Recognition, 16000–16009 (IEEE, 2022).
Oquab, M. et al. Dinov2: learning robust visual features without supervision. Trans. Mach. Learn. Res. (2024).
Azizi, S. et al. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nat. Biomed. Eng. 7, 756–779 (2023).
Ma, J. et al. A generalizable pathology foundation model using a unified knowledge distillation pretraining framework. Nat. Biomed. Eng. https://doi.org/10.1038/s41551-025-01488-4 (2025).
Huang, S.-C. et al. Self-supervised learning for medical image classification: a systematic review and implementation guidelines. NPJ Digit. Med. 6, 74 (2023).
Krishnan, R., Rajpurkar, P. & Topol, E. J. Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng. 6, 1346–1352 (2022).
Chen, R. J. et al. Towards a general-purpose foundation model for computational pathology. Nat. Med. 30, 850–862 (2024).
Ramesh, S. et al. Dissecting self-supervised learning methods for surgical computer vision. Med. Image Anal. 88, 102844 (2023).
Batić, D., Holm, F., Özsoy, E., Czempiel, T. & Navab, N. Endovit: pretraining vision transformers on a large collection of endoscopic images. Int. J. Comput. Assist. Radiol. Surg. 19, 1085–1091 (2024).
Hirsch, R. et al. Self-supervised learning for endoscopic video analysis. in Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention, 569–578 (Springer, 2023).
Schmidgall, S., Kim, J. W., Jopling, J. & Krieger, A. General surgery vision transformer: a video pre-trained foundation model for general surgery. arXiv preprint arXiv: https://arxiv.org/abs/2403.05949 (2024).
Czempiel, T. et al. Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In Proc. Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23, 343–352 (Springer, 2020).
Twinanda, A. P. et al. Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE Trans. Med. imaging 36, 86–97 (2016).
Stauder, R. et al. The tum lapchole dataset for the M2CAI 2016 workflow challenge. arXiv preprint arXiv: https://arxiv.org/abs/1610.09278 (2016).
Wagner, M. et al. Comparative validation of machine learning algorithms for surgical workflow and skill analysis with the heichole benchmark. Med. Image Anal. 86, 102770 (2023).
Das, A. et al. Pitvis-2023 challenge: workflow recognition in videos of endoscopic pituitary surgery. Med. Image Anal. 106, 103716 (2025).
Valderrama, N. et al. Towards holistic surgical scene understanding. in International Conference on Medical Image Computing and Computer-assisted Intervention, 442–452 (Springer, 2022).
Wang, Z. et al. Autolaparo: A new dataset of integrated multi-tasks for image-guided surgical automation in laparoscopic hysterectomy. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 486–496 (Springer, 2022).
Lavanchy, J. L. et al. Challenges in multi-centric generalization: Phase and step recognition in roux-en-y gastric bypass surgery. Int. J. Comput. Assist. Radiol. Surg. 19, 2249–2257 (2024).
Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. In Proc. International Conference on Learning Representations (2021).
Tong, Z., Song, Y., Wang, J. & Wang, L. Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. 35, 10078–10093 (2022).
Wang, Y. et al. Internvideo: general video foundation models via generative and discriminative learning. arXiv preprint arXiv: https://arxiv.org/abs/2212.03191 (2022).
Wang, R. et al. Masked video distillation: rethinking masked feature modeling for self-supervised video representation learning. in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6312–6322 (IEEE, 2023).
Li, K. et al. Unmasked teacher: towards training-efficient video foundation models. in Proc. IEEE/CVF International Conference on Computer Vision, 19948–19960 (IEEE, 2023).
White, C. et al. Livebench: a challenging, contamination-free llm benchmark. In Proc. International Conference on Learning Representations (2025).
Wang, L. et al. Videomae v2: Scaling video masked autoencoders with dual masking. in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14549–14560 (IEEE, 2023).
Bertasius, G., Wang, H. & Torresani, L. Is space-time attention all you need for video understanding? In Proc. International Conference on Machine Learning, 813–824 (Proceedings of Machine Learning Research, 2021).
Feichtenhofer, C., Li, Y., He, K. et al. Masked autoencoders as spatiotemporal learners. Adv. neural Inf. Process. Syst. 35, 35946–35958 (2022).
Yang, S., Luo, L., Wang, Q. & Chen, H. Surgformer: surgical transformer with hierarchical temporal attention for surgical phase recognition. in Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention, 606–616 (Springer, 2024).
Guo, D., Si, W., Li, Z., Pei, J. & Heng, P.-A. Surgical workflow recognition and blocking effectiveness detection in laparoscopic liver resections with pringle maneuver. In Proc. 39th AAAI Conference on Artificial Intelligence, 3220-3228 (AAAI Press, 2025).
Al Hajj, H. et al. Cataracts: Challenge on automatic tool annotation for cataract surgery. Med. image Anal. 52, 24–41 (2019).
Schoeffmann, K. et al. Cataract-101: video dataset of 101 cataract surgeries. in Proc. 9th ACM Multimedia Systems Conference, 421–425 (ACM, 2018).
Primus, M. J. et al. Frame-based classification of operation phases in cataract surgery videos. in MultiMedia Modeling: 24th International Conference, MMM 2018, Bangkok, Thailand, February 5-7, 2018, Proceedings, Part I 24, 241–253 (Springer, 2018).
Jin, Y. et al. Temporal memory relation network for workflow recognition from surgical video. IEEE Trans. Med. Imaging 40, 1911–1923 (2021).
Yang, S. et al. Surgpetl: Parameter-efficient image-to-surgical-video transfer learning for surgical phase recognition. IEEE Trans. Med. Imaging (2025).
Schoeffmann, K. et al. Video retrieval in laparoscopic video recordings with dynamic content descriptors. Multim. Tools Appl. 77, 16813–16832 (2018).
Nwoye, C. I. et al. Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos. Med. Image Anal. 78, 102433 (2022).
Nwoye, C. I. et al. Recognition of instrument-tissue interactions in endoscopic videos via action triplets. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23, 364–374 (Springer, 2020).
Ríos, M. S. et al. Cholec80-cvs: An open dataset with an evaluation of strasberg’s critical view of safety for AI. Sci. Data 10, 194 (2023).
Murali, A. et al. Latent graph representations for critical view of safety assessment. in Proc. IEEE Transactions on Medical Imaging (IEEE, 2023).
Deng, J. et al. Imagenet: A large-scale hierarchical image database. in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (IEEE, 2009).
Carreira, J. & Zisserman, A. Quo Vadis, action recognition? A new model and the Kinetics dataset. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4724–4733 (IEEE, 2017).
Li, K. et al. UniFormerV2: Unlocking the potential of image ViTs for video understanding. In Proc. IEEE/CVF International Conference on Computer Vision, 1632–1643 (IEEE, 2023).
Goyal, R. et al. The" something something" video database for learning and evaluating visual common sense. in Proc. IEEE International Conference on Computer Vision, 5842–5850 (IEEE, 2017).
Gu, C. et al. Ava: a video dataset of spatio-temporally localized atomic visual actions. in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 6047–6056 (IEEE, 2018).
Bain, M., Nagrani, A., Varol, G. & Zisserman, A. Frozen in time: a joint video and image encoder for end-to-end retrieval. in Proc. IEEE/CVF International Conference on Computer Vision, 1728–1738 (IEEE, 2021).
Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C. & Zisserman, A. A short note about kinetics-600. arXiv preprint arXiv: https://arxiv.org/abs/1808.01340 (2018).
Carreira, J., Noland, E., Hillier, C. & Zisserman, A. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv: https://arxiv.org/abs/1907.06987 (2019).
Radford, A. et al. Learning transferable visual models from natural language supervision. in Proc. International Conference on Machine Learning, 8748–8763 (PMLR, 2021).
Chen, X., Fan, H., Girshick, R. & He, K. Improved baselines with momentum contrastive learning. arXiv preprint arXiv: https://arxiv.org/abs/2003.04297 (2020).
Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. in Proc. International Conference on Machine Learning, 1597–1607 (PMLR, 2020).
Caron, M. et al. Unsupervised learning of visual features by contrasting cluster assignments. Adv. neural Inf. Process. Syst. 33, 9912–9924 (2020).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, 770–778 (IEEE, 2016).
Assran, M. et al. Masked siamese networks for label-efficient learning. in European Conference on Computer Vision, 456–473 (Springer, 2022).
Liu, X. et al. Efficientvit: memory efficient vision transformer with cascaded group attention. in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14420–14430 (IEEE, 2023).
Nwoye, C. I. & Padoy, N. Data splits and metrics for method benchmarking on surgical action triplet datasets. arXiv preprint arXiv: https://arxiv.org/abs/2204.05235 (2022).
Bawa, V. S. et al. The saras endoscopic surgeon action detection (ESAD) dataset: challenges and methods. arXiv preprint arXiv: https://arxiv.org/abs/2104.03178 (2021).
Acknowledgements
The work described in this paper was supported by the Germany/Hong Kong Joint Research Scheme, sponsored by the Research Grants Council of Hong Kong and the Germany Academic Exchange Service (Reference No. G-HKUST605/24); by the Hong Kong Innovation and Technology Commission (Project No. GHP/006/22GD and ITCPD/17-9); and by the National Natural Science Foundation of China (Grant No. 62402458).
Author information
Authors and Affiliations
Contributions
S.Y., L.M.-H., and H.C. conceived and designed the work. S.Y. contributed to the technical implementation and conducted experiments. F.Z. participated in discussions regarding the design of the self-supervised learning framework and were responsible for reproducing the natural-domain models. L.M. participated in discussions regarding the design of the self-supervised learning framework and contributed to part of the experimental evaluations. F.H., Y.W., S.H., Y.N., and Y.C. collected the data for self-supervised learning and downstream task evaluation. Xi.W., Y.J. and J.Q. offered insightful suggestions for the experimental design and thoughtfully directing the research trajectory. H.S., S.X., A.Q.L, Z.L., and J.Y.T. provided clinical expertise and facilitated access to proprietary datasets. All authors contributed to the drafting and revising of the manuscript. L.M.-H. and H.C. supervised the research.
Corresponding authors
Ethics declarations
Competing interests
S.Y. and H.C. are inventors on a patent application related to this work that is currently being prepared for filing via the Patent Cooperation Treaty (PCT) route, with The Hong Kong University of Science and Technology as the applicant. The application will cover the pre-training framework, model architecture, and pre-trained parameters presented in this manuscript. All other authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Yang, S., Zhou, F., Mayer, L. et al. Large-scale self-supervised video foundation model for intelligent surgery. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-026-02403-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-026-02403-0