Abstract
Artificial Intelligence (AI) is reshaping medical education, particularly in the domain of competency-based assessment, where current methods remain subjective and resource-intensive. We introduce a multimodal AI framework that integrates video, audio, and patient monitor data to provide objective and interpretable competency assessments. Using 90 anesthesia residents, we established “ideal” performance benchmarks and trained an anomaly detection model (MEMTO) to quantify deviations from these benchmarks. Competency scores derived from these deviations showed strong alignment with expert ratings (Spearman’s ρ = 0.78; ICC = 0.75) and demonstrated high ranking precision (Relative L2-distance = 0.12). SHAP analysis revealed that communication and eye contact with the patient monitor are key drivers of variability. By linking AI-assisted anomaly detection with interpretable feedback, our framework addresses critical challenges of fairness, reliability, and transparency in simulation-based education. This work provides actionable evidence for integrating AI into medical training and advancing scalable, equitable evaluation of competence.
Data availability
All requests for raw and analyzed data, as well as related materials, will be reviewed by our legal department (Technion–Israel Institute of Technology) to verify whether the request is subject to any intellectual property or confidentiality constraints. Any data and materials that can be shared will be released via a material transfer agreement for noncommercial research purposes. The request should be addressed to S.L.
Code availability
All requests for programming code should be addressed to S.L. Any materials that can be shared will be released via a material transfer agreement for noncommercial research purposes. All analyses were conducted on a computing cluster with two NVIDIA RTX A6000 48GB GPUs. The software environment consisted of Ubuntu 20.04 LTS, Python (v3.11), and PyTorch (v2.00).
References
Lam, K. et al. Machine learning for technical skill assessment in surgery: a systematic review. npj Digit. Med. 5, 24 (2022).
Organization, W. H. Global patient safety action plan 2021–2030: towards eliminating avoidable harm in health care (World Health Organization, 2021).
Turner, L., Hashimoto, D. A., Vasisht, S. & Schaye, V. Demystifying AI: current state and future role in medical education assessment. Acad. Med. 99, S42–S47 (2024).
Liu, D. et al. Towards unified surgical skill assessment. in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 9517–9526 (IEEE, 2021).
Holmboe, E. S. & Durning, S. J. Practical Guide to the Assessment of Clinical Competence, (Elsevier Health Sciences, 2024).
Elendu, C. et al. The impact of simulation-based training in medical education: a review. Medicine 103, e38813 (2024).
Glavin, R. J. & Gaba, D. M. Challenges and opportunities in simulation and assessment. Simul. Healthc. 3, 69–71 (2008).
Berkenstadt, H., Ziv, A., Gafni, N. & Sidi, A. Incorporating simulation-based objective structured clinical examination into the Israeli national board examination in anesthesiology. Anesth. Analg. 102, 853–858 (2006).
Petrusa, E. R. Current challenges and future opportunities for simulation in high-stakes assessment. Simul. Healthc. 4, 3–5 (2009).
Norcini, J. J. Setting standards on educational tests. Med. Educ. 37, 464–469 (2003).
Khalid, S., Goldenberg, M., Grantcharov, T., Taati, B. & Rudzicz, F. Evaluation of deep learning models for identifying surgical actions and measuring performance. JAMA Netw. Open 3, e201664 (2020).
Kiyasseh, D. et al. A vision transformer for decoding surgeon activity from surgical videos. Nat. Biomed. Eng. 7, 780–796 (2023).
Fazlollahi, A. M. et al. Effect of artificial intelligence tutoring vs expert instruction on learning simulated surgical skills among medical students. JAMA Netw. Open 5, e2149008 (2022).
Igaki, T. et al. Automatic surgical skill assessment system based on concordance of standardized surgical field development using artificial intelligence. JAMA Surg. 158, e231131 (2023).
Boal, M. W. E. et al. Evaluation of objective tools and artificial intelligence in robotic surgery technical skills assessment: a systematic review. Br. J. Surg. 111 (2024).
Tekin, M., Yurdal, M. O. O., Toraman, Ç, Korkmaz, G. & Uysal, İ Is AI the future of evaluation in medical education? AI vs. human evaluation in objective structured clinical examination. BMC Med. Educ. 25, 641 (2025).
Zhou, K., Ma, Y., Shum, H. P. H. & Liang, X. Hierarchical graph convolutional networks for action quality assessment. IEEE Trans. Circuits Syst. Video Technol. 33, 7749–7763 (2023).
Wang, S. et al. A survey of video-based action quality assessment. in 2021 International Conference on Networking Systems of AI (INSAI) 1–9 (IEEE, 2021).
Mascagni, P. et al. Computer vision in surgery: from potential to clinical value. npj Digit. Med. 5, 163 (2022).
Carciumaru, T. Z. et al. Systematic review of machine learning applications using nonoptical motion tracking in surgery. npj Digit. Med. 8, 28 (2025).
Harari, R. E. et al. Deep learning analysis of surgical video recordings to assess nontechnical skills. JAMA Netw. Open 7, e2422520 (2024).
Wang, Z. & Majewicz Fey, A. Deep learning with convolutional neural network for objective skill evaluation in robot-assisted surgery. Int. J. Comput. Assist. Radiol. Surg. (IJCARS 13, 1959–1970 (2018).
Levin, M., McKechnie, T., Khalid, S., Grantcharov, T. P. & Goldenberg, M. Automated methods of technical skill assessment in surgery: a systematic review. J. Surg. Educ. 76, 1629–1639 (2019).
Pirsiavash, H., Vondrick, C. & Torralba, A. Assessing the quality of actions. in European Conference on Computer Vision (ECCV) 556–571 (Springer International Publishing, 2014).
Lei, Q., Du, J.-X., Zhang, H.-B., Ye, S. & Chen, D.-S. A survey of vision-based human action evaluation methods. Sensors 19, 4129 (2019).
Kiyasseh, D. et al. Human visual explanations mitigate bias in AI-based assessment of surgeon skills. npj Digit. Med. 6, 54 (2023).
Gao, Y. et al. JHU-ISI gesture and skill assessment working set (JIGSAWS): a surgical activity dataset for human motion modeling. https://cirl.lcsr.jhu.edu/wp-content/uploads/2015/11/JIGSAWS.pdf (2014).
Zhang, Y., Weng, Y. & Wang, B. CWT-ViT: a time–frequency representation and vision transformer-based framework for automated robotic surgical skill assessment. Expert Syst. Appl. 258, 125064 (2024).
Caloca-Amber, S., Mauriz, E. & Vázquez-Casares, A. M. Exploring eye-tracking data as an indicator of situational awareness in nursing students during a cardiorespiratory arrest simulation. Nurse Educ. Pract. 76, 103911 (2024).
Hu, H. et al. Application of eye-tracking in nursing research: a scoping review. Nurs. Open 11, e2108 (2024).
Pugh, C. M. Quantifying performance decline in the operating room using fNIRS. Ann. Surg. 272, 658–659 (2020).
Nemani, A. et al. Objective assessment of surgical skill transfer using non-invasive brain imaging. Surg. Endosc. 33, 2485–2494 (2019).
Kamat, A. et al. Assessment of surgical tasks using neuroimaging dataset (ASTaUND). Sci. Data 10, 699 (2023).
Wang, S. et al. A survey of video-based action quality assessment. in International Conference on Networking Systems of AI (INSAI) 1–9 (IEEE, 2021).
Li, G. & Jung, J. J. Deep learning for anomaly detection in multivariate time series: approaches, applications, and challenges. Inf. Fusion 91, 93–102 (2023).
Choi, K., Yi, J., Park, C. & Yoon, S. Deep learning for anomaly detection in time-series data: review, analysis, and guidelines. IEEE Access 9, 120043–120065 (2021).
Song, Junho. et al. Memto: Memory-guided transformer for multivariate time series anomaly detection. Adv. Neural Inf. Process. Syst 36, 57947–57963 (2023).
Kathrine Kollerup, N. et al. Clinical needs and preferences for AI-based explanations in clinical simulation training. Behav. Inf. Technol. 44 (2025).
Calisto, F. M., Abrantes, J. M., Santiago, C., Nunes, N. J. & Nascimento, J. C. Personalized explanations for clinician-AI interaction in breast imaging diagnosis by adapting communication to expertise levels. Int. J. Hum. Comput. Stud. 197, 103444 (2025).
Yibulayimu, S. et al. An explainable machine learning method for assessing surgical skill in liposuction surgery. Int. J. Comput. Assist. Radiol. Surg. 17, 2325–2336 (2022).
Wang, T., Jin, M. & Li, M. Towards accurate and interpretable surgical skill assessment: a video-based method for skill score prediction and guiding feedback generation. Int. J. Comput. Assist. Radiol. Surg. 16, 1595–1605 (2021).
Yamada, K. et al. Surgical skill analysis using explainable AI in endoscopic sinus surgery. in IEEE/SICE International Symposium on System Integration (SII) 287–291 (IEEE, Munich, Germany, 2025).
Mislevy, R. J., Almond, R. G. & Lukas, J. F. A brief introduction to evidence-centered design. ETS Res. Rep. Ser 2003, i–29 (2003).
Wilson, M. Constructing measures: an item response modeling approach, (Routledge, New York, 2023).
Gershov, S., Mahameed, F., Raz, A. & Laufer, S. More than meets the eye: physicians’ visual attention in the operating room. in International Workshop on Applications of Medical AI 11–20 (Springer Nature Switzerland, 2024).
Wang, A. et al. Yolov10: real-time end-to-end object detection. in Advances in Neural Information Processing Systems 37, 107984–108011 (2024).
Aharon, N.a.O., Roy and Bobrovsky, Ben-Zion. BoT-SORT: robust associations multi-pedestrian tracking. arXiv https://arxiv.org/abs/2206.14651 (2022).
Wang, Y., Wang, Z., Liu, L. & Daniilidis, K. TRAM: global trajectory and motion of 3D humans from in-the-wild videos. in European Conference on Computer Vision (ECCV) 467–487 (Springer Nature Switzerland, 2025).
Li, C. et al. ESPnet-SE: end-to-end speech enhancement and separation toolkit designed for ASR integration. in IEEE Spoken Language Technology Workshop (SLT) 785–792 (IEEE, 2021).
Hu, Y. et al. DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement. in INTERSPEECH 2472–2476 (Shanghai, China, 2020).
Liao, H. et al. DocTr: document transformer for structured information extraction in documents. in IEEE/CVF International Conference on Computer Vision (ICCV) 19527–19537 (IEEE Computer Society, 2023).
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. in Advances in Neural Information Processing Systems 30 (eds. Guyon, I. et al.) 4765–4774 (Curran Associates, Inc., 2017).
Antwarg, L., Miller, R. M., Shapira, B. & Rokach, L. Explaining anomalies detected by autoencoders using shapley additive explanations. Expert Syst. Appl. 186, 115736 (2021).
Acknowledgements
We thank the Technion Autonomous Systems Program for support.
Author information
Authors and Affiliations
Contributions
S.L. was responsible for the conceptualization and overall supervision of the study, as well as securing the funding. S.G. led the formal analysis, methodology development, and visualization, and drafted the original manuscript. S.G. also performed validation and final editing under the guidance of S.L. and A.R., who provided critical feedback and contributed to the review and editing process. F.M. conducted data curation and was responsible for data collection and project administration. All authors reviewed and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Gershov, S., Mahameed, F., Raz, A. et al. Towards accurate and interpretable competency-based assessment: enhancing clinical competency assessment through multimodal AI and anomaly detection. npj Digit. Med. (2026). https://doi.org/10.1038/s41746-025-02299-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41746-025-02299-2