Feasibility of large language models for assessing and coaching surgeons’ non-technical skills

Obuseh, Marian; Singh, Sneha; Anton, Nicholas E.; Gardiner, Robin; Stefanidis, Dimitrios; Yu, Denny

doi:10.1038/s44401-025-00027-2

Download PDF

Brief Communication
Open access
Published: 15 July 2025

Feasibility of large language models for assessing and coaching surgeons’ non-technical skills

Marian Obuseh¹,
Sneha Singh²,
Nicholas E. Anton³,
Robin Gardiner³,
Dimitrios Stefanidis³ &
…
Denny Yu¹

npj Health Systems volume 2, Article number: 25 (2025) Cite this article

2005 Accesses
2 Citations
1 Altmetric
Metrics details

Subjects

Abstract

This study demonstrates Large Language models (LLMs) to assess and coach surgeons on their non-technical skills, traditionally evaluated through subjective and resource-intensive methods. Llama 3.1 and Mistral effectively analyzed robotic-assisted surgery transcripts, identified exemplar and non-exemplar behaviors, and autonomously generated structured coaching feedback to guide surgeons’ improvement. Our findings highlight the potential of LLMs as scalable, data-driven tools for enhancing surgical education and supporting consistent coaching practices.

Effectiveness of large language models in preoperative and discharge education: a systematic review based on an evaluation framework

Article Open access 07 January 2026

A strategy for cost-effective large language model use at health system-scale

Article Open access 18 November 2024

Robust multi-label surgical tool classification in noisy endoscopic videos

Article Open access 14 February 2025

Large language models (LLMs) have demonstrated strong performance in general natural language processing (NLP) tasks across various domains¹. In healthcare, studies have shown promising results particularly in medical diagnostic support and clinical decision-making^2,3. For example, GPT-4 achieved a diagnostic accuracy of 61.1% in its top six differential diagnoses for previously unpublished challenging clinical cases, surpassing the 49.1% accuracy of physicians². Similarly, a Llama-based model achieved 52% accuracy in assigning diagnosis-related groups for hospitalized patients using clinical notes³. These advancements highlight the potential of LLMs in specialized healthcare applications, including skill assessment and training of healthcare professionals. While substantial work has focused on assessing and training surgeons’ technical skills^4,5, there is less emphasis on non-technical skills (NTS)⁶. However, these NTS are critical cognitive, social, and interpersonal abilities that are directly linked to adverse events and patient safety outcomes⁷.

Assessing NTS presents unique challenges due to their primary expression through verbal communication⁸, which can be dynamic, nuanced, context-dependent, and susceptible to interpretation bias. This variability introduces subjectivity into assessments, as observers may interpret the same behaviors differently. Traditional observational assessments are prone to these inconsistencies that can compromise subsequent training efforts. They are also resource-intensive, requiring significant time and effort not only to train the raters but also for the raters to conduct the assessments for long surgeries^9,10. Dyadic NTS coaching approaches, while beneficial, demand specialized non-surgeon coaching staff who can focus less on technical skills and are also very time-intensive^11,12. Given these challenges, there is a clear need for more objective, efficient, and scalable methods to assess NTS, ensuring consistent, high-quality training that enhances patient outcomes and strengthens team dynamics.

NLP offers a promising solution to these challenges since NTS are mostly demonstrated through verbal communication. In this study, we explore the feasibility of LLMs for surgeons’ NTS assessment and coaching. Specifically, we hypothesize that LLMs can differentiate between exemplar and non-exemplar NTS behaviors and provide actionable performance coaching feedback to surgeons, supporting scalable and effective surgical education.

Model performance

Table 1 presents the performance of LLMs and benchmarking machine learning (ML) models in categorizing NTS behaviors. Overall, LLMs demonstrated higher performance than classical ML models. Llama 3.1 achieved a macro-averaged F1 score of 0.62 and a prediction accuracy of 74%, while Mistral showed comparable performance with a macro-averaged F1 score of 0.61 and a prediction accuracy of 75%. In comparison, the classical ML models including logistic regression (LR) and support vector machines (SVM), demonstrated lower overall performance. Although their prediction accuracy (LR = 74%, SVM = 71%) was comparable to LLMs, their macro-averaged F1 scores were significantly lower (LR = 0.52, SVM = 0.51), highlighting their limitations in balancing precision and recall across NTS behaviors.

Table 1 Overall and class performance on test data for each model category

Full size table

For exemplar NTS behaviors, both LLMs and classical ML models performed excellently, with macro-averaged F1 scores of 0.84 and 0.85 respectively. This indicates that both model types were effective in identifying well-defined, positive behaviors exhibited by surgeons. For non-exemplar behaviors, which are more challenging to identify, Llama 3.1 achieved a macro-averaged F1 score of 0.41, outperforming LR and SVM, both of which had F1 scores below 0.20. Mistral also performed better than the classical models with an F1 score of 0.37. These results show that while classical ML models maintained comparable performance for exemplar behaviors, LLMs provided improved accuracy and F1 scores for non-exemplar behavior identification.

Figure 1 compares the traditional manual NTS assessment pipeline with the proposed LLM-based automation approach using an example of a missed communication opportunity in the operating room due to the lack of a vocative address. In the manual approach¹², multiple human expert raters, including a coach, independently assess the surgeon’s NTS behaviors using the Non-Technical Skills for Surgeons (NOTSS) tool. After completing the individual assessments, they engage in group discussions to reach a consensus on the assessment. The coach then provides dyadic coaching feedback to the surgeon, specifically targeting improvements to address the identified communication breakdown. This multi-step process is resource-intensive, requiring substantial time for individual assessments, consensus-building, and personalized coaching.

**Fig. 1: Comparative workflow for current gold standard versus LLM-based approach for NTS assessment and coaching for a surgical event.**

In contrast, the LLM automation pipeline directly processes surgical transcripts through fine-tuned models to assess NTS behaviors and generate coaching-style feedback. Both Llama 3.1 and Mistral identified the same non-exemplar behavior in the example scenario. Llama 3.1 highlighted potential confusion due to the lack of a vocative address, while Mistral offered specific suggestions for clearer communication. Additional examples of exemplar and non-exemplar classifications can be found in Supplementary Table S1. These results demonstrate that LLMs can automate NTS coaching, providing consistent, context-specific feedback that aligns with human coaches while offering a more efficient and scalable alternative to manual methods.

Implications

This study demonstrated the feasibility of LLMs for NTS assessment and coaching in surgeries, addressing key limitations of traditional methods. Our results showed that LLMs outperformed ML models despite being trained on significantly smaller datasets. Differences in data handling strategies should be considered when interpreting classification performance comparisons. Although balanced training data were used in both cases to mitigate class imbalance effects, the underlying training mechanisms differed.

LLM performance was likely constrained by limited context windows, which restricted the amount of contextual information that could be processed during training¹³. In particular, the shorter context window of Mistral limited the size of the training set. While this enabled consistent evaluation across LLMs and a focus on developing highly targeted prompts, it may reduce generalizability. This limitation should be revisited as LLMs advance. Additionally, the underrepresentation of non-exemplar behaviors in the dataset likely contributed to lower performance in this category. In this study, fewer than 17% of observed behaviors were non-exemplar, aligning with prior research indicating that while these behaviors are less frequent, they are not uncommon¹⁴. Ongoing research is focused on enhancing model adaptability through optimized fine-tuning strategies and dataset expansion, particularly by increasing non-exemplar cases to improve class balance.

Despite not being explicitly trained on human coaching feedback, LLMs demonstrated the ability to generate performance-based feedback for surgeons, offering a scalable alternative to traditional dyadic human coaching, which is often limited by human availability and variability. This scalability could democratize access to high-quality NTS training, benefiting a wide range of surgical teams, including those in resource-limited settings. The ability to better identify non-exemplar behaviors, where coaching is most needed, combined with the generation of structured, context-specific feedback, makes LLMs preferable in this setting.

This study focused on transcript-based, offline analysis that lays critical groundwork for future efforts by (1) establishing a reproducible pipeline for building large-scale datasets from real intraoperative conversations, (2) demonstrating the advantages of LLMs over traditional ML models in capturing the nuance of verbal behaviors during RAS, (3) refining prompt engineering strategies to handle domain-specific language and achieve high performance on NTS performance classification and coaching feedback tasks, and (4) producing a trained LLM-based model that can be further expanded, refined, validated, and adapted. Early-stage large-scale human evaluation of LLM-generated feedback will be a formal part of the system design, with expert review helping to shape model refinement. Initial feedback efforts will focus on acceptability, perceived usefulness, trust, and integration into training workflows. A full usability study will be necessary before deployment to systematically evaluate acceptability, trust, perceived usefulness, and workflow integration in clinical environments. Practical implementation will require addressing barriers such as generalizability across specialties and institutions, alignment with clinical education standards, and fidelity to the dynamic context of live procedures. It will be essential to address concerns around accountability, reliability, privacy, and the secure handling of clinical data to prevent misuse of sensitive information. Additionally, we recognize that LLMs operate as black-box systems, which presents challenges for interpretability and adoption. A human-in-the-loop approach will therefore remain essential throughout the design, development, deployment, and post-deployment monitoring phases. While automation reduces the overall burden compared to traditional dyadic coaching, expert oversight by surgical educators or system stakeholders will remain essential to ensure accountability. Together, these efforts aim to enable scalable, targeted, and high-fidelity NTS coaching while establishing a clearer pathway toward integrating LLM-based feedback mechanisms into surgical education programs.

Methods

Participants

This study was approved by Indiana University’s Institutional Review Board (protocol 1702481748; approved: March 2023). No race, ethnicity, or patient-identifiable data was collected. Between 2023 and 2024, nine surgeons (4 women [44%]) from six surgical subspecialties across three hospitals in Indiana University Health system, all performing at least one robotic-assisted surgery (RAS) per month, were recruited. They provided informed consent, and twenty RAS were observed and recorded in the operating room, with each surgeon contributing 1–3 cases.

Data

Human expert raters watched the RAS recordings and used the validated NOTSS¹⁵ tool to identify operating room interactions that challenged the surgeons’ NTS. Surgeons’ performance was classified as exemplar (rating = 4) or non-exemplar (rating < 4). This classification approach aligns with prior research distinguishing behaviors meeting the highest standard from those requiring improvement⁹. The raters identified 709 NTS events, with 590 categorized as exemplar behaviors and 119 as non-exemplar. Transcripts of these interactions constitutes the dataset.

Large language models

This study used publicly released state-of-the-art models, including Meta’s Llama-3.1-8B-Instruct¹⁶ (open-weight, under a custom non-commercial license) and Mistral AI’s Mistral-7B-Instruct-v0.3¹⁷ (open-weight and open-source under Apache 2.0), both accessed via Hugging Face. All experiments were conducted using Google Colab Pro+, which provided cloud-based access to a single A100 GPU. The only cost incurred was the Colab Pro+ subscription, which offered sufficient compute capacity for prompt engineering, model inference, and performance evaluation. These models were chosen for their strong performance, computational efficiency, and accessibility, which enable execution within controlled research environments without reliance on proprietary API services. This supports reproducibility, transparency, and privacy, which are key considerations for clinical research.

While Llama 3.1 supports a context window of up to 128 K tokens¹⁶, our experiments were constrained by Mistral’s shorter context window of approximately 32 K tokens¹⁷. This limitation necessitated a focus on developing highly targeted prompts rather than increasing the size of the training dataset. A stratified sampling strategy was used to construct the training dataset. One exemplar and one non-exemplar data point were randomly sampled from each of the four NOTSS categories, resulting in eight data points (~1% of data), covering all combinations of NTS performance level and construct. The same training data (n = 4 exemplar class and n = 8 non-exemplar class) was used for both LLMs.

Prompt design targeted both classification of NTS performance and generation of coaching feedback. For both models, the generation parameters were as follows: max_new_tokens = 512 (prevents overly long coaching feedback), temperature = 0.005 (makes the output consistent across runs), top p = 0.001 (restricts choices to only the most likely words), top k = 20 (adds a cutoff to avoid unusual or irrelevant words), and do sample = True (to retain some natural language variability). Importantly, the models were not trained on any human coaching feedback. Instead, they were prompted to generate feedback solely from the transcripts of surgeon interactions in the operating room, ensuring that feedback generation was independent of expert coaching examples.

To evaluate model performance with minimal instruction, an initial prompt was executed and outputs were manually reviewed to identify failure patterns. Prompts were subsequently refined iteratively using evidence-based strategies, including role prompting (e.g., “You are an experienced non-technical skills coach”), clear section delimitation, and zero-shot chain-of-thought prompting (e.g., “Think step-by-step”)^18,19. Refinement continued until both models achieved perfect (100%) classification and feedback generation performance on the training data. The final prompts can be found in the Supplementary Information. Model performance was evaluated on the full test dataset (n = 701; 115 non-exemplar) using classification metrics including accuracy, precision, recall, and macro-averaged F1 scores.

Classical machine learning models

Classical ML models (Linear Regression and Support Vector Machine) were used as benchmarks for evaluating LLM performance. To address class imbalance, the majority class (exemplar) was under-sampled to 80% of the minority class²⁰, creating a balanced training dataset (n = 190; 95 non-exemplar). The models were trained on Term Frequency-Inverse Document Frequency (TF-IDF) vectorized unigrams of the balanced training data. Model performance was then evaluated on a held-out test dataset (n = 519; 24 non-exemplar) using the metrics as the LLMs.

Data availability

Due to Indiana University ethics policy and to protect participants privacy, the data used in this study cannot be publicly shared. However, sample de-identified datapoints are provided in the Supplementary Information.

Code availability

Due to Indiana University ethics policy and to protect participants privacy, the full codes used in this study cannot be publicly shared. However, partial model prompts are provided in the Supplementary Information.

References

Raiaan, M. A. K. et al. A review on large language models: architectures, applications, taxonomies, open issues and challenges. IEEE Access 12, 26839–26874 (2024).
Article Google Scholar
Rutledge, G. W. Diagnostic accuracy of GPT-4 on common clinical scenarios and challenging cases. Learn Health Syst 8, e10438 (2024).
Article PubMed PubMed Central Google Scholar
Wang, H., Gao, C., Dantona, C., Hull, B. & Sun, J. DRG-LLaMA: tuning LLaMA model to predict diagnosis-related group for hospitalized patients. npj Digital Med. 7, 1–9 (2024).
Article CAS Google Scholar
Lam, K. et al. Machine learning for technical skill assessment in surgery: a systematic review. npj Digital Med. 5, 1–16 (2022).
Article Google Scholar
Bjerrum, F., Thomsen, A. S. S., Nayahangan, L. J. & Konge, L. Surgical simulation: current practices and future perspectives for technical skills training. Med. Teach. 40, 668–675 (2018).
Article PubMed Google Scholar
Mahendran, V., Turpin, L., Boal, M. & Francis, N. K. Assessment and application of non-technical skills in robotic-assisted surgery: a systematic review. Surg. Endosc 38, 1758–1774 (2024).
Article PubMed PubMed Central Google Scholar
Flin, R. & O’Connor, P. Safety at the sharp end. safety at the sharp end, https://doi.org/10.1201/9781315607467 (CRC Press, 2017).
Yule, S., Flin, R., Paterson-Brown, S., Maran, N. & Rowley, D. Development of a rating system for surgeons’ non-technical skills. Med. Educ. 40, 1098–1104 (2006).
Article PubMed CAS Google Scholar
Cha, J. S. et al. Objective nontechnical skills measurement using sensor-based behavior metrics in surgical teams. Hum. Factors, https://doi.org/10.1177/00187208221101292 (2022).
Cha, J. S. & Yu, D. Objective measures of surgeon nontechnical skills in surgery: a scoping review. Hum. Factors 00, 18720821995319 (2021).
Google Scholar
Granchi, N. et al. Coaching to enhance qualified surgeons’ non-technical skills: a systematic review. Brit. J. Surg. 108, 1154–1161 (2021).
Article PubMed CAS Google Scholar
Obuseh, M. et al. Development and application of a non- technical skills coaching intervention framework for surgeons: a pilot quality improvement initiative. PLoS One 19, 1–14 (2024).
Article Google Scholar
Press, O., Smith, N. A. & Lewis, M. Train short, test long: attention with linear biases enables input length extrapolation. In: ICLR 2022 - 10th International Conference on Learning Representations (ICLR, 2021).
Yule, S., Flin, R., Paterson-Brown, S. & Maran, N. Non-technical skills for surgeons in the operating room: a review of the literature. Surgery 139, 140–149 (2006).
Article PubMed CAS Google Scholar
Yule, S. et al. Surgeons’ non-technical skills in the operating room: reliability testing of the NOTSS behavior rating system. World J. Surg. 32, 548–556 (2008).
Article PubMed Google Scholar
Dubey, A. et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024).
Jiang, A. Q. et al. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
Liu, P. et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55, 1–35 (2023).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process Syst. 35, 24824–24837 (2022).
Van Hulse, J., Khoshgoftaar, T. M. & Napolitano, A. Experimental perspectives on learning from imbalanced data. ACM Int. Conf. Proc. Ser. 227, 935–942 (2007).
Google Scholar

Download references

Acknowledgements

This study was funded by the Agency for Healthcare Research and Quality (AHRQ) (Grant number: R01HS028026). This study was also made possible by Indiana University Health and The Advances In Medicine (AIM) grant provided by Cook Medical. Administratively, it was supported by the Indiana Clinical and Translational Sciences Institute, funded in part by grant #UM1TR004402 from the National Institutes of Health, National Center for Advancing Translational Sciences, and Clinical and Translational Sciences Award. The content is solely the responsibility of the authors and does not necessarily represent the official views of AHRQ, Indiana University Health, Cook Medical, or the National Institutes of Health. The authors also acknowledge the data cleaning efforts of Maheen Syed, Raunak Chakrabarty, and Dan-Ha Le.

Author information

Authors and Affiliations

Edwardson School of Industrial Engineering, Purdue University, West Lafayette, IN, USA
Marian Obuseh & Denny Yu
Koita Centre for Digital Health, Indian Institute of Technology, Bombay, India
Sneha Singh
School of Medicine, Indiana University, Indianapolis, IN, USA
Nicholas E. Anton, Robin Gardiner & Dimitrios Stefanidis

Authors

Marian Obuseh
View author publications
Search author on:PubMed Google Scholar
Sneha Singh
View author publications
Search author on:PubMed Google Scholar
Nicholas E. Anton
View author publications
Search author on:PubMed Google Scholar
Robin Gardiner
View author publications
Search author on:PubMed Google Scholar
Dimitrios Stefanidis
View author publications
Search author on:PubMed Google Scholar
Denny Yu
View author publications
Search author on:PubMed Google Scholar

Contributions

Concept and design: M.O. Acquisition, analysis or interpretation of data: M.O., S.S., N.A., R.G. Model training and evaluation: M.O., S.S. Drafting of the manuscript: M.O. Critical review of analysis: D.Y. Critical review of manuscript: D.S., D.Y. Project supervision: D.S., D.Y. All authors read and approved the final version of the manuscript.

Corresponding author

Correspondence to Marian Obuseh.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Obuseh, M., Singh, S., Anton, N. et al. Feasibility of large language models for assessing and coaching surgeons’ non-technical skills. npj Health Syst. 2, 25 (2025). https://doi.org/10.1038/s44401-025-00027-2

Download citation

Received: 27 February 2025
Accepted: 24 May 2025
Published: 15 July 2025
Version of record: 15 July 2025
DOI: https://doi.org/10.1038/s44401-025-00027-2

This article is cited by

Robotic inguinal and umbilical hernia repair: clinical outcomes, costs, and future perspectives: a narrative review
- Michele Schiano di Visconte
Journal of Robotic Surgery (2025)