Large language models (LLMs) have demonstrated strong performance in general natural language processing (NLP) tasks across various domains1. In healthcare, studies have shown promising results particularly in medical diagnostic support and clinical decision-making2,3. For example, GPT-4 achieved a diagnostic accuracy of 61.1% in its top six differential diagnoses for previously unpublished challenging clinical cases, surpassing the 49.1% accuracy of physicians2. Similarly, a Llama-based model achieved 52% accuracy in assigning diagnosis-related groups for hospitalized patients using clinical notes3. These advancements highlight the potential of LLMs in specialized healthcare applications, including skill assessment and training of healthcare professionals. While substantial work has focused on assessing and training surgeons’ technical skills4,5, there is less emphasis on non-technical skills (NTS)6. However, these NTS are critical cognitive, social, and interpersonal abilities that are directly linked to adverse events and patient safety outcomes7.

Assessing NTS presents unique challenges due to their primary expression through verbal communication8, which can be dynamic, nuanced, context-dependent, and susceptible to interpretation bias. This variability introduces subjectivity into assessments, as observers may interpret the same behaviors differently. Traditional observational assessments are prone to these inconsistencies that can compromise subsequent training efforts. They are also resource-intensive, requiring significant time and effort not only to train the raters but also for the raters to conduct the assessments for long surgeries9,10. Dyadic NTS coaching approaches, while beneficial, demand specialized non-surgeon coaching staff who can focus less on technical skills and are also very time-intensive11,12. Given these challenges, there is a clear need for more objective, efficient, and scalable methods to assess NTS, ensuring consistent, high-quality training that enhances patient outcomes and strengthens team dynamics.

NLP offers a promising solution to these challenges since NTS are mostly demonstrated through verbal communication. In this study, we explore the feasibility of LLMs for surgeons’ NTS assessment and coaching. Specifically, we hypothesize that LLMs can differentiate between exemplar and non-exemplar NTS behaviors and provide actionable performance coaching feedback to surgeons, supporting scalable and effective surgical education.

Model performance

Table 1 presents the performance of LLMs and benchmarking machine learning (ML) models in categorizing NTS behaviors. Overall, LLMs demonstrated higher performance than classical ML models. Llama 3.1 achieved a macro-averaged F1 score of 0.62 and a prediction accuracy of 74%, while Mistral showed comparable performance with a macro-averaged F1 score of 0.61 and a prediction accuracy of 75%. In comparison, the classical ML models including logistic regression (LR) and support vector machines (SVM), demonstrated lower overall performance. Although their prediction accuracy (LR = 74%, SVM = 71%) was comparable to LLMs, their macro-averaged F1 scores were significantly lower (LR = 0.52, SVM = 0.51), highlighting their limitations in balancing precision and recall across NTS behaviors.

Table 1 Overall and class performance on test data for each model category

For exemplar NTS behaviors, both LLMs and classical ML models performed excellently, with macro-averaged F1 scores of 0.84 and 0.85 respectively. This indicates that both model types were effective in identifying well-defined, positive behaviors exhibited by surgeons. For non-exemplar behaviors, which are more challenging to identify, Llama 3.1 achieved a macro-averaged F1 score of 0.41, outperforming LR and SVM, both of which had F1 scores below 0.20. Mistral also performed better than the classical models with an F1 score of 0.37. These results show that while classical ML models maintained comparable performance for exemplar behaviors, LLMs provided improved accuracy and F1 scores for non-exemplar behavior identification.

Figure 1 compares the traditional manual NTS assessment pipeline with the proposed LLM-based automation approach using an example of a missed communication opportunity in the operating room due to the lack of a vocative address. In the manual approach12, multiple human expert raters, including a coach, independently assess the surgeon’s NTS behaviors using the Non-Technical Skills for Surgeons (NOTSS) tool. After completing the individual assessments, they engage in group discussions to reach a consensus on the assessment. The coach then provides dyadic coaching feedback to the surgeon, specifically targeting improvements to address the identified communication breakdown. This multi-step process is resource-intensive, requiring substantial time for individual assessments, consensus-building, and personalized coaching.

Fig. 1: Comparative workflow for current gold standard versus LLM-based approach for NTS assessment and coaching for a surgical event.
figure 1

Classical ML are excluded, as they do not support natural language generation.

In contrast, the LLM automation pipeline directly processes surgical transcripts through fine-tuned models to assess NTS behaviors and generate coaching-style feedback. Both Llama 3.1 and Mistral identified the same non-exemplar behavior in the example scenario. Llama 3.1 highlighted potential confusion due to the lack of a vocative address, while Mistral offered specific suggestions for clearer communication. Additional examples of exemplar and non-exemplar classifications can be found in Supplementary Table S1. These results demonstrate that LLMs can automate NTS coaching, providing consistent, context-specific feedback that aligns with human coaches while offering a more efficient and scalable alternative to manual methods.

Implications

This study demonstrated the feasibility of LLMs for NTS assessment and coaching in surgeries, addressing key limitations of traditional methods. Our results showed that LLMs outperformed ML models despite being trained on significantly smaller datasets. Differences in data handling strategies should be considered when interpreting classification performance comparisons. Although balanced training data were used in both cases to mitigate class imbalance effects, the underlying training mechanisms differed.

LLM performance was likely constrained by limited context windows, which restricted the amount of contextual information that could be processed during training13. In particular, the shorter context window of Mistral limited the size of the training set. While this enabled consistent evaluation across LLMs and a focus on developing highly targeted prompts, it may reduce generalizability. This limitation should be revisited as LLMs advance. Additionally, the underrepresentation of non-exemplar behaviors in the dataset likely contributed to lower performance in this category. In this study, fewer than 17% of observed behaviors were non-exemplar, aligning with prior research indicating that while these behaviors are less frequent, they are not uncommon14. Ongoing research is focused on enhancing model adaptability through optimized fine-tuning strategies and dataset expansion, particularly by increasing non-exemplar cases to improve class balance.

Despite not being explicitly trained on human coaching feedback, LLMs demonstrated the ability to generate performance-based feedback for surgeons, offering a scalable alternative to traditional dyadic human coaching, which is often limited by human availability and variability. This scalability could democratize access to high-quality NTS training, benefiting a wide range of surgical teams, including those in resource-limited settings. The ability to better identify non-exemplar behaviors, where coaching is most needed, combined with the generation of structured, context-specific feedback, makes LLMs preferable in this setting.

This study focused on transcript-based, offline analysis that lays critical groundwork for future efforts by (1) establishing a reproducible pipeline for building large-scale datasets from real intraoperative conversations, (2) demonstrating the advantages of LLMs over traditional ML models in capturing the nuance of verbal behaviors during RAS, (3) refining prompt engineering strategies to handle domain-specific language and achieve high performance on NTS performance classification and coaching feedback tasks, and (4) producing a trained LLM-based model that can be further expanded, refined, validated, and adapted. Early-stage large-scale human evaluation of LLM-generated feedback will be a formal part of the system design, with expert review helping to shape model refinement. Initial feedback efforts will focus on acceptability, perceived usefulness, trust, and integration into training workflows. A full usability study will be necessary before deployment to systematically evaluate acceptability, trust, perceived usefulness, and workflow integration in clinical environments. Practical implementation will require addressing barriers such as generalizability across specialties and institutions, alignment with clinical education standards, and fidelity to the dynamic context of live procedures. It will be essential to address concerns around accountability, reliability, privacy, and the secure handling of clinical data to prevent misuse of sensitive information. Additionally, we recognize that LLMs operate as black-box systems, which presents challenges for interpretability and adoption. A human-in-the-loop approach will therefore remain essential throughout the design, development, deployment, and post-deployment monitoring phases. While automation reduces the overall burden compared to traditional dyadic coaching, expert oversight by surgical educators or system stakeholders will remain essential to ensure accountability. Together, these efforts aim to enable scalable, targeted, and high-fidelity NTS coaching while establishing a clearer pathway toward integrating LLM-based feedback mechanisms into surgical education programs.

Methods

Participants

This study was approved by Indiana University’s Institutional Review Board (protocol 1702481748; approved: March 2023). No race, ethnicity, or patient-identifiable data was collected. Between 2023 and 2024, nine surgeons (4 women [44%]) from six surgical subspecialties across three hospitals in Indiana University Health system, all performing at least one robotic-assisted surgery (RAS) per month, were recruited. They provided informed consent, and twenty RAS were observed and recorded in the operating room, with each surgeon contributing 1–3 cases.

Data

Human expert raters watched the RAS recordings and used the validated NOTSS15 tool to identify operating room interactions that challenged the surgeons’ NTS. Surgeons’ performance was classified as exemplar (rating = 4) or non-exemplar (rating < 4). This classification approach aligns with prior research distinguishing behaviors meeting the highest standard from those requiring improvement9. The raters identified 709 NTS events, with 590 categorized as exemplar behaviors and 119 as non-exemplar. Transcripts of these interactions constitutes the dataset.

Large language models

This study used publicly released state-of-the-art models, including Meta’s Llama-3.1-8B-Instruct16 (open-weight, under a custom non-commercial license) and Mistral AI’s Mistral-7B-Instruct-v0.317 (open-weight and open-source under Apache 2.0), both accessed via Hugging Face. All experiments were conducted using Google Colab Pro+, which provided cloud-based access to a single A100 GPU. The only cost incurred was the Colab Pro+ subscription, which offered sufficient compute capacity for prompt engineering, model inference, and performance evaluation. These models were chosen for their strong performance, computational efficiency, and accessibility, which enable execution within controlled research environments without reliance on proprietary API services. This supports reproducibility, transparency, and privacy, which are key considerations for clinical research.

While Llama 3.1 supports a context window of up to 128 K tokens16, our experiments were constrained by Mistral’s shorter context window of approximately 32 K tokens17. This limitation necessitated a focus on developing highly targeted prompts rather than increasing the size of the training dataset. A stratified sampling strategy was used to construct the training dataset. One exemplar and one non-exemplar data point were randomly sampled from each of the four NOTSS categories, resulting in eight data points (~1% of data), covering all combinations of NTS performance level and construct. The same training data (n = 4 exemplar class and n = 8 non-exemplar class) was used for both LLMs.

Prompt design targeted both classification of NTS performance and generation of coaching feedback. For both models, the generation parameters were as follows: max_new_tokens = 512 (prevents overly long coaching feedback), temperature = 0.005 (makes the output consistent across runs), top p = 0.001 (restricts choices to only the most likely words), top k = 20 (adds a cutoff to avoid unusual or irrelevant words), and do sample = True (to retain some natural language variability). Importantly, the models were not trained on any human coaching feedback. Instead, they were prompted to generate feedback solely from the transcripts of surgeon interactions in the operating room, ensuring that feedback generation was independent of expert coaching examples.

To evaluate model performance with minimal instruction, an initial prompt was executed and outputs were manually reviewed to identify failure patterns. Prompts were subsequently refined iteratively using evidence-based strategies, including role prompting (e.g., “You are an experienced non-technical skills coach”), clear section delimitation, and zero-shot chain-of-thought prompting (e.g., “Think step-by-step”)18,19. Refinement continued until both models achieved perfect (100%) classification and feedback generation performance on the training data. The final prompts can be found in the Supplementary Information. Model performance was evaluated on the full test dataset (n = 701; 115 non-exemplar) using classification metrics including accuracy, precision, recall, and macro-averaged F1 scores.

Classical machine learning models

Classical ML models (Linear Regression and Support Vector Machine) were used as benchmarks for evaluating LLM performance. To address class imbalance, the majority class (exemplar) was under-sampled to 80% of the minority class20, creating a balanced training dataset (n = 190; 95 non-exemplar). The models were trained on Term Frequency-Inverse Document Frequency (TF-IDF) vectorized unigrams of the balanced training data. Model performance was then evaluated on a held-out test dataset (n = 519; 24 non-exemplar) using the metrics as the LLMs.