Table 1 Overall and class performance on test data for each model category
From: Feasibility of large language models for assessing and coaching surgeons’ non-technical skills
Overall performance on test data for each model category | |||||
|---|---|---|---|---|---|
Model category | Model name | Performance metrics | |||
Accuracy | Precision | Recall | F1 Score | ||
Classical ML | LR | 0.74 | 0.55 | 0.71 | 0.52 |
SVM | 0.71 | 0.55 | 0.73 | 0.51 | |
LLM | Llama 3.1 | 0.74 | 0.61 | 0.66 | 0.62 |
Mistral | 0.75 | 0.60 | 0.63 | 0.61 | |
Class performance on test data for each model category | |||||
Model category | Model name | NTS performance rating | Performance metrics | ||
Precision | Recall | F1 Score | |||
Classical ML | LR | Exemplar | 0.98 | 0.75 | 0.85 |
SVM | 0.98 | 0.71 | 0.83 | ||
LLM | Llama 3.1 | 0.90 | 0.78 | 0.84 | |
Mistral | 0.88 | 0.81 | 0.84 | ||
Classical ML | LR | Non-exemplar | 0.11 | 0.67 | 0.19 |
SVM | 0.11 | 0.75 | 0.19 | ||
LLM | Llama 3.1 | 0.33 | 0.54 | 0.41 | |
Mistral | 0.32 | 0.45 | 0.37 | ||