Table 1 Overall and class performance on test data for each model category

From: Feasibility of large language models for assessing and coaching surgeons’ non-technical skills

Overall performance on test data for each model category

Model category

Model name

Performance metrics

Accuracy

Precision

Recall

F1 Score

Classical ML

LR

0.74

0.55

0.71

0.52

SVM

0.71

0.55

0.73

0.51

LLM

Llama 3.1

0.74

0.61

0.66

0.62

Mistral

0.75

0.60

0.63

0.61

Class performance on test data for each model category

Model category

Model name

NTS performance rating

Performance metrics

Precision

Recall

F1 Score

Classical ML

LR

Exemplar

0.98

0.75

0.85

SVM

0.98

0.71

0.83

LLM

Llama 3.1

0.90

0.78

0.84

Mistral

0.88

0.81

0.84

Classical ML

LR

Non-exemplar

0.11

0.67

0.19

SVM

0.11

0.75

0.19

LLM

Llama 3.1

0.33

0.54

0.41

Mistral

0.32

0.45

0.37

  1. F1 scores (harmonic mean of precision and recall) regarded as most important metric.
  2. LLM large language model, ML machine learning, LR linear regression, SVM support vector machine, NTS non-technical skills.