Table 1 LLM accuracy, discrimination, calibration, and confidence scores, sorted from best calibration (lowest Brier score) to worst for each model family

From: Across generations, sizes, and types, large language models poorly report self-confidence in gastroenterology clinical reasoning tasks

Model family

Model name and parameter (quantization)

Date accessed

Calibration

Discrimination

Accuracy

Self-reported confidence score

 

Brier score

ECE

AUROC

Percent

Mean (95CI)

Llama

       
 

Llama-3.3-70b

December 2024

0.260

0.199

0.563

65.66

8.46 (8.36–8.56)

 

Llama 3.1 405B

August 2024

0.273

0.211

0.592

64

8.47 (8.38–8.57)

 

Llama3.2-90B

December 2024

0.302

0.269

0.600

60.00

8.49 (8.34–8.62)

 

Llama 3.1 70B

August 2024

0.313

0.283

0.538

58.19

8.51 (8.39–8.62)

 

Llama 3 70B

May 2024

0.334

0.301

0.572

54.66

8.38 (8.28–8.48)

 

Llama 3 8B

May 2024

0.422

0.450

0.478

43.33

8.54 (8.41–8.68)

 

Llama-3.2-11b

December 2024

0.400

0.390

0.519

48.65

8.59 (8.46–8.69)

 

Llama 3.1 8B

August 2024

0.433

0.441

0.512

43.14

8.67 (8.54–8.80)

 

Llama-3.2-3b

December 2024

0.465

0.487

0.534

35.66

8.32 (8.18–8.45)

 

Llama 2 70B

April 2024

0.481

0.493

0.529

37.71

8.70 (8.58–8.81)

 

Llama-3.2-1b

December 2024

0.500

0.511

0.455

30.61

8.13 (7.96–8.31)

 

Llama 2 13B (Q5)

April 2024

0.525

0.546

0.5

35.16

8.98 (8.92–9.04)

 

Llama 3 8B (Q8)

April 2024

0.527

0.613

0.472

30.35

8.65 (8.28–9.02)

 

Llama 2 7B

April 2024

0.528

0.587

0.47

30.87

8.66 (8.47–8.84)

 

Llama 2 13B

April 2024

0.531

0.558

0.52

33.11

8.89 (8.82–8.95)

 

Llama 2 7B (Q8)

April 2024

0.559

0.582

0.458

32.45

9.07 (8.98–9.15)

Qwen

       
 

Qwen-2.5-72b

September 2024

0.326

0.304

0.549

61.48

8.39(8.15–8.63)

 

Qwen-2-72B

September 2024

0.364

0.360

0.583

57.00

9.10(8.98–9.20)

Phi

       
 

Phi-3 Medium 14B (Q6)

April 2024

0.389

0.377

0.588

48.66

8.57 (8.48–8.67)

 

Phi-3 3B FP16

April 2024

0.458

0.464

0.486

43.79

8.96 (8.84–9.07)

 

Phi-3.5-4b

December 2024

0.558

0.578

0.465

31.86

8.96 (8.90–9.02)

Google

       
 

Gemini Advanced Web

March–April 2024

0.297

0.247

0.561

58.49

8.20 (8.07–8.33)

 

Gemma 2 27B

July 2024

0.374

0.352

0.557

50

8.52 (8.41–8.63)

 

Gemma 2 9B (Q8)

July 2024

0.397

0.392

0.543

45.33

8.40 (8.30-8.50)

 

Gemma 2 9B

July 2024

0.398

0.390

0.592

44.59

8.33 (8.20–8.45)

 

Gemini Web

March 2024

0.421

0.420

0.563

44.44

8.61 (8.53–8.70)

Mistral

       
 

Mistral Large

April 2024

0.282

0.224

0.602

60.53

8.13 (7.98–8.28)

 

Mixtral 8x7B

April 2024

0.359

0.336

0.547

54.33

8.79 (8.72–8.87)

 

Mistral v2 Q8

April 2024

0.506

0.527

0.554

39.06

9.11 (8.90–9.32)

 

Mistral 7B

April 2024

0.547

0.551

0.519

40.66

 

Claude

       
 

Claude 3.5 Sonnet

July 2024

0.207

0.122

0.6

74

8.60 (8.54–8.67)

 

Claude 3 Opus

March–April 2024

0.229

0.150

0.575

70.35

8.54 (8.44–8.63)

 

Claude 3 Opus Web

March–April 2024

0.246

0.154

0.578

65.66

7.99 (7.89–8.09)

 

Claude 3 Sonnet Web

March–April 2024

0.326

0.284

0.551

55.33

8.37 (8.29–8.45)

 

Claude 3 Sonnet

March–April 2024

0.361

0.336

0.559

51.17

8.48 (8.39–8.58)

 

Claude 3 Haiku

March–April 2024

0.373

0.357

0.522

53.76

8.88 (8.80–8.96)

 

Claude 3 Haiku Web

March–April 2024

0.398

0.385

0.523

50

8.85 (8.80–8.90)

GPT

       
 

o1 preview

September 2024

0.157

0.100

0.576

81.57

9.15 (9.10–9.20)

 

GPT-4o

May 2024

0.208

0.148

0.604

74

8.86 (8.80–8.92)

 

GPT-4 Web

March 2024

0.267

0.221

0.588

66.22

8.79 (8.70–8.87)

 

GPT-4

March 2024

0.278

0.237

0.605

66.53

9.02 (8.92–9.13)

 

o1 Mini

September 2024

0.278

0.257

0.626

66.33

9.20 (9.12–9.27)

 

GPT-4o Mini

July 2024

0.342

0.309

0.572

56.61

8.75 (8.67–8.83)

 

GPT-3.5 Web

March 2024

0.394

0.375

0.546

47.66

8.56 (8.48–8.63)