Table 1 LLM accuracy, discrimination, calibration, and confidence scores, sorted from best calibration (lowest Brier score) to worst for each model family
Model family | Model name and parameter (quantization) | Date accessed | Calibration | Discrimination | Accuracy | Self-reported confidence score | |
|---|---|---|---|---|---|---|---|
Brier score | ECE | AUROC | Percent | Mean (95CI) | |||
Llama | |||||||
Llama-3.3-70b | December 2024 | 0.260 | 0.199 | 0.563 | 65.66 | 8.46 (8.36–8.56) | |
Llama 3.1 405B | August 2024 | 0.273 | 0.211 | 0.592 | 64 | 8.47 (8.38–8.57) | |
Llama3.2-90B | December 2024 | 0.302 | 0.269 | 0.600 | 60.00 | 8.49 (8.34–8.62) | |
Llama 3.1 70B | August 2024 | 0.313 | 0.283 | 0.538 | 58.19 | 8.51 (8.39–8.62) | |
Llama 3 70B | May 2024 | 0.334 | 0.301 | 0.572 | 54.66 | 8.38 (8.28–8.48) | |
Llama 3 8B | May 2024 | 0.422 | 0.450 | 0.478 | 43.33 | 8.54 (8.41–8.68) | |
Llama-3.2-11b | December 2024 | 0.400 | 0.390 | 0.519 | 48.65 | 8.59 (8.46–8.69) | |
Llama 3.1 8B | August 2024 | 0.433 | 0.441 | 0.512 | 43.14 | 8.67 (8.54–8.80) | |
Llama-3.2-3b | December 2024 | 0.465 | 0.487 | 0.534 | 35.66 | 8.32 (8.18–8.45) | |
Llama 2 70B | April 2024 | 0.481 | 0.493 | 0.529 | 37.71 | 8.70 (8.58–8.81) | |
Llama-3.2-1b | December 2024 | 0.500 | 0.511 | 0.455 | 30.61 | 8.13 (7.96–8.31) | |
Llama 2 13B (Q5) | April 2024 | 0.525 | 0.546 | 0.5 | 35.16 | 8.98 (8.92–9.04) | |
Llama 3 8B (Q8) | April 2024 | 0.527 | 0.613 | 0.472 | 30.35 | 8.65 (8.28–9.02) | |
Llama 2 7B | April 2024 | 0.528 | 0.587 | 0.47 | 30.87 | 8.66 (8.47–8.84) | |
Llama 2 13B | April 2024 | 0.531 | 0.558 | 0.52 | 33.11 | 8.89 (8.82–8.95) | |
Llama 2 7B (Q8) | April 2024 | 0.559 | 0.582 | 0.458 | 32.45 | 9.07 (8.98–9.15) | |
Qwen | |||||||
Qwen-2.5-72b | September 2024 | 0.326 | 0.304 | 0.549 | 61.48 | 8.39(8.15–8.63) | |
Qwen-2-72B | September 2024 | 0.364 | 0.360 | 0.583 | 57.00 | 9.10(8.98–9.20) | |
Phi | |||||||
Phi-3 Medium 14B (Q6) | April 2024 | 0.389 | 0.377 | 0.588 | 48.66 | 8.57 (8.48–8.67) | |
Phi-3 3B FP16 | April 2024 | 0.458 | 0.464 | 0.486 | 43.79 | 8.96 (8.84–9.07) | |
Phi-3.5-4b | December 2024 | 0.558 | 0.578 | 0.465 | 31.86 | 8.96 (8.90–9.02) | |
Gemini Advanced Web | March–April 2024 | 0.297 | 0.247 | 0.561 | 58.49 | 8.20 (8.07–8.33) | |
Gemma 2 27B | July 2024 | 0.374 | 0.352 | 0.557 | 50 | 8.52 (8.41–8.63) | |
Gemma 2 9B (Q8) | July 2024 | 0.397 | 0.392 | 0.543 | 45.33 | 8.40 (8.30-8.50) | |
Gemma 2 9B | July 2024 | 0.398 | 0.390 | 0.592 | 44.59 | 8.33 (8.20–8.45) | |
Gemini Web | March 2024 | 0.421 | 0.420 | 0.563 | 44.44 | 8.61 (8.53–8.70) | |
Mistral | |||||||
Mistral Large | April 2024 | 0.282 | 0.224 | 0.602 | 60.53 | 8.13 (7.98–8.28) | |
Mixtral 8x7B | April 2024 | 0.359 | 0.336 | 0.547 | 54.33 | 8.79 (8.72–8.87) | |
Mistral v2 Q8 | April 2024 | 0.506 | 0.527 | 0.554 | 39.06 | 9.11 (8.90–9.32) | |
Mistral 7B | April 2024 | 0.547 | 0.551 | 0.519 | 40.66 | ||
Claude | |||||||
Claude 3.5 Sonnet | July 2024 | 0.207 | 0.122 | 0.6 | 74 | 8.60 (8.54–8.67) | |
Claude 3 Opus | March–April 2024 | 0.229 | 0.150 | 0.575 | 70.35 | 8.54 (8.44–8.63) | |
Claude 3 Opus Web | March–April 2024 | 0.246 | 0.154 | 0.578 | 65.66 | 7.99 (7.89–8.09) | |
Claude 3 Sonnet Web | March–April 2024 | 0.326 | 0.284 | 0.551 | 55.33 | 8.37 (8.29–8.45) | |
Claude 3 Sonnet | March–April 2024 | 0.361 | 0.336 | 0.559 | 51.17 | 8.48 (8.39–8.58) | |
Claude 3 Haiku | March–April 2024 | 0.373 | 0.357 | 0.522 | 53.76 | 8.88 (8.80–8.96) | |
Claude 3 Haiku Web | March–April 2024 | 0.398 | 0.385 | 0.523 | 50 | 8.85 (8.80–8.90) | |
GPT | |||||||
o1 preview | September 2024 | 0.157 | 0.100 | 0.576 | 81.57 | 9.15 (9.10–9.20) | |
GPT-4o | May 2024 | 0.208 | 0.148 | 0.604 | 74 | 8.86 (8.80–8.92) | |
GPT-4 Web | March 2024 | 0.267 | 0.221 | 0.588 | 66.22 | 8.79 (8.70–8.87) | |
GPT-4 | March 2024 | 0.278 | 0.237 | 0.605 | 66.53 | 9.02 (8.92–9.13) | |
o1 Mini | September 2024 | 0.278 | 0.257 | 0.626 | 66.33 | 9.20 (9.12–9.27) | |
GPT-4o Mini | July 2024 | 0.342 | 0.309 | 0.572 | 56.61 | 8.75 (8.67–8.83) | |
GPT-3.5 Web | March 2024 | 0.394 | 0.375 | 0.546 | 47.66 | 8.56 (8.48–8.63) | |