Table 1 Accuracy and RMS calibration error of different models on HLE, demonstrating low accuracy and high calibration error across all models
From: A benchmark of expert-level academic questions to assess AI capabilities
Model | Accuracy (%) ↑ | Calibration error (%) ↓ |
|---|---|---|
GPT-4o | 2.7 ± 0.6 | 89 |
Claude 3.5 Sonnet | 4.1 ± 0.8 | 84 |
Gemini 1.5 Pro | 4.6 ± 0.8 | 88 |
o1 | 8.0 ± 1.1 | 83 |
DeepSeek R1a | 8.5 ± 1.2 | 73 |
Post-release models | ||
Claude 4 Sonnet | 7.8 ± 1.1 | 75 |
Gemini 2.5 Pro | 21.6 ± 1.6 | 72 |
GPT-5 | 25.3 ± 1.7 | 50 |