Table 1 Accuracy and RMS calibration error of different models on HLE, demonstrating low accuracy and high calibration error across all models

From: A benchmark of expert-level academic questions to assess AI capabilities

Model

Accuracy (%) ↑

Calibration error (%) ↓

GPT-4o

2.7 ± 0.6

89

Claude 3.5 Sonnet

4.1 ± 0.8

84

Gemini 1.5 Pro

4.6 ± 0.8

88

o1

8.0 ± 1.1

83

DeepSeek R1a

8.5 ± 1.2

73

Post-release models

Claude 4 Sonnet

7.8 ± 1.1

75

Gemini 2.5 Pro

21.6 ± 1.6

72

GPT-5

25.3 ± 1.7

50

  1. The most updated evaluations are hosted on https://lastexam.ai. Post-release models are released after HLE was open-sourced; we separate them as model builders have access to the HLE dataset. We report a breakdown of the text-only subset and other categories in Extended Data Tables. 1 and 2.
  2. aModel is not multi-modal, evaluated on a text-only subset.