Table 1 Accuracy and RMS calibration error of different models on HLE, demonstrating low accuracy and high calibration error across all models

The most updated evaluations are hosted on https://lastexam.ai. Post-release models are released after HLE was open-sourced; we separate them as model builders have access to the HLE dataset. We report a breakdown of the text-only subset and other categories in Extended Data Tables. 1 and 2.
^aModel is not multi-modal, evaluated on a text-only subset.

Quick links

Search