Fig. 1: Performance of frontier LLMs on popular benchmarks and HLE.
From: A benchmark of expert-level academic questions to assess AI capabilities

Compared with the saturation of other popular capability benchmarks, HLE accuracy remains low across several frontier models, demonstrating its effectiveness for measuring advanced, closed-ended, academic capabilities.