Abstract
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding1, limiting informed measurement of state-of-the-art LLM capabilities. Here, in response, we introduce Humanity’s Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be an expert-level closed-ended academic benchmark with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable but cannot be quickly answered by internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a marked gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
Main
The capabilities of large language models (LLMs) have advanced markedly, exceeding human performance across a diverse array of tasks. To systematically measure these capabilities, LLMs are evaluated on benchmarks: collections of questions that assess model performance on tasks such as math, programming or biology. However, state-of-the-art LLMs2,3,4,5,6 now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding (MMLU)1, which were once challenging frontiers for LLMs. The saturation of existing benchmarks, as shown in Fig. 1, limits our ability to precisely measure artificial intelligence (AI) capabilities and calls for more challenging evaluations that can meaningfully assess the rapid improvements in LLM capabilities at the frontiers of human knowledge.
To address this gap, we introduce HLE (originally defined as Humanity’s Last Exam, although we will use the term HLE for this paper), a benchmark of 2,500 challenging questions from dozens of subject areas, designed to assess LLM capabilities at an expert level in broad academic subjects. HLE is developed by academics and domain experts, providing a precise measure of capabilities as LLMs continue to improve (see section ‘Collection’). HLE is multi-modal, featuring questions that are either text-only or accompanied by an image reference and includes both multiple-choice and exact-match questions for automated answer verification. Questions are original, precise, unambiguous and resistant to simple internet lookup or database retrieval. Among the diversity of questions in the benchmark, HLE emphasizes world-class mathematics problems aimed at testing deep reasoning skills broadly applicable across multiple academic areas.
We use a multi-stage review process to thoroughly ensure question difficulty and quality (see section ‘Review’). Before submission, each question is tested against state-of-the-art LLMs to verify its difficulty—questions are rejected if LLMs can answer them correctly. Questions submitted are then processed through a two-stage reviewing process: (1) an initial feedback round with multiple graduate-level reviewers and (2) an approval of organizer and expert reviewer, ensuring quality and adherence to our submission criteria. Following the release, we conducted a public review period, welcoming community feedback to correct any points of concern in the dataset.
Frontier LLMs consistently demonstrate low accuracy across all models, highlighting a marked gap between current capabilities and expert-level academic performance (see section ‘Evaluation’). Models also provide incorrect answers with high confidence rather than acknowledging uncertainty on these challenging questions, with most models exhibiting root mean square (RMS) calibration errors above 70%.
As AI systems approach human expert performance in many domains, precise measurement of their capabilities and limitations is essential for informing research, governance and the broader public. High performance on HLE would suggest expert-level capabilities on closed-ended academic questions. To establish a common reference point for assessing these capabilities, we publicly release a large number of 2,500 questions from HLE to enable this precise measurement, while maintaining a private test set to assess potential model overfitting.
Dataset
Collection
HLE consists of 2,500 challenging questions across over a hundred subjects. A high-level summary is provided in Fig. 2. HLE is a global collaborative effort, with questions from nearly 1,000 subject expert contributors affiliated with more than 500 institutions across 50 countries—comprised mostly of professors, researchers and graduate degree holders. Examples of the diverse and challenging questions submitted to HLE are shown in Fig. 3.
Question style
HLE contains two question formats: exact-match questions (models provide an exact string as output) and multiple-choice questions (the model selects one of five or more answer choices). HLE is a multi-modal benchmark, with around 14% of questions requiring comprehending both text and an image; 24% of questions are multiple-choice, with the remainder being exact match.
Each question submission includes several required components: the question text itself, answer specifications (either an exact-match answer or multiple-choice options with the correct answer marked), detailed rationale explaining the solution, academic subject and name of the contributor and institutional affiliation to maintain accountability and accuracy.
Submission format
To ensure question quality and integrity, we enforce strict submission criteria. Questions should be precise, unambiguous, solvable and non-searchable, ensuring models cannot rely on memorization or simple retrieval methods. All submissions must be original work or non-trivial syntheses of published information, although contributions from unpublished research are acceptable. Questions typically require graduate-level expertise or test knowledge of highly specific topics (for example, precise historical details, trivia and local customs) and have specific, unambiguous answers accepted by domain experts. When LLMs provide correct answers with faulty reasoning, authors are encouraged to modify question parameters, such as the number of answer choices, to discourage false positives. We require clear English with precise technical terminology, supporting LaTeX notation wherever necessary. Answers are kept short and easily verifiable for exact-match questions to support automatic grading. We prohibit open-ended questions, subjective interpretations, and content related to weapons of mass destruction. Finally, every question is accompanied by a detailed solution to verify accuracy. More details about guidelines for contributors can be found in Supplementary Information section 1.
Prize pool
To attract high-quality submissions, we establish a USD$500,000 prize pool, with prizes of USD$5,000 for each of the top 50 questions and USD$500 for each of the next 500 questions, as determined by organizers. This incentive structure, combined with the opportunity for paper co-authorship for anyone with an accepted question in HLE, draws participation from qualified experts, particularly those with advanced degrees or notable technical experience in their fields.
Review
LLM difficulty check
To ensure question difficulty, each question is first validated against several frontier LLMs before submission (Methods). If the LLMs cannot solve the question (or, in the case of multiple choices, if the models on average do worse than random guessing), the question proceeds to the next stage: human expert review. In total, we logged more than 70,000 attempts, resulting in approximately 13,000 questions, which stumped LLMs that were forwarded to expert human review.
Expert review
Our human reviewers possess a graduate degree (for example, master’s, PhD and JD) in their fields. Reviewers select submissions in their domain, grading them against standardized rubrics and offering feedback when applicable. There are two rounds of reviews. The first round focuses on iteratively refining submissions, with each question receiving between one and three reviews. The primary goal is to help the question contributors (who are primarily academics and researchers from a wide range of disciplines) better design questions that are closed-ended, robust and of high quality for AI evaluation. In the second round, good and outstanding questions from the first round are identified and approved by organizers and reviewers to be included in the final HLE dataset. Details, instructions and rubrics for both rounds can be found in Supplementary Information section 2. Figure 4 shows our full process.
We accept questions that make frontier LLMs fail, then iteratively refine them with the help of expert peer reviewers. Each question is then manually approved by organizers or expert reviewers trained by organizers. A private held-out set is kept apart from the public set to assess model overfitting and gaming on the public benchmark.
Evaluation
We evaluate the performance of state-of-the-art LLMs on HLE and analyse their capabilities across different question types and domains. We describe our evaluation setup (see section ‘Setup’) and present several quantitative results on metrics that track model performance (see section ‘Quantitative results’).
Setup
After data collection and review, we evaluated our final HLE dataset on additional frontier multi-modal LLMs. We use a standardized system prompt that structures model responses into explicit reasoning followed by a final answer. As the question–answers are precise and close-ended, we use o3-mini as a judge to verify answer correctness against model predictions while accounting for equivalent formats (for example, decimals compared with fractions or estimations). Evaluation prompts are detailed in the Methods.
Quantitative results
Accuracy
All frontier models achieve low accuracy on HLE (Table 1), highlighting substantial room for improvement in narrowing the gap between current LLMs and expert-level academic capabilities on closed-ended questions. These low scores are partially by design the dataset collection process attempts to filter out questions that existing models can answer correctly. Nevertheless, we notice on evaluation that models exhibit non-zero accuracy. This is due to inherent noise in model inference—models can inconsistently guess the right answer or guess worse than random chance for multiple-choice questions. We notice an elevated accuracy on multiple-choice questions compared with exact-answer questions in Extended Data Table. 3. We choose to leave these questions in the dataset as a natural component instead of strongly adversarially filtering. However, we stress that the true capability floor of frontier models on the dataset will remain an open question, and small inflections close to zero accuracy are not strongly indicative of progress.
Calibration error
Given low performance on HLE, models should be calibrated, recognizing their uncertainty rather than confidently provide incorrect answers. To measure calibration, we prompt models to provide both an answer and their confidence from 0% to 100% (Methods), using the setup from7. The implementation of our RMS calibration error is from ref. 8. The stated confidence of a well-calibrated model should match its actual accuracy, for example, achieving 50% accuracy on questions, in which it claims 50% confidence. Table 1 shows poor calibration across all models, reflected in high RMS calibration error scores. Models frequently provide incorrect answers with high confidence on HLE, failing to recognize when questions exceed their capabilities.
Inference time computation
Reasoning models are designed to spend extra compute thinking before answering: they generate intermediate reasoning tokens and then produce the final response, which means substantially more tokens must be decoded at inference time5,6. To shed light on this in our evaluation, we analyse the compute-intensive scaling of output tokens (including reasoning tokens) across several state-of-the-art reasoning models in Fig. 5. Through binning output lengths with a log2 scale, we observe a log-linear scaling of accuracy with more reasoning tokens; however, this trend reverses after 214 tokens, highlighting that a larger reasoning budget is not always optimal. The observation that accuracy benefits diminish beyond a certain threshold suggests that future models should improve not only their raw accuracy on HLE but also their computational efficiency.
Discussion
Limitations
Although present-day LLMs achieve very low accuracy on HLE, recent history shows benchmarks are quickly saturated—with models markedly progressing from near-zero to near-perfect performance in a short timeframe9,10. High accuracy on HLE would demonstrate expert-level performance on closed-ended, verifiable questions and cutting-edge scientific knowledge, but it would not alone suggest autonomous research capabilities or artificial general intelligence11. HLE tests structured academic problems rather than open-ended research or creative problem-solving abilities, making it a focused measure of technical knowledge and reasoning across a diverse range of subjects, albeit with a stronger representation in math and STEM (science, technology, engineering and mathematics) disciplines, as shown in Fig. 2. By pushing the limits of established closed-ended benchmarks, HLE is intended to hasten the transition towards a new class of benchmarks focused on more dynamic and open-ended AI capabilities.
Impact
By providing a clear measure of AI progress, HLE creates a common reference point for scientists and policymakers to assess AI capabilities. This enables more informed discussions about development trajectories, potential risks and necessary governance measures.
Methods
Related works
LLM benchmarks
Benchmarks are important tools for tracking the rapid advancement of LLM capabilities, including general and scientific knowledge1,10,12,13,14,15 and mathematical reasoning16,17,18,19,20,21, code generation22,23,24,25,26,27,28 and general-purpose human assistance7,29,30,31,32,33,34,35. Owing to their objectivity and ease of automated scoring at scale, evaluations commonly include multiple-choice and short-answer questions31,36,37,38,39, with benchmarks such as MMLU1 also spanning a broad range of academic disciplines and levels of complexity.
Saturation and frontier benchmark design
However, state-of-the-art models now achieve nearly perfect scores on many existing evaluations, obscuring the full extent of current and future frontier AI capabilities40,41,42,43. This has motivated the development of more challenging benchmarks that test for multi-modal capabilities17,22,24,44,45,46,47,48,49,50, strengthen existing benchmarks32,44,45,51,52, filter questions over multiple stages of review9,12,19,42,53,54 and use experts to write tests for advanced academic knowledge9,12,19,54,55,56. HLE combines these approaches: the questions are developed by subject-matter experts and undergo multiple rounds of review, while preserving the broad subject-matter coverage of MMLU. As a result, HLE provides a clear measurement of the gap between current AI capabilities and human expertise on closed-ended academic tasks, complementing other assessments of advanced capabilities in open-ended domains57,58.
Dataset
Submission process
To ensure question difficulty, we automatically check the accuracy of frontier LLMs on each question before submission. Our testing process uses multi-modal LLMs for text-and-image questions (GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet and o1) and adds two non-multi-modal models (o1-mini and o1-preview) for text-only questions. We use different submission criteria by question type: exact-match questions must stump all models, whereas multiple-choice questions must stump all but one model to account for potential lucky guesses. Users are instructed to submit only questions that meet these criteria. We note that due to non-determinism in models and a non-zero floor in multiple-choice questions, further evaluation on the dataset exhibits some low but non-zero accuracy.
Post-release
Late contributions
In response to research community interest, we opened the platform for late contributors after the initial release, resulting in thousands of submissions. Each submission was manually reviewed by organizers. The new questions are of similar difficulty and quality to our initial dataset, resulting in a second held-out private set, which will be used in future evaluations.
Refinement
Community feedback: owing to the advanced, specialized nature of many submissions, reviewers were not expected to verify the full accuracy of each provided solution rationale, instead focusing on whether the question aligns with guidelines. Given this limitation in the review process, we launched a community feedback bug bounty program following the initial release of the dataset to identify and eliminate the main errors in the dataset, namely, label errors and other errors in the statement of the question. Each error report was manually verified by the organizers with feedback from the original author of the question when appropriate.
Searchable questions: a question is potentially searchable if a model with search tools answered correctly, but answered incorrectly without search. Each of these potentially searchable questions was then manually audited, removing any that were easily found using web search. We used GPT-4o mini/GPT-4o search and Perplexity Sonar models in this procedure. We observe that current frontier model performance on HLE after applying this procedure is similar to the performance on HLE before applying this procedure.
Expert disagreement rate
Before release, we conducted two main rounds of auditing, each on a sample of 200 questions. We recruited students from top universities in the United States to fully solve a sample of questions from HLE. Errors flagged were routed between organizers, original question authors and auditors until consensus was reached. We used data from these audits to further refine our dataset. The first round aimed to identify common categories of imprecise questions, such as open-ended formats, reliance on rounded numerical values or submissions from authors with low acceptance rates. Based on these signals, we manually removed or revised potential questions with similar issues before conducting a second audit on a new sample of 200 questions. This iterative process yielded a final estimated expert disagreement rate of 15.4% for the public set. This level of expert disagreement is in line with what is observed in other well-known machine learning benchmarks59,60,61,62.
Disagreement rates are often higher in domains such as health and medicine. A targeted peer review on a biology, chemistry and health subset, proposed in ref. 63, found an expert disagreement rate of approximately 18%. This is also observed in other similarly expert-grade work; for example64, notes that disagreement among expert physicians is frequent on complex health topics. To aid future community efforts in identifying other potential dataset errors, we outline several key factors that contribute to the complexity of these audits below:
-
The need for multiple experts: our multi-reviewer process highlighted the complexity of these questions. In several cases, a reviewer identified an important piece of information, such as a decades-old paper or a foundational concept not immediately apparent to others, that was essential to confirming the validity of an answer. To illustrate, if we were to adopt a single-reviewer methodology in which a question is flagged based on just one dissenting expert, the disagreement rate on the aforementioned health-focused subset jumps from 18% to 25%, which is close to the approximate numbers and method from ref. 63. This discrepancy highlights the importance of a standard peer-review process, complete with multiple reviewers and author rebuttal, for HLE questions.
-
Questions from research experience: HLE is intentionally designed to include questions based on insights from the direct, hands-on experiments of its contributors. This design captures knowledge gained from direct research experiences, which is often difficult to verify through standard literature searches or by external reviewers. This was done to test model knowledge beyond what is readily indexed on the internet.
-
Understanding question design: designing challenging closed-ended research questions is difficult. Consequently, the objective for some HLE multiple-choice questions is to identify the most plausible answer among the provided options. Some external reviewers, unfamiliar with these design principles, sought to find external sources to support an open-ended answer rather than evaluating the best choice among the given options.
HLE-Rolling
Inspired by these valuable community discussions and researcher interest across disciplines in contributing to the dataset, and as part of our commitment to continual improvement, we will introduce a dynamic fork of the dataset post-release: HLE-Rolling. This version will be regularly updated to address community feedback and integrate new questions. Information about the updates will be made publicly available at https://lastexam.ai. Our goal is to provide a seamless migration path for researchers once frontier models begin to hit the noise ceiling performance on the original HLE dataset.
Prompts
We use the following system prompt for evaluating LLMs on HLE questions. For models that do not support a system prompt, we add it as a separate user prompt.
Your response should be in the following format:
Explanation: {your explanation for your answer choice}
Answer: {your chosen answer}
Confidence: {your confidence score between 00% and 100% for your answer}
We use the following system prompt to judge the model answers against the correct answers for our evaluations in Table 1. We used o3-mini-2025-01-31 with structured decoding enabled to get an extracted_final_answer, reasoning, correct, confidence extraction for each output. An example of a structured response using an LLM judge is shown in Extended Data Fig. 1.
Judge whether the following [response] to [question] is correct or not based on the precise and unambiguous [correct_answer] below.
[question]: {question}
[response]: {response}
Your judgement must be in the format and criteria specified below:
extracted_final_answer: The final exact answer extracted from the [response]. Put the extracted answer as 'None' if there is no exact, final answer to extract from the response.
[correct_answer]: {correct_answer}
reasoning: Explain why the extracted_final_answer is correct or incorrect based on [correct_answer], focusing only on if there are meaningful differences between [correct_answer] and the extracted_final_answer. Do not comment on any background to the problem, do not attempt to solve the problem, do not argue for any answer different than [correct_answer], focus only on whether the answers match.
correct: Answer 'yes' if extracted_final_answer matches the [correct_answer] given above, or is within a small margin of error for numerical problems. Answer 'no' otherwise, i.e. if there if there is any inconsistency, ambiguity, non-equivalency, or if the extracted answer is incorrect.
confidence: The extracted confidence score between 0|%| and 100|%| from [response]. Put 100 if there is no confidence score available.
Data availability
The HLE dataset is open-source and available at https://huggingface.co/datasets/cais/hle. Important updates to the project and dataset will be announced at https://lastexam.ai.
Code availability
The inference script for benchmarking AI systems on HLE is available at GitHub (https://github.com/centerforaisafety/hle).
References
Hendrycks, D. et al. Measuring massive multitask language understanding. In Proc. International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=d7KBjmI3GmQ (ICLR, 2021).
Gemini Team Google. Gemini 1.5: unlocking multimodal understanding across millions of tokens of context. Preprint at https://arxiv.org/abs/2403.05530 (2024).
OpenAI et al. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2024).
The Claude 3 Model Family: Opus, Sonnet, Haiku (Anthropic, 2024).
OpenAI o1 System Card (OpenAI, 2024).
Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 633–638 (2025).
Wei, J. et al. Measuring short-form factuality in large language models. Preprint at https://arxiv.org/abs/2411.04368 (2024).
Hendrycks, D. et al. PixMix: Dreamlike pictures comprehensively improve safety measures. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16783–16792 (IEEE/CVF, 2022).
Rein, D. et al. GPQA: A graduate-level Google-proof Q&A benchmark. In Proc. First Conference on Language Modeling (COLM) https://openreview.net/forum?id=Ti67584b98 (COLM, 2024).
Chollet, F., Knoop, M., Kamradt, G. & Landers, B. ARC prize 2024: technical report. Preprint at https://arxiv.org/abs/2412.04604 (2024).
Hendrycks, D. et al. A definition of AGI. Preprint at https://arxiv.org/abs/2510.18212 (2025).
Li, N. et al. The WMDP benchmark: measuring and reducing malicious use with unlearning. In Proc. 41st International Conference on Machine Learning (ICML), 28713–28738 (PMLR, 2024).
Laurent, J. M. et al. LAB-bench: measuring capabilities of language models for biology research. Preprint at https://arxiv.org/abs/2407.10362 (2024).
Srivastava, A. et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Trans. Mach. Learn. Res. https://openreview.net/forum?id=uyTL5Bvosj (2023).
Zhong, W. et al. Agieval: a human-centric benchmark for evaluating foundation models. Preprint at https://arxiv.org/abs/2304.06364 (2023).
Hendrycks, D. et al. Measuring mathematical problem solving with the MATH dataset. In Proc. 35th Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track https://openreview.net/forum?id=7Bywt2mQsCe (NeurIPS, 2021).
Lu, P. et al. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. In Proc. International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=KUNzEQMWU7 (ICLR, 2024).
Cobbe, K. et al. Training verifiers to solve math word problems. Preprint at https://arxiv.org/abs/2110.14168 (2021).
Glazer, E. et al. FrontierMath: a benchmark for evaluating advanced mathematical reasoning in AI. Preprint at https://arxiv.org/abs/2411.04872 (2024).
He, C. et al. OlympiadBench: A challenging benchmark for promoting AGI with Olympiad-level bilingual multimodal scientific problems. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 3828–3850 (ACL, 2024).
Gao, B. et al. Omni-MATH: A universal Olympiad level mathematic benchmark for large language models. In Proc. International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=yaqPf0KAlN (ICLR, 2025).
Chan, J. S. et al. MLE-bench: Evaluating machine learning agents on machine learning engineering. In Proc. International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=6s5uXNWGIh (ICLR, 2025).
Zhang, A. K. et al. Cybench: A framework for evaluating cybersecurity capabilities and risks of language models. In Proc. International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=tc90LV0yRL (ICLR, 2025).
Jimenez, C. E. et al. SWE-bench: Can language models resolve real-world Github issues? In Proc. International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=VTF8yNQM66 (ICLR, 2024).
Chen, M. et al. Evaluating large language models trained on code. Preprint at https://arxiv.org/abs/2107.03374 (2021).
Hendrycks, D. et al. Measuring coding challenge competence with APPS. In Proc. 35th Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track https://openreview.net/forum?id=sD93GOzH3i5 (NeurIPS, 2021).
Bhatt, M. et al. Purple Llama CyberSecEval: a secure coding benchmark for language models. Preprint at https://arxiv.org/abs/2312.04724 (2023).
Austin, J. et al. Program synthesis with large language models. Preprint at https://arxiv.org/abs/2108.07732 (2021).
Bai, Y. et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. Preprint at https://arxiv.org/abs/2204.05862 (2022).
Perez, E. et al. Discovering language model behaviors with model-written evaluations. In Findings of the Association for Computational Linguistics: ACL 2023, 13387–13434 (ACL, 2023).
Rajpurkar, P. et al. SQuAD: 100,000+ questions for machine comprehension of text. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2383–2392 (EMNLP, 2016).
Rajpurkar, P. et al. Know what you don’t know: Unanswerable questions for SQuAD. In Proc. 56th Annual Meeting of the Association for Computational Linguistics (ACL), 784–789 (ACL, 2018).
Bajaj, P. et al. MS MACRO: a human generated machine reading comprehension dataset. Preprint at https://arxiv.org/abs/1611.09268 (2018).
Hendrycks, D. et al. What would Jiminy Cricket do? Towards agents that behave morally. In Proc. 35th Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track https://openreview.net/forum?id=G1muTb5zuO7 (NeurIPS, 2021).
Phan, L., Mazeika, M., Zou, A. & Hendrycks, D. Textquests: how good are LLMs at text-based video games? Preprint at https://arxiv.org/abs/2507.23701 (2025).
Wang, A. et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proc. International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=rJ4km2R5t7 (ICLR, 2019).
Wang, A. et al. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. In Proc. Advances in Neural Information Processing Systems (NeurIPS), Vol. 32, 3261–3275 (NeurIPS, 2019).
Yang, Z. et al. HotpotQA: A dataset for diverse, explainable multihop question answering. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2369–2380 (EMNLP, 2018).
Dua, D. et al. DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs. Preprint at https://arxiv.org/abs/1903.00161 (2019).
Ott, S., Barbosa-Silva, A., Blagec, K., Brauner, J. & Samwald, M. Mapping global dynamics of benchmark creation and saturation in artificial intelligence. Nat. Commun. 13, 6793 (2022).
Owen, D. How predictable is language model benchmark performance? Preprint at https://arxiv.org/abs/2401.04757 (2024).
Kiela, D. et al. Dynabench: Rethinking benchmarking in NLP. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 4110–4124 (NAACL, 2021).
McIntosh, T. R. et al. Inadequacies of large language model benchmarks in the era of generative artificial intelligence. IEEE Trans. Artif. Intell. https://doi.org/10.1109/TAI.2025.3569516 (2025).
Wang, Y. et al. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. In Proc. Advances in Neural Information Processing Systems (NeurIPS), article no. 3018 (NeurIPS, 2024).
Taghanaki, S. A., Khani, A. & Khasahmadi, A. MMLU-Pro+: evaluating higher-order reasoning and shortcut learning in LLMS. Preprint at https://arxiv.org/abs/2409.02257 (2024).
Yao, S. et al. τ-bench: A benchmark for tool-agent-user interaction in real-world domains. In Proc. International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=roNSXZpUDN (ICLR, 2025).
Andriushchenko, M. et al. AgentHarm: A benchmark for measuring harmfulness of LLM agents. In Proc. International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=AC5n7xHuR1 (ICLR, 2025).
Kumar, P. et al. Refusal-trained LLMS are easily jailbroken as browser agents. Preprint at https://arxiv.org/abs/2410.13886 (2024).
Yan, F. et al. Berkeley Function Calling Leaderboard https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html (2024).
Srinivasan, V. K. et al. NexusRaven: a commercially-permissive language model for function calling. In NeurIPS 2023 Foundation Models for Decision Making Workshop (NeurIPS, 2023).
Hosseini, A., Sordoni, A., Toyama, D., Courville, A. & Agarwal, R. Not all LLM reasoners are created equal. Preprint at https://arxiv.org/abs/2410.01748 (2024).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Nie, Y. et al. Adversarial NLI: A new benchmark for natural language understanding. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (ACL), 4885–4901 (ACL, 2020).
Götting, J. et al. Virology capabilities test (VCT): a multimodal virology Q&A benchmark. Preprint at https://arxiv.org/abs/2504.16137 (2025).
Phuong, M. et al. Evaluating frontier models for dangerous capabilities. Preprint at https://arxiv.org/abs/2403.13793 (2024).
Anthropic’s Responsible Scaling Policy Updates https://www.anthropic.com/rsp-updates (Anthropic, 2024).
Mazeika, M. et al. Remote labor index: measuring AI automation of remote work. Preprint at https://arxiv.org/abs/2510.26787 (2025).
Patwardhan, T. et al. GDPval: evaluating ai model performance on real-world economically valuable tasks. Preprint at https://arxiv.org/abs/2510.04374 (2025).
Kwiatkowski, T. et al. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguist. 7, 452–466 (2019).
Antol, S. et al. VQA: Visual question answering. In Proc. IEEE International Conference on Computer Vision (ICCV), 2425–2433 (IEEE, 2015).
Reddy, S., Chen, D. & Manning, C. D. CoQA: a conversational question answering challenge. Trans. Assoc. Comput. Linguist. 7, 249–266 (2019).
Bowman, S. R., Angeli, G., Potts, C. & Manning, C. D. A large annotated corpus for learning natural language inference. In Proc. 2015 Conference on Empirical Methods in Natural Language Processing (eds Màrquez, L. et al.), 632–642 (ACL, 2015).
Skarlinski, M., Laurent, J., Bou, A. & White, A. About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong. FutureHouse https://www.futurehouse.org/research-announcements/hle-exam (2025).
Arora, R. K. et al. HealthBench: evaluating large language models towards improved human health. Preprint at https://arxiv.org/abs/2505.08775 (2025).
Acknowledgements
The research is supported by Center for AI Safety and Scale AI.
Author information
Authors and Affiliations
Consortia
Contributions
All authors have contributed to the dataset creation process. The Center for AI Safety and Scale AI consortia jointly designed the dataset premise and pipeline; operated the data collection platform (https://lastexam.ai); and provided funding, inference infrastructure for LLMs and review/auditing resources. The authors in the HLE Contributors Consortium contributed to the dataset in various ways, including submitting at least one accepted question to one of the dataset versions, contributing to dataset refinement or assisting with evaluations. In the Center for AI Safety and Scale AI, Long Phan, Alice Gatti, Ziwen Han and Nathaniel Li led the project, and Summer Yue, Alexandr Wang and Dan Hendrycks provided senior supervision.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Extended Data Fig. 1 Example of a structured response using an LLM judge.
Exact-match answers in HLE sometimes require several reasoning steps to compare the AI’s final answer with the correct answer; therefore, a capable LLM judge with reasoning capabilities is necessary.
Supplementary information
Supplementary Information
Supplementary Information containing the contributor guidelines and human review instructions.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Center for AI Safety., Scale AI. & HLE Contributors Consortium. A benchmark of expert-level academic questions to assess AI capabilities. Nature 649, 1139–1146 (2026). https://doi.org/10.1038/s41586-025-09962-4
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1038/s41586-025-09962-4




