A benchmark of expert-level academic questions to assess AI capabilities

doi:10.1038/s41586-025-09962-4

Download PDF

Article
Open access
Published: 28 January 2026

A benchmark of expert-level academic questions to assess AI capabilities

Center for AI Safety,
Scale AI &
HLE Contributors Consortium

Nature volume 649, pages 1139–1146 (2026)Cite this article

47k Accesses
154 Altmetric
Metrics details

Subjects

Abstract

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding¹, limiting informed measurement of state-of-the-art LLM capabilities. Here, in response, we introduce Humanity’s Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be an expert-level closed-ended academic benchmark with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable but cannot be quickly answered by internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a marked gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

Self-reflection enhances large language models towards substantial academic response

Article Open access 01 December 2025

Testing AI on language comprehension tasks reveals insensitivity to underlying meaning

Article Open access 14 November 2024

LLM ethics benchmark: a three-dimensional assessment system for evaluating moral reasoning in large language models

Article Open access 05 October 2025

Main

The capabilities of large language models (LLMs) have advanced markedly, exceeding human performance across a diverse array of tasks. To systematically measure these capabilities, LLMs are evaluated on benchmarks: collections of questions that assess model performance on tasks such as math, programming or biology. However, state-of-the-art LLMs^2,3,4,5,6 now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding (MMLU)¹, which were once challenging frontiers for LLMs. The saturation of existing benchmarks, as shown in Fig. 1, limits our ability to precisely measure artificial intelligence (AI) capabilities and calls for more challenging evaluations that can meaningfully assess the rapid improvements in LLM capabilities at the frontiers of human knowledge.

**Fig. 1: Performance of frontier LLMs on popular benchmarks and HLE.**

To address this gap, we introduce HLE (originally defined as Humanity’s Last Exam, although we will use the term HLE for this paper), a benchmark of 2,500 challenging questions from dozens of subject areas, designed to assess LLM capabilities at an expert level in broad academic subjects. HLE is developed by academics and domain experts, providing a precise measure of capabilities as LLMs continue to improve (see section ‘Collection’). HLE is multi-modal, featuring questions that are either text-only or accompanied by an image reference and includes both multiple-choice and exact-match questions for automated answer verification. Questions are original, precise, unambiguous and resistant to simple internet lookup or database retrieval. Among the diversity of questions in the benchmark, HLE emphasizes world-class mathematics problems aimed at testing deep reasoning skills broadly applicable across multiple academic areas.

We use a multi-stage review process to thoroughly ensure question difficulty and quality (see section ‘Review’). Before submission, each question is tested against state-of-the-art LLMs to verify its difficulty—questions are rejected if LLMs can answer them correctly. Questions submitted are then processed through a two-stage reviewing process: (1) an initial feedback round with multiple graduate-level reviewers and (2) an approval of organizer and expert reviewer, ensuring quality and adherence to our submission criteria. Following the release, we conducted a public review period, welcoming community feedback to correct any points of concern in the dataset.

Frontier LLMs consistently demonstrate low accuracy across all models, highlighting a marked gap between current capabilities and expert-level academic performance (see section ‘Evaluation’). Models also provide incorrect answers with high confidence rather than acknowledging uncertainty on these challenging questions, with most models exhibiting root mean square (RMS) calibration errors above 70%.

As AI systems approach human expert performance in many domains, precise measurement of their capabilities and limitations is essential for informing research, governance and the broader public. High performance on HLE would suggest expert-level capabilities on closed-ended academic questions. To establish a common reference point for assessing these capabilities, we publicly release a large number of 2,500 questions from HLE to enable this precise measurement, while maintaining a private test set to assess potential model overfitting.

Dataset

Collection

HLE consists of 2,500 challenging questions across over a hundred subjects. A high-level summary is provided in Fig. 2. HLE is a global collaborative effort, with questions from nearly 1,000 subject expert contributors affiliated with more than 500 institutions across 50 countries—comprised mostly of professors, researchers and graduate degree holders. Examples of the diverse and challenging questions submitted to HLE are shown in Fig. 3.

**Fig. 2: Distribution of HLE questions across categories.**

Question style

HLE contains two question formats: exact-match questions (models provide an exact string as output) and multiple-choice questions (the model selects one of five or more answer choices). HLE is a multi-modal benchmark, with around 14% of questions requiring comprehending both text and an image; 24% of questions are multiple-choice, with the remainder being exact match.

Each question submission includes several required components: the question text itself, answer specifications (either an exact-match answer or multiple-choice options with the correct answer marked), detailed rationale explaining the solution, academic subject and name of the contributor and institutional affiliation to maintain accountability and accuracy.

Submission format

To ensure question quality and integrity, we enforce strict submission criteria. Questions should be precise, unambiguous, solvable and non-searchable, ensuring models cannot rely on memorization or simple retrieval methods. All submissions must be original work or non-trivial syntheses of published information, although contributions from unpublished research are acceptable. Questions typically require graduate-level expertise or test knowledge of highly specific topics (for example, precise historical details, trivia and local customs) and have specific, unambiguous answers accepted by domain experts. When LLMs provide correct answers with faulty reasoning, authors are encouraged to modify question parameters, such as the number of answer choices, to discourage false positives. We require clear English with precise technical terminology, supporting LaTeX notation wherever necessary. Answers are kept short and easily verifiable for exact-match questions to support automatic grading. We prohibit open-ended questions, subjective interpretations, and content related to weapons of mass destruction. Finally, every question is accompanied by a detailed solution to verify accuracy. More details about guidelines for contributors can be found in Supplementary Information section 1.

Prize pool

To attract high-quality submissions, we establish a USD$500,000 prize pool, with prizes of USD$5,000 for each of the top 50 questions and USD$500 for each of the next 500 questions, as determined by organizers. This incentive structure, combined with the opportunity for paper co-authorship for anyone with an accepted question in HLE, draws participation from qualified experts, particularly those with advanced degrees or notable technical experience in their fields.

Review

LLM difficulty check

To ensure question difficulty, each question is first validated against several frontier LLMs before submission (Methods). If the LLMs cannot solve the question (or, in the case of multiple choices, if the models on average do worse than random guessing), the question proceeds to the next stage: human expert review. In total, we logged more than 70,000 attempts, resulting in approximately 13,000 questions, which stumped LLMs that were forwarded to expert human review.

Expert review

Our human reviewers possess a graduate degree (for example, master’s, PhD and JD) in their fields. Reviewers select submissions in their domain, grading them against standardized rubrics and offering feedback when applicable. There are two rounds of reviews. The first round focuses on iteratively refining submissions, with each question receiving between one and three reviews. The primary goal is to help the question contributors (who are primarily academics and researchers from a wide range of disciplines) better design questions that are closed-ended, robust and of high quality for AI evaluation. In the second round, good and outstanding questions from the first round are identified and approved by organizers and reviewers to be included in the final HLE dataset. Details, instructions and rubrics for both rounds can be found in Supplementary Information section 2. Figure 4 shows our full process.

**Fig. 4: HLE dataset creation pipeline.**

Evaluation

We evaluate the performance of state-of-the-art LLMs on HLE and analyse their capabilities across different question types and domains. We describe our evaluation setup (see section ‘Setup’) and present several quantitative results on metrics that track model performance (see section ‘Quantitative results’).

Setup

After data collection and review, we evaluated our final HLE dataset on additional frontier multi-modal LLMs. We use a standardized system prompt that structures model responses into explicit reasoning followed by a final answer. As the question–answers are precise and close-ended, we use o3-mini as a judge to verify answer correctness against model predictions while accounting for equivalent formats (for example, decimals compared with fractions or estimations). Evaluation prompts are detailed in the Methods.

Quantitative results

Accuracy

All frontier models achieve low accuracy on HLE (Table 1), highlighting substantial room for improvement in narrowing the gap between current LLMs and expert-level academic capabilities on closed-ended questions. These low scores are partially by design the dataset collection process attempts to filter out questions that existing models can answer correctly. Nevertheless, we notice on evaluation that models exhibit non-zero accuracy. This is due to inherent noise in model inference—models can inconsistently guess the right answer or guess worse than random chance for multiple-choice questions. We notice an elevated accuracy on multiple-choice questions compared with exact-answer questions in Extended Data Table. 3. We choose to leave these questions in the dataset as a natural component instead of strongly adversarially filtering. However, we stress that the true capability floor of frontier models on the dataset will remain an open question, and small inflections close to zero accuracy are not strongly indicative of progress.

Table 1 Accuracy and RMS calibration error of different models on HLE, demonstrating low accuracy and high calibration error across all models

Full size table

Calibration error

Given low performance on HLE, models should be calibrated, recognizing their uncertainty rather than confidently provide incorrect answers. To measure calibration, we prompt models to provide both an answer and their confidence from 0% to 100% (Methods), using the setup from⁷. The implementation of our RMS calibration error is from ref. ⁸. The stated confidence of a well-calibrated model should match its actual accuracy, for example, achieving 50% accuracy on questions, in which it claims 50% confidence. Table 1 shows poor calibration across all models, reflected in high RMS calibration error scores. Models frequently provide incorrect answers with high confidence on HLE, failing to recognize when questions exceed their capabilities.

Inference time computation

Reasoning models are designed to spend extra compute thinking before answering: they generate intermediate reasoning tokens and then produce the final response, which means substantially more tokens must be decoded at inference time^5,6. To shed light on this in our evaluation, we analyse the compute-intensive scaling of output tokens (including reasoning tokens) across several state-of-the-art reasoning models in Fig. 5. Through binning output lengths with a log₂ scale, we observe a log-linear scaling of accuracy with more reasoning tokens; however, this trend reverses after 2¹⁴ tokens, highlighting that a larger reasoning budget is not always optimal. The observation that accuracy benefits diminish beyond a certain threshold suggests that future models should improve not only their raw accuracy on HLE but also their computational efficiency.

**Fig. 5: Accuracy compared with reasoning token budget.**

Discussion

Limitations

Although present-day LLMs achieve very low accuracy on HLE, recent history shows benchmarks are quickly saturated—with models markedly progressing from near-zero to near-perfect performance in a short timeframe^9,10. High accuracy on HLE would demonstrate expert-level performance on closed-ended, verifiable questions and cutting-edge scientific knowledge, but it would not alone suggest autonomous research capabilities or artificial general intelligence¹¹. HLE tests structured academic problems rather than open-ended research or creative problem-solving abilities, making it a focused measure of technical knowledge and reasoning across a diverse range of subjects, albeit with a stronger representation in math and STEM (science, technology, engineering and mathematics) disciplines, as shown in Fig. 2. By pushing the limits of established closed-ended benchmarks, HLE is intended to hasten the transition towards a new class of benchmarks focused on more dynamic and open-ended AI capabilities.

Impact

By providing a clear measure of AI progress, HLE creates a common reference point for scientists and policymakers to assess AI capabilities. This enables more informed discussions about development trajectories, potential risks and necessary governance measures.

Methods

Related works

LLM benchmarks

Benchmarks are important tools for tracking the rapid advancement of LLM capabilities, including general and scientific knowledge^{1,10,12,13,14,15} and mathematical reasoning^{16,17,18,19,20,21}, code generation^{22,23,24,25,26,27,28} and general-purpose human assistance^{7,29,30,31,32,33,34,35}. Owing to their objectivity and ease of automated scoring at scale, evaluations commonly include multiple-choice and short-answer questions^{31,36,37,38,39}, with benchmarks such as MMLU¹ also spanning a broad range of academic disciplines and levels of complexity.

Saturation and frontier benchmark design

However, state-of-the-art models now achieve nearly perfect scores on many existing evaluations, obscuring the full extent of current and future frontier AI capabilities^40,41,42,43. This has motivated the development of more challenging benchmarks that test for multi-modal capabilities^{17,22,24,44,45,46,47,48,49,50}, strengthen existing benchmarks^{32,44,45,51,52}, filter questions over multiple stages of review^{9,12,19,42,53,54} and use experts to write tests for advanced academic knowledge^{9,12,19,54,55,56}. HLE combines these approaches: the questions are developed by subject-matter experts and undergo multiple rounds of review, while preserving the broad subject-matter coverage of MMLU. As a result, HLE provides a clear measurement of the gap between current AI capabilities and human expertise on closed-ended academic tasks, complementing other assessments of advanced capabilities in open-ended domains^57,58.

Dataset

Submission process

To ensure question difficulty, we automatically check the accuracy of frontier LLMs on each question before submission. Our testing process uses multi-modal LLMs for text-and-image questions (GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet and o1) and adds two non-multi-modal models (o1-mini and o1-preview) for text-only questions. We use different submission criteria by question type: exact-match questions must stump all models, whereas multiple-choice questions must stump all but one model to account for potential lucky guesses. Users are instructed to submit only questions that meet these criteria. We note that due to non-determinism in models and a non-zero floor in multiple-choice questions, further evaluation on the dataset exhibits some low but non-zero accuracy.

Post-release

Late contributions

In response to research community interest, we opened the platform for late contributors after the initial release, resulting in thousands of submissions. Each submission was manually reviewed by organizers. The new questions are of similar difficulty and quality to our initial dataset, resulting in a second held-out private set, which will be used in future evaluations.

Refinement

Community feedback: owing to the advanced, specialized nature of many submissions, reviewers were not expected to verify the full accuracy of each provided solution rationale, instead focusing on whether the question aligns with guidelines. Given this limitation in the review process, we launched a community feedback bug bounty program following the initial release of the dataset to identify and eliminate the main errors in the dataset, namely, label errors and other errors in the statement of the question. Each error report was manually verified by the organizers with feedback from the original author of the question when appropriate.

Searchable questions: a question is potentially searchable if a model with search tools answered correctly, but answered incorrectly without search. Each of these potentially searchable questions was then manually audited, removing any that were easily found using web search. We used GPT-4o mini/GPT-4o search and Perplexity Sonar models in this procedure. We observe that current frontier model performance on HLE after applying this procedure is similar to the performance on HLE before applying this procedure.

Expert disagreement rate

Before release, we conducted two main rounds of auditing, each on a sample of 200 questions. We recruited students from top universities in the United States to fully solve a sample of questions from HLE. Errors flagged were routed between organizers, original question authors and auditors until consensus was reached. We used data from these audits to further refine our dataset. The first round aimed to identify common categories of imprecise questions, such as open-ended formats, reliance on rounded numerical values or submissions from authors with low acceptance rates. Based on these signals, we manually removed or revised potential questions with similar issues before conducting a second audit on a new sample of 200 questions. This iterative process yielded a final estimated expert disagreement rate of 15.4% for the public set. This level of expert disagreement is in line with what is observed in other well-known machine learning benchmarks^59,60,61,62.

Disagreement rates are often higher in domains such as health and medicine. A targeted peer review on a biology, chemistry and health subset, proposed in ref. ⁶³, found an expert disagreement rate of approximately 18%. This is also observed in other similarly expert-grade work; for example⁶⁴, notes that disagreement among expert physicians is frequent on complex health topics. To aid future community efforts in identifying other potential dataset errors, we outline several key factors that contribute to the complexity of these audits below:

The need for multiple experts: our multi-reviewer process highlighted the complexity of these questions. In several cases, a reviewer identified an important piece of information, such as a decades-old paper or a foundational concept not immediately apparent to others, that was essential to confirming the validity of an answer. To illustrate, if we were to adopt a single-reviewer methodology in which a question is flagged based on just one dissenting expert, the disagreement rate on the aforementioned health-focused subset jumps from 18% to 25%, which is close to the approximate numbers and method from ref. ⁶³. This discrepancy highlights the importance of a standard peer-review process, complete with multiple reviewers and author rebuttal, for HLE questions.
Questions from research experience: HLE is intentionally designed to include questions based on insights from the direct, hands-on experiments of its contributors. This design captures knowledge gained from direct research experiences, which is often difficult to verify through standard literature searches or by external reviewers. This was done to test model knowledge beyond what is readily indexed on the internet.
Understanding question design: designing challenging closed-ended research questions is difficult. Consequently, the objective for some HLE multiple-choice questions is to identify the most plausible answer among the provided options. Some external reviewers, unfamiliar with these design principles, sought to find external sources to support an open-ended answer rather than evaluating the best choice among the given options.

HLE-Rolling

Inspired by these valuable community discussions and researcher interest across disciplines in contributing to the dataset, and as part of our commitment to continual improvement, we will introduce a dynamic fork of the dataset post-release: HLE-Rolling. This version will be regularly updated to address community feedback and integrate new questions. Information about the updates will be made publicly available at https://lastexam.ai. Our goal is to provide a seamless migration path for researchers once frontier models begin to hit the noise ceiling performance on the original HLE dataset.

Prompts

We use the following system prompt for evaluating LLMs on HLE questions. For models that do not support a system prompt, we add it as a separate user prompt.

Your response should be in the following format:

Explanation: {your explanation for your answer choice}

Answer: {your chosen answer}

Confidence: {your confidence score between 00% and 100% for your answer}

We use the following system prompt to judge the model answers against the correct answers for our evaluations in Table 1. We used o3-mini-2025-01-31 with structured decoding enabled to get an extracted_final_answer, reasoning, correct, confidence extraction for each output. An example of a structured response using an LLM judge is shown in Extended Data Fig. 1.

Judge whether the following [response] to [question] is correct or not based on the precise and unambiguous [correct_answer] below.

[question]: {question}

[response]: {response}

Your judgement must be in the format and criteria specified below:

extracted_final_answer: The final exact answer extracted from the [response]. Put the extracted answer as 'None' if there is no exact, final answer to extract from the response.

[correct_answer]: {correct_answer}

reasoning: Explain why the extracted_final_answer is correct or incorrect based on [correct_answer], focusing only on if there are meaningful differences between [correct_answer] and the extracted_final_answer. Do not comment on any background to the problem, do not attempt to solve the problem, do not argue for any answer different than [correct_answer], focus only on whether the answers match.

correct: Answer 'yes' if extracted_final_answer matches the [correct_answer] given above, or is within a small margin of error for numerical problems. Answer 'no' otherwise, i.e. if there if there is any inconsistency, ambiguity, non-equivalency, or if the extracted answer is incorrect.

confidence: The extracted confidence score between 0|%| and 100|%| from [response]. Put 100 if there is no confidence score available.

Data availability

The HLE dataset is open-source and available at https://huggingface.co/datasets/cais/hle. Important updates to the project and dataset will be announced at https://lastexam.ai.

Code availability

The inference script for benchmarking AI systems on HLE is available at GitHub (https://github.com/centerforaisafety/hle).

References

Hendrycks, D. et al. Measuring massive multitask language understanding. In Proc. International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=d7KBjmI3GmQ (ICLR, 2021).
Gemini Team Google. Gemini 1.5: unlocking multimodal understanding across millions of tokens of context. Preprint at https://arxiv.org/abs/2403.05530 (2024).
OpenAI et al. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2024).
The Claude 3 Model Family: Opus, Sonnet, Haiku (Anthropic, 2024).
OpenAI o1 System Card (OpenAI, 2024).
Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 633–638 (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Wei, J. et al. Measuring short-form factuality in large language models. Preprint at https://arxiv.org/abs/2411.04368 (2024).
Hendrycks, D. et al. PixMix: Dreamlike pictures comprehensively improve safety measures. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16783–16792 (IEEE/CVF, 2022).
Rein, D. et al. GPQA: A graduate-level Google-proof Q&A benchmark. In Proc. First Conference on Language Modeling (COLM) https://openreview.net/forum?id=Ti67584b98 (COLM, 2024).
Chollet, F., Knoop, M., Kamradt, G. & Landers, B. ARC prize 2024: technical report. Preprint at https://arxiv.org/abs/2412.04604 (2024).
Hendrycks, D. et al. A definition of AGI. Preprint at https://arxiv.org/abs/2510.18212 (2025).
Li, N. et al. The WMDP benchmark: measuring and reducing malicious use with unlearning. In Proc. 41st International Conference on Machine Learning (ICML), 28713–28738 (PMLR, 2024).
Laurent, J. M. et al. LAB-bench: measuring capabilities of language models for biology research. Preprint at https://arxiv.org/abs/2407.10362 (2024).
Srivastava, A. et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Trans. Mach. Learn. Res. https://openreview.net/forum?id=uyTL5Bvosj (2023).
Zhong, W. et al. Agieval: a human-centric benchmark for evaluating foundation models. Preprint at https://arxiv.org/abs/2304.06364 (2023).
Hendrycks, D. et al. Measuring mathematical problem solving with the MATH dataset. In Proc. 35th Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track https://openreview.net/forum?id=7Bywt2mQsCe (NeurIPS, 2021).
Lu, P. et al. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. In Proc. International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=KUNzEQMWU7 (ICLR, 2024).
Cobbe, K. et al. Training verifiers to solve math word problems. Preprint at https://arxiv.org/abs/2110.14168 (2021).
Glazer, E. et al. FrontierMath: a benchmark for evaluating advanced mathematical reasoning in AI. Preprint at https://arxiv.org/abs/2411.04872 (2024).
He, C. et al. OlympiadBench: A challenging benchmark for promoting AGI with Olympiad-level bilingual multimodal scientific problems. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 3828–3850 (ACL, 2024).
Gao, B. et al. Omni-MATH: A universal Olympiad level mathematic benchmark for large language models. In Proc. International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=yaqPf0KAlN (ICLR, 2025).
Chan, J. S. et al. MLE-bench: Evaluating machine learning agents on machine learning engineering. In Proc. International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=6s5uXNWGIh (ICLR, 2025).
Zhang, A. K. et al. Cybench: A framework for evaluating cybersecurity capabilities and risks of language models. In Proc. International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=tc90LV0yRL (ICLR, 2025).
Jimenez, C. E. et al. SWE-bench: Can language models resolve real-world Github issues? In Proc. International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=VTF8yNQM66 (ICLR, 2024).
Chen, M. et al. Evaluating large language models trained on code. Preprint at https://arxiv.org/abs/2107.03374 (2021).
Hendrycks, D. et al. Measuring coding challenge competence with APPS. In Proc. 35th Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track https://openreview.net/forum?id=sD93GOzH3i5 (NeurIPS, 2021).
Bhatt, M. et al. Purple Llama CyberSecEval: a secure coding benchmark for language models. Preprint at https://arxiv.org/abs/2312.04724 (2023).
Austin, J. et al. Program synthesis with large language models. Preprint at https://arxiv.org/abs/2108.07732 (2021).
Bai, Y. et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. Preprint at https://arxiv.org/abs/2204.05862 (2022).
Perez, E. et al. Discovering language model behaviors with model-written evaluations. In Findings of the Association for Computational Linguistics: ACL 2023, 13387–13434 (ACL, 2023).
Rajpurkar, P. et al. SQuAD: 100,000+ questions for machine comprehension of text. In Proc. 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2383–2392 (EMNLP, 2016).
Rajpurkar, P. et al. Know what you don’t know: Unanswerable questions for SQuAD. In Proc. 56th Annual Meeting of the Association for Computational Linguistics (ACL), 784–789 (ACL, 2018).
Bajaj, P. et al. MS MACRO: a human generated machine reading comprehension dataset. Preprint at https://arxiv.org/abs/1611.09268 (2018).
Hendrycks, D. et al. What would Jiminy Cricket do? Towards agents that behave morally. In Proc. 35th Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track https://openreview.net/forum?id=G1muTb5zuO7 (NeurIPS, 2021).
Phan, L., Mazeika, M., Zou, A. & Hendrycks, D. Textquests: how good are LLMs at text-based video games? Preprint at https://arxiv.org/abs/2507.23701 (2025).
Wang, A. et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proc. International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=rJ4km2R5t7 (ICLR, 2019).
Wang, A. et al. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. In Proc. Advances in Neural Information Processing Systems (NeurIPS), Vol. 32, 3261–3275 (NeurIPS, 2019).
Yang, Z. et al. HotpotQA: A dataset for diverse, explainable multihop question answering. In Proc. 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2369–2380 (EMNLP, 2018).
Dua, D. et al. DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs. Preprint at https://arxiv.org/abs/1903.00161 (2019).
Ott, S., Barbosa-Silva, A., Blagec, K., Brauner, J. & Samwald, M. Mapping global dynamics of benchmark creation and saturation in artificial intelligence. Nat. Commun. 13, 6793 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Owen, D. How predictable is language model benchmark performance? Preprint at https://arxiv.org/abs/2401.04757 (2024).
Kiela, D. et al. Dynabench: Rethinking benchmarking in NLP. In Proc. 2021 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 4110–4124 (NAACL, 2021).
McIntosh, T. R. et al. Inadequacies of large language model benchmarks in the era of generative artificial intelligence. IEEE Trans. Artif. Intell. https://doi.org/10.1109/TAI.2025.3569516 (2025).
Wang, Y. et al. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. In Proc. Advances in Neural Information Processing Systems (NeurIPS), article no. 3018 (NeurIPS, 2024).
Taghanaki, S. A., Khani, A. & Khasahmadi, A. MMLU-Pro+: evaluating higher-order reasoning and shortcut learning in LLMS. Preprint at https://arxiv.org/abs/2409.02257 (2024).
Yao, S. et al. τ-bench: A benchmark for tool-agent-user interaction in real-world domains. In Proc. International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=roNSXZpUDN (ICLR, 2025).
Andriushchenko, M. et al. AgentHarm: A benchmark for measuring harmfulness of LLM agents. In Proc. International Conference on Learning Representations (ICLR) https://openreview.net/forum?id=AC5n7xHuR1 (ICLR, 2025).
Kumar, P. et al. Refusal-trained LLMS are easily jailbroken as browser agents. Preprint at https://arxiv.org/abs/2410.13886 (2024).
Yan, F. et al. Berkeley Function Calling Leaderboard https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html (2024).
Srinivasan, V. K. et al. NexusRaven: a commercially-permissive language model for function calling. In NeurIPS 2023 Foundation Models for Decision Making Workshop (NeurIPS, 2023).
Hosseini, A., Sordoni, A., Toyama, D., Courville, A. & Agarwal, R. Not all LLM reasoners are created equal. Preprint at https://arxiv.org/abs/2410.01748 (2024).
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Nie, Y. et al. Adversarial NLI: A new benchmark for natural language understanding. In Proc. 58th Annual Meeting of the Association for Computational Linguistics (ACL), 4885–4901 (ACL, 2020).
Götting, J. et al. Virology capabilities test (VCT): a multimodal virology Q&A benchmark. Preprint at https://arxiv.org/abs/2504.16137 (2025).
Phuong, M. et al. Evaluating frontier models for dangerous capabilities. Preprint at https://arxiv.org/abs/2403.13793 (2024).
Anthropic’s Responsible Scaling Policy Updates https://www.anthropic.com/rsp-updates (Anthropic, 2024).
Mazeika, M. et al. Remote labor index: measuring AI automation of remote work. Preprint at https://arxiv.org/abs/2510.26787 (2025).
Patwardhan, T. et al. GDPval: evaluating ai model performance on real-world economically valuable tasks. Preprint at https://arxiv.org/abs/2510.04374 (2025).
Kwiatkowski, T. et al. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguist. 7, 452–466 (2019).
Google Scholar
Antol, S. et al. VQA: Visual question answering. In Proc. IEEE International Conference on Computer Vision (ICCV), 2425–2433 (IEEE, 2015).
Reddy, S., Chen, D. & Manning, C. D. CoQA: a conversational question answering challenge. Trans. Assoc. Comput. Linguist. 7, 249–266 (2019).
Article Google Scholar
Bowman, S. R., Angeli, G., Potts, C. & Manning, C. D. A large annotated corpus for learning natural language inference. In Proc. 2015 Conference on Empirical Methods in Natural Language Processing (eds Màrquez, L. et al.), 632–642 (ACL, 2015).
Skarlinski, M., Laurent, J., Bou, A. & White, A. About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong. FutureHouse https://www.futurehouse.org/research-announcements/hle-exam (2025).
Arora, R. K. et al. HealthBench: evaluating large language models towards improved human health. Preprint at https://arxiv.org/abs/2505.08775 (2025).

Download references

Acknowledgements

The research is supported by Center for AI Safety and Scale AI.

Author information

Authors and Affiliations

Center for AI Safety, San Francisco CA, USA
Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dan Hendrycks & Arunim Agarwal
Scale AI, San Francisco CA, USA
Ziwen Han, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Aakaash Nattanmai, Gordon McKellips, Anish Cheraku, Asim Suhail, Ethan Luo, Marvin Deng, Jason Luo, Ashley Zhang, Kavin Jindel, Jay Paek, Kasper Halevy, Allen Baranov, Michael Liu, Advaith Avadhanam, David Zhang, Vincent Cheng, Brad Ma, Evan Fu, Liam Do, Joshua Lass, Hubert Yang, Surya Sunkari, Vishruth Bharath, Violet Ai, James Leung, Rishit Agrawal, Alan Zhou, Kevin Chen, Tejas Kalpathi, Ziqi Xu, Gavin Wang, Tyler Xiao, Erik Maung, Sam Lee, Ryan Yang, Roy Yue, Ben Zhao, Julia Yoon, Xiangwan Sun, Aryan Singh, Clark Peng, Tyler Osbey, Taozhi Wang, Daryl Echeazu, Timothy Wu, Spandan Patel, Vidhi Kulkarni, Vijaykaarti Sundarapandiyan, Andrew Le, Zafir Nasim, Srikar Yalam, Ritesh Kasamsetty, Soham Samal, David Sun, Nihar Shah, Abhijeet Saha, Alex Zhang, Leon Nguyen, Laasya Nagumalli, Kaixin Wang, Aidan Wu, Anwith Telluri, Summer Yue & Alexandr Wang
Independent Researcher https://lastexam.ai
Dmitry Dodonov, Daron Anderson, Mikhail Doroshenko, Alun Cennyth Stokes, Michael Yu, Varun Gangal, Alvaro Sanchez, Fabian Giska, Gashaw M. Goshu, Zachary Giboney, Mark Nandor, Edwin Taylor, John B. Wydallis, Kengo Zenitani, Natanael Wildner Fraga, Gözdenur Demir, Dakotah Martinez, Ben Pageler, Sean R. Green, Alexei Kopylov, Benjamin Myklebust, Qiaochu Yuan, Anmol Sahu, James Koppel, Ali Dehghan, Sergey Ivanov, Vladislav Poritski, Michael Foster, Linh Ho, Aleksey Kuchkin, Alexandra Rodriguez-Romero, Keith Schneider, Piotr Padlewski, Stanislaw Barzowski, Hailey Schoelkopf, Martin Stehberger, David M. Cunningham, Vladimir Goryachev, Sangwon Lee, Sandy Zhao, Adithya Shenoy, Shreyas Verma, Himanshu Narayan, Aline Menezes, William Alley, Freddie Martin, Fereshteh Kazemi, Biró Bálint, Behzad Ansarinejad, Guillaume Douville, Hew Wolff, Murat Eron, Sherwin Abdoli, Jacob Drori, Jonathan Eicher, Jason Gross, Rohan Pandey, Ilya Gusev, Adam Jones, Kostiantyn Dobarskyi, Roman Leventov, Michelle X. Yuan, Katarzyna Olszewska, Timothy Manik, Hector Haffenden, Bita Golshani, Nick Winter, Dustin Wehr, Shaun Phillips, Fredrik Ekström, Angela Hammon, George Medley, Forough Mohammadzadeh, Eric Hallman, Mike Battaglia, Dave Hulbert, Handoko, Anton Peristyy, Frank Reidegeld, Wanyoung Kim, Mariana Costa, Chiara Ceconello, Chao Zhuang, Jainam Shah, Lixin Zhang, Abram Jackson, Jesus Colino, Vladimir Vinnikov, Fatimah Adesanya, Julien Degorre, Gbenga Daniel Obikoya, Oleg Shumar, Thomas C. H. Lux, Ben Rank, Matthew Brooks, Stefano Cavalleri, Jules Robins, Brad Raynor, Blake Sims, Rebeka Plecnik, Omer Faruk Bodur, D. P. Shinde, Tania C. B. Santos, Nicholas Farina, Hodjat Mariji, Rasoul Pouriamanesh, Ross Finocchio, Danyelle Ferreira, Siriphan Arthornthurasuk, Isaac C. McAlister, Angel Ramirez-Trinidad, Samuel Perry, Raúl Adrián Huerta Rodrguez, Maja Somrak, Eric Vergo, Antoine Jallon, I. M. J. McInnis, Luk Gloor, Samuel Albanie, Warren S. Vaz, Yiyang Fan, Marcus Abramovitch, Juan Gonzalez, Daniel Bugas, Nasser Heydari, Ferenc Jeanplong, Antonella Pinto, Joan of Arc Xavier, Kanu Priya Agarwal, Gang Zhang, Dmitry Malishev, Stephen Mensah, Wiktor Morak, Javier Gimenez, Khalida Meer, Xavier Alapont, David Outevsky, Abdelkader Dendane, Priti Shukla, Tony Fruhauff, Glen Sherman, Dylan Ler, Innocent Enyekwe, Jiang Muzhen, Aleksandr Maksapetyan, Vivien Rossbach, Mohsen Bahaloohoreh, Jasdeep Sidhu, John Lai, Justine Leon Uro, Greg Bateman, Mohamed Sayed, Ashley Aaron, Murat Tiryakioglu, Sheeshram Siddh, Keith Krenek, Jun Jin, Scott Creighton, Ragavendran P V, Michael Richmond, Sergey Bogdanik, Nitin Chandok, Genta Indra Winata, Suchandra Datta, Bonan Pu, Adam Wecker, Felix Juefei-Xu, My Chiffon Nguyen, Dingsu Wang, Alexandre Oliveira Arrais, Grzegorz Luczyna, Mike Peterson, Ivan Fosin, Timothy Kang, Laila Bashmal, Ha Thi Hoang, Marc Sperzel, Julia Chernyavsky & Guilherme Maximiano
Texas A&M University, College Station, TX, USA
Tung Nguyen
Brown University, Providence, RI, USA
Jaeho Lee, Zheng-Xin Yong & Aaron Kirtland
McGill University, Montreal Quebec, Canada
Mobeen Mahmood, Arkil Patel, Haile Kassahun, Xing Han Lù & Peter E. Chen
Institute of Mathematics of NAS of Ukraine, Kyiv, Ukraine
Oleksandr Pokutnyi
Kiev School of Economics, Kyiv, Ukraine
Oleksandr Pokutnyi
Carnegie Mellon University, Pittsburgh, PA, USA
Oleg Iskra, Hyunwoo Park, Hangrui Cao, Tong Yang, Steffi Chern, Eric Zheng, Jeremiah Milbauer, Andy Zou, Colin Tang, Anupam Nayak, Yunze Xiao, Erica Weng, Ben Racz, Martin Q. Ma, Zhuo Cheng & Johnathan Morris
RWTH Aachen University, Aachen, Germany
Jessica P. Wang
University of Cambridge, Cambridge, UK
John-Clark Levin, Francesco Fournier-Facio, Lina Brüssel, Stefan Ivanov, Julian Wykowski, Mart Oller, Jonathan Roberts, Jacob Loader, Tim Santens, Zaki Hossain, Rami Aly, Daattavya Aggarwal, Antonio Franca, Julien Portier, Lawrence Hollom, Victor Souza, Justin Tan & Arshad Anil Fasiludeen
Kyiv Polytechnic Institute, Kyiv, Ukraine
Mstyslav Kazakov
Queen’s University, Kingston Ontario, Canada
Fiona Feng & Brandon Christof
Stanford University, Stanford, CA, USA
Steven Y. Feng, Chelsea Zou, Noah Burns, Ciprian Manolescu, Declan Grabb, Max Lamparth, Anka Reuel, Allison Tee, G. Bruno De Luca, Niklas Muennighoff, Kushal Thaman, David K. Zhang, William Held, Kang Yong Loh, Nathan Cho, Brian Amaro, Vivek Vajipey, Gabriel Poesia Reis e Silva, Chris Harjadi, Jorge Sanz-Ros, Qizheng Zhang, Yuhui Zhang, Vivek Sanker, Orr Zohar, Sina Mollaei, Rui Li, Anjiang Wei, Xiaohan Wang, Kushin Mukherjee, Haoxuan Chen, Yinuo Ren, Stefano Ermon, Sherwin Lai & Genghan Zhang
University of Washington, Seattle, WA, USA
Haoran Zhao, Jiaqi Wang, Andrew R. Tawfeek, Stefan Steinerberger, Krzysztof Burdzy, Stefan Todoran & Rongwu Xu
University of California San Diego, San Diego, CA, USA
Zihan Wang, Daniel Munro & Junda Chen
University of Porto, Porto, Portugal
Serguei Popov & Duarte V. Gonçalves
ELTE, Budapest, Hungary
Robert Gerbicz
Nimbus AI, Honolulu, HI, USA
Geoff Galgon
ETH Zürich, Zurich, Switzerland
Johannes Schmitt, Lennart Finke, Tim Gehrunger, Antonio Terpin, Johannes Lengler, Miguel Orbegozo Rodriguez, Simon Weber, Emilien Duc, Lukas Lewark, Jakub Łucki, Niels Mündler, Vilém Zouhar, Kumar Shridhar, Alessandro Stolfo, Alice Bizeul & Shehzaad Dhuliawala
Durham University, Durham, UK
Will Yeadon, Chris G. Willcocks & Joshua Newbould
Georgia Southern University, Statesboro, GA, USA
Yongki Lee
University of Minnesota, Minneapolis, MN, USA
Scott Sauers, Krishnamurthy Iyer & Gaoxiang Luo
Queen Mary University of London, London, UK
Marc Roth & Søren Riis
Microsoft, Redmond, WA, USA
Saiteja Utpala, Sivakanth Gopi, Yifan Xiong & Costin Cozianu
Auckland University of Technology, Auckland, New Zealand
Mohinder Maheshbhai Naiya
Alberta Health Services, Edmonton Alberta, Canada
Chidozie Agu
University of Illinois Chicago, Chicago, IL, USA
Antrell Cheatom, Weizhi Zhang & Zishun Yu
Hereford College of Arts, Hereford, UK
Sarah-Jane Crowson
Princeton University, Princeton, NJ, USA
Zerui Cheng, Longke Tang, Kaiqu Liang, Zixuan Wang, Jacob Votava, Denis Peskoff, S. Ashwin Hebbar, Yinwei Dai, Mike He, Jianzhu Yao, Ming Yin, Yihao Liang, Luther Yap, Boyi Wei, Atharv Singh Patlan, Yuval Kansal & Rui Pan
University of Canterbury, Christchurch, New Zealand
Jennifer Zampese
Metropolitan State University of Denver, Denver, CO, USA
Ryan G. Hoerr
Massachusetts Institute of Technology, Cambridge, MA, USA
Jiaqi Cai, Edward Vendrow, Joshua Vendrow, Jayson Lynch, Derek Lim, Richard Stanley, Alvin Jin, Yuexuan Zu, Zachary Berger, Parker Whitfill, Anji Zhang, Adam Zweiger, Jeffery Li, Zachary Brown, Robin Zhang, Linjie Dai, Evan Kim, Hans Gundlach, Evan Chen, Kalon J. Overholt, Changhao Li, Xinyao Han & Cedegao E. Zhang
Accenture Labs, Washington, DC, USA
Ben McCarty
Tufts University, Medford, MA, USA
Alexis C. Garretson
The Jackson Laboratory, Bar Harbor, ME, USA
Alexis C. Garretson
INRIA, Paris, France
Damien Sileo, Gabriel Loiseau & Armel Randy Zebaze
University of California, Berkeley, Berkeley, CA, USA
Qiuyu Ren, Kevin Zhou, Carlo Bosio, Joseph M. Cavanagh, Tobias Kreiman, Alan Goldfarb, Archan Sen, David Aldous, Ning Tang, Michael K. Cohen, Orr Paradise, Michael Wang, Kunyang Sun, Vinh-Kha Le, Jiayi Pan, Milind Jagota, Wenjie Ma, Micah Carroll, Xiuyu Li, Long Tony Lian, Will Cai, Ruiji Sun, Yasin Sonmez, Xuandong Zhao, Muyan Jiang, Adam Bouyamourn, Haocheng Xi, Dawn Song, Zhun Wang, Zhe Ye, Jiaxin Ge & Tianneng Shi
University of British Columbia, Vancouver British Columbia, Canada
Usman Qazi, Shannon Coleman, Ali Khajegili Mirabadi, Wentao Wu & Antonio A. W. L. Wong
Ross University School of Medicine, Bridgetown, Barbados
Usman Qazi
École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Lianghui Li, Pierre Marion, Aleksandar Mikov, Philippe Schwaller, Shaipranesh Senthilkumar, Andres M. Bran, Julien Laurendeau, Filippo Bigi, Luca Arnaboldi & Eeshaan Jain
Concordia University, Montreal Quebec, Canada
Jungbae Nam
Institute of Science and Technology Austria, Klosterneuburg, Austria
Pavel Arkhipov
National University of Singapore, Singapore, Singapore
Jack Wei Lun Shi & Kwok Hao Lee
California Institute of Technology, Pasadena, CA, USA
Aras Bacho, Shi-Zhuo Looi, Michael Chen & Hongsen Qin
University of Oxford, Oxford, UK
Sumeet Motwani, Henry Tang, Lisa Schut, Kaivalya Rawal, Younesse Kaddar, Kalyan Ramakrishnan, Elliott Thornley, Demosthenes Patramanis, Christian Schroeder de Witt, Liangti Dai, Ronald Clark, Avi Semler, Justin Xu, Pieter Francois, Runjia Li, Alexander Pondaven, Hunar Batra & Jacek Karwowski
University of São Paulo, São Paulo, Brazil
Emily de Oliveira Santos, Felipe Meneguitti Dias & Benedito Alves de Oliveira Junior
Humboldt-Universität zu Berlin, Berlin, Germany
Johannes Veith, Xiaoxiang Zhou & Christoph Demian
Charité – Universitätsmedizin, Berlin, Germany
Johannes Veith
Columbia University, New York, NY, USA
Doru Cojoc, Jaehyeok Jin, Immo Klose, Andrew Redenti, Ben Segev, Xilin Jiang & Wei Hao
University of Southern California, Los Angeles, CA, USA
Joshua Robinson, Sam Ali, Long Le & Yizhuo Liang
C. N. Yang Institute for Theoretical Physics, Stony Brook, NY, USA
Yuqi Li
University of Luxembourg, Luxembourg City, Luxembourg
Vladyslav Kuchkin
Universidade Federal de Juiz de Fora, Juiz de Fora, Brazil
Andrey Pupasov Maksimov
Rockwell Automation, Milwaukee, WI, USA
Denis Efremov
Contramont Research, San Francisco, CA, USA
Andrew Gritsevskiy
École Normale Supérieure, Paris, France
Julien Guillod & Jérémy Andréoletti
Sorbonne Université, Paris, France
Julien Guillod & Guillaume Malod
University of Toronto, Toronto Ontario, Canada
Saeed Soori, Pavel Zhelnov, Mohammadreza Mofayezi, Mustafa Mehkary, Joseph McGowan, David Anugraha, Zewen Shen, Leo Smucker, Deepayan Banik & Zhanda Zhu
University of Tübingen, Tübingen, Germany
Ori Press, Ameya Prabhu, Anna-Katharina Dick, Michael Kirchhof, Lukas S. Huber, Jules Kreuer, Robert Geirhos & Kristof Meding
Sapienza University of Rome, Rome, Italy
Paolo Rissone, Donato Crisostomi, Andrea Caciolai & Emanuele Rodolà
University of North Texas, Denton, TX, USA
Moon Twayana & Ignat Soroko
Institut Polytechnique de Paris, Palaiseau, France
Aymeric Dieuleveut
National University Philippines, Manila, The Philippines
Joseph Marvin Imperial
University of Bath, Bath, UK
Joseph Marvin Imperial
Maastricht University, Maastricht, The Netherlands
Jinzhou Yang
Washington University, St Louis, MO, USA
Nick Crispino, Daofeng Li, Jiawei Shen, Wenjin Zhang, Chenguang Wang, Kyle Montgomery, Hannah Szlyk, Ting Wang & Yana Malysheva
University of California, Los Angeles, Los Angeles, CA, USA
Arun Rao, Hubeyb Gurdogan, Colin Ni & Stephen Ebert
Université Paris-Saclay, Gif-sur-Yvette, France
Dimitri Zvonkine & Rafael Sayous
CNRS, Paris, France
Dimitri Zvonkine & Jean-Christophe Mourrat
Martin-Luther-University Halle-Wittenberg, Halle (Saale), Germany
Mikhail Kalinin
Leibniz University Hannover, Hannover, Germany
Marco Lukas & Ahmad Sakor
Diverging Mathematics, Boston, MA, USA
Nate Stambaugh
Indian Institute of Technology Bombay, Mumbai, India
Subrata Mishra
Institute for Molecular Manufacturing, Palo Alto, CA, USA
Tad Hogg
University of Michigan, Ann Arbor, MI, USA
Brian P. Coppola, Ziqiao Ma, Soham Sachin Purohit, Patrick Tser Jern Kon, Jiachen Liu, Shuyu Wu, Jeff J. Ma, Xi Lin, Jae-Won Chung, Sihan Xu & Zhiyi Sun
Google DeepMind, London, UK
Julian Salazar, Johan Ferret & Eric Chu
Vrije Universiteit Brussel, Brussels, Belgium
Andres Algaba, Kelsey Van den Houte, Lynn Van Der Sypt, Brecht Verbeken & Vincent Ginis
UZ Brussel, Brussels, Belgium
Kelsey Van den Houte & Lynn Van Der Sypt
PeopleTec, Huntsville, AL, USA
David Noever
University of Chicago, Chicago, IL, USA
Bikun Li, Allen Zang, Linwei Xin, Zihao Wang, Rayner Hernandez Perez, Xi Jiang, Alex Hoover, Yibo Jiang, Claire Sparrow, Hanchen Li, Francesco Pinto, Theo Knights, Ziyi Zhang & Tianchi Zhang
Technion – Israel Institute of Technology, Haifa, Israel
Evgenii Zheltonozhskii & Davide Manini
University of Miami, Coral Gables, FL, USA
Richard Stanley
Technische Universität Berlin, Berlin, Germany
John Maar & Robert Lauff
University of Manchester, Manchester, UK
Cesare Giulio Ardito & Igor Chernyavsky
University of Illinois Urbana-Champaign, Urbana-Champaign, IL, USA
Yuzheng Hu, Muhammad Fayez Aziz, Peter Bradshaw, Ruicheng Xian, Ting Sun, Chenkai Sun, Yaowen Chang, Wei Hu, Yueying Liu & Phuong M. Cao
University of Calgary, Calgary Alberta, Canada
Ariel Ghislain Kemogne Kamdoum & Hamid Mostaghimi
Universidad Iberoamericana, Mexico City, Mexico
Tobias Garcia Vilchis
TU Wien, Vienna, Austria
Martin Lackner, Julian Noah Leser & Ignacio D. Lopez-Miguel
University of Wisconsin-Madison, Madison, WI, USA
Gongbo Sun, Minghao Yan, Song Bian, Mu Cai, Mengze Tang, Anthony Gitter, Jiaxuan Wu, Siddharth Suresh & Ziqi Liu
Yale University, New Haven, CT, USA
Daniil S. Antonenko & Hanmeng Xu
University of Edinburgh, Edinburgh, UK
Bingchen Zhao, Richard Wheeler, Kaniuar Bacho, Edoardo M. Ponti, Brian Rabern & Ondrej Bohdal
École Normale Supérieure Paris-Saclay, Gif-sur-Yvette, France
Pierrot Arsene & Robin Riblet
University of Western Australia, Perth Western Australia, Australia
David Perrella & Sean Li
Snorkel AI, Redwood City, CA, USA
Nurdin Kaparov
New York University, New York, NY, USA
Ilia Sucholutsky, Niv Cohen, William Merrill, Bingsen Chen, Isaac Park & Qiutong Men
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Arina Kharlamova, Daniil Orel, Raghav Singhal, Xingyu Qu, Junyi Guan, Kaustubh Ponkshe & Alham Fikri Aji
University of Waterloo, Waterloo Ontario, Canada
Shalev Ben-David, Ronak Pradeep & Alex Meiburg
University of Maryland, College Park, MD, USA
Shankar Sivarajan, Kelin Zhu, Anh N. Nhu & Hieu Tran
Manhattan School of Music, New York, NY, USA
Dan Bar Hava
Universiteit Leiden, Leiden, The Netherlands
David Holmes & Jiale Chen
Synbionix, Casselberry, FL, USA
Frank Sommerhage
The Open University, Milton Keynes, UK
Richard Moat & Anna Plassart
Corteva Agriscience, Indianapolis, IN, USA
Zakayo Kazibwe
Sanford Burnham Prebys, La Jolla, CA, USA
Don Clarke
Yonsei University, Seoul, South Korea
Dae Hyun Kim
Harvard University, Cambridge, MA, USA
Sara Fish, Vincent Ginis, Jamie Tucker-Foltz, Harrison K. Wang, Qijia Chen, Oam Patel, Dmitry Kazakov, Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Michael P. Brenner, Tong Jiang, Sarah Hoback, Dianzhuo Wang, Satyapriya Krishna, Zechen Zhang, Rishab Kumar Jain & Core Francisco Park
Cornell University, Ithaca, NY, USA
Veit Elser, Wen-Ding Li, Zhibai Jia, Yingheng Wang, Claas Beger, Hongzheng Chen & Maosen Tang
University of Leeds, Leeds, UK
Victor Efren Guadarrama Vilchis
Arizona State University, Tempe, AZ, USA
Ujjwala Anantheswaran, Jennifer Sandlin, Kevin Joseph Scaria & Himanshu Gupta
Swinburne University of Technology, Melbourne Victoria, Australia
Jeremy Nguyen
KU Leuven, Leuven, Belgium
Nicolas Daans
St Petersburg College, St Petersburg, FL, USA
Haline Heidinger
La Molina National Agrarian University, Lima, Peru
Haline Heidinger
Brandenburg University of Technology, Cottbus, Germany
Maksim Radionov
INSAIT, Sofia, Bulgaria
Václav Rozhoň & Seri Khoury
Ruhr University Bochum, Bochum, Germany
Christian Stump & Alexander Ivanov
National Information Processing Institute, Warsaw, Poland
Rafał Poświata & Michał Perełkiewicz
Charles University, Prague, Czech Republic
Josef Tkadlec
Cranfield University, Cranfield, UK
Ryan Stendall, Petr Spelda & Vit Stritecky
University of Copenhagen, Copenhagen, Denmark
Jack Stade & Sara Vera Marjanović
TRR Designs, Fayetteville, AR, USA
T. Ryan Rogers
The University of Sydney, Sydney New South Wales, Australia
Tom Goertzen, Dominic Williamson & Tony CY Pang
Indian Institute of Technology Delhi, New Delhi, India
Abhishek Shukla & Laxman Prasad Goswami
Universidad de Buenos Aires, Buenos Aires, Argentina
Alan Givré & Laila Yacar
University of Technology Sydney, Sydney New South Wales, Australia
John Arnold Ambay
Indiana State University, Terre Haute, IN, USA
Mark H. Inlow
Australian National University, Canberra Australian Capital Territory, Australia
Hao He, Ling Zhang & Pascal Lauer
KTH Royal Institute of Technology, Stockholm, Sweden
Ivar Ängquist, Nils Gustafsson & Emil Verkama
University of Amsterdam, Amsterdam, The Netherlands
Yanxu Chen, David Stap & Leon Lang
Ben-Gurion University, Beersheba, Israel
Avishy Carmi
Donald and Barbara Zucker School of Medicine, Hempstead, NY, USA
Ethan D. L. Brown
Cohere, Toronto Ontario, Canada
Max Bartolo
Siili Solutions, Helsinki, Finland
JP Heimonen
University of Pennsylvania, Philadelphia, PA, USA
Kaustubh Sridhar, George Balabanian, Zhehang Du, Li S. Yifei & Alex Slen
Aalto University, Espoo, Finland
Ido Akov
Toyota Technological Institute at Chicago, Chicago, IL, USA
Yury Makarychev
Northeastern University, Boston, MA, USA
Joanna Tam & Jason O. Matos
Case Western Reserve University, Cleveland, OH, USA
Hieu Hoang
University of Windsor, Windsor, Ontario, Canada
Michael Krause
St. Jude Children’s Research Hospital, Memphis, TN, USA
Jesyin Lai
Rochester Institute of Technology, Rochester, NY, USA
Jiangnan Xu
Emory University, Atlanta, GA, USA
Ilias Magoulas, Kaustubh Dhole & Isha Gupta
Anthropic, San Francisco, CA, USA
Jan Hendrik Kirchner & Jack Lindsey
CERN, Geneva, Switzerland
Maksym Ovchynnikov
University of California Santa Barbara, Santa Barbara, CA, USA
Yuzhou Nie, Taom Sakal, Daniel Espinosa Gonzalez, Xianjun Yang & Xinlu Zhang
Warsaw University of Technology, Warsaw, Poland
Anna Sztyber-Betley
Hewlett Packard Enterprise, San Francisco, CA, USA
Paolo Faraboschi
North Carolina State University, Raleigh, NC, USA
Jonathan Crozier & Anil Radhakrishnan
University of Houston, Houston, TX, USA
Shiv Halasyamani
All India Institute of Medical Sciences, New Delhi, India
Prashant Joshi
Tel Aviv University, Tel Aviv, Israel
Eli Meril
Georgia Institute of Technology, Atlanta, GA, USA
Jacob Platnick, William Held, Yuzhou Wang, Rynaa Grover, Gunjan Chhablani & Jiaqi Deng
Northwestern University, Evanston, IL, USA
Volodymyr Nevirkovets, Yosi Kratish, Ali ElSheikh & Christopher R. Scotese
University of Arizona, Tucson, AZ, USA
Luke Basler
Universidade de Lisboa, Lisbon, Portugal
Marco Piccardo
Indian Institute of Technology Kharagpur, Kharagpur, India
Virendra Singh
Posts and Telecommunications Institute of Technology, Hanoi, Vietnam
Tran Quoc Khánh
Duke University, Durham, NC, USA
Paul Rosu, Muthu Chidambaram, Sanxing Chen & Songyang Zhang
Mila - Québec AI Institute, Montreal Quebec, Canada
Arkil Patel & Mohammed Mahfoud
University College London, London, UK
Andrea Achilleos, Jean Kaddour, Maria del Rio-Chanona & Ed Chalstrey
University of Zurich, Zurich, Switzerland
Thomas Preu, Nikola Zubić, Martyna Plomecka, Bruno Hebling Vieira & Davide Scaramuzza
UK AI Safety Institute, London, UK
Tomek Korbak
University of Padua, Padua, Italy
Ida Bosio
Boston University, Boston, MA, USA
Ziye Chen, Hao Qi, Mao Mao, Xinyu Zhang, Gabe Maayan & Yuanli Wang
Royal Veterinary College, London, UK
Eve J. Y. Lo
Instituto Superior Técnico, Lisbon, Portugal
Maria Inês S. Nunes
SDAIA, Riyadh, Saudi Arabia
M. Saiful Bari
The Ohio State University, Columbus, OH, USA
Yewen Sun & Christopher W. Bartlett
University of Montreal, Montreal Quebec, Canada
Stephane Durand
Cairo University Specialized Pediatric Hospital, Cairo, Egypt
Hossam Elgnainy
Universidad de Valencia, Valencia, Spain
Daniel Tordera & Pablo Hernández-Cámara
Monash University, Melbourne Victoria, Australia
Lynna Kvistad
Van Andel Institute, Grand Rapids, MI, USA
Hsiaoyun Milliron
Larkin Community Hospital, South Miami, FL, USA
D. O. Andrew Favre
The University of Texas at Dallas, Richardson, TX, USA
Shailesh Shah
Canadian University Dubai, Dubai, UAE
Firuz Kamalov
The Hebrew University of Jerusalem, Jerusalem, Israel
Shaul Barkan, Noam Kolt & Assaf Brown
Università di Milano-Bicocca, Milan, Italy
Alessandro Tomasiello
University of Massachusetts Lowell, Lowell, MA, USA
Emma Rodman
Virginia Tech, Blacksburg, VA, USA
Carl J. Fossum
University of Geneva, Geneva, Switzerland
Honglu Fan
Google Research, Mountain View, CA, USA
Moritz Firsching
Cal Poly San Luis Obispo, San Luis Obispo, CA, USA
Carter Harris
Alexandru Ioan Cuza University, Iasi, Romania
Stefan Ciobâcă
University of Mannheim, Mannheim, Germany
Shashank Agnihotri & David Avagian
Stockholm University, Stockholm, Sweden
Alexander Piperski
College of Eastern Idaho, Idaho Falls, ID, USA
Joshua Duersch
Intrinsic Innovation, Mountain View, CA, USA
Vage Taamazyan
Ivy Natal, San Francisco, CA, USA
Andrew Ho
King Saud University, Riyadh, Saudi Arabia
Mohanad Mohamed
SAMPE Switzerland, Zurich, Switzerland
Claudio Di Fratta
CERo Therapeutics Holdings, South San Francisco, CA, USA
Edson Oliveira
University of Tennessee, Knoxville, TN, USA
Joseph W. Jackson
Gray Swan AI, Pittsburgh, PA, USA
Andy Zou
EleutherAI, Washington, DC, USA
Dashiell Stander & Jianxin Wang
Johns Hopkins University, Baltimore, MD, USA
Ali Dasouqi, Orion Weller, Xue Wang, Maxwell Shepherd, Chuanyang Jin, Yifan Yin & Guangyao Zheng
University of Montpellier, Montpellier, France
Alexander Shen
Fraunhofer IMTE, Lübeck, Germany
Egor Kretov
HomeEquity Bank, Toronto Ontario, Canada
Mikalai Uzhou
Materials Platform for Data Science, Tallinn, Estonia
Alina Borisovna Zhidkovskaya
University of Pisa, Pisa, Italy
Fortuna Samuele
Georgia State University, Atlanta, GA, USA
Faraz Farhidi
Polytechnic University of the Philippines, Manila, The Philippines
Madellene Peñaflor
University of Oregon, Eugene, OR, USA
Alena Friedrich
Drexel University, Philadelphia, PA, USA
Daniel Pyda
University of Mumbai, Mumbai, India
Omkar Dhamane
Gakushuin University, Tokyo, Japan
Kenchi Okutsu
University of Guelph, Guelph Ontario, Canada
Mohammad Maghsoudimehrabani
Intuit, Mountain View, CA, USA
Alon Amit
CTTC / CERCA, Castelldefels, Spain
Roberto Pereira
Dyno Therapeutics, Watertown, MA, USA
Stephen Malina
The Hospital for Sick Children, Toronto Ontario, Canada
Mustafa Mehkary & Loukmane Karim
Temple University, Philadelphia, PA, USA
Cary Friday
Saint Mary’s University, Halifax Nova Scotia, Canada
Mukhwinder Singh
Cisco, San Jose, CA, USA
Hassan Shapourian
Indian Institute of Technology (BHU), Varanasi, India
Harsh Kumar
AIM Intelligence, Seoul, South Korea
Haon Park
Seoul National University, Seoul, South Korea
Haon Park
The University of Texas at Arlington, Arlington, TX, USA
Diana T. Pham
The Hartree Centre, Daresbury, UK
Joshua Robinson
University of Vienna, Vienna, Austria
Paolo Giordano & Philipp Petersen
POLITEHNICA Bucharest National University of Science and Technology, Bucharest, Romania
Adrian Cosma
Abacus.AI, San Francisco, CA, USA
Colin White
University of Galway, Galway, Ireland
Ethan Delaney & Imad Ali Shah
Eastern Institute of Technology (EIT), Napier, New Zealand
Syed M. Shahid
ENS Lyon, Lyon, France
Jean-Christophe Mourrat
Czech Technical University in Prague, Prague, Czech Republic
Lavr Vetoshkin
University of Hamburg, Hamburg, Germany
Koen Sponselee
CISPA Helmholtz Center for Information Security, Saarbrücken, Germany
Renas Bacho
Universidad de Morón, Morón, Argentina
Florencia de la Rosa
Université Paris Cité, Paris, France
Guillaume Malod
Politecnico di Milano, Milan, Italy
Guglielmo Albani
The New School, New York, NY, USA
Yuchen Anna Zhou
Max Planck Institute for Software Systems, Saarbrücken, Germany
Yiğit Yaln
OpenAI, San Francisco, CA, USA
Rai Michael Pokorny & Tejal Patwardhan
Universidad de Granada, Granada, Spain
M. C. Boscá
Modulo Research, Cambridge, UK
Gabriel Recchia
Heidelberg University, Heidelberg, Germany
Mara Popescu & Jing Fan
La Trobe University, Melbourne Victoria, Australia
Nikita Shulga
University of Yaoundé I, Yaoundé, Cameroon
Ngefor Mildred Tanwie & Gautier Abou Loume
University of Innsbruck, Innsbruck, Austria
Alesia Yakimchyk
Nabu Technologies, San Francisco, CA, USA
Huanxu Quinn Liu
Chalmers University of Technology, Gothenburg, Sweden
Olle Häggström
Unidade Local de Saúde de Lisboa Ocidental, Lisbon, Portugal
Leonor Brito-Santana
Children’s Hospital of Orange County, Orange, CA, USA
Peyman Kassani
The Future Paralegals of America, New York, NY, USA
Eshawn Jessica Scipio
Eastlake High School, Sammamish, WA, USA
Alon Ragoler
Center for Scientific Research and Higher Education at Ensenada (CICESE), Ensenada, Mexico
Yan Carlos Leyva Labrador
University of Bradford, Bradford, UK
Zahra Adoul
Beni Suef University, Beni Suef, Egypt
Mohamed Zekry
Bogazici University, Istanbul, Turkey
Ali Karakoc
Mansoura University, Mansoura, Egypt
Samir Shamseldeen
University of Bristol, Bristol, UK
Anna Liakhovitskaia
University of Oklahoma, Norman, OK, USA
Nate Resman & Elizabeth Kelley
Jala University, Honolulu, HI, USA
Juan Carlos Gonzalez
University of Arkansas, Fayetteville, AR, USA
Earth Anderson
Florida Atlantic University, Boca Raton, FL, USA
Rodrigo De Oliveira Pena
Bournemouth University, Bournemouth, UK
Ismail Alarab
University of Warwick, Coventry, UK
Joshua Cole
University of Alabama Huntsville, Huntsville, AL, USA
Bryan Johnson
University of Hertfordshire, Hatfield, UK
Mohammad Safdari
OncoPrecision, New York, NY, USA
Alejandro José Moyano
Central College, Pella, IA, USA
Alexey Pronin
Nottingham Trent University, Nottingham, UK
Daphiny Pottmaier
Max Planck Institute for Intelligent Systems, Stuttgart, Germany
Omid Taheri & Fanfei Li
University of Virginia, Charlottesville, VA, USA
Stanley Stepanic
Dartmouth College, Hanover, NH, USA
Luke Askew
Cairo University, Giza, Egypt
Ali M. R. Minissi & Zienab EL-Wasif
INESC Microsistemas e Nanotecnologias, Lisbon, Portugal
Ricardo Lorena & Diogo M. Caetano
James Madison University, Harrisonburg, VA, USA
Josh Ducey
Instituto Gonçalo Moniz, Salvador, Brazil
Matheus Piza
Rice University, Houston, TX, USA
Juehang Qin
HUN-REN, Budapest, Hungary
Benjámin Borbás
Rutgers University, New Brunswick, NJ, USA
Tej Shah
AE Studio, Marina Del Rey, CA, USA
Marc Carauleanu
Saarland University, Saarbrücken, Germany
Pascal Lauer
HUTECH, Ho Chi Minh City, Vietnam
Tran Duc Huy
Pennsylvania College of Technology, Williamsport, PA, USA
Hossein Shahrtash
Intelligent Geometries, Front Royal, VA, USA
Brian Weber
École Polytechnique, Palaiseau, France
Pierre Clavier & Sergei Bogdanov
CONICET, Buenos Aires, Argentina
Sandra Mendoza
Universidad Tecnológica Nacional, Buenos Aires, Argentina
Sandra Mendoza
John Crane UK, Slough, UK
Murat Islam
Alan Turing Institute, London, UK
Vasilios Mavroudis & Pieter Francois
Pondicherry Engineering College, Puducherry, India
Pawan Kumar
Leibniz Institute for Science and Mathematics Education, Kiel, Germany
Thorben Jansen
Royal Holloway, University of London, Egham, UK
Archimedes Apronti
Tanta University, Tanta, Egypt
Abdallah Galal
University of Malaya, Kuala Lumpur, Malaysia
Ng Ze-An
Hemwati Nandan Bahuguna Garhwal University, Srinagar, India
Ankit Singh
University Mohammed I, Oujda, Morocco
Mohammed Berkani
LGM, Paris, France
Nicolas Remy
Northern Illinois University, DeKalb, IL, USA
Taylor D. Hartman & Jichao Fang
Bethune-Cookman University, Daytona Beach, FL, USA
Tim Tarver
University of California, Irvine, Irvine, CA, USA
Farzad Habibi, Sina Rismanchian & Christopher Toukmaji
Central Mindanao University, Maramag, The Philippines
Roselynn Grace Montecillo
University of the Fraser Valley, Abbotsford British Columbia, Canada
Russell Campbell
Patched Codes, San Francisco, CA, USA
Asankhaya Sharma
Missouri University of Science and Technology, Rolla, MO, USA
Shreen Gul
Quotient AI, Boston, MA, USA
Freddie Vargus
CSMSS Chh. Shahu College of Engineering, Aurangabad, India
Deepakkumar Patil
Genomia Diagnostics Research, New Delhi, India
Rajat Maheshwari
Sheffield Teaching Hospitals NHS Foundation Trust, Sheffield, UK
Ashley Cartwright
Forschungszentrum Jülich, Jülich, Germany
Sören Möller
Standard Intelligence, San Francisco, CA, USA
Kunvar Thaman
RMIT University, Melbourne Victoria, Australia
Muhammad Rehan Siddiqi
German Research Center for Artificial Intelligence, Kaiserslautern, Germany
Prajvi Saxena
Fondazione Bruno Kessler, Trento, Italy
Mátyás Vincze & Lorenzo Vaquero
University of Trento, Trento, Italy
Mátyás Vincze
Chulalongkorn University, Bangkok, Thailand
Siranut Usawasutsakorn
Aligarh Muslim University, Aligarh, India
Sk Md Salauddin
Happy Technologies LLC, Arlington, VA, USA
Eric Singer
Menoufia University, Shebin El Kom, Egypt
Ahmed Menshawy
Instituto Politécnico Nacional, Mexico City, Mexico
Darling Duclosel
University of Bologna, Bologna, Italy
Dario Bezzi
Manipal University Jaipur, Jaipur, India
Yashaswini Jain
The University of Texas at Austin, Austin, TX, USA
Hao-Yu Sun
Murdoch University, Perth Western Australia, Australia
Samuele Sala
University of Delaware, Newark, DE, USA
Manuel Schottdorf
Williams College, Williamstown, MA, USA
Gerol Petruzella
Perimeter Institute for Theoretical Physics, Waterloo Ontario, Canada
Alex Meiburg
University of Maribor, Maribor, Slovenia
Tilen Medved
Brigham and Women’s Hospital, Boston, MA, USA
Jason Poulos
The University of Tokyo, Tokyo, Japan
Mingfang Zhang
Imperial College London, London, UK
Ali Anil Demircali, Abdurrahim Yilmaz & Barbara Dworakowska
University of California Santa Cruz, Santa Cruz, CA, USA
Yuyin Zhou, Juncheng Wu & Haoqin Tu
Vellore Institute of Technology, Vellore, India
Aarush Sinha
CHRU de Nancy, Nancy, France
Mickaël Noyé
Delft University of Technology, Delft, The Netherlands
Ioannis Pantidis
Scripps Research, La Jolla, CA, USA
Tianbo Qi & Shiqi Wang
Aleph Alpha, Heidelberg, Germany
Letitia Parcalabescu & Philipp D. Siedler
George Mason University, Fairfax, VA, USA
Thai-Hoa Nguyen
Atilim University, Ankara, Turkey
Jongee Park
Leonardo Labs, Rome, Italy
Dario Abbondanza
Complexity Science Hub, Vienna, Austria
Maria del Rio-Chanona, Dániel Kondor, Jakob Zsambok, Dan Hoyer, Jenny Reddish, Jakob Hauser & Peter Turchin
Universidad Nacional de Educación a Distancia, Madrid, Spain
Francisco-Javier Rodrigo-Ginés
Saxion University, Enschede, The Netherlands
Thom Kamphuis
Korea Advanced Institute of Science and Technology, Daejeon, South Korea
Hyunjun Kim, Sejong Kim & Woongyeong Yeo
Adobe Research, San Jose, CA, USA
Franck Dernoncourt
National Aerospace University ‘Kharkiv Aviation Institute’, Kharkiv, Ukraine
Glib Briia
Google, Mountain View, CA, USA
Hieu Nguyen, Ivan Dewerpe & Subhashini Venugopalan
Hexworks, Barcelona, Spain
David Quod Soler Bartomeu
Westmead Hospital, Sydney New South Wales, Australia
Tony CY Pang
University of Bern, Bern, Switzerland
Lukas S. Huber & Joshua Jaeger
Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau, Kaiserslautern, Germany
Romano De Maddalena
SUMM AI, Munich, Germany
Ankit Agrawal
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Shoubin Yu, Han Lin & Vaidehi Patil
Konkuk University, Seoul, South Korea
Hyun Kyu Park
University of Groningen, Groningen, The Netherlands
Gabriele Sarti
Jagiellonian University, Kraków, Poland
Marcin Briański
Minerva University, San Francisco, CA, USA
Truong An Nguyen
Aalborg University, Aalborg, Denmark
Mike Zhang
IBM Research, Givatayim, Israel
Yotam Perlitz
Universitat Politecnica de Valencia, Valencia, Spain
Jose Hernandez-Orallo
RBC Borealis, Toronto Ontario, Canada
Amin Shabani
Mayo Clinic, Rochester, MN, USA
Shikhar Dhingra
University of Lausanne, Lausanne, Switzerland
Martino Maggetti
Dalhousie University, Halifax Nova Scotia, Canada
Jonathon Kean
Universitat de Lleida, Lleida, Spain
Jorge Chamorro-Padial
Amazon, Seattle, WA, USA
Shreyas Subramanian
Dell Technologies, Round Rock, TX, USA
Yushun Chen & Samaksh Gulati
University of Seoul, Seoul, South Korea
Junwoo Ha
University of Auckland, Auckland, New Zealand
Gaël Gendron
Morgridge Institute for Research, Madison, WI, USA
Anthony Gitter
Korea University of Technology and Education, Cheonan, South Korea
Namkyu Park
University of California Davis, Davis, CA, USA
Kunal Pai & Zijian Song
Baylor College of Medicine, Houston, TX, USA
Ahmed Elkhanany
Indraprastha Institute of Information Technology Delhi, New Delhi, India
Ritwik Mishra
Two Minute Papers, Pécs, Hungary
Károly Zsolnai-Fehér
ADIA Lab, Abu Dhabi, UAE
Shadab Khan
New Jersey Institute of Technology, Newark, NJ, USA
Jun Yuan
Novo Nordisk, Bagsværd, Denmark
Zhe Wang
Purdue University, West Lafayette, IN, USA
Aditya Malusare & Veerupaksh Singla
Gakugei Shuppan-sha, Kyoto, Japan
Kazuki Matsumoto
Universiteit Utrecht, Utrecht, The Netherlands
Gerben Sewuster
T-Systems Iberia, Madrid, Spain
Jorge Pretel Villanueva
University of Klagenfurt, Klagenfurt, Austria
Ivan Rannev
Max Planck Institute for Security and Privacy, Bochum, Germany
Wenchao Dong
InxiteOut, Bangalore, India
Kaushik Bar
Goethe Universität Frankfurt, Frankfurt am Main, Germany
Caroline Geirhos
Universidad del Valle, Cali, Colombia
Julien Wist
Bilkent University, Ankara, Turkey
Kutay Tire & Atak Talay Yücel
Trinity School, New York, NY, USA
Joshua Mak
Universitat Pompeu Fabra, Barcelona, Spain
Antoine Moulin
Brighton Law School, Brighton, UK
Ketan Jha
University of Melbourne, Melbourne, Australia
Vassilis Kostakos
Ankara University, Ankara, Turkey
Eunmi Yu
Dr. Siyami Ersek Thoracic, Cardiovascular and Vascular Surgery Training and Research Hospital, Istanbul, Turkey
Arif Engin Demircali
AIT Austrian Institute of Technology, Vienna, Austria
Roman Pflugfelder
Technical University of Munich, Munich, Germany
Roman Pflugfelder
Providence College, Providence, RI, USA
James Bailey
University of Jyväskylä, Jyväskylä, Finland
Ville Heilala
Weizmann Institute of Science, Rehovot, Israel
Sybille Rosset
Indiana University, Bloomington, IN, USA
Sreekar Chigurupati & Sai Prajwal Reddy
Nanyang Technological University, Singapore, Singapore
Hu Shiyu
University of Sheffield, Sheffield, UK
Ben Wu

Consortia

Center for AI Safety

Long Phan
, Alice Gatti
, Nathaniel Li
, Adam Khoja
, Ryan Kim
, Richard Ren
, Jason Hausenloy
, Oliver Zhang
, Mantas Mazeika
& Dan Hendrycks

Scale AI

Ziwen Han
, Josephina Hu
, Hugh Zhang
, Chen Bo Calvin Zhang
, Mohamed Shaaban
, John Ling
, Sean Shi
, Michael Choi
, Anish Agrawal
, Arnav Chopra
, Aakaash Nattanmai
, Gordon McKellips
, Anish Cheraku
, Asim Suhail
, Ethan Luo
, Marvin Deng
, Jason Luo
, Ashley Zhang
, Kavin Jindel
, Jay Paek
, Kasper Halevy
, Allen Baranov
, Michael Liu
, Advaith Avadhanam
, David Zhang
, Vincent Cheng
, Brad Ma
, Evan Fu
, Liam Do
, Joshua Lass
, Hubert Yang
, Surya Sunkari
, Vishruth Bharath
, Violet Ai
, James Leung
, Rishit Agrawal
, Alan Zhou
, Kevin Chen
, Tejas Kalpathi
, Ziqi Xu
, Gavin Wang
, Tyler Xiao
, Erik Maung
, Sam Lee
, Ryan Yang
, Roy Yue
, Ben Zhao
, Julia Yoon
, Xiangwan Sun
, Aryan Singh
, Clark Peng
, Tyler Osbey
, Taozhi Wang
, Daryl Echeazu
, Timothy Wu
, Spandan Patel
, Vidhi Kulkarni
, Vijaykaarti Sundarapandiyan
, Andrew Le
, Zafir Nasim
, Srikar Yalam
, Ritesh Kasamsetty
, Soham Samal
, David Sun
, Nihar Shah
, Abhijeet Saha
, Alex Zhang
, Leon Nguyen
, Laasya Nagumalli
, Kaixin Wang
, Aidan Wu
, Anwith Telluri
, Summer Yue
& Alexandr Wang

HLE Contributors Consortium

Dmitry Dodonov
, Tung Nguyen
, Jaeho Lee
, Daron Anderson
, Mikhail Doroshenko
, Alun Cennyth Stokes
, Mobeen Mahmood
, Oleksandr Pokutnyi
, Oleg Iskra
, Jessica P. Wang
, John-Clark Levin
, Mstyslav Kazakov
, Fiona Feng
, Steven Y. Feng
, Haoran Zhao
, Michael Yu
, Varun Gangal
, Chelsea Zou
, Zihan Wang
, Serguei Popov
, Robert Gerbicz
, Geoff Galgon
, Johannes Schmitt
, Will Yeadon
, Yongki Lee
, Scott Sauers
, Alvaro Sanchez
, Fabian Giska
, Marc Roth
, Søren Riis
, Saiteja Utpala
, Noah Burns
, Gashaw M. Goshu
, Mohinder Maheshbhai Naiya
, Chidozie Agu
, Zachary Giboney
, Antrell Cheatom
, Francesco Fournier-Facio
, Sarah-Jane Crowson
, Lennart Finke
, Zerui Cheng
, Jennifer Zampese
, Ryan G. Hoerr
, Mark Nandor
, Hyunwoo Park
, Tim Gehrunger
, Jiaqi Cai
, Ben McCarty
, Alexis C. Garretson
, Edwin Taylor
, Damien Sileo
, Qiuyu Ren
, Usman Qazi
, Lianghui Li
, Jungbae Nam
, John B. Wydallis
, Pavel Arkhipov
, Jack Wei Lun Shi
, Aras Bacho
, Chris G. Willcocks
, Hangrui Cao
, Sumeet Motwani
, Emily de Oliveira Santos
, Johannes Veith
, Edward Vendrow
, Doru Cojoc
, Kengo Zenitani
, Joshua Robinson
, Longke Tang
, Yuqi Li
, Joshua Vendrow
, Natanael Wildner Fraga
, Vladyslav Kuchkin
, Andrey Pupasov Maksimov
, Pierre Marion
, Denis Efremov
, Jayson Lynch
, Kaiqu Liang
, Aleksandar Mikov
, Andrew Gritsevskiy
, Julien Guillod
, Gözdenur Demir
, Dakotah Martinez
, Ben Pageler
, Kevin Zhou
, Saeed Soori
, Ori Press
, Henry Tang
, Paolo Rissone
, Sean R. Green
, Lina Brüssel
, Moon Twayana
, Aymeric Dieuleveut
, Joseph Marvin Imperial
, Ameya Prabhu
, Jinzhou Yang
, Nick Crispino
, Arun Rao
, Dimitri Zvonkine
, Gabriel Loiseau
, Mikhail Kalinin
, Marco Lukas
, Ciprian Manolescu
, Nate Stambaugh
, Subrata Mishra
, Tad Hogg
, Carlo Bosio
, Brian P. Coppola
, Julian Salazar
, Jaehyeok Jin
, Rafael Sayous
, Stefan Ivanov
, Philippe Schwaller
, Shaipranesh Senthilkumar
, Andres M. Bran
, Andres Algaba
, Kelsey Van den Houte
, Lynn Van Der Sypt
, Brecht Verbeken
, David Noever
, Alexei Kopylov
, Benjamin Myklebust
, Bikun Li
, Lisa Schut
, Evgenii Zheltonozhskii
, Qiaochu Yuan
, Derek Lim
, Richard Stanley
, Tong Yang
, John Maar
, Julian Wykowski
, Mart Oller
, Anmol Sahu
, Cesare Giulio Ardito
, Yuzheng Hu
, Ariel Ghislain Kemogne Kamdoum
, Alvin Jin
, Tobias Garcia Vilchis
, Yuexuan Zu
, Martin Lackner
, James Koppel
, Gongbo Sun
, Daniil S. Antonenko
, Steffi Chern
, Bingchen Zhao
, Pierrot Arsene
, Joseph M. Cavanagh
, Daofeng Li
, Jiawei Shen
, Donato Crisostomi
, Wenjin Zhang
, Ali Dehghan
, Sergey Ivanov
, David Perrella
, Nurdin Kaparov
, Allen Zang
, Ilia Sucholutsky
, Arina Kharlamova
, Daniil Orel
, Vladislav Poritski
, Shalev Ben-David
, Zachary Berger
, Parker Whitfill
, Michael Foster
, Daniel Munro
, Linh Ho
, Shankar Sivarajan
, Dan Bar Hava
, Aleksey Kuchkin
, David Holmes
, Alexandra Rodriguez-Romero
, Frank Sommerhage
, Anji Zhang
, Richard Moat
, Keith Schneider
, Zakayo Kazibwe
, Don Clarke
, Dae Hyun Kim
, Felipe Meneguitti Dias
, Sara Fish
, Veit Elser
, Tobias Kreiman
, Victor Efren Guadarrama Vilchis
, Immo Klose
, Ujjwala Anantheswaran
, Adam Zweiger
, Kaivalya Rawal
, Jeffery Li
, Jeremy Nguyen
, Nicolas Daans
, Haline Heidinger
, Maksim Radionov
, Václav Rozhoň
, Vincent Ginis
, Christian Stump
, Niv Cohen
, Rafał Poświata
, Josef Tkadlec
, Alan Goldfarb
, Chenguang Wang
, Piotr Padlewski
, Stanislaw Barzowski
, Kyle Montgomery
, Ryan Stendall
, Jamie Tucker-Foltz
, Jack Stade
, T. Ryan Rogers
, Tom Goertzen
, Declan Grabb
, Abhishek Shukla
, Alan Givré
, John Arnold Ambay
, Archan Sen
, Muhammad Fayez Aziz
, Mark H. Inlow
, Hao He
, Ling Zhang
, Younesse Kaddar
, Ivar Ängquist
, Yanxu Chen
, Harrison K. Wang
, Kalyan Ramakrishnan
, Elliott Thornley
, Antonio Terpin
, Hailey Schoelkopf
, Eric Zheng
, Avishy Carmi
, Ethan D. L. Brown
, Kelin Zhu
, Max Bartolo
, Richard Wheeler
, Martin Stehberger
, Peter Bradshaw
, JP Heimonen
, Kaustubh Sridhar
, Ido Akov
, Jennifer Sandlin
, Yury Makarychev
, Joanna Tam
, Hieu Hoang
, David M. Cunningham
, Vladimir Goryachev
, Demosthenes Patramanis
, Michael Krause
, Andrew Redenti
, David Aldous
, Jesyin Lai
, Shannon Coleman
, Jiangnan Xu
, Sangwon Lee
, Ilias Magoulas
, Sandy Zhao
, Ning Tang
, Michael K. Cohen
, Orr Paradise
, Jan Hendrik Kirchner
, Maksym Ovchynnikov
, Jason O. Matos
, Adithya Shenoy
, Michael Wang
, Yuzhou Nie
, Anna Sztyber-Betley
, Paolo Faraboschi
, Robin Riblet
, Jonathan Crozier
, Shiv Halasyamani
, Shreyas Verma
, Prashant Joshi
, Eli Meril
, Ziqiao Ma
, Jérémy Andréoletti
, Raghav Singhal
, Jacob Platnick
, Volodymyr Nevirkovets
, Luke Basler
, Alexander Ivanov
, Seri Khoury
, Nils Gustafsson
, Marco Piccardo
, Hamid Mostaghimi
, Qijia Chen
, Virendra Singh
, Tran Quoc Khánh
, Paul Rosu
, Hannah Szlyk
, Zachary Brown
, Himanshu Narayan
, Aline Menezes
, Jonathan Roberts
, William Alley
, Kunyang Sun
, Arkil Patel
, Max Lamparth
, Anka Reuel
, Linwei Xin
, Hanmeng Xu
, Jacob Loader
, Freddie Martin
, Zixuan Wang
, Andrea Achilleos
, Thomas Preu
, Tomek Korbak
, Ida Bosio
, Fereshteh Kazemi
, Ziye Chen
, Biró Bálint
, Eve J. Y. Lo
, Jiaqi Wang
, Maria Inês S. Nunes
, Jeremiah Milbauer
, M. Saiful Bari
, Zihao Wang
, Behzad Ansarinejad
, Yewen Sun
, Stephane Durand
, Hossam Elgnainy
, Guillaume Douville
, Daniel Tordera
, George Balabanian
, Hew Wolff
, Lynna Kvistad
, Hsiaoyun Milliron
, Ahmad Sakor
, Murat Eron
, D. O. Andrew Favre
, Shailesh Shah
, Xiaoxiang Zhou
, Firuz Kamalov
, Sherwin Abdoli
, Tim Santens
, Shaul Barkan
, Allison Tee
, Robin Zhang
, Alessandro Tomasiello
, G. Bruno De Luca
, Shi-Zhuo Looi
, Vinh-Kha Le
, Noam Kolt
, Jiayi Pan
, Emma Rodman
, Jacob Drori
, Carl J. Fossum
, Niklas Muennighoff
, Milind Jagota
, Ronak Pradeep
, Honglu Fan
, Jonathan Eicher
, Michael Chen
, Kushal Thaman
, William Merrill
, Moritz Firsching
, Carter Harris
, Stefan Ciobâcă
, Jason Gross
, Rohan Pandey
, Ilya Gusev
, Adam Jones
, Shashank Agnihotri
, Pavel Zhelnov
, Mohammadreza Mofayezi
, Alexander Piperski
, David K. Zhang
, Kostiantyn Dobarskyi
, Roman Leventov
, Ignat Soroko
, Joshua Duersch
, Vage Taamazyan
, Andrew Ho
, Wenjie Ma
, William Held
, Ruicheng Xian
, Armel Randy Zebaze
, Mohanad Mohamed
, Julian Noah Leser
, Michelle X. Yuan
, Laila Yacar
, Johannes Lengler
, Katarzyna Olszewska
, Claudio Di Fratta
, Edson Oliveira
, Joseph W. Jackson
, Andy Zou
, Muthu Chidambaram
, Timothy Manik
, Hector Haffenden
, Dashiell Stander
, Ali Dasouqi
, Alexander Shen
, Bita Golshani
, David Stap
, Egor Kretov
, Mikalai Uzhou
, Alina Borisovna Zhidkovskaya
, Nick Winter
, Miguel Orbegozo Rodriguez
, Robert Lauff
, Dustin Wehr
, Colin Tang
, Zaki Hossain
, Shaun Phillips
, Fortuna Samuele
, Fredrik Ekström
, Angela Hammon
, Oam Patel
, Faraz Farhidi
, George Medley
, Forough Mohammadzadeh
, Madellene Peñaflor
, Haile Kassahun
, Alena Friedrich
, Rayner Hernandez Perez
, Daniel Pyda
, Taom Sakal
, Omkar Dhamane
, Ali Khajegili Mirabadi
, Eric Hallman
, Kenchi Okutsu
, Mike Battaglia
, Mohammad Maghsoudimehrabani
, Alon Amit
, Dave Hulbert
, Roberto Pereira
, Simon Weber
, Handoko
, Anton Peristyy
, Stephen Malina
, Mustafa Mehkary
, Rami Aly
, Frank Reidegeld
, Anna-Katharina Dick
, Cary Friday
, Mukhwinder Singh
, Hassan Shapourian
, Wanyoung Kim
, Mariana Costa
, Hubeyb Gurdogan
, Harsh Kumar
, Chiara Ceconello
, Chao Zhuang
, Haon Park
, Micah Carroll
, Andrew R. Tawfeek
, Stefan Steinerberger
, Daattavya Aggarwal
, Michael Kirchhof
, Linjie Dai
, Evan Kim
, Johan Ferret
, Jainam Shah
, Yuzhou Wang
, Minghao Yan
, Krzysztof Burdzy
, Lixin Zhang
, Antonio Franca
, Diana T. Pham
, Kang Yong Loh
, Joshua Robinson
, Abram Jackson
, Paolo Giordano
, Philipp Petersen
, Adrian Cosma
, Jesus Colino
, Colin White
, Jacob Votava
, Vladimir Vinnikov
, Ethan Delaney
, Petr Spelda
, Vit Stritecky
, Syed M. Shahid
, Jean-Christophe Mourrat
, Lavr Vetoshkin
, Koen Sponselee
, Renas Bacho
, Zheng-Xin Yong
, Florencia de la Rosa
, Nathan Cho
, Xiuyu Li
, Guillaume Malod
, Orion Weller
, Guglielmo Albani
, Leon Lang
, Julien Laurendeau
, Dmitry Kazakov
, Fatimah Adesanya
, Julien Portier
, Lawrence Hollom
, Victor Souza
, Yuchen Anna Zhou
, Julien Degorre
, Yiğit Yaln
, Gbenga Daniel Obikoya
, Rai Michael Pokorny
, Filippo Bigi
, M. C. Boscá
, Oleg Shumar
, Kaniuar Bacho
, Gabriel Recchia
, Mara Popescu
, Nikita Shulga
, Ngefor Mildred Tanwie
, Thomas C. H. Lux
, Ben Rank
, Colin Ni
, Matthew Brooks
, Alesia Yakimchyk
, Huanxu Quinn Liu
, Stefano Cavalleri
, Olle Häggström
, Emil Verkama
, Joshua Newbould
, Hans Gundlach
, Leonor Brito-Santana
, Brian Amaro
, Vivek Vajipey
, Rynaa Grover
, Ting Wang
, Yosi Kratish
, Wen-Ding Li
, Sivakanth Gopi
, Andrea Caciolai
, Christian Schroeder de Witt
, Pablo Hernández-Cámara
, Emanuele Rodolà
, Jules Robins
, Dominic Williamson
, Brad Raynor
, Hao Qi
, Ben Segev
, Jingxuan Fan
, Sarah Martinson
, Erik Y. Wang
, Kaylie Hausknecht
, Michael P. Brenner
, Mao Mao
, Christoph Demian
, Peyman Kassani
, Xinyu Zhang
, David Avagian
, Eshawn Jessica Scipio
, Alon Ragoler
, Justin Tan
, Blake Sims
, Rebeka Plecnik
, Aaron Kirtland
, Omer Faruk Bodur
, D. P. Shinde
, Yan Carlos Leyva Labrador
, Zahra Adoul
, Mohamed Zekry
, Ali Karakoc
, Tania C. B. Santos
, Samir Shamseldeen
, Loukmane Karim
, Anna Liakhovitskaia
, Nate Resman
, Nicholas Farina
, Juan Carlos Gonzalez
, Gabe Maayan
, Earth Anderson
, Rodrigo De Oliveira Pena
, Elizabeth Kelley
, Hodjat Mariji
, Rasoul Pouriamanesh
, Wentao Wu
, Ross Finocchio
, Ismail Alarab
, Joshua Cole
, Danyelle Ferreira
, Bryan Johnson
, Mohammad Safdari
, Liangti Dai
, Siriphan Arthornthurasuk
, Isaac C. McAlister
, Alejandro José Moyano
, Alexey Pronin
, Jing Fan
, Angel Ramirez-Trinidad
, Yana Malysheva
, Daphiny Pottmaier
, Omid Taheri
, Stanley Stepanic
, Samuel Perry
, Luke Askew
, Raúl Adrián Huerta Rodrguez
, Ali M. R. Minissi
, Ricardo Lorena
, Krishnamurthy Iyer
, Arshad Anil Fasiludeen
, Ronald Clark
, Josh Ducey
, Matheus Piza
, Maja Somrak
, Eric Vergo
, Juehang Qin
, Benjámin Borbás
, Eric Chu
, Jack Lindsey
, Antoine Jallon
, I. M. J. McInnis
, Evan Chen
, Avi Semler
, Luk Gloor
, Tej Shah
, Marc Carauleanu
, Pascal Lauer
, Tran Duc Huy
, Hossein Shahrtash
, Emilien Duc
, Lukas Lewark
, Assaf Brown
, Samuel Albanie
, Brian Weber
, Warren S. Vaz
, Pierre Clavier
, Yiyang Fan
, Gabriel Poesia Reis e Silva
, Long Tony Lian
, Marcus Abramovitch
, Xi Jiang
, Sandra Mendoza
, Murat Islam
, Juan Gonzalez
, Vasilios Mavroudis
, Justin Xu
, Pawan Kumar
, Laxman Prasad Goswami
, Daniel Bugas
, Nasser Heydari
, Ferenc Jeanplong
, Thorben Jansen
, Antonella Pinto
, Archimedes Apronti
, Abdallah Galal
, Ng Ze-An
, Ankit Singh
, Tong Jiang
, Joan of Arc Xavier
, Kanu Priya Agarwal
, Mohammed Berkani
, Gang Zhang
, Zhehang Du
, Benedito Alves de Oliveira Junior
, Dmitry Malishev
, Nicolas Remy
, Taylor D. Hartman
, Tim Tarver
, Stephen Mensah
, Gautier Abou Loume
, Wiktor Morak
, Farzad Habibi
, Sarah Hoback
, Will Cai
, Javier Gimenez
, Roselynn Grace Montecillo
, Jakub Łucki
, Russell Campbell
, Asankhaya Sharma
, Khalida Meer
, Shreen Gul
, Daniel Espinosa Gonzalez
, Xavier Alapont
, Alex Hoover
, Gunjan Chhablani
, Freddie Vargus
, Arunim Agarwal
, Yibo Jiang
, Deepakkumar Patil
, David Outevsky
, Kevin Joseph Scaria
, Rajat Maheshwari
, Abdelkader Dendane
, Priti Shukla
, Ashley Cartwright
, Sergei Bogdanov
, Niels Mündler
, Sören Möller
, Luca Arnaboldi
, Kunvar Thaman
, Muhammad Rehan Siddiqi
, Prajvi Saxena
, Himanshu Gupta
, Tony Fruhauff
, Glen Sherman
, Mátyás Vincze
, Siranut Usawasutsakorn
, Dylan Ler
, Anil Radhakrishnan
, Innocent Enyekwe
, Sk Md Salauddin
, Jiang Muzhen
, Aleksandr Maksapetyan
, Vivien Rossbach
, Chris Harjadi
, Mohsen Bahaloohoreh
, Claire Sparrow
, Jasdeep Sidhu
, Sam Ali
, Song Bian
, John Lai
, Eric Singer
, Justine Leon Uro
, Greg Bateman
, Mohamed Sayed
, Ahmed Menshawy
, Darling Duclosel
, Dario Bezzi
, Yashaswini Jain
, Ashley Aaron
, Murat Tiryakioglu
, Sheeshram Siddh
, Keith Krenek
, Imad Ali Shah
, Jun Jin
, Scott Creighton
, Denis Peskoff
, Zienab EL-Wasif
, Ragavendran P V
, Michael Richmond
, Joseph McGowan
, Tejal Patwardhan
, Hao-Yu Sun
, Ting Sun
, Nikola Zubić
, Samuele Sala
, Stephen Ebert
, Jean Kaddour
, Manuel Schottdorf
, Dianzhuo Wang
, Gerol Petruzella
, Alex Meiburg
, Tilen Medved
, Ali ElSheikh
, S. Ashwin Hebbar
, Lorenzo Vaquero
, Xianjun Yang
, Jason Poulos
, Vilém Zouhar
, Sergey Bogdanik
, Mingfang Zhang
, Jorge Sanz-Ros
, David Anugraha
, Yinwei Dai
, Anh N. Nhu
, Xue Wang
, Ali Anil Demircali
, Zhibai Jia
, Yuyin Zhou
, Juncheng Wu
, Mike He
, Nitin Chandok
, Aarush Sinha
, Gaoxiang Luo
, Long Le
, Mickaël Noyé
, Michał Perełkiewicz
, Ioannis Pantidis
, Tianbo Qi
, Soham Sachin Purohit
, Letitia Parcalabescu
, Thai-Hoa Nguyen
, Genta Indra Winata
, Edoardo M. Ponti
, Hanchen Li
, Kaustubh Dhole
, Jongee Park
, Dario Abbondanza
, Yuanli Wang
, Anupam Nayak
, Diogo M. Caetano
, Antonio A. W. L. Wong
, Maria del Rio-Chanona
, Dániel Kondor
, Pieter Francois
, Ed Chalstrey
, Jakob Zsambok
, Dan Hoyer
, Jenny Reddish
, Jakob Hauser
, Francisco-Javier Rodrigo-Ginés
, Suchandra Datta
, Maxwell Shepherd
, Thom Kamphuis
, Qizheng Zhang
, Hyunjun Kim
, Ruiji Sun
, Jianzhu Yao
, Franck Dernoncourt
, Satyapriya Krishna
, Sina Rismanchian
, Bonan Pu
, Francesco Pinto
, Yingheng Wang
, Kumar Shridhar
, Kalon J. Overholt
, Glib Briia
, Hieu Nguyen
, David Quod Soler Bartomeu
, Tony CY Pang
, Adam Wecker
, Yifan Xiong
, Fanfei Li
, Lukas S. Huber
, Joshua Jaeger
, Romano De Maddalena
, Xing Han Lù
, Yuhui Zhang
, Claas Beger
, Patrick Tser Jern Kon
, Sean Li
, Vivek Sanker
, Ming Yin
, Yihao Liang
, Xinlu Zhang
, Ankit Agrawal
, Li S. Yifei
, Zechen Zhang
, Mu Cai
, Yasin Sonmez
, Costin Cozianu
, Changhao Li
, Alex Slen
, Shoubin Yu
, Hyun Kyu Park
, Gabriele Sarti
, Marcin Briański
, Alessandro Stolfo
, Truong An Nguyen
, Mike Zhang
, Yotam Perlitz
, Jose Hernandez-Orallo
, Runjia Li
, Amin Shabani
, Felix Juefei-Xu
, Shikhar Dhingra
, Orr Zohar
, My Chiffon Nguyen
, Alexander Pondaven
, Abdurrahim Yilmaz
, Xuandong Zhao
, Chuanyang Jin
, Muyan Jiang
, Stefan Todoran
, Xinyao Han
, Jules Kreuer
, Brian Rabern
, Anna Plassart
, Martino Maggetti
, Luther Yap
, Robert Geirhos
, Jonathon Kean
, Dingsu Wang
, Sina Mollaei
, Chenkai Sun
, Yifan Yin
, Shiqi Wang
, Rui Li
, Yaowen Chang
, Anjiang Wei
, Alice Bizeul
, Xiaohan Wang
, Alexandre Oliveira Arrais
, Kushin Mukherjee
, Jorge Chamorro-Padial
, Jiachen Liu
, Xingyu Qu
, Junyi Guan
, Adam Bouyamourn
, Shuyu Wu
, Martyna Plomecka
, Junda Chen
, Mengze Tang
, Jiaqi Deng
, Shreyas Subramanian
, Haocheng Xi
, Haoxuan Chen
, Weizhi Zhang
, Yinuo Ren
, Haoqin Tu
, Sejong Kim
, Yushun Chen
, Sara Vera Marjanović
, Junwoo Ha
, Grzegorz Luczyna
, Jeff J. Ma
, Zewen Shen
, Dawn Song
, Cedegao E. Zhang
, Zhun Wang
, Gaël Gendron
, Yunze Xiao
, Leo Smucker
, Erica Weng
, Kwok Hao Lee
, Zhe Ye
, Stefano Ermon
, Ignacio D. Lopez-Miguel
, Theo Knights
, Anthony Gitter
, Namkyu Park
, Boyi Wei
, Hongzheng Chen
, Kunal Pai
, Ahmed Elkhanany
, Han Lin
, Philipp D. Siedler
, Jichao Fang
, Ritwik Mishra
, Károly Zsolnai-Fehér
, Xilin Jiang
, Shadab Khan
, Jun Yuan
, Rishab Kumar Jain
, Xi Lin
, Mike Peterson
, Zhe Wang
, Aditya Malusare
, Maosen Tang
, Isha Gupta
, Ivan Fosin
, Timothy Kang
, Barbara Dworakowska
, Kazuki Matsumoto
, Guangyao Zheng
, Gerben Sewuster
, Jorge Pretel Villanueva
, Ivan Rannev
, Igor Chernyavsky
, Jiale Chen
, Deepayan Banik
, Ben Racz
, Wenchao Dong
, Jianxin Wang
, Laila Bashmal
, Duarte V. Gonçalves
, Wei Hu
, Kaushik Bar
, Ondrej Bohdal
, Atharv Singh Patlan
, Shehzaad Dhuliawala
, Caroline Geirhos
, Julien Wist
, Yuval Kansal
, Bingsen Chen
, Kutay Tire
, Atak Talay Yücel
, Brandon Christof
, Veerupaksh Singla
, Zijian Song
, Sanxing Chen
, Jiaxin Ge
, Kaustubh Ponkshe
, Isaac Park
, Tianneng Shi
, Martin Q. Ma
, Joshua Mak
, Sherwin Lai
, Antoine Moulin
, Zhuo Cheng
, Zhanda Zhu
, Ziyi Zhang
, Vaidehi Patil
, Ketan Jha
, Qiutong Men
, Jiaxuan Wu
, Tianchi Zhang
, Bruno Hebling Vieira
, Alham Fikri Aji
, Jae-Won Chung
, Mohammed Mahfoud
, Ha Thi Hoang
, Marc Sperzel
, Wei Hao
, Kristof Meding
, Sihan Xu
, Vassilis Kostakos
, Davide Manini
, Yueying Liu
, Christopher Toukmaji
, Eunmi Yu
, Arif Engin Demircali
, Zhiyi Sun
, Ivan Dewerpe
, Hongsen Qin
, Roman Pflugfelder
, James Bailey
, Johnathan Morris
, Ville Heilala
, Sybille Rosset
, Zishun Yu
, Peter E. Chen
, Woongyeong Yeo
, Eeshaan Jain
, Sreekar Chigurupati
, Julia Chernyavsky
, Sai Prajwal Reddy
, Subhashini Venugopalan
, Hunar Batra
, Core Francisco Park
, Hieu Tran
, Guilherme Maximiano
, Genghan Zhang
, Yizhuo Liang
, Hu Shiyu
, Rongwu Xu
, Rui Pan
, Siddharth Suresh
, Ziqi Liu
, Samaksh Gulati
, Songyang Zhang
, Peter Turchin
, Christopher W. Bartlett
, Christopher R. Scotese
, Phuong M. Cao
, Ben Wu
, Jacek Karwowski
& Davide Scaramuzza

Contributions

All authors have contributed to the dataset creation process. The Center for AI Safety and Scale AI consortia jointly designed the dataset premise and pipeline; operated the data collection platform (https://lastexam.ai); and provided funding, inference infrastructure for LLMs and review/auditing resources. The authors in the HLE Contributors Consortium contributed to the dataset in various ways, including submitting at least one accepted question to one of the dataset versions, contributing to dataset refinement or assisting with evaluations. In the Center for AI Safety and Scale AI, Long Phan, Alice Gatti, Ziwen Han and Nathaniel Li led the project, and Summer Yue, Alexandr Wang and Dan Hendrycks provided senior supervision.

Corresponding authors

Correspondence to Long Phan or Dan Hendrycks.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Example of a structured response using an LLM judge.

Exact-match answers in HLE sometimes require several reasoning steps to compare the AI’s final answer with the correct answer; therefore, a capable LLM judge with reasoning capabilities is necessary.

Extended Data Table 1 Accuracy and RMS Calibration error of frontier LLMs on the text-only questions of HLE

Full size table

Extended Data Table 2 Category-wise breakdown of frontier LLMs performance on HLE

Full size table

Extended Data Table 3 Accuracy across multi-modal only, exact answer, and multiple-choice splits of HLE

Full size table

Supplementary information

Supplementary Information

Supplementary Information containing the contributor guidelines and human review instructions.

Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Center for AI Safety., Scale AI. & HLE Contributors Consortium. A benchmark of expert-level academic questions to assess AI capabilities. Nature 649, 1139–1146 (2026). https://doi.org/10.1038/s41586-025-09962-4

Download citation

Received: 07 May 2025
Accepted: 25 November 2025
Published: 28 January 2026
Version of record: 28 January 2026
Issue date: 29 January 2026
DOI: https://doi.org/10.1038/s41586-025-09962-4

Subjects

Abstract

Similar content being viewed by others

Self-reflection enhances large language models towards substantial academic response

Testing AI on language comprehension tasks reveals insensitivity to underlying meaning

LLM ethics benchmark: a three-dimensional assessment system for evaluating moral reasoning in large language models

Main

Dataset

Collection

Question style

Submission format

Prize pool

Review

LLM difficulty check

Expert review

Evaluation

Setup

Quantitative results

Accuracy

Calibration error

Inference time computation

Discussion

Limitations

Impact

Methods

Related works

LLM benchmarks

Saturation and frontier benchmark design

Dataset

Submission process

Post-release

Late contributions

Refinement

Expert disagreement rate

HLE-Rolling

Prompts

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Consortia

Center for AI Safety

Scale AI

HLE Contributors Consortium

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data figures and tables

Extended Data Fig. 1 Example of a structured response using an LLM judge.

Supplementary information

Supplementary Information

Peer Review File

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links