We use cookies to enhance your browsing experience, analyze site traffic and deliver personalized content. For more information, please read our Privacy Policy.
Back to Blog

Humanity's Last Exam - The Ultimate Test of AI's Reasoning

Date
January 8, 2026
Learning
Humanity's Last Exam - The Ultimate Test of AI's Reasoning

In the ever-evolving landscape of artificial intelligence, researchers are constantly raising the bar to gauge true machine intelligence. Lately, one evocative phrase has been making waves in the AI community: "Humanity's Last Exam." This term refers to a rigorous new benchmark designed to push the limits of AI's reasoning abilities. Understanding what Humanity's Last Exam (HLE) entails and why it was created offers a revealing glimpse into the current state of AI and where it still falls short of human expertise.

If you're new to AI and want to quickly test your knowledge on basic terms and understanding, take our quick two-minute test here.

What Is Humanity's Last Exam?

As large language models (LLMs) have advanced, the need for robust evaluations has intensified. Benchmarks are standardized sets of questions or tasks that help researchers compare models and track progress. Humanity's Last Exam is a next-generation benchmark created to test an AI model’s reasoning prowess, not just its capacity to memorize or pattern-match answers. In essence, HLE serves as a comprehensive exam comprising thousands of expert-level questions across a broad range of academic fields. It aims to measure how well a model can solve problems that even human specialists find challenging, from advanced mathematics and computer science to literature analysis and world history.

Why Do We Need a Last Exam?

With many AI benchmarks already in existence, one might ask: why introduce another? The answer lies in the saturation of earlier tests. Popular benchmarks that once challenged AI, such as the Massive Multitask Language Understanding (MMLU) exam, have been effectively conquered by frontier models, which now often score above 90% on them. When state-of-the-art models nearly ace a test, that benchmark loses its utility for distinguishing the truly intelligent from the merely good. Humanity's Last Exam was conceived to raise the difficulty bar. Its questions demand multi-step reasoning and deep understanding, moving beyond superficial recall. By forcing AI systems to work through complex, novel problems, HLE can better reveal meaningful differences in reasoning ability that easier tests can no longer capture.

How Was HLE Developed?

The origins of Humanity's Last Exam trace back to late 2024, when the non-profit Center for AI Safety teamed up with the data platform Scale AI to build a more challenging benchmark. Under the leadership of AI researcher Dan Hendrycks, the project took an innovative approach: crowdsourcing thousands of graduate-level questions from experts across various disciplines. To attract high-quality contributions, the organizers offered substantial prizes, rewarding the top 50 question contributors with $5,000 each, and the next 500 contributors with $500 apiece.

The result of this global call was an enormous pool of expert-level questions spanning subjects as diverse as advanced mathematics, theoretical physics, computer science, literature, music theory, and history. Crucially, each submitted question was intended to stump even the most advanced AI models of the day.

What's Included in Humanity's Last Exam?

The creators of HLE describe it as “the final closed-ended benchmark for broad academic skills.” In practice, this means the exam is designed to cover a little bit of everything, unified by a requirement for reasoning. The questions are intentionally complex and often require multiple steps of logic or calculation, thereby preventing AI models from succeeding through simple guessing or memorization.

In its full form, Humanity's Last Exam consists of approximately 3,000 questions divided into two parts: 2,500 questions are publicly available, and around 500 more are kept in a private holdout set (used for official evaluations so that models cannot simply train on every known question). Each question is original (avoiding duplicates of existing test problems) and has a single correct answer. Most questions are open-ended with an exact-match answer expected (such as a specific number or phrase), while roughly one-quarter are multiple-choice. Notably, about 14% of the questions are multimodal, meaning they include both text and images that the model must interpret together.

Building such a benchmark required a rigorous vetting process. First, every candidate question had to demonstrate its difficulty by defeating a state-of-the-art language model during testing (around 70,000 questions cleared this bar). Next, expert peer reviewers examined and refined this collection, whittling it down to roughly 13,000 high-quality questions. From there, the organizers and domain experts manually selected the best 6,000 questions. Finally, this elite set was split into the 2,500 public questions and the approximately 500 private questions that make up Humanity’s Last Exam today.

Criticisms of HLE

No ambitious project is without its critics, and Humanity's Last Exam has faced some scrutiny since its release. Early results on HLE showed that even cutting-edge models struggled, often scoring quite poorly while paradoxically expressing high confidence in their (wrong) answers. This disconnect between confidence and competence highlighted the tendency of AI models to hallucinate, meaning they produce answers that sound convincing but are factually incorrect or unjustified.

Beyond these initial observations, independent research groups have analyzed the content of HLE itself. One notable critique came from Future House, a non-profit AI research lab, which published an analysis suggesting that roughly 30% of the answers for HLE’s chemistry and biology questions were likely incorrect or dubious. The root of the problem, they argued, was HLE’s question review process: the benchmark relied on question submitters to provide the answer key, and peer reviewers had only about five minutes to verify each answer’s accuracy. Under such time pressure, it is plausible that some overly complex or ambiguous questions slipped through with wrong answers or reasoning that contradicts established scientific knowledge.

The maintainers of HLE responded to these concerns by taking steps to improve the benchmark’s quality. As of September 2025, they convened panels of experts to re-evaluate the flagged questions and announced plans for a rolling review process to continuously audit and correct the question set. This ongoing refinement is meant to ensure that Humanity's Last Exam remains a trustworthy measure of AI reasoning, without being undermined by flawed or unclear questions.

The AI Benchmark Landscape

Humanity's Last Exam sits within a broader ecosystem of AI benchmarks, each tailored to evaluate different facets of intelligence. To put HLE in context, it’s helpful to survey some of the other major benchmarks driving AI research:

Knowledge and Reasoning Benchmarks

A core category of evaluations tests models on academic knowledge and reasoning in text form. For instance, the Massive Multitask Language Understanding (MMLU) benchmark challenges AI across 57 different subjects in a zero-shot setting. Top models have now achieved human-level scores on MMLU, prompting extensions like MMLU-Pro (and an advanced variant called MMLU-Pro+) that increase question complexity and emphasize higher-order reasoning skills. Another example is Google’s ProofQA (GPQA), a graduate-level STEM exam specifically designed to be "Google-proof," meaning its 448 multiple-choice questions cannot be answered by a quick internet search. Humanity's Last Exam belongs in this knowledge-and-reasoning family, but it distinguishes itself by using hand-curated, expert-level problems that prioritize reasoning over rote recall.

Multimodal Understanding Benchmarks

While HLE includes some multimodal questions, there are entire benchmarks devoted to testing an AI’s ability to reason across text and images together. The Massive Multi-Discipline Multimodal Understanding (MMMU) benchmark, for example, presents about 1,500 questions sourced from real exams, quizzes, and textbooks. Each question requires the model to interpret both written text and an accompanying image. An enhanced version, MMMU-Pro, ups the ante by removing any problems solvable by text alone, providing more challenging answer options, and even introducing a special mode where the entire prompt is encoded within an image (to ensure the model truly processes visual information). Such benchmarks probe the cutting edge of AI multimodal reasoning, pushing models to demonstrate understanding that spans different types of data.

Software Engineering and Tool-Use Benchmarks

Other benchmarks focus on practical problem-solving skills, such as writing code or using external tools. SWE-Bench (Software Engineering Benchmark) is built from real-world GitHub issues drawn from a dozen popular Python repositories. In a typical SWE-Bench task, a model is given a description of a software bug or feature request along with the relevant codebase, and it must propose a code patch to resolve the issue. OpenAI introduced an improved variant called SWE-Bench (Verified) that fixes overly specific tests, clarifies vague issue descriptions, and ensures a stable evaluation environment. There is even a SWE-Bench Live edition, featuring over 1,300 tasks across 90+ repositories, which continuously updates with new challenges so that evaluation keeps pace with the evolving software landscape. These benchmarks test an AI’s ability to integrate understanding, reasoning, and action in the context of programming. These skills are quite different from the academic Q&A style of HLE.

Holistic Evaluation Frameworks

Some organizations have developed broad evaluation suites to examine multiple aspects of AI performance in tandem. Notably, the Center for Research on Foundation Models (CRFM) at Stanford created the Holistic Evaluation of Language Models (HELM) framework to support more responsible and comprehensive AI assessment. HELM defines a variety of standardized scenarios (such as open-ended question answering, summarization tasks, responding to sensitive or ethically charged prompts, and more) and evaluates models on each. Crucially, models are scored on multiple dimensions in HELM: not only accuracy, but also calibration (how well the model knows what it doesn’t know), robustness to perturbations, output quality, and even measures of bias or toxicity. Over time, HELM has expanded into a family of specialized tracks: for example, HELM Capabilities serves as a general-purpose leaderboard for language models; HELM Audio evaluates speech and audio-related tasks; HELM Finance focuses on financial question-answering; and MedHELM addresses medical-domain reasoning and safety. This holistic approach ensures that progress in AI is measured not just by intelligence, but also by reliability and safety. This ethos aligns well with the motivations behind HLE.

Safety and Risk Evaluations

Apart from measuring sheer intellectual performance, the AI community also uses benchmarks to gauge potential risks and dangerous capabilities of advanced models. For instance, the nonprofit Model Evaluation and Threat Research (METR) initiative explicitly tests whether frontier AI models could pose threats, such as attempting cyberattacks, evading human oversight, or even autonomously improving themselves. The goal of METR and similar efforts is to catch warning signs of catastrophic risk early, before such capabilities grow uncontrolled. Similarly, Google DeepMind’s Frontier Safety Framework defines a set of critical capability levels (CCLs) that delineate when an AI system might become dangerous (for example, if it can self-replicate or strategize against humans). DeepMind's framework monitors cutting-edge models to see if they approach these thresholds, and it lays out mitigation plans to intervene if they do. These safety-focused evaluations complement benchmarks like HLE by making sure that as we push AI to become smarter, we also keep an eye on whether it’s becoming too powerful or unpredictable in hazardous ways.

Where Do Current AI Models Stand on Key Benchmarks?

Given this landscape of benchmarks, how are today’s top AI models performing? In recent years, a number of public leaderboards have sprung up to track the ever-shifting state of the art across various evaluation tests. Platforms such as the Scale AI leaderboard hub and LLM Stats aggregate results on everything from reasoning and coding challenges to image generation and speech recognition. Meanwhile, the Vellum AI leaderboard specifically highlights performance on complex reasoning and programming tasks, and other sites like ArtificialAnalysis.AI rank models on composite metrics like overall intelligence, speed, and cost. Even Hugging Face hosts an open leaderboard where the global community can submit results for different models on standardized tasks.

As of late 2025, a few models have staked out leading positions on these benchmarks. On Humanity’s Last Exam itself, the current champion is Gemini 3 Pro, Google’s flagship multimodal AI model, which outperforms all other evaluated systems on this expert-level test of reasoning. The same Gemini 3 Pro also dominates most multimodal reasoning leaderboards (such as Kaggle’s MMLU-Pro competition), thanks to its ability to integrate text and vision into a unified understanding.

In the domain of coding and long-horizon planning tasks, the front-runner is Claude Sonnet 4.5, an advanced large language model developed by Anthropic with a special emphasis on careful reasoning. Claude Sonnet 4.5 not only produces excellent code, but also excels at so-called agentic evaluations that require multi-step decision-making; it currently ranks first on tests that simulate complex, extended scenarios for AI “agents.”

On the safety front, dedicated evaluations have their own leaders. For example, OpenAI has reported that one of its proprietary models (identified by the code o3-2024-04-18) achieved the highest score on PropensityBench (a test that measures whether an AI will choose safe, benign actions over potentially harmful ones when given a dilemma). In another safety evaluation focused on national security risks, a model named GPT-oss-120b was noted as the top performer. Interestingly, Anthropic’s Claude Sonnet 4.5 also earned recognition in safety trials by showing exceptional resistance to adversarial prompts that attempted to induce the AI to lie or produce disinformation. These results underscore that no single AI model leads across every category, and Humanity’s Last Exam remains one of the most formidable challenges for gauging advanced reasoning.

How Is Humanity’s Last Exam Used?

So far, we have explored what HLE is, why it was developed, and how it compares to other evaluations. But how is this benchmark actually used in practice? Humanity’s Last Exam has quickly become a crucial tool for two groups in particular: AI research teams and policymakers.

For AI Research Teams

HLE offers developers and scientists a standardized way to probe the strengths and weaknesses of their latest models. By testing a model on a wide array of expert-level questions, a research team can pinpoint where the AI excels (for example, perhaps it’s very good at math) and where it struggles (maybe it falters on literary analysis or physics problems). The benchmark provides a clear measure of how far a model still lags behind human experts in various domains. These insights are not just academic; knowing a model’s weak spots helps guide the next stages of development, fine-tuning, or post-training. Engineers can focus their efforts on improving reasoning strategies or feeding the model additional domain-specific knowledge where needed. In short, HLE acts as a diagnostic exam for advanced AI, illuminating the path toward better performance.

For Policymakers

For governments, regulators, and others shaping AI policy, Humanity’s Last Exam serves as a valuable barometer of progress. It offers a publicly accessible, objective metric of how sophisticated AI’s reasoning has become. This is especially important as discussions around AI governance and safety intensify worldwide. Instead of relying on hype or vague assurances from AI labs, policymakers can look at HLE results as concrete evidence of what current models can and cannot do. If an AI system starts approaching human-level scores on HLE, that might signal that it’s time to enact certain safety regulations or oversight mechanisms. Conversely, if progress on HLE is slow, it suggests that truly human-like reasoning remains a distant goal — a reality that could inform how resources are allocated in research and education. In essence, HLE gives policymakers a common reference point (a kind of "scorecard" for AI reasoning) that can ground their decisions in real-world data rather than speculation.

Conclusion

As artificial intelligence continues its rapid advance, AI benchmarks have become the yardsticks by which we measure each new breakthrough. Early tests fulfilled their purpose but eventually became too easy, giving the false impression that AI was nearing human parity. The emergence of Humanity's Last Exam has helped reset that perspective. By crowdsourcing thousands of extraordinarily challenging questions from experts around the globe, HLE reintroduces a much-needed level of difficulty and rigor into AI evaluation. It shifts the focus from regurgitating facts to truly reasoning through problems.

While HLE is not the final word on machine intelligence, it provides an invaluable reality check on where we stand. The fact that no AI system has yet aced this ultimate exam underscores that there are still significant gaps between machine reasoning and human expert reasoning. At the same time, each incremental improvement that AI models make on HLE’s gauntlet of problems will illuminate how close we are to bridging that gap. In the coming years, Humanity’s Last Exam and benchmarks like it will remain critical for tracking genuine progress in AI — ensuring that as these systems grow more powerful, we also keep them honest.

Humanity's Last Exam FAQs

What is Humanity's Last Exam (HLE)?

Humanity's Last Exam (HLE) is a comprehensive benchmark test designed to evaluate an AI model’s ability to reason through expert-level academic questions across a wide range of subjects.

Why was HLE created?

Researchers developed HLE because earlier benchmarks had become too easy for the most advanced AI models. HLE raises the difficulty and complexity of the questions so that improvements in true reasoning ability can be measured more reliably.

What types of questions are included in HLE?

HLE contains thousands of graduate-level questions from diverse disciplines: mathematics, computer science, history, literature, biology, music theory, and more. Many questions require multi-step reasoning, and a portion of them include both text and images.

Who developed Humanity's Last Exam?

HLE was spearheaded by Dan Hendrycks of the Center for AI Safety, in collaboration with the team at Scale AI. They coordinated a broad effort in late 2024 to collect and curate challenging questions for the benchmark.

How do AI researchers use HLE?

AI researchers use HLE to benchmark and compare the performance of different models. By analyzing which questions a model gets right or wrong, they can identify the model’s strengths and weaknesses, track its progress over time, and focus on areas that need improvement relative to human-level expertise.