Acing This New AI Exam — Which Its Creators Say Is the Toughest in the World — Might Point to the First Signs of AGI

•February 27, 2026

Live Science AI•Feb 27, 2026

Why It Matters

Achieving near‑human accuracy on such a rigorous exam signals a major leap in AI reasoning breadth, influencing investment, regulation, and competitive dynamics in the AI race. However, the gap between high test scores and true AGI remains a critical caution for policymakers and industry leaders.

Key Takeaways

•Humanity’s Last Exam contains 2,500 PhD‑level questions
•Gemini 3 achieved 48.4%, highest to date
•Human experts score ~90% on same exam
•Researchers predict >50% possible by end‑2025
•High HLE scores don’t equal AGI

Pulse Analysis

The AI community has long wrestled with reliable metrics that capture a model’s true reasoning ability. Traditional suites like MMLU or ARC‑AGI focus on narrow domains or rely heavily on memorization, leaving a blind spot for broad, interdisciplinary expertise. Humanity’s Last Exam was engineered to close that gap, curating 2,500 questions vetted by over a thousand subject‑matter experts from 500 institutions. By insisting on unambiguous, non‑searchable prompts, the benchmark forces models to generate answers from internalized knowledge rather than web retrieval, offering a more stringent test of generalization.

Early results underscore both progress and limits. Google’s Gemini 3 Deep Think posted a 48.4% accuracy, a dramatic jump from the 8.3% achieved by OpenAI’s o1, yet still far below the roughly 90% human baseline. The steep performance gradient highlights how current large language models excel in pattern recognition but falter on deep, cross‑disciplinary problem solving. Researchers behind the exam argue that a 50% crossing point could be realistic by late 2025, given the accelerating pace of model scaling and training techniques. Such a milestone would likely reshape market expectations, prompting enterprises to reconsider AI‑driven decision‑making in fields like scientific research, finance, and legal analysis.

Nevertheless, the authors caution that high scores on Humanity’s Last Exam are a necessary but insufficient condition for artificial general intelligence. True AGI would require autonomous hypothesis generation, experimental design, and self‑directed learning—capabilities that remain absent from even the most capable systems today. For investors, regulators, and technologists, the exam serves as a valuable barometer of incremental advances while reminding the industry that the journey from expert‑level performance to genuine, self‑aware intelligence is still a long and uncertain road.