
The article examines why gauging AI progress is becoming more difficult, focusing on METR’s task‑length benchmark and its recent Claude Opus 4.6 results. While the chart suggests accelerating capabilities, METR’s confidence interval (5‑66 hours) reveals high measurement noise. It also traces the lifecycle of classic benchmarks like MMLU, which have plateaued, prompting the creation of harder tests such as Humanity’s Last Exam. Finally, the piece highlights practical and conceptual obstacles to extending benchmarks to multi‑day tasks, including steep costs and diminishing relevance to real‑world performance.

Last fall, analysts warned of an AI bubble as firms like OpenAI and Anthropic projected revenue doubling or tripling within a year. Contrary to those fears, Anthropic’s annualized revenue surged to $19 billion, far exceeding its 2026 target and the industry’s...