New Math Benchmark Reveals AI Models Confidently Solve Problems that Have No Solution

New Math Benchmark Reveals AI Models Confidently Solve Problems that Have No Solution

THE DECODER
THE DECODERMay 17, 2026

Companies Mentioned

Why It Matters

SOOHAK exposes AI’s inability to recognize unsolvable math, a critical safety and reliability shortfall as models are deployed for advanced scientific work. The benchmark also clarifies where scaling and compute fail to improve model caution, guiding future research priorities.

Key Takeaways

  • Gemini 3 Pro tops challenge set at 30% accuracy
  • No model exceeds 50% on unsolvable‑task detection
  • Open‑weight models lag behind closed‑weight on research‑level math
  • Olympiad‑trained humans outperform PhDs on timed benchmark
  • Dataset stays private until 2026, restricting public testing

Pulse Analysis

The AI community has long relied on competition‑derived datasets that favor well‑trodden problem types. SOOHAK breaks that mold by commissioning 439 original questions, split between a graduate‑level "Challenge" suite and a "Refusal" suite of intentionally contradictory tasks. By prohibiting AI assistance during authoring, the creators ensure a clean test bed that probes both raw problem‑solving and the more subtle skill of recognizing when a question is ill‑posed. This dual focus mirrors real‑world research where a model must know its limits before proposing results.

Early results are sobering. Even the most advanced closed‑weight systems, like Gemini 3 Pro, solve less than a third of the research‑level problems, and open‑weight models such as Qwen‑3 and Kimi‑2.5 linger below 15% accuracy. Scaling up model size or extending reasoning time improves solution rates but does little for refusal performance, where the best model, GLM‑5, barely reaches 48%. The disparity underscores a fundamental gap: larger compute fuels answer generation, yet it does not teach models to flag ambiguous or contradictory premises, a capability essential for trustworthy scientific assistance.

Human benchmarks further illuminate the challenge. In a timed test of 79 mixed tasks, Olympiad‑trained participants outperformed PhD researchers, and only Gemini 3 Pro eclipsed the combined human score. This suggests that competitive problem‑solving heuristics, honed under pressure, translate better to SOOHAK than deep research expertise. With the full dataset locked until 2026, the AI field must rely on limited access to drive improvements, prompting calls for richer evaluation formats—proof‑assistant integration, multi‑step reasoning, and expert review. As AI edges closer to research‑level contributions, benchmarks like SOOHAK will be pivotal in steering development toward both capability and caution.

New math benchmark reveals AI models confidently solve problems that have no solution

Comments

Want to join the conversation?

Loading comments...