
AI Scores a ‘C-’ on Its Hardest Math Test Yet
Companies Mentioned
Why It Matters
The benchmark reveals AI’s emerging utility for complex proofs but underscores high computational costs and citation ethics, influencing research funding and AI development priorities.
Key Takeaways
- •ChatGPT‑5.5 Pro solved four to five of ten benchmark problems.
- •ETH‑Aarhus IMProofBench, using a council of LLMs, solved six to seven.
- •Multi‑model scaffolding can cost up to $1,000 per problem set.
- •AI solutions often lack proper citations, raising plagiarism concerns.
- •Public benchmarking pushes transparency but limits participation to few models.
Pulse Analysis
The First Proof initiative marks a turning point in how the research community evaluates artificial intelligence’s capacity for high‑level mathematics. By assembling a panel of top mathematicians and publishing a transparent test set, the project moves beyond proprietary demos and offers a reproducible yardstick for large language models. The inclusion of only publicly accessible models—OpenAI’s ChatGPT‑5.5 Pro and three university‑backed systems—ensures that the results reflect tools that any researcher can actually use, making the findings immediately relevant to academia and industry alike.
Technical analysis of the results shows a nuanced picture. ChatGPT‑5.5 Pro correctly answered four to five of the ten problems, while the ETH‑Aarhus IMProofBench framework, which consults a "council" of LLMs such as Claude and Gemini, achieved the highest score of six to seven. This "scaffolding" approach dramatically improves accuracy but comes at a steep price: some runs accumulated nearly $1,000 in token‑usage fees for a single problem set. The models also displayed classic AI shortcomings—hallucinated citations, incomplete proofs, and occasional outright refusals—underscoring that raw computational stamina does not yet replace rigorous mathematical reasoning.
The broader implications are twofold. First, funding agencies may need to allocate substantial budgets for AI‑driven research, treating token costs as a line item comparable to laboratory consumables. Second, the ethical dimension cannot be ignored; missing citations and potential plagiarism raise questions about academic integrity when AI contributes to scholarly work. As the First Proof team plans additional rounds and opens the challenge to a wider array of models, the community will gain clearer insight into how AI can augment, rather than replace, human ingenuity in mathematics.
AI scores a ‘C-’ on its hardest math test yet
Comments
Want to join the conversation?
Loading comments...