New Benchmark Shows Claude Mythos and GPT-5.5 Can Develop Real Browser Exploits Autonomously

New Benchmark Shows Claude Mythos and GPT-5.5 Can Develop Real Browser Exploits Autonomously

THE DECODER
THE DECODERMay 16, 2026

Why It Matters

The results show that cutting‑edge LLMs can act as competent browser security researchers, raising both the promise of automated vulnerability discovery and the risk of AI‑driven exploit development.

Key Takeaways

  • Claude Mythos scores 9.90/16, reaches top tier on 21 of 41 bugs
  • GPT‑5.5 scores 5.51, hits top tier on only two vulnerabilities
  • Mythos test cost $36,428; GPT‑5.5 cost $3,075, ~12× cheaper
  • Mythos autonomously reproduces CVE‑2024‑0519 after a year of failed human attempts
  • Benchmark measures five exploit tiers up to arbitrary code execution on V8

Pulse Analysis

The launch of ExploitBench marks a watershed moment for AI‑driven security research. By moving beyond simple bug‑trigger checks to a tiered scoring system that culminates in arbitrary code execution, the benchmark mirrors the real‑world workflow of a browser exploit researcher. V8, the JavaScript engine behind Chrome, Edge, Node.js and Cloudflare Workers, is a high‑value target; demonstrating that language models can navigate its complexities signals a new frontier where AI assists or even supersedes human analysts in vulnerability assessment.

Anthropic’s Claude Mythos Preview outperformed OpenAI’s GPT‑5.5 by a wide margin, achieving near‑perfect scores on more than half of the tested vulnerabilities. However, the performance gap comes with a steep price tag: Mythos consumed roughly $36,428 in compute, while GPT‑5.5 required about $3,075. This cost differential suggests that OpenAI could narrow the gap by allocating more resources, but it also raises questions about the economic viability of scaling such high‑cost AI security tools for routine use. The benchmark’s autonomous mode further underscores Mythos’s consistency, dropping only marginally in score, whereas GPT‑5.5’s capability deteriorated sharply when human nudges were removed.

Beyond raw performance, the findings have profound implications for AI safety and policy. The ability of LLMs to autonomously craft functional exploits challenges existing threat models and compels regulators, developers, and security teams to rethink defensive strategies. While the current dataset includes known bugs—potentially giving models an advantage—the inclusion of undisclosed vulnerabilities hints at future benchmarks that could evaluate truly novel exploit generation. As AI continues to blur the line between researcher and adversary, transparent benchmarking and responsible disclosure will be essential to harness the technology’s benefits without amplifying cyber‑risk.

New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously

Comments

Want to join the conversation?

Loading comments...