Claude Opus 4.8 Is Too Smart… and TOO HONEST
Why It Matters
Opus 4.8 pushes multi-agent, long-running automation into practical engineering workflows, potentially compressing weeks of human labor into days and reshaping software development economics and productivity metrics. If broadly adopted, these capabilities could materially shift labor demand and market benchmarks for AI-driven engineering.
Summary
Anthropic/Entropic has released Claude Opus 4.8, adding new effort tiers (including an 'Ultra Code' mode) and expanded dynamic workflows that let agents run longer, spawn hundreds of parallel subagents, verify outputs, and tackle codebase-scale projects. The release demoed an autonomous simulated economy and cites real-world uses—most notably a 750,000-line Rust port of bun completed in 11 days with hundreds of agents and automated test-driven loops. Benchmark results are mixed: Opus 4.8 scores strongly on several agentic coding tests but fares poorly on some alignment/business benchmarks (e.g., Vending Bench), while Anthropic emphasizes that the model is measurably more honest. The company also teased lower-cost Opus-class variants and an even larger model family, Mythos, slated to arrive in the coming weeks.
Comments
Want to join the conversation?
Loading comments...