Anthropic's Claude Opus 4.5 Can Tackle some Tasks Lasting Nearly Five Hours

•December 21, 2025

THE DECODER•Dec 21, 2025

Companies Mentioned

Anthropic

Why It Matters

The breakthrough demonstrates AI’s growing capacity for sustained reasoning on extended tasks, opening possibilities for complex, multi‑step applications. Yet the narrow benchmark sample cautions against over‑interpreting the result.

Key Takeaways

•Claude Opus 4.5 reaches 4h 49m 50% success.
•80% success drops to 27 minutes, similar to older models.
•Benchmark covers only 14 samples, limiting reliability.
•Theoretical ceiling over 20 hours likely data noise.
•METR reports 196‑day model doubling time.

Pulse Analysis

The METR time‑horizon benchmark shines a spotlight on a new dimension of AI performance: sustained reasoning over long intervals. By measuring how long a task can be while still achieving a 50 percent success rate, the metric moves beyond traditional token‑limit or accuracy scores. Claude Opus 4.5’s 4‑hour‑49‑minute horizon eclipses prior models, suggesting that the architecture can maintain context and logical coherence far longer than before. This achievement aligns with Anthropic’s reported 196‑day doubling time, underscoring rapid iteration cycles in large‑scale model development.

However, the headline numbers must be tempered by the benchmark’s methodological constraints. The METR evaluation relied on merely 14 sample tasks, a sample size too small to guarantee statistical robustness. Critics, such as Shashwat Goel, have highlighted potential gaming strategies that could inflate the apparent horizon. Moreover, the theoretical ceiling of over 20 hours likely reflects noise rather than genuine capability, as the limited data set cannot reliably extrapolate extreme performance. These factors expose a broader challenge: the AI community needs richer, more diverse evaluation suites to accurately gauge long‑form reasoning.

If the results hold up under stricter scrutiny, the implications for enterprise and research are significant. Extended time horizons enable AI to tackle multi‑step workflows—such as legal document analysis, long‑form content creation, and complex simulation planning—without frequent context resets. Companies could integrate such models into pipelines that demand sustained attention, reducing hand‑off friction between AI and human operators. For Anthropic, the milestone reinforces its competitive stance against rivals like OpenAI and Google, while also prompting a push for more transparent, large‑scale benchmarks that can validate true long‑duration intelligence.

AI Pulse

Anthropic's Claude Opus 4.5 Can Tackle some Tasks Lasting Nearly Five Hours

Companies Mentioned

Why It Matters

Key Takeaways

Pulse Analysis

Ask Pulse AI: