Microsoft Researchers Find AI Models and Agents Can't Handle Long-Running Tasks

Microsoft Researchers Find AI Models and Agents Can't Handle Long-Running Tasks

The Register — Networks
The Register — NetworksMay 11, 2026

Why It Matters

The findings expose a critical reliability gap for AI‑driven automation, warning enterprises that unchecked delegation can damage essential data and erode trust in AI agents. Companies must temper automation ambitions with rigorous long‑horizon testing and human oversight.

Key Takeaways

  • Frontier LLMs lose ~25% content after 20 delegated steps
  • Only Python programming met 98% readiness threshold across 52 domains
  • Agentic tool use added ~6% more degradation versus plain LLMs
  • 80% of model/domain combos suffered catastrophic document corruption
  • Gemini 3.1 Pro succeeded in just 11 of 52 domains

Pulse Analysis

The Microsoft Research team introduced DELEGATE‑52, a benchmark that simulates complex, multi‑step tasks across 52 professional fields—from coding to crystallography. By forcing models to iteratively edit and merge documents, the study captures a long‑horizon failure mode that short‑term tests miss. Results show frontier models such as Gemini 3.1 Pro, Claude 4.6 Opus, and GPT‑5.4 lose roughly a quarter of document content after twenty interactions, with overall degradation averaging 50 percent. Only the Python programming domain cleared the 98 percent readiness bar, underscoring a narrow niche where current AI agents are dependable.

For businesses betting on AI‑powered workflow automation, the implications are stark. The research demonstrates that granting LLMs file‑system access or code‑execution tools does not mitigate errors; in fact, agentic setups added an extra six percent degradation. In real‑world settings, a single corrupted document could trigger compliance breaches, financial loss, or reputational damage, especially when AI agents operate with minimal human supervision. Enterprises allocating up to 36 percent of digital budgets to AI automation must therefore embed rigorous validation layers, continuous monitoring, and fallback mechanisms to safeguard critical assets.

Despite the sobering results, the trajectory of model improvement remains promising. OpenAI’s GPT series has lifted benchmark scores from under 15 percent to over 70 percent within 16 months, suggesting that robustness can be engineered with better training data, alignment techniques, and longer context windows. Future research should focus on hybrid approaches that combine deterministic rule‑based checks with probabilistic LLM reasoning, as well as domain‑specific fine‑tuning to reduce hallucinations. Until such safeguards mature, organizations should limit AI delegation to well‑understood tasks like code generation, while keeping human oversight at the core of any broader automation strategy.

Microsoft researchers find AI models and agents can't handle long-running tasks

Comments

Want to join the conversation?

Loading comments...