Even Frontier LLMs From GPT-5 Onward Lose up to 33% Accuracy when You Chat Too Long

•February 28, 2026

THE DECODER•Feb 28, 2026

Why It Matters

The finding exposes a fundamental context‑management weakness that can undermine enterprise applications relying on extended, multi‑turn interactions, signaling a need for better conversation handling strategies.

Key Takeaways

•Multi-turn prompts cut LLM accuracy up to 33%.
•GPT‑5 improves but still degrades versus single‑prompt.
•Python tasks suffer smallest performance drop (10‑20%).
•Temperature adjustments fail to recover lost accuracy.
•Fresh conversation with summary mitigates degradation.

Pulse Analysis

The study by Laban et al. shines a light on a lingering blind spot in even the most advanced large language models: maintaining coherence across extended dialogues. By testing code generation, database queries, action planning, data‑to‑text, mathematics, and summarization, the researchers demonstrated that splitting a task into several messages consistently erodes performance. Although GPT‑5 and its peers trim the average drop from 39 % to roughly 33 %, the degradation remains significant enough to affect real‑world deployments where users naturally iterate and refine requests.

For businesses that embed LLMs into customer‑support bots, development assistants, or data‑analysis pipelines, this limitation translates into higher error rates, longer resolution times, and potentially costly rework. The impact is especially pronounced in non‑coding domains, where the models lose up to a third of their accuracy. Companies must therefore reconsider interaction designs, perhaps limiting the number of turns or employing guardrails that detect when the model’s confidence wanes. The research also suggests that simple parameter tweaks, such as lowering temperature, are insufficient, underscoring the need for more sophisticated context‑management techniques.

Practitioners can mitigate the issue by adopting a “summarize‑and‑restart” workflow: before a conversation derails, the model generates a concise summary of all prior inputs, which then seeds a fresh session. This approach preserves the accumulated intent while resetting the model’s internal state, reducing drift. Looking ahead, developers and researchers are likely to explore hierarchical prompting, external memory stores, and fine‑tuned adapters designed specifically for multi‑turn fidelity. Addressing this challenge will be crucial for unlocking the full productivity promise of next‑generation LLMs in enterprise settings.