Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

•March 10, 2026

AI Paper of the Day•Mar 10, 2026

Key Takeaways

•New benchmark contains 2,000 diverse storytelling prompts
•CONSTORY‑CHECKER automates detection of narrative contradictions
•Errors cluster in factual details and timeline logic
•Consistency errors grow linearly with story length
•Open‑source models approach proprietary performance on consistency

Summary

The paper introduces CONSTORY‑CHECKER, an automated pipeline, and ConStory‑Bench, a 2,000‑prompt benchmark, to evaluate narrative consistency in long‑form story generation by LLMs. The four‑stage system extracts suspect spans, pairs conflicting statements, generates evidence chains, and produces anchored reports. Evaluation across proprietary and open‑source models shows GPT‑5‑Reasoning, Gemini‑2.5‑Pro, and Claude‑Sonnet‑4.5 lead in consistency, yet most models still produce many errors, especially factual and timeline contradictions that increase linearly with story length.

Pulse Analysis

The latest generation of large language models now supports context windows that stretch into tens of thousands of tokens, enabling the creation of full‑length novels, game scripts, and marketing copy without manual stitching. Yet as the narrative horizon expands, models frequently lose track of earlier facts, character arcs, and world‑building rules, producing contradictions that erode reader trust. Traditional evaluation metrics—perplexity, BLEU, or plot coherence scores—focus on fluency and overall story arc, leaving a blind spot for fine‑grained consistency. Addressing this gap is critical for commercial deployments that demand reliable, immersive storytelling.

To fill that blind spot, the authors present ConStory‑Bench, a benchmark of 2,000 prompts covering story‑from‑scratch, continuation, outline‑expansion, and bounded‑completion tasks. The accompanying CONSTORY‑CHECKER pipeline follows a four‑stage process: it scans text with category‑specific guidelines, pairs suspicious spans, constructs an evidence chain where a judge model explains the conflict with exact quotes, and outputs a standardized report anchored to precise locations. A taxonomy of five dimensions—timeline logic, characterization, world‑building, factual consistency, and narrative style—breaks errors into 19 subtypes, giving researchers granular insight into where models falter.

Results reveal that while top‑tier models such as GPT‑5‑Reasoning, Gemini‑2.5‑Pro, and Claude‑Sonnet‑4.5 achieve the highest consistency scores, the majority of systems still generate a substantial number of contradictions. Factual details and timeline reasoning emerge as the most error‑prone areas, and the frequency of inconsistencies scales linearly with story length. Open‑source contenders like GLM‑4.6 and Qwen3‑32B close the gap, suggesting that robust consistency evaluation can accelerate parity across the ecosystem. The benchmark and automated checker provide a reproducible foundation for future research aimed at tightening narrative coherence, a prerequisite for trustworthy AI‑driven content creation.