AIs Can Generate Near-Verbatim Copies of Novels From Training Data

•February 23, 2026

Ars Technica AI•Feb 23, 2026

Why It Matters

The ability to extract copyrighted content demonstrates heightened legal risk for AI firms, potentially reshaping training practices and prompting stricter regulatory oversight.

Key Takeaways

•Top LLMs reproduce up to 77% of novel text
•Studies expose memorization beyond prior industry claims
•Potential copyright liability threatens AI business models
•Safeguards may not prevent determined data extraction
•Privacy risks extend to healthcare and education data

Pulse Analysis

The latest academic investigations into large language models have moved the debate from theoretical speculation to concrete evidence. By prompting models to complete sentences from well‑known books, researchers coaxed Gemini 2.5 to reproduce 76.8% of *Harry Potter and the Philosopher’s Stone* and Claude 3.7 to output almost an entire novel. These results suggest that the models retain sizable verbatim fragments of their training corpora, a behavior that persists despite the guardrails many providers tout as safeguards against content leakage.

Legal scholars are now grappling with the ramifications for copyright law. The industry’s long‑standing defense—that AI merely learns statistical patterns and does not store copyrighted works—faces a direct challenge when courts see tangible, near‑verbatim reproductions. Recent rulings, from the U.S. court deeming Anthropic’s use of copyrighted material transformative to the German decision against OpenAI for song‑lyric memorization, illustrate a growing willingness to treat such outputs as infringement. Companies could confront billions in settlements, as seen in Anthropic’s $1.5 billion payout, and may need to reassess the viability of the fair‑use argument.

Beyond litigation, the findings ripple through sectors that rely on sensitive data. In healthcare and education, inadvertent leakage of patient records or student information could trigger privacy violations under regulations like HIPAA or GDPR. Consequently, AI developers are pressured to refine data curation, explore synthetic alternatives, and implement more robust extraction‑resistant architectures. Policymakers, too, are likely to tighten disclosure requirements and enforce stricter auditing of training datasets, reshaping the economics and engineering of next‑generation generative AI.

AI Pulse

AIs Can Generate Near-Verbatim Copies of Novels From Training Data

Why It Matters

Key Takeaways

Pulse Analysis

Ask Pulse AI: