Researchers Show that Training on “Junk Data” Can Lead to LLM “Brain Rot”

Researchers Show that Training on “Junk Data” Can Lead to LLM “Brain Rot”

Ars Technica AI
Ars Technica AIOct 23, 2025

Why It Matters

The authors warn that unchecked reliance on low‑quality internet content risks cumulative harms and potential “model collapse,” urging tighter data curation and re‑examination of continual pre‑training practices as AI‑generated web content proliferates.

Summary

Researchers from Texas A&M, UT and Purdue quantified an “LLM brain rot” effect, showing that continual pre‑training on high‑engagement, short or sensationalist “junk” tweets degrades large language model performance on reasoning and long‑context memory benchmarks. Using two junk-data definitions drawn from engagement metrics and GPT‑4o semantic filtering, they pre‑trained models on varying junk/control mixes and found statistically significant declines on key benchmarks—though some mixes (e.g., 50/50 for Llama 8B) produced mixed or improved results on certain ethical and personality measures. The authors warn that unchecked reliance on low‑quality internet content risks cumulative harms and potential “model collapse,” urging tighter data curation and re‑examination of continual pre‑training practices as AI‑generated web content proliferates.

Researchers show that training on “junk data” can lead to LLM “brain rot”

Comments

Want to join the conversation?

Loading comments...