
Researchers Show that Training on “Junk Data” Can Lead to LLM “Brain Rot”
Why It Matters
The authors warn that unchecked reliance on low‑quality internet content risks cumulative harms and potential “model collapse,” urging tighter data curation and re‑examination of continual pre‑training practices as AI‑generated web content proliferates.
Summary
Researchers from Texas A&M, UT and Purdue quantified an “LLM brain rot” effect, showing that continual pre‑training on high‑engagement, short or sensationalist “junk” tweets degrades large language model performance on reasoning and long‑context memory benchmarks. Using two junk-data definitions drawn from engagement metrics and GPT‑4o semantic filtering, they pre‑trained models on varying junk/control mixes and found statistically significant declines on key benchmarks—though some mixes (e.g., 50/50 for Llama 8B) produced mixed or improved results on certain ethical and personality measures. The authors warn that unchecked reliance on low‑quality internet content risks cumulative harms and potential “model collapse,” urging tighter data curation and re‑examination of continual pre‑training practices as AI‑generated web content proliferates.
Researchers show that training on “junk data” can lead to LLM “brain rot”
Comments
Want to join the conversation?
Loading comments...