Researchers Show that Training on “Junk Data” Can Lead to LLM “Brain Rot”

•October 23, 2025

Ars Technica AI•Oct 23, 2025

Why It Matters

The authors warn that unchecked reliance on low‑quality internet content risks cumulative harms and potential “model collapse,” urging tighter data curation and re‑examination of continual pre‑training practices as AI‑generated web content proliferates.

Summary

Researchers from Texas A&M, UT and Purdue quantified an “LLM brain rot” effect, showing that continual pre‑training on high‑engagement, short or sensationalist “junk” tweets degrades large language model performance on reasoning and long‑context memory benchmarks. Using two junk-data definitions drawn from engagement metrics and GPT‑4o semantic filtering, they pre‑trained models on varying junk/control mixes and found statistically significant declines on key benchmarks—though some mixes (e.g., 50/50 for Llama 8B) produced mixed or improved results on certain ethical and personality measures. The authors warn that unchecked reliance on low‑quality internet content risks cumulative harms and potential “model collapse,” urging tighter data curation and re‑examination of continual pre‑training practices as AI‑generated web content proliferates.

Researchers show that training on “junk data” can lead to LLM “brain rot”

Read Original Article

Comments

Want to join the conversation?

Loading comments...

Researchers Show that Training on “Junk Data” Can Lead to LLM “Brain Rot”

Why It Matters

Summary

Ask Pulse AI:

Comments

AI Pulse