Convergent Abstraction Hypothesis

Convergent Abstraction Hypothesis

LessWrong
LessWrongMay 15, 2026

Key Takeaways

  • Convergent abstractions arise from shared data and selection pressures
  • Compression drives abstraction convergence across brains, AI, and biology
  • Architectural constraints can break convergence, causing fragile representations
  • Human-aligned vision models improve robustness via multi-resolution training
  • Alignment hinges on whether moral abstractions are natural or convergent

Pulse Analysis

The convergent abstraction hypothesis borrows from biology’s convergent evolution to explain why disparate learners often settle on similar concepts. When agents—whether mammals, deep‑learning networks, or imagined extraterrestrials—operate under the same physical laws, they face identical pressures to compress massive streams of information into low‑dimensional summaries. Information theory predicts that the most economical representations, such as objects, causal relations, and arithmetic, will repeatedly emerge because they minimize the bits required for prediction and decision‑making. This view treats abstraction as a solution to a universal compression problem rather than a fixed property of the world.

A concrete illustration comes from recent work on adversarial robustness in image classification. Standard classifiers learn human‑like categories but diverge under tiny perturbations, revealing that their internal abstractions are contingent on architectural choices. By training models on multi‑resolution inputs and enforcing scale‑invariant consistency, researchers observed a dramatic alignment with human visual concepts and far‑greater resistance to adversarial attacks. The study highlights that while high‑level abstractions can converge, they remain fragile unless the model’s architecture explicitly supports the necessary invariances, underscoring the importance of design constraints in shaping reliable AI perception.

The strategic stakes for AI alignment hinge on whether moral and value‑related abstractions are natural—deeply rooted in the structure of the universe—or merely convergent products of shared training environments. If the former holds, pointing sufficiently capable learners at human data may reliably yield ethical representations, bolstering confidence in hand‑over scenarios and iterative amplification. Conversely, if these abstractions are only convergent within narrow basins, they could fracture under stronger optimization pressures, making alignment brittle and increasing the risk of misaligned behavior. Recognizing this distinction guides research priorities, from probing the basins of convergence to engineering architectures that preserve desirable abstractions under diverse objectives.

Convergent Abstraction Hypothesis

Comments

Want to join the conversation?