Stanford CS25: Transformers United V6 I Distinct Modes of Generalization From Parameters and Context
Why It Matters
The gap between fine‑tuning and in‑context generalization limits LLM reliability for reasoning tasks, prompting new architectural and prompting strategies to unlock more human‑like inference.
Key Takeaways
- •Contextual prompting yields near‑perfect reversal accuracy, fine‑tuning does not
- •In‑context learning generalizes better on syllogistic and codebook tasks
- •Models trained from scratch still fail to infer latent relations via fine‑tuning
- •Reversal curse stems from causal next‑token architecture, not data scarcity
- •Bridging the gap requires architectural tweaks or test‑time compute strategies
Summary
The talk by Andrew Lampinen explores how large language models (LLMs) generalize knowledge differently when it is stored in model parameters versus when it is supplied in the prompt context. By replicating the "reversal curse"—where fine‑tuned models struggle to answer inverse relational queries—he shows that simply feeding the same facts as context enables 99% accuracy, highlighting a stark contrast between parameter‑based and context‑based learning.
Across several experiments, Lampinen compares fine‑tuning against in‑context learning on tasks such as relational reversals, syllogistic reasoning, and codebook translation. While fine‑tuned models hover near chance, contextual models consistently achieve high performance, even on novel logical implications. Training a small model from scratch confirms that the limitation is not merely insufficient fine‑tuning data; the models still cannot infer unseen reversals despite abundant exposure.
Key observations include that LLMs implicitly learn to manipulate relational structures present in natural text, yet they do not internalize the latent inference rules during parameter updates. Architectural factors—specifically causal next‑token prediction—exacerbate the reversal curse, whereas bidirectional transformers or modified objectives can mitigate it. The research suggests that test‑time compute or hybrid approaches may bridge the generalization gap.
These findings imply that practitioners should leverage in‑context prompting for tasks requiring flexible relational reasoning, and that future model designs may need to incorporate mechanisms beyond standard fine‑tuning to capture latent structures. Understanding the divergence between parameter and context generalization also offers a window into parallels between artificial and natural intelligence.
Comments
Want to join the conversation?
Loading comments...