Stanford CS336 Language Modeling From Scratch | Spring 2026 | Lecture 4: Attention Alternatives
Why It Matters
Longer context windows become affordable, enabling more capable AI agents while linear attention and MoE keep compute costs manageable for commercial deployment.
Key Takeaways
- •Linear attention reduces quadratic cost to linear via associative multiplication.
- •Flash attention offers constant‑factor speedups but doesn't change asymptotic complexity.
- •Hybrid models combine linear attention with occasional full softmax layers for performance.
- •Gated state‑space models like Mamba 2 and Gated‑DeltaNet enable efficient inference.
- •Mixture‑of‑Experts scales parameter count while keeping compute affordable.
Summary
The lecture covered advanced transformer architectures, focusing on attention alternatives that achieve linear‑time complexity and the use of mixture‑of‑experts (MoE) to boost parameter efficiency. Professor Kumar explained why quadratic attention costs dominate as context length grows and introduced techniques—such as exploiting the associativity of matrix multiplication, flash attention, and hybrid local‑global schemes—to curb those costs. Key insights included the re‑ordering of Q·Kᵀ·V into Q·(Kᵀ·V), which shifts the dominant term from N² to N·D, and the observation that flash attention provides dramatic constant‑factor gains without altering asymptotic behavior. Hybrid models like Minimax M1 interleave several linear‑attention layers with a single full softmax layer, achieving competitive performance against large models such as GPT‑3. The professor highlighted state‑space approaches—Mamba 2 and Gated‑DeltaNet—that add input‑dependent gates (γₜ) to the linear recurrence, preserving parallel training while enabling fast recurrent inference. Open‑source frontier models (Neon 3, Gated‑DeltaNet‑based systems) demonstrate that these gated recurrences deliver high throughput at long context windows. Overall, the shift toward linear‑time attention and MoE architectures promises scalable, cost‑effective language models capable of handling tens of millions of tokens, opening new possibilities for complex AI agents and enterprise applications.
Comments
Want to join the conversation?
Loading comments...