Stanford CS336 Language Modeling From Scratch | Spring 2026 | Lecture 4: Attention Alternatives

Stanford Online
Stanford OnlineApr 15, 2026

Why It Matters

Longer context windows become affordable, enabling more capable AI agents while linear attention and MoE keep compute costs manageable for commercial deployment.

Key Takeaways

  • Linear attention reduces quadratic cost to linear via associative multiplication.
  • Flash attention offers constant‑factor speedups but doesn't change asymptotic complexity.
  • Hybrid models combine linear attention with occasional full softmax layers for performance.
  • Gated state‑space models like Mamba 2 and Gated‑DeltaNet enable efficient inference.
  • Mixture‑of‑Experts scales parameter count while keeping compute affordable.

Summary

The lecture covered advanced transformer architectures, focusing on attention alternatives that achieve linear‑time complexity and the use of mixture‑of‑experts (MoE) to boost parameter efficiency. Professor Kumar explained why quadratic attention costs dominate as context length grows and introduced techniques—such as exploiting the associativity of matrix multiplication, flash attention, and hybrid local‑global schemes—to curb those costs. Key insights included the re‑ordering of Q·Kᵀ·V into Q·(Kᵀ·V), which shifts the dominant term from N² to N·D, and the observation that flash attention provides dramatic constant‑factor gains without altering asymptotic behavior. Hybrid models like Minimax M1 interleave several linear‑attention layers with a single full softmax layer, achieving competitive performance against large models such as GPT‑3. The professor highlighted state‑space approaches—Mamba 2 and Gated‑DeltaNet—that add input‑dependent gates (γₜ) to the linear recurrence, preserving parallel training while enabling fast recurrent inference. Open‑source frontier models (Neon 3, Gated‑DeltaNet‑based systems) demonstrate that these gated recurrences deliver high throughput at long context windows. Overall, the shift toward linear‑time attention and MoE architectures promises scalable, cost‑effective language models capable of handling tens of millions of tokens, opening new possibilities for complex AI agents and enterprise applications.

Original Description

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai
To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs336-language-modeling-scratch
Follow along with the course schedule and syllabus, visit: https://cs336.stanford.edu/
Percy Liang
Professor of Computer Science (and courtesy in Statistics)
Tatsunori Hashimoto
Assistant Professor of Computer Science

Comments

Want to join the conversation?

Loading comments...