Stanford CS25: Transformers United V6 I On the Tradeoffs of State Space Models and Transformers

Stanford Online
Stanford OnlineApr 27, 2026

Why It Matters

By lowering compute and memory demands, SSM‑based hybrids make trillion‑parameter models more affordable, accelerating deployment of powerful generative AI across cloud and edge environments.

Key Takeaways

  • Linear SSMs replace quadratic attention with constant‑time per token.
  • Large, input‑dependent state enables richer context compression.
  • Efficient training uses associative scan or chunked matrix multiplication.
  • Hybrid models (e.g., Jamba, Qwen) blend SSMs with attention layers.
  • Mamba 2 and Gated DeltaNet dominate large‑scale production deployments.

Summary

Albert Gu’s Stanford CS25 talk examined the trade‑offs between traditional transformer architectures and the emerging family of state‑space models (SSMs), highlighting how these linear‑complexity models reshape sequence‑modeling.

Over the past three years, models such as Mamba, Mamba 2/3, xLSTM, DeltaNet, and Gated DeltaNet have gained traction, often appearing in hybrid systems like AI21’s Jamba, Microsoft’s Samba, Tencent’s Hunyuan, and NVIDIA’s Nemotron 3. These architectures achieve sub‑quadratic or linear inference cost while scaling to hundreds of billions of parameters.

Gu identified three pillars of successful SSMs: a much larger hidden state than classic RNNs, input‑dependent (selective) parameterization of the recurrence, and algorithmic tricks—associative scan or chunked matrix multiplication—that turn a naïve O(N²) recurrence into O(N) parallel work. He noted that Mamba 2 and Gated DeltaNet currently offer the best balance of speed and accuracy in production.

The shift toward linear‑time sequence models reduces memory pressure from KV caches, cuts inference latency, and opens new cost‑effective scaling pathways for large language models, prompting enterprises to reconsider architecture choices for next‑generation AI services.

Original Description

For more information about Stanford’s graduate programs, visit: https://online.stanford.edu/graduate-education
April 16, 2026
This seminar covers:
• A high-level overview of a recently popular subquadratic alternative to the Transformer, the state space model (SSM)
• The core characteristics and design choices of SSMs and other related modern linear models
Follow along with the seminar schedule. Visit: https://web.stanford.edu/class/cs25/
Guest Speaker: Albert Gu (CMU, Cartesia AI)
Instructors:
• Steven Feng, Stanford Computer Science PhD student and NSERC PGS-D scholar
• Karan P. Singh, Electrical Engineering PhD student and NSF Graduate Research Fellow in the Stanford Translational AI Lab
• Michael C. Frank, Benjamin Scott Crocker Professor of Human Biology Director, Symbolic Systems Program
• Christopher Manning, Thomas M. Siebel Professor in Machine Learning, Professor of Linguistics and of Computer Science, Co-Founder and Senior Fellow of the Stanford Institute for Human-Centered Artificial Intelligence (HAI)

Comments

Want to join the conversation?

Loading comments...