AI Videos

All News Deals Social Blogs Videos Podcasts Digests

Stanford CS336 Language Modeling From Scratch | Spring 2026 | Lecture 3: Architectures

•April 15, 2026

Stanford Online

Stanford Online•Apr 15, 2026

Why It Matters

Understanding these architectural norms lets engineers build larger, more reliable language models faster, directly impacting AI product performance and development costs.

Key Takeaways

•Pre‑norm layer placement now standard for stable deep LLMs
•RMS‑norm replaces traditional layer‑norm for efficiency
•Positional embeddings evolve: from sinusoidal to RoPE and variants
•Hyper‑parameters like FFN multiplier and vocab size follow clear trends
•Stability tricks (extra norms, activation swaps) mitigate training spikes

Summary

The lecture surveys modern transformer architectures, emphasizing how design choices have crystallized around stability and scalability. Starting from the original Vaswani transformer, the instructor traces the shift from post‑norm residual placement to pre‑norm, noting that moving layer‑norm outside the residual stream preserves gradient flow and eliminates the need for aggressive learning‑rate warm‑up. He also highlights the widespread adoption of RMS‑norm over full layer‑norm, and the transition from sinusoidal positional encodings to rotary (RoPE) and other learned schemes. Key data points include the near‑universal use of pre‑norm across recent LLMs, the exception of OPT‑350M, and the prevalence of hyper‑parameter conventions such as a feed‑forward multiplier of four and vocabularies sized for token efficiency. The instructor cites empirical studies (e.g., Salazar & Yen) showing reduced gradient spikes and smoother convergence when norms are placed before computations. Notable examples feature Llama‑2’s minor variations, Gemma‑2’s post‑norm outside the residual stream, and the practice of sprinkling additional norms to rescue unstable training runs. The lecture also references the explosion of dense model releases—Quen‑3, GMA‑4, Intern‑LM2—illustrating the rapid iteration cycle that fuels architectural experimentation. For practitioners, the takeaway is clear: prioritize pre‑norm or post‑norm outside the residual path, adopt RMS‑norm for speed, and leverage positional‑embedding advances. These choices collectively enable deeper, longer‑context models while keeping training stable, a prerequisite for competitive language‑model research and product deployment.

Original Description

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai

To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs336-language-modeling-scratch

Follow along with the course schedule and syllabus, visit: https://cs336.stanford.edu/

Percy Liang

Professor of Computer Science (and courtesy in Statistics)

Tatsunori Hashimoto

Assistant Professor of Computer Science

View the course playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV

Comments

Want to join the conversation?

Loading comments...