Stanford CS336 Language Modeling From Scratch | Spring 2026 | Lecture 3: Architectures

Stanford Online
Stanford OnlineApr 15, 2026

Why It Matters

Understanding these architectural norms lets engineers build larger, more reliable language models faster, directly impacting AI product performance and development costs.

Key Takeaways

  • Pre‑norm layer placement now standard for stable deep LLMs
  • RMS‑norm replaces traditional layer‑norm for efficiency
  • Positional embeddings evolve: from sinusoidal to RoPE and variants
  • Hyper‑parameters like FFN multiplier and vocab size follow clear trends
  • Stability tricks (extra norms, activation swaps) mitigate training spikes

Summary

The lecture surveys modern transformer architectures, emphasizing how design choices have crystallized around stability and scalability. Starting from the original Vaswani transformer, the instructor traces the shift from post‑norm residual placement to pre‑norm, noting that moving layer‑norm outside the residual stream preserves gradient flow and eliminates the need for aggressive learning‑rate warm‑up. He also highlights the widespread adoption of RMS‑norm over full layer‑norm, and the transition from sinusoidal positional encodings to rotary (RoPE) and other learned schemes. Key data points include the near‑universal use of pre‑norm across recent LLMs, the exception of OPT‑350M, and the prevalence of hyper‑parameter conventions such as a feed‑forward multiplier of four and vocabularies sized for token efficiency. The instructor cites empirical studies (e.g., Salazar & Yen) showing reduced gradient spikes and smoother convergence when norms are placed before computations. Notable examples feature Llama‑2’s minor variations, Gemma‑2’s post‑norm outside the residual stream, and the practice of sprinkling additional norms to rescue unstable training runs. The lecture also references the explosion of dense model releases—Quen‑3, GMA‑4, Intern‑LM2—illustrating the rapid iteration cycle that fuels architectural experimentation. For practitioners, the takeaway is clear: prioritize pre‑norm or post‑norm outside the residual path, adopt RMS‑norm for speed, and leverage positional‑embedding advances. These choices collectively enable deeper, longer‑context models while keeping training stable, a prerequisite for competitive language‑model research and product deployment.

Original Description

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai
To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs336-language-modeling-scratch
Follow along with the course schedule and syllabus, visit: https://cs336.stanford.edu/
Percy Liang
Professor of Computer Science (and courtesy in Statistics)
Tatsunori Hashimoto
Assistant Professor of Computer Science

Comments

Want to join the conversation?

Loading comments...