Stanford CS336 Language Modeling From Scratch | Spring 2026 | Lecture 3: Architectures
Why It Matters
Understanding these architectural norms lets engineers build larger, more reliable language models faster, directly impacting AI product performance and development costs.
Key Takeaways
- •Pre‑norm layer placement now standard for stable deep LLMs
- •RMS‑norm replaces traditional layer‑norm for efficiency
- •Positional embeddings evolve: from sinusoidal to RoPE and variants
- •Hyper‑parameters like FFN multiplier and vocab size follow clear trends
- •Stability tricks (extra norms, activation swaps) mitigate training spikes
Summary
The lecture surveys modern transformer architectures, emphasizing how design choices have crystallized around stability and scalability. Starting from the original Vaswani transformer, the instructor traces the shift from post‑norm residual placement to pre‑norm, noting that moving layer‑norm outside the residual stream preserves gradient flow and eliminates the need for aggressive learning‑rate warm‑up. He also highlights the widespread adoption of RMS‑norm over full layer‑norm, and the transition from sinusoidal positional encodings to rotary (RoPE) and other learned schemes. Key data points include the near‑universal use of pre‑norm across recent LLMs, the exception of OPT‑350M, and the prevalence of hyper‑parameter conventions such as a feed‑forward multiplier of four and vocabularies sized for token efficiency. The instructor cites empirical studies (e.g., Salazar & Yen) showing reduced gradient spikes and smoother convergence when norms are placed before computations. Notable examples feature Llama‑2’s minor variations, Gemma‑2’s post‑norm outside the residual stream, and the practice of sprinkling additional norms to rescue unstable training runs. The lecture also references the explosion of dense model releases—Quen‑3, GMA‑4, Intern‑LM2—illustrating the rapid iteration cycle that fuels architectural experimentation. For practitioners, the takeaway is clear: prioritize pre‑norm or post‑norm outside the residual path, adopt RMS‑norm for speed, and leverage positional‑embedding advances. These choices collectively enable deeper, longer‑context models while keeping training stable, a prerequisite for competitive language‑model research and product deployment.
Comments
Want to join the conversation?
Loading comments...