Stanford CS336 Language Modeling From Scratch | Spring 2026 | Lecture 9: Scaling Laws

Stanford Online
Stanford OnlineApr 30, 2026

Why It Matters

Scaling laws turn expensive, trial‑and‑error model training into a data‑driven optimization problem, enabling faster, cheaper breakthroughs for AI research and commercial deployment.

Key Takeaways

  • Use small experiments to predict large‑scale model performance.
  • Power‑law trends (≈ p³‑p⁴) dominate data and compute scaling.
  • Historical scaling concepts date back to 1990s empirical theory.
  • Accurate scaling reduces costly trial‑and‑error on big runs.
  • Emerging sigmoid patterns link compute to downstream benchmark gains.

Summary

The lecture introduced scaling laws as a framework for predicting language‑model performance when moving from modest to massive training regimes. Professor Kumar emphasized that rather than spending millions on large‑scale trial runs, researchers can conduct inexpensive small‑scale experiments and extrapolate results using simple functional relationships. Key insights included the prevalence of power‑law relationships—often with exponents around three or four—between resources (data, compute, parameters) and test loss. He traced the idea back to early 1990s work on empirical sample complexity and highlighted the 2017 Hess et al. paper that systematically documented polynomial trends across speech, translation, and language tasks. The discussion also covered how downstream benchmark performance often follows a sigmoid curve as compute increases. Notable examples cited were the classic Bell‑Labs studies on data scaling, the Bangkok‑Brill analysis of data‑size effects on NLP, and recent observations that the upper envelope of capability curves remains linear in log‑log space. The professor stressed that while power laws are common, they are not universal; theory and physics can suggest alternative forms. For practitioners, scaling laws provide a cost‑effective roadmap: they guide hyper‑parameter selection, model‑size decisions, and compute budgeting, reducing wasteful large‑scale runs. By treating scaling as a predictive science, organizations can accelerate development of state‑of‑the‑art models while managing financial risk.

Original Description

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai
To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs336-language-modeling-scratch
Follow along with the course schedule and syllabus, visit: https://cs336.stanford.edu/
Percy Liang
Professor of Computer Science (and courtesy in Statistics)
Tatsunori Hashimoto
Assistant Professor of Computer Science

Comments

Want to join the conversation?

Loading comments...