Stanford CS221 | Autumn 2025 | Lecture 17: Language Models

Stanford Online
Stanford OnlineMar 9, 2026

Why It Matters

The sheer compute and financial investment required to train large language models reshapes competitive dynamics, making access to advanced AI a strategic differentiator for firms that can afford the infrastructure.

Key Takeaways

  • Training LLMs requires trillions of tokens and petaflop compute.
  • Pre‑training costs run into tens of millions of dollars per model.
  • Autoregressive models predict next token using probability distributions.
  • Scaling up models improves performance but demands massive hardware infrastructure.
  • Language models now power everyday tools from autocomplete to code generation.

Summary

The Stanford CS221 lecture 17 provides a sweeping overview of modern language models, emphasizing their ubiquity—from chat assistants and phone keyboards to code‑completion tools—and the massive scale at which they are built. Professor Kumar walks students through concrete examples such as Meta’s Llama 3 and Alibaba’s Qwen 3, highlighting that training these systems consumes tens of trillions of tokens, roughly 144 TB of raw text, and requires on the order of 10^25 floating‑point operations.

Key data points illustrate the staggering resource demands: a single H100 GPU would need about 880,800 days to finish pre‑training, translating to roughly $42 million at current cloud rates. The lecture also quantifies the human effort—hundreds of engineers and researchers collaborate on each model—and notes industry moves toward exotic compute solutions, including space‑based GPU farms. These numbers are framed with vivid analogies, such as stacking 90 billion pages of paper to a height of 9,000 km, far exceeding the International Space Station’s orbit.

Notable remarks include the professor’s tongue‑in‑cheek prediction that GPUs may be operating in orbit before Stanford’s next football game, underscoring how quickly the hardware race is accelerating. He also stresses that despite the astronomical costs, the output is essentially a massive matrix of numbers that can generate coherent text, code, or even video captions, illustrating the transformative power of scaling language models.

For students and practitioners, the lecture underscores three takeaways: understand the probabilistic foundations of next‑token prediction, recognize the economic and environmental implications of training at scale, and appreciate that mastering these fundamentals is essential as the technology becomes a core competitive asset across industries.

Original Description

For more information about Stanford’s Artificial Intelligence professional and graduate programs, visit: https://stanford.io/ai
Please follow along with the course schedule: https://stanford-cs221.github.io/autumn2025/
Teaching Team
Percy Liang, Associate Professor of Computer Science (and courtesy in Statistics)

Comments

Want to join the conversation?

Loading comments...