AI Videos

All News Deals Social Blogs Videos Podcasts Digests

Stanford CS336 Language Modeling From Scratch | Spring 2026 | Lecture 10: Inference

•May 11, 2026

Stanford Online

Stanford Online•May 11, 2026

Why It Matters

Inference efficiency directly impacts the profitability and scalability of AI products, making speed‑optimizing techniques essential for any organization deploying large language models.

Key Takeaways

•Inference cost dominates post‑training operational expenses for large language models across industries
•Time‑to‑first‑token and latency crucial for interactive applications and user experience
•Throughput matters for batch processing and large‑scale token generation
•KV‑cache reduction, quantization, pruning, and speculative decoding boost speed
•Open‑source runtimes like vLLM, SG‑Lang, TensorRT, LlamaCPP enable flexible deployment

Summary

The lecture focuses on inference for large language models, emphasizing that once a model is trained, the recurring cost of generating responses dominates operational budgets. It contrasts the one‑time training expense with the continuous, token‑by‑token computation required for chatbots, agents, and batch processing, highlighting how inference now drives the majority of compute spend. Key insights include the metrics that define "fast" inference: time‑to‑first‑token (TTFT), per‑token latency, and overall throughput. Because inference is autoregressive, it cannot parallelize across the sequence dimension, leading to low arithmetic intensity and memory‑bound workloads, especially for small batch sizes. Techniques such as KV‑cache compression, quantization, pruning, and speculative decoding are presented as ways to raise compute efficiency. The speaker cites OpenAI’s production of 8.6 trillion tokens daily—surpassing the 32 trillion tokens used to train GPT‑4—in order to illustrate scale. He also surveys the ecosystem of inference runtimes, from cloud APIs to open‑source stacks like vLLM, SG‑Lang, Nvidia TensorRT, and LlamaCPP, showing how developers can trade off speed, hardware compatibility, and ease of deployment. The implication for businesses is clear: even modest improvements in inference speed or cost translate into substantial savings at scale. Optimizing latency for interactive use cases and maximizing throughput for batch jobs are both strategic levers for maintaining competitive AI services.

Original Description

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai

To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs336-language-modeling-scratch

Follow along with the course schedule and syllabus, visit: https://cs336.stanford.edu/

Percy Liang

Professor of Computer Science (and courtesy in Statistics)

Tatsunori Hashimoto

Assistant Professor of Computer Science

View the course playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV

Comments

Want to join the conversation?

Loading comments...