Stanford CS25: Transformers United V6 I Serving Transformers: Lessons From the Trenches

Stanford Online
Stanford OnlineJun 4, 2026

Why It Matters

Efficient inference directly determines the profitability and scalability of AI products, turning expensive model training into sustainable revenue streams.

Key Takeaways

  • Inference drives revenue, training remains cost center for businesses.
  • Efficient inference spans hardware, software, and latency optimization.
  • Different LLM application archetypes dictate workload and SLA requirements.
  • Metrics like QPS, token lengths, and cache reuse guide system design.
  • Observability and performance tuning are essential for scalable deployment.

Summary

The lecture focuses on moving beyond model training to the practical challenges of serving large language models in production. Charles explains that while training generates the intellectual asset, inference is the revenue engine that turns model weights into usable products, requiring careful engineering across the entire stack. Key insights include the distinction between training as a cost center and inference as a profit driver, the ubiquity of inference workloads compared to the limited number of pre‑training providers, and the need to align hardware choices, latency budgets, and cost constraints with specific application archetypes. He outlines three primary use‑cases—chatbot plus, background agents, and data processors—each with its own SLA expectations and performance metrics such as QPS, token counts, and cache‑reuse potential. Notable moments include a tweet from Sean Wang noting that AI infrastructure teams are “finally getting filthy rich,” underscoring the commercial pull of efficient inference. Charles also shares practical tools like the LM Engineers Almanac workload‑definition template and real‑world examples of caching strategies that trade latency for cost savings. The broader implication is clear: engineers who master inference optimization can unlock scalable, profitable AI services, attract venture funding, and meet the growing demand for responsive, cost‑effective language‑model applications.

Original Description

For more information about Stanford’s graduate programs, visit: https://online.stanford.edu/graduate-education
May 28, 2026
Serving Transformers: Lessons from the Trenches of Production Inference
This seminar covers insights, lessons, and gnarly scars from serving transformer model inferences at the scale of thousands of GPUs.
Follow along with the seminar schedule. Visit: https://web.stanford.edu/class/cs25/
Guest Speaker: Charles Frye (Modal)
Instructors:
• Steven Feng, Stanford Computer Science PhD student and NSERC PGS-D scholar
• Karan P. Singh, Electrical Engineering PhD student and NSF Graduate Research Fellow in the Stanford Translational AI Lab
• Michael C. Frank, Benjamin Scott Crocker Professor of Human Biology Director, Symbolic Systems Program
• Christopher Manning, Thomas M. Siebel Professor in Machine Learning, Professor of Linguistics and of Computer Science, Co-Founder and Senior Fellow of the Stanford Institute for Human-Centered Artificial Intelligence (HAI)

Comments

Want to join the conversation?

Loading comments...