Stanford CS336 Language Modeling From Scratch | Spring 2026 | Lecture 2: PyTorch (Einops)
Why It Matters
Understanding compute and precision trade‑offs lets AI teams train larger models faster and cheaper, directly impacting research productivity and commercial deployment costs.
Key Takeaways
- •Compute flops = 6 × parameters × tokens for training estimates
- •H100 GPU provides ~0.5 MFU, enabling 70B model in 143 days
- •BF16 offers sweet spot between memory use and numerical stability
- •Mixed‑precision training balances FP32 optimizer states with BF16 tensors
- •einops simplifies tensor dimension handling, reducing indexing errors
Summary
The lecture focused on resource accounting for large language‑model training, covering how to estimate compute, memory needs, and precision choices using PyTorch and the einops library. Professor Wang introduced a simple formula—flops equal six times the number of parameters times the token count—to gauge training cost, then applied it to a 70‑billion‑parameter model on an H100 GPU, arriving at roughly 143 days of compute. He explained tensor memory basics, comparing FP32, FP16, BF16, and newer formats like FP8 and FP4, emphasizing BF16 as the practical sweet spot for most workloads, and described mixed‑precision training that keeps optimizer states in FP32 while using BF16 for activations and gradients. The session also demonstrated einops for named‑dimension tensor operations, showing how it avoids the pitfalls of manual index manipulation. By mastering these calculations and tools, students can design training pipelines that maximize hardware efficiency, control costs, and scale models responsibly.
Comments
Want to join the conversation?
Loading comments...