Stanford CS336 Language Modeling From Scratch | Spring 2026 | Lecture 7: Parallelism
Why It Matters
Efficient multi‑GPU parallelism unlocks training of trillion‑parameter models while controlling time‑to‑solution and infrastructure expenses, a critical capability for competitive AI development.
Key Takeaways
- •Multi‑GPU training scales models beyond single‑GPU memory limits
- •Effective parallelism requires minimizing inter‑GPU communication overhead significantly
- •Collective operations like all‑reduce, all‑gather, and reduce‑scatter enable synchronization
- •Data, tensor, and pipeline parallelism split models differently for performance
- •Balancing workload across GPUs is crucial to avoid bottlenecks
Summary
The lecture introduces multi‑GPU parallelism, extending the single‑GPU concepts covered previously to clusters of dozens or thousands of devices. It outlines the hardware hierarchy—from a single GPU’s HBM and caches to multi‑GPU nodes linked by NVLink/MV switches and finally to multi‑node clusters connected via InfiniBand or Ethernet—emphasizing that data must travel farther as scale grows. Key insights include two primary motivations for multi‑GPU training: fitting trillion‑parameter models that exceed a single GPU’s memory and accelerating training of smaller models by distributing work. The professor stresses that naïvely adding GPUs is easy, but achieving speedups requires careful reduction of communication overhead through replication, sharding, and the use of collective communication primitives. The session walks through collective operations—broadcast, scatter, gather, reduce, all‑gather, reduce‑scatter, all‑reduce, and all‑to‑all—illustrating each with rank‑based examples. All‑reduce, for instance, sums gradients across ranks and replicates the result, forming the backbone of data‑parallel training, while all‑to‑all supports dynamic routing in mixture‑of‑experts models. Understanding these primitives and the trade‑offs among data, tensor, and pipeline parallelism is essential for building efficient large‑scale language models. Proper workload balancing and minimizing inter‑GPU traffic directly translate into faster training times and lower cloud costs, enabling researchers and enterprises to push model sizes further.
Comments
Want to join the conversation?
Loading comments...