AI Videos

All News Deals Social Blogs Videos Podcasts Digests

Stanford CS336 Language Modeling From Scratch | Spring 2026 | Lecture 7: Parallelism

•April 28, 2026

Stanford Online

Stanford Online•Apr 28, 2026

Why It Matters

Efficient multi‑GPU parallelism unlocks training of trillion‑parameter models while controlling time‑to‑solution and infrastructure expenses, a critical capability for competitive AI development.

Key Takeaways

•Multi‑GPU training scales models beyond single‑GPU memory limits
•Effective parallelism requires minimizing inter‑GPU communication overhead significantly
•Collective operations like all‑reduce, all‑gather, and reduce‑scatter enable synchronization
•Data, tensor, and pipeline parallelism split models differently for performance
•Balancing workload across GPUs is crucial to avoid bottlenecks

Summary

The lecture introduces multi‑GPU parallelism, extending the single‑GPU concepts covered previously to clusters of dozens or thousands of devices. It outlines the hardware hierarchy—from a single GPU’s HBM and caches to multi‑GPU nodes linked by NVLink/MV switches and finally to multi‑node clusters connected via InfiniBand or Ethernet—emphasizing that data must travel farther as scale grows. Key insights include two primary motivations for multi‑GPU training: fitting trillion‑parameter models that exceed a single GPU’s memory and accelerating training of smaller models by distributing work. The professor stresses that naïvely adding GPUs is easy, but achieving speedups requires careful reduction of communication overhead through replication, sharding, and the use of collective communication primitives. The session walks through collective operations—broadcast, scatter, gather, reduce, all‑gather, reduce‑scatter, all‑reduce, and all‑to‑all—illustrating each with rank‑based examples. All‑reduce, for instance, sums gradients across ranks and replicates the result, forming the backbone of data‑parallel training, while all‑to‑all supports dynamic routing in mixture‑of‑experts models. Understanding these primitives and the trade‑offs among data, tensor, and pipeline parallelism is essential for building efficient large‑scale language models. Proper workload balancing and minimizing inter‑GPU traffic directly translate into faster training times and lower cloud costs, enabling researchers and enterprises to push model sizes further.

Original Description

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai

To learn more about enrolling in this course, visit: https://online.stanford.edu/courses/cs336-language-modeling-scratch

Follow along with the course schedule and syllabus, visit: https://cs336.stanford.edu/

Percy Liang

Professor of Computer Science (and courtesy in Statistics)

Tatsunori Hashimoto

Assistant Professor of Computer Science

View the course playlist: https://www.youtube.com/playlist?list=PLoROMvodv4rMqXOcazWaTUHhq-yembLCV

Comments

Want to join the conversation?

Loading comments...