Stanford CS336 Language Modeling From Scratch | Spring 2026 | Lecture 6: Kernels, Triton, XLA
Why It Matters
Understanding GPU kernel mechanics directly translates to faster, more cost‑effective AI model training, giving firms a competitive edge in large‑scale language modeling.
Key Takeaways
- •GPU memory hierarchy: registers>L1>L2>HBM, speed inversely proportional to size.
- •Thread blocks enable shared‑memory communication, avoiding costly HBM accesses.
- •Warps execute in lockstep; divergence and bank conflicts degrade performance.
- •Occupancy depends on registers, warps, and block size; balance is crucial.
- •Memory coalescing and bank‑conflict avoidance are key for efficient kernels.
Summary
The lecture builds on a prior overview of GPU architecture, focusing on practical kernel development with Triton and performance profiling. It revisits the memory hierarchy—registers, L1/L2 caches, and high‑bandwidth memory (HBM)—and emphasizes that faster memory is smaller and localized to each streaming multiprocessor (SM). The programming model is explained in terms of threads, thread blocks (CTAs), and grids, highlighting how thread blocks map to SMs and allow shared‑memory communication for operations like softmax or matrix multiplication. Key insights include the role of warps (32‑thread groups) that must execute identical instructions, making control‑flow divergence costly. The lecture details how the warp scheduler hides latency by swapping between resident warps, and how register usage and block size affect warp occupancy. Examples illustrate low occupancy caused by high register consumption and the impact of shared‑memory bank conflicts, which serialize accesses when multiple threads target the same memory bank. Notable examples feature a calculation of 18% occupancy for a block using 160 registers per thread, and a description of 32‑way bank conflicts when threads access the same column of a matrix. The instructor also reviews memory coalescing, where 32 threads in a warp combine HBM accesses into a single cache‑line transaction, and stresses that poor access patterns lead to wasted bandwidth. The implications are clear: achieving high performance on modern GPUs requires deep hardware awareness—optimizing thread‑block dimensions, minimizing divergence, managing registers, and aligning memory accesses. Mastery of these concepts enables developers to write Triton kernels that fully exploit tensor cores and HBM bandwidth, translating into faster language‑model training and inference.
Comments
Want to join the conversation?
Loading comments...