Hardware Videos

All News Deals Social Blogs Videos Podcasts Digests

Hardware AI Semiconductors EdTech

CUDA Programming for NVIDIA H100s – Comprehensive Course

•April 9, 2026

freeCodeCamp

freeCodeCamp•Apr 9, 2026

Why It Matters

Equipping engineers with free, expert‑level H100 CUDA skills accelerates AI development, cuts training costs, and strengthens a company’s competitive edge in high‑performance computing.

Key Takeaways

•Free 24‑hour course teaches high‑performance CUDA on NVIDIA H100.
•Covers WGMMA pipelines, cutlass optimizations, and asynchronous execution.
•Includes multi‑GPU scaling, NCCL primitives for trillion‑parameter models.
•Requires C++ basics, prior CUDA knowledge, and linear‑algebra fundamentals.
•Emphasizes mental models, AI assistance, and step‑by‑step kernel design.

Summary

The video introduces a free, intensive 24‑hour curriculum that bridges basic coding to high‑performance CUDA programming on NVIDIA Hopper H100 GPUs. Aimed at engineers and senior technical leaders, the course promises hands‑on instruction in building efficient WGMMA pipelines, leveraging cutlass optimizations, and mastering the asynchronous execution model that powers modern AI workloads. Key insights span the full hardware‑software stack: from the H100 architecture and tensor memory accelerator to low‑level PTX instructions like cp.async.bulk, and finally to multi‑GPU scaling using NCCL and various parallelism strategies for trillion‑parameter models. Prerequisites include solid C++ fundamentals, prior CUDA kernel experience, and a working knowledge of matrix multiplication and transformer concepts. The instructor repeatedly stresses that no comparable free resource exists, positioning the course as a democratizing effort. He highlights the importance of mental models, AI assistants, and a step‑by‑step learning staircase, warning against trying to master every optimization simultaneously. Real‑world examples include dissecting cutlass source code and applying WGMMA sparse tensor operations. For businesses, the training equips developers to extract near‑peak performance from H100 hardware, accelerating AI model training and reducing reliance on costly consulting or proprietary tutorials. Companies adopting the curriculum can expect faster time‑to‑market for AI products and a deeper internal talent pool capable of sustaining competitive advantage on cutting‑edge GPU platforms.

Original Description

Learn CUDA programming for NVIDIA Hopper GPUs. You will learn to build efficient WGMMA pipelines and leverage Cutlass optimizations to perform the massive matrix multiplications that power modern AI. Beyond single-chip performance, the curriculum covers multi-GPU scaling and NCCL primitives necessary for training trillion-parameter models. To get the most out of these lessons, you should have a foundational grasp of C++ syntax and linear algebra, particularly how matrices are tiled and multiplied.

- Course website: https://cudacourseh100.github.io

- Course repo: https://github.com/cudacourseh100/H100-Course

- X: https://x.com/_PrateekShukla_

- GitHub Sponsors: https://github.com/sponsors/prateekshukla1108

✏️ Developed by @Prateek_Shukla

❤️ Support for this channel comes from our friends at Scrimba – the coding platform that's reinvented interactive learning: https://scrimba.com/freecodecamp

0:00:00 Course Introduction

0:07:27 Table of Contents & Course Overview

0:23:30 LESSON 1 — H100 Hopper GPU Architecture

0:25:47 H100 Specifications: HBM3, Bandwidth & Power

0:26:22 Tensor Cores Overview

0:27:18 Tensor Memory Accelerator (TMA)

0:34:44 Transformer Engine

0:34:58 L2 Cache Architecture

0:35:21 GPCs, TPCs & SM Layout

0:37:00 Thread Block Clusters

0:46:22 Distributed Shared Memory

0:52:44 SM Sub-Partitions (SMSPs)

0:54:01 Warp Schedulers & Dispatch Units

1:02:37 Shared Memory & Data Movement

1:12:20 Occupancy

1:32:49 LESSON 2 — Clusters, Data Types, Inline PTX & Pointers

1:32:57 Thread Block Clusters Programming

1:42:11 Configuring Cluster Dimensions

1:48:08 Inline PTX Assembly

1:59:31 State Spaces

2:06:01 Data Types in PTX

2:07:16 Generic Pointers

2:09:59 Address Space Conversion

2:15:14 LESSON 3 — Asynchronicity & Barriers

2:15:22 Introduction to Async Operations

2:28:06 Proxies

2:28:56 Fences & Memory Ordering

2:36:17 Fence Ordering & Visibility

2:38:58 Fence Scopes

2:40:30 Acquire & Release Fences

2:45:18 Expected Count & Thread Arrival

2:46:01 M-Barrier Arrive Operations

2:55:37 M-Barrier PTX Instructions

3:07:21 Barrier Wait Operations

3:10:03 Phase & Parity

3:59:42 Commit Operations

4:10:06 LESSON 4 — CuTensorMap Descriptors

4:16:18 Tensor Shape, Stride & Data Type

4:22:52 Element Stride & Dimensions

4:24:13 Box Dimensions (Tile Size)

4:30:30 Bank Conflicts

4:31:05 Swizzling

4:33:02 Swizzle Formula Deep Dive

4:52:48 Interleave Layouts

5:04:22 Out-of-Bounds Fill (OOB)

5:06:01 LESSON 5 — cp.async.bulk (Async Bulk Copies via TMA)

5:08:04 Bulk Tensor Operations (1D–5D)

5:27:31 Multicast Operations

5:47:44 Prefetch

5:53:41 LESSON 6 — WGMMA Part 1 (Warp Group Matrix Multiply Accumulate)

5:59:18 Warp Groups & Matrix Multiplication

6:03:21 WGMMA Descriptors

6:07:26 Accumulators & Register Reuse

6:30:34 Scale Factors (Scale D, Scale A, Scale B)

6:47:05 Core Matrices & 16×16 Tiles

7:44:03 LESSON 7 — WGMMA Part 2

7:46:02 Commit Groups & Wait Groups

8:04:31 WGMMA with FP8 Data Types

8:48:46 LESSON 8 — Kernel Design

8:50:58 Compute-Bound vs. Memory-Bound Kernels

8:54:10 Warp Specialization

9:08:56 Cooperative vs. Ping-Pong Pipelines

9:09:47 Pipelining Fundamentals

9:12:47 Circular Buffering

9:36:38 Ping-Pong Pipeline Deep Dive

9:37:34 Epilogue Handling in Pipelines

9:43:52 Persistent Scheduling

10:48:13 Split-K & Stream-K Strategies

10:57:20 Data-Parallel Tile Scheduling

11:35:23 Epilogue Fusion (Bias, Activation, Scaling)

11:41:35 Epilogue Operations Overview

12:05:32 CUTLASS SOURCE CODE WALKTHROUGH

13:04:17 Main Loop & Scheduling Policies

13:51:03 Dispatch Policy

15:18:49 SM90 Tile Scheduler

17:58:46 SM90 Epilogue (TMA Warp Specialized)

19:22:42 SM90 Builder

19:44:58 Collective Builder

19:49:56 FAST.CU KERNEL WALKTHROUGH

19:55:19 Main Loop Implementation

20:06:51 Producer Warp Group (Dependence Wall)

20:12:08 Consumer Warp Group

21:29:30 Prologue

21:47:09 MULTI-GPU PROGRAMMING — Part 1

21:56:19 NVSwitch

22:03:23 Topology & System Architecture

22:17:03 NVSwitch, BlueField DPUs & Storage Fabrics

22:37:19 CUDA Peer-to-Peer Communication

22:37:57 MPI (Message Passing Interface)

22:46:29 P2P Limitations & Trade-offs

22:49:29 MULTI-GPU PROGRAMMING — Part 2

22:52:20 SLURM Resource Allocation

22:52:52 PMIx Process Management

23:05:27 NCCL (NVIDIA Collective Communications Library)

23:15:20 NCCL Internals & Ring Algorithm

23:17:43 AllReduce Operations

23:34:18 NCCL Collectives: Broadcast, AllGather, ReduceScatter

23:36:16 Parallelism Strategies: Data, Tensor, Pipeline & Expert Parallelism

24:37:56 Course Conclusion & Next Steps

Comments

Want to join the conversation?

Loading comments...