This lowers the barrier for researchers to deploy optimized ROCm kernels, accelerating AMD GPU adoption in AI workloads and fostering collaborative kernel development.
Custom GPU kernels have become the linchpin of modern AI models, enabling developers to tailor low‑level operations for specific workloads. On AMD hardware, the ROCm stack offers powerful compute capabilities, yet the traditional build process involves intricate CMake configurations, ABI mismatches, and environment drift. These hurdles often deter teams from exploiting ROCm’s full potential, especially when targeting emerging data types like FP8 that demand precise scaling and quantization logic. By addressing these pain points, the ecosystem can unlock faster training times and lower energy consumption across a broader range of devices.
Hugging Face’s kernel‑builder and kernels libraries introduce a declarative, Nix‑driven workflow that abstracts away the complexity of ROCm compilation. Developers define project metadata in a concise build.toml, specify target architectures such as gfx942, and let the toolchain resolve dependencies, generate reproducible environments via flake.nix, and produce ready‑to‑install Python wheels. This automation not only guarantees that a kernel built on one machine will run identically on another, but also integrates seamlessly with PyTorch through autogenerated bindings. The result is a faster iteration cycle: write HIP code, run a single build command, and obtain a pip‑installable package.
The broader impact extends beyond individual projects. Publishing kernels to the Hugging Face kernels‑community repository creates a shared marketplace where high‑performance implementations, like the RadeonFlow FP8 GEMM kernel that earned the 2025 AMD Developer Challenge Grand Prize, become instantly accessible. Users can benchmark, adopt, or extend these kernels without reinventing the wheel, fostering collaborative innovation across academia and industry. As more teams contribute optimized ROCm kernels, the collective performance baseline rises, encouraging wider AMD GPU adoption in AI research and production environments.
Published November 17, 2025
Abdennacer Badaoui
Daniel Huang
ColorsWind
Zesen Liu
Custom kernels are the backbone of high‑performance deep learning, enabling GPU operations tailored precisely to your workload—whether that’s image processing, tensor transformations, or other compute‑heavy tasks. Compiling these kernels for the right architectures, wiring all the build flags, and integrating them cleanly into PyTorch extensions can quickly become a mess of CMake/Nix, compiler errors, and ABI issues, which is not fun. Hugging Face’s kernel‑builder and kernels libraries make it easy to share these kernels with the kernels‑community, with support for multiple GPU and accelerator backends, including CUDA, ROCm, Metal, and XPU. This ensures your kernels are fast, portable, and seamlessly integrated with PyTorch.
In this guide we focus exclusively on ROCm‑compatible kernels and show how to build, test, and share them using kernel‑builder. You’ll learn how to create kernels that run efficiently on AMD GPUs, along with best practices for reproducibility, packaging, and deployment.
This ROCm‑specific walkthrough is a streamlined version of the original kernel‑builder guide. For the broader CUDA‑focused version, see “A Guide to Building and Scaling Production‑Ready CUDA Kernels”.
We will use the GEMM kernel from RadeonFlow_Kernels as an example.
This section was written by the RadeonFlow GEMM kernel authors to introduce the kernel.
Authors: ColorsWind, Zesen Liu, Andy
The RadeonFlow GEMM kernel is a high‑performance, FP8 block‑wise matrix multiplication implementation optimized for the AMD Instinct MI300X GPU. GEMM (General Matrix Multiplication) is the core building block behind most deep‑learning workloads: given two matrices A and B, you compute their product C = A × B. Here it’s implemented in FP8, a low‑precision floating‑point format that trades a bit of accuracy for much higher throughput and lower memory bandwidth. This kernel was developed for the AMD Developer Challenge 2025 and was awarded the 🏆 Grand Prize in June 2025, recognizing its excellence in performance and innovation on AMD hardware.
The kernel operates on quantized inputs using the e4m3fnuz floating‑point format and applies per‑block scaling to preserve accuracy during low‑precision computation. The e4m3fnuz format is an FP8 variant with 4 exponent bits and 3 mantissa bits, designed to be efficient for neural‑network workloads. Because FP8 has a much smaller dynamic range than FP16/FP32, we apply per‑block scaling factors (a_scale and b_scale) so that each block of values is rescaled into a numerically “comfortable” range before and after computation, which helps preserve accuracy despite the low precision.
It takes the following arguments:
(a, b, a_scale, b_scale, c)
where a and b are the input matrices, a_scale and b_scale are the scaling factors for a and b respectively, and c is the output matrix:
a is K × M in e4m3fnuz
b is K × N in e4m3fnuz
a_scale is (K // 128) × M in fp32
b_scale is (K // 128) × (N // 128) in fp32
c is M × N in bf16
The kernel is precompiled for specific matrix shapes and assumes a transposed memory layout (as required by the competition). To support additional shapes or alternative memory layouts, you must modify the kernel launcher.
So now that we have a high‑performance ROCm kernel, the natural question is: how do we integrate it into a real PyTorch workflow and share it with others? That’s exactly what we’ll cover next, using kernel-builder and kernels to structure, build, and publish the ROCm kernel.
This is a fairly technical guide, but you can still follow it step by step without understanding every detail and everything will work fine. If you’re curious, you can always come back later to dig deeper into the concepts.
The Hugging Face Kernel Builder expects your files to be organized like this:
gemm/
├── build.toml
├── gemm
│ └── gemm_kernel.h
├── flake.nix
└── torch-ext
├── torch_binding.cpp
├── torch_binding.h
└── gemm
└── __init__.py
build.toml – the project manifest; it’s the brain of the build process.
gemm/ – your raw CUDA source code where the GPU magic happens.
flake.nix – the key to a perfectly reproducible build environment.
torch‑ext/gemm/ – the Python wrapper for the raw PyTorch operators.
Sometimes your project might depend on other files, like tests or helper scripts, and you can add them without any issues.
In our case the full structure is:
gemm/
├── build.toml
├── gemm
│ ├── gemm_kernel.h
│ ├── gemm_kernel_legacy.h
│ ├── transpose_kernel.h
│ └── gemm_launcher.hip
├── include
│ ├── clangd_workaround.h
│ ├── gpu_libs.h
│ ├── gpu_types.h
│ └── timer.h
├── src/utils
│ ├── arithmetic.h
│ └── timer.hip
├── tests/checker
│ ├── checker.cpp
│ ├── metrics.h
│ └── checker.h
├── flake.nix
└── torch-ext
├── torch_binding.cpp
├── torch_binding.h
└── gemm
└── __init__.py
The original RadeonFlow kernel files are HIP source files with .cpp extensions. As a first step, rename these extensions to either .h or .hip depending on their content and usage:
Use .h for header files containing kernel declarations, inline functions, or template code that will be included in other files.
Use .hip for implementation files containing HIP/GPU code that needs to be compiled separately (e.g., kernel launchers, device functions with complex implementations).
In our example, gemm_kernel.h, gemm_kernel_legacy.h, and transpose_kernel.h are header files, while gemm_launcher.hip is a HIP implementation file. This naming convention helps kernel-builder correctly identify and compile each file type.
build.toml Manifest
[general]
name = "gemm"
universal = false
[torch]
src = [
"torch-ext/torch_binding.cpp",
"torch-ext/torch_binding.h",
]
[kernel.gemm]
backend = "rocm"
rocm-archs = ["gfx942"]
depends = ["torch"]
src = [
"include/clangd_workaround.h",
"include/gpu_libs.h",
"include/gpu_types.h",
"include/timer.h",
"gemm/gemm_kernel.h",
"gemm/gemm_kernel_legacy.h",
"gemm/gemm_launcher.hip",
"gemm/transpose_kernel.h",
"src/utils/arithmetic.h",
"src/utils/timer.hip",
"tests/checker/metrics.h",
]
include = ["include"]
general – basic project settings.
name (required): name of the project, used for the Python package.
universal (optional): set to true for pure‑Python kernels (no compiled files). Default is false.
torch – describes the Torch extension configuration (the Python bindings).
kernel.gemm – specification of a kernel named “gemm”.
backend (required): compute backend, here "rocm".
rocm-archs (required for ROCm): list of target ROCm architectures, e.g., "gfx942" for MI300 series GPUs.
depends (required): list of dependencies; we depend on "torch" for PyTorch tensor operations.
include (optional): additional include directories relative to the project root.
flake.nix Reproducibility File
{
description = "Flake for GEMM kernel";
inputs = {
kernel-builder.url = "github:huggingface/kernel-builder";
};
outputs = { self, kernel-builder, ... }:
kernel-builder.lib.genFlakeOutputs {
inherit self;
path = ./.;
};
}
This file locks the exact version of kernel-builder and its dependencies, ensuring anyone can build the kernel on any machine.
Inside gemm/gemm_launcher.hip we define how the GEMM kernel is launched. Depending on the configuration, we either call the new optimized gemm/gemm_kernel or fall back to the legacy implementation (gemm/gemm_kernel_legacy).
// gemm_launcher.hip – simplified excerpt
#include "gemm_kernel.h"
#include "gemm_kernel_legacy.h"
#include "transpose_kernel.h"
extern "C" __global__
void launch_gemm(const half* a, const half* b,
const float* a_scale, const float* b_scale,
bfloat16* c,
int K, int M, int N) {
// Choose implementation based on compile‑time flag
#if USE_LEGACY
gemm_kernel_legacy(a, b, a_scale, b_scale, c, K, M, N);
#else
gemm_kernel(a, b, a_scale, b_scale, c, K, M, N);
#endif
}
(Full source code is available in the repository.)
From the project root run:
nix develop # enters a reproducible dev environment defined by flake.nix
kernel-builder build
kernel-builder reads build.toml, resolves dependencies, and compiles the HIP sources for the specified ROCm architecture. The resulting Python package can be installed with pip install . inside the development environment.
Create a simple test script test_gemm.py:
import torch
from gemm import gemm # the Python wrapper generated by kernel‑builder
def reference_gemm(a, b, a_scale, b_scale):
# Simple FP32 reference for verification
a_fp32 = (a.float() * a_scale.unsqueeze(-2)).to(torch.float32)
b_fp32 = (b.float() * b_scale.unsqueeze(-2)).to(torch.float32)
c = torch.matmul(a_fp32, b_fp32)
return c.to(torch.bfloat16)
# Random inputs
K, M, N = 1024, 512, 256
a = torch.randint(0, 255, (K, M), dtype=torch.uint8).to('cuda')
b = torch.randint(0, 255, (K, N), dtype=torch.uint8).to('cuda')
a_scale = torch.randn(K // 128, M, dtype=torch.float32, device='cuda')
b_scale = torch.randn(K // 128, N // 128, dtype=torch.float32, device='cuda')
c = gemm(a, b, a_scale, b_scale) # call the compiled kernel
c_ref = reference_gemm(a, b, a_scale, b_scale)
print("Max absolute error:", (c - c_ref).abs().max().item())
Running the script should show a small error (typically < 1e‑2) due to the FP8 quantization, confirming that the kernel works as expected.
Create a repository under your Hugging Face organization (e.g., hf.co/kernels-community/gemm).
Push the source tree (including build.toml, flake.nix, and all source files).
Add a README.md describing usage, licensing, and performance numbers.
Tag a release (e.g., v0.1.0). The kernel-builder CI will automatically build wheels for the supported ROCm architectures and upload them to the repository’s releases page.
Once published, users can install the kernel directly:
pip install https://huggingface.co/kernels-community/gemm/resolve/main
or, if they have the repository cloned:
pip install .
The kernel will then be importable as import gemm and usable in any PyTorch codebase.
Version Pinning – Keep the flake.nix inputs pinned to exact commit hashes to guarantee reproducibility.
Testing Across Architectures – If you target multiple ROCm GPUs (e.g., gfx900, gfx906), list them all in rocm-archs.
Documentation – Provide clear examples (like the test script above) and explain the scaling factors, especially for users unfamiliar with FP8.
Performance Benchmarking – Include a benchmark table comparing FP8 GEMM vs. FP16/FP32 on the same hardware; this helps users decide when to adopt the kernel.
By following the steps above you can turn a high‑performance ROCm kernel into a shareable, reproducible Python package using Hugging Face’s kernel-builder and kernels ecosystem. The workflow handles everything from project layout and reproducible builds (via flake.nix) to automatic wheel generation and publishing. With this infrastructure, researchers and engineers can collaborate on cutting‑edge GPU kernels without the usual “it works on my machine” headaches. Happy kernel building!
Comments
Want to join the conversation?
Loading comments...