
RightNow AI Releases AutoKernel: An Open-Source Framework that Applies an Autonomous Agent Loop to GPU Kernel Optimization for Arbitrary PyTorch Models
Companies Mentioned
Why It Matters
AutoKernel democratizes high‑performance GPU tuning, turning weeks of expert work into an overnight, automated process that can accelerate large‑scale models without specialized engineers. Its model‑wide profiling ensures that speedups translate into real end‑to‑end gains, reshaping how ML teams deploy transformer workloads.
Key Takeaways
- •300‑400 kernel experiments run overnight on single GPU
- •Five‑stage harness guarantees numerical correctness before speedup
- •Memory‑bound kernels see >2× gains over torch.compile
- •Amdahl‑driven profiling focuses effort on runtime‑dominant kernels
Pulse Analysis
AutoKernel’s core innovation lies in mechanizing the expert kernel‑engineer workflow—write, benchmark, keep or revert—through an LLM‑controlled loop. By treating each code edit as a git commit and logging results in a simple TSV file, the framework creates a reproducible, auditable optimization pipeline. This approach eliminates the steep learning curve of CUDA and Triton, allowing ML engineers to hand over kernel tuning to an autonomous agent that can explore hundreds of configurations in a single night.
Beyond raw speed, AutoKernel integrates model‑level profiling to prioritize kernels that dominate GPU time. Using torch.profiler, it quantifies each kernel’s share of total runtime and applies Amdahl’s law to estimate end‑to‑end impact. The orchestrator then caps effort on diminishing‑return kernels, ensuring that the agent’s compute budget is spent where it matters most. This strategic allocation yields substantial overall model acceleration, as demonstrated on H100 hardware where RMSNorm, softmax, and cross‑entropy kernels achieve up to 5.3× and 3.4× improvements over PyTorch eager and torch.compile respectively.
The dual‑backend design—supporting both Triton’s rapid JIT compilation and CUDA C++’s low‑level control—gives AutoKernel flexibility across hardware generations, from NVIDIA’s Hopper and Ampere GPUs to AMD’s MI300 series. Its five‑stage correctness suite, covering smoke tests, shape sweeps, adversarial stability, determinism, and edge‑case dimensions, safeguards against the subtle bugs that often plague aggressive kernel optimizations. By delivering open‑source, reproducible, and correctness‑first GPU acceleration, AutoKernel positions itself as a catalyst for faster AI research and production deployments, lowering the barrier for organizations to extract maximum performance from their existing GPU fleets.
RightNow AI Releases AutoKernel: An Open-Source Framework that Applies an Autonomous Agent Loop to GPU Kernel Optimization for Arbitrary PyTorch Models
Comments
Want to join the conversation?
Loading comments...