GPU Autoscaling on Kubernetes with KEDA: Building an External Scaler

GPU Autoscaling on Kubernetes with KEDA: Building an External Scaler

CNCF Blog
CNCF BlogMay 27, 2026

Companies Mentioned

Why It Matters

By aligning autoscaling decisions with actual GPU usage, enterprises cut cloud costs, reduce carbon emissions, and improve LLM inference latency. The approach demonstrates how CNCF projects can evolve to support emerging AI infrastructure without redesigning core components.

Key Takeaways

  • DaemonSet runs on each GPU node, exposing metrics via gRPC
  • Profiles map common AI workloads to sensible GPU thresholds
  • ExternalScaler lets KEDA drive HPA based on GPU utilization
  • Mock collector mode enables testing without physical GPUs
  • Open‑source implementation simplifies adoption for Kubernetes clusters

Pulse Analysis

Enterprises deploying large language models or inference stacks on Kubernetes face a paradox: Kubernetes’ Horizontal Pod Autoscaler (HPA) reacts to CPU and memory, while the real bottleneck—GPU capacity—remains invisible. This mismatch inflates cloud spend and drives unnecessary power consumption, a growing concern as organizations pledge greener operations. Traditional KEDA scalers, built without CGO support, cannot directly query NVIDIA’s Management Library (NVML), leaving a gap for AI‑centric workloads that demand precise hardware awareness.

The keda‑gpu‑scaler bridges that gap with a per‑node DaemonSet that leverages the go‑nvml library to pull real‑time GPU metrics—utilization, memory usage, temperature and power draw. Each pod serves these stats over gRPC using KEDA’s ExternalScaler interface, allowing the KEDA operator to make HPA decisions as if the metrics were native. Pre‑configured profiles for vLLM inference, Triton serving, training jobs, and batch processing provide out‑of‑the‑box thresholds, while users can customize aggregation across multi‑GPU nodes. Helm charts streamline deployment, and a mock collector mode lets teams validate scaling logic without physical GPUs.

For the broader cloud‑native AI ecosystem, this pattern illustrates how CNCF projects can stay flexible amid rapid hardware evolution. By exposing GPU‑aware scaling, organizations can achieve true scale‑to‑zero, lower operational expenditures, and shrink Scope 3 emissions associated with idle accelerators. The open‑source reference encourages community contributions, paving the way for future extensions such as custom power‑budget policies or integration with multi‑cloud GPU marketplaces, ultimately making Kubernetes a more sustainable platform for AI workloads.

GPU autoscaling on Kubernetes with KEDA: Building an external scaler

Comments

Want to join the conversation?

Loading comments...