The Inference Bottleneck: Architecting Kubernetes Autoscaling for Production LLMs

The Inference Bottleneck: Architecting Kubernetes Autoscaling for Production LLMs

Container Journal
Container JournalMay 15, 2026

Companies Mentioned

Why It Matters

Accurate autoscaling prevents wasted cloud spend and latency spikes, making GenAI services viable at scale.

Key Takeaways

  • HPA memory metrics cause false scaling for LLM inference
  • Token‑centric metrics like TTFT and TPOT drive accurate autoscaling
  • KEDA can query Prometheus for TTFT percentiles to trigger scaling
  • Custom controllers enable sub‑second scaling by listening to request queues

Pulse Analysis

The core of the scaling problem lies in how LLM inference engines manage GPU memory. When a model starts, it loads weights into VRAM and reserves the remaining space for a key‑value (KV) cache that stores context for each generated token. Engines such as vLLM allocate almost the entire GPU memory to keep the cache contiguous, driving utilization past 90 percent even under idle conditions. Traditional HPA rules that watch memory or CPU therefore interpret this healthy state as a scaling signal, causing premature pod expansion and unnecessary expense.

Switching to token‑centric observability resolves the mismatch. Metrics like Time to First Token (TTFT), Time per Output Token (TPOT), and the depth of the internal batching queue directly reflect user‑perceived latency and system load. By exposing these counters through a Prometheus endpoint, engineers can let Kubernetes Event‑Driven Autoscaling (KEDA) query a PromQL expression—e.g., the 95th‑percentile TTFT over the last minute—and trigger scaling only when service‑level objectives are breached. For ultra‑low‑latency environments, a bespoke controller written with controller‑runtime can ingest request events in real time and adjust replica counts in sub‑second intervals, bypassing the scrape‑cycle altogether.

Pod‑level scaling is only half the equation; without available GPU nodes, new replicas remain pending and cost balloons as idle GPUs sit idle. Integrating the autoscaler with a just‑in‑time provisioner such as Karpenter ensures that each scaling decision also provisions the required A100 or H100 instances on demand, then de‑provisions them when demand subsides. Embedding this workflow in a GitOps pipeline via Argo CD gives teams reproducible, auditable configurations. The shift from a generic DevOps role to an AI Platform Engineer who orchestrates metrics, custom controllers, and dynamic node pools is now a practical necessity for any enterprise deploying production‑grade GenAI.

The Inference Bottleneck: Architecting Kubernetes Autoscaling for Production LLMs

Comments

Want to join the conversation?

Loading comments...