Building Inference-as-a-Service on Kubernetes

The DevOps Toolkit (Viktor Farcic)
The DevOps Toolkit (Viktor Farcic)Mar 23, 2026

Why It Matters

Self‑hosting inference on Kubernetes gives regulated firms data sovereignty and performance control, turning AI from a cloud expense into a strategic, compliant asset.

Key Takeaways

  • Self‑hosted inference on Kubernetes eliminates data exposure risks.
  • GPU‑enabled clusters require careful provisioning to avoid costly waste.
  • Crossplane custom resources simplify cluster and model deployment for teams.
  • Open‑weight Chinese models dominate but raise censorship and safety concerns.
  • Compatibility with OpenAI API enables seamless migration to internal endpoints.

Summary

The video walks through building a self‑contained inference‑as‑a‑service platform on Kubernetes, from provisioning GPU‑enabled clusters to deploying the first model. It targets organizations in regulated sectors—healthcare, finance, government—where data must never leave the corporate network, and it demonstrates how a single custom resource can launch a model without requiring GPU expertise from end users.

Key points include the steep cost of GPUs and the risk of mis‑configuration, the dominance of Chinese open‑weight models (e.g., Alibaba’s Quen, DeepSeek) and their built‑in censorship and safety issues, and the trade‑offs between public API usage and on‑premise hosting. The presenter argues that while public APIs are cheaper for most workloads, self‑hosting becomes compelling when compliance, latency, IP protection, or scale make external services untenable.

A notable example shows a Crossplane composition that abstracts the entire stack—EKS cluster creation, GPU node groups, Nvidia operator installation, VLLM runtime, and ingress—behind a custom API. Deploying the Quen model requires only a few fields, and the resulting endpoint speaks the OpenAI‑compatible API, allowing existing tools to point at the internal service without code changes.

The implication is clear: enterprises can regain control over AI workloads, enforce strict data governance, and avoid vendor lock‑in, but they must invest in robust Kubernetes and GPU orchestration to prevent costly waste. Future videos will explore advanced patterns like multi‑cluster scaling and KV‑cache routing, underscoring the complexity of production‑grade inference.

Original Description

This video walks you through building a fully self-hosted AI inference platform on Kubernetes, giving your organization the ability to run large language models on infrastructure you control. If you're in healthcare, finance, government, or any field where data privacy and regulatory compliance matter, sending prompts through third-party APIs may not be an option — and this guide shows you the alternative. The video covers why inference (as opposed to training or fine-tuning) is the critical piece for most teams, examines the current landscape of open-weight models including the rapid rise of Chinese models like Qwen and DeepSeek, and honestly addresses the trade-offs of self-hosting versus using commercial APIs.
From there, the video moves into a hands-on build using Crossplane and Kubernetes with GPU nodes on AWS. You'll see how to define simple custom resources that let any team in your company provision a GPU-enabled cluster and deploy a model — without needing to understand the underlying complexity of EKS node groups, NVIDIA GPU Operators, or vLLM configuration. By the end, you have a working Inference-as-a-Service platform serving an OpenAI-compatible API endpoint, fully contained within your own network. The video also lays out the architecture and sets the stage for future topics like disaggregated inference, KV-cache routing, autoscaling, and multi-cluster patterns.
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Sponsor: Kilo Code
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
#SelfHostedAI #KubernetesInference #InferenceAsAService
Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join
▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬
🔗 Crossplane: https://crossplane.io
🎬 Why Self-Hosting AI Models Is a Bad Idea: https://youtu.be/pWtDTkfNaUU
▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬
If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below).
▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬
▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬
▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬
00:00 AI Inference (Self-Managed)
01:24 Kilo Code (sponsor)
02:53 Self-Hosted AI Inference Explained
13:43 GPU Kubernetes Cluster Setup
15:35 Deploy and Serve LLMs
18:48 Inference Platform Architecture

Comments

Want to join the conversation?

Loading comments...