DevOps Videos

All News Deals Social Blogs Videos Podcasts Digests

HOW to RUN AI MODELS Locally Using Llama Cpp and Docker

•May 27, 2026

That DevOps Guy (Marcel Dempers)

That DevOps Guy (Marcel Dempers)•May 27, 2026

Why It Matters

The guide makes it straightforward for developers and platform engineers to host large-language models privately and cost-effectively on local hardware, reducing reliance on paid cloud APIs and improving control over latency and data privacy. That capability can lower operational costs and enable on-premise AI workflows for DevOps, SRE, and internal tooling.

Summary

The video demonstrates how to run AI models locally using llama.cpp and Docker, outlining three official Docker image variants—light (CLI), full (everything), and server (server binary)—and advising selection based on hardware (CPU, NVIDIA CUDA, AMD ROCm, Intel options). The presenter shows two practical examples: a CPU-only setup and a GPU-enabled setup that exposes GPU devices, persists model files via mounted directories, and serves models on port 8080. It highlights Hugging Face as the primary model source (e.g., Gemma 4), explains launching the llama server with the -M flag, and demonstrates connecting a client (Open Code) to localhost:8080 and configuring the provider to send prompts. The walkthrough culminates with a live example of the model processing a prompt, performing a web fetch, and returning a response.

Original Description

Comments

Want to join the conversation?

Loading comments...