Google's Free AI Model Running in Docker

That DevOps Guy (Marcel Dempers)
That DevOps Guy (Marcel Dempers)May 26, 2026

Why It Matters

This shows teams and developers can run a powerful, free Google LLM on-premises for lower latency, reduced cloud costs and greater data privacy, while retaining easy integration into local developer workflows. It lowers the barrier to experimenting with and deploying large models without relying on cloud APIs.

Summary

The video demonstrates how to run Google’s Gemma 4 models locally using the lightweight Llama.cpp runtime inside a Docker container. It walks through downloading Gemma 4 variants from Hugging Face (2B, 4B, 26B, 31B), choosing the appropriate Docker image for CPU, NVIDIA CUDA, ROCm (AMD) or Intel graphics, and launching a llama server that exposes the model on port 8080. The host mounts the model files, configures GPU offloading and temperature settings, and then accesses the model via a local UI or developer tools like OpenCode. The presenter emphasizes that even modest hardware can run smaller Gemma variants while larger models require more powerful GPUs.

Original Description

Comments

Want to join the conversation?

Loading comments...