DevOps Videos

All News Deals Social Blogs Videos Podcasts Digests

Bill Kennedy at FOSDEM'26: Directly Integrating LLM Models Into Go Applications

•March 6, 2026

Ardan Labs

Ardan Labs•Mar 6, 2026

Why It Matters

By removing external model servers, Go teams can cut infrastructure spend, lower latency, and bring AI capabilities to edge environments, accelerating product development.

Key Takeaways

•Go can directly embed LlamaCPP without C or Python dependencies.
•Ron Evans created pure Go FFI bindings for LlamaCPP APIs.
•New “Kron” API mimics OpenAI endpoints within a single binary.
•Eliminates model servers, enabling lightweight deployment on Cloud Run.
•Local inference reduces cost and latency compared to cloud models.

Summary

At FOSDEM ‘26 Bill Kennedy unveiled a new approach for integrating large‑language‑model inference directly into Go applications, bypassing traditional model‑server architectures.

He explained how licensing costs and the need to run separate C‑or‑Python services have hampered Go developers. Ron Evans’ pure‑Go FFI bindings to LlamaCPP eliminate the cumbersome cgo/Sego layer, and the resulting “Kron” library offers an idiomatic Go API that mirrors OpenAI’s HTTP schema while running entirely in‑process.

Kennedy demonstrated the EMA example, loading a model from disk, tokenising input and streaming output through Go channels. He highlighted features such as automatic LlamaCPP updates, concurrent request handling, and the ability to embed models up to ~2 GB, effectively turning a single binary into a full‑featured LLM server.

The breakthrough enables developers to ship lightweight, cost‑effective binaries to Cloud Run, edge devices, or even embedded hardware, shifting the AI stack away from expensive cloud endpoints toward on‑device inference and faster iteration cycles.

Original Description

Directly integrating LLM models into Go applications allows developers to run local AI inference directly inside production systems. In this talk, Bill Kennedy demonstrates how Go developers can run hardware accelerated local inference using llama.cpp, embedding large language models directly into their applications.

Presented at FOSDEM 2026 in Brussels during the AI Plumbers (un)conference, this session explores practical approaches for running large language models locally and integrating AI capabilities into modern backend systems.

Instead of relying entirely on external APIs or cloud hosted models, Go teams can integrate local inference directly into their services. This approach offers greater control over performance, infrastructure cost, and data privacy while enabling new intelligent features inside production applications.

What You Will Learn

• How to run large language models locally inside Go applications

• How hardware acceleration improves local AI inference performance

• How llama.cpp enables efficient local LLM inference

• Practical approaches for integrating AI features into backend systems

• Patterns for embedding AI capabilities into production Go services

This session is especially useful for:

• Go developers exploring AI integration

• Backend engineers building AI enabled systems

• Teams interested in running large language models locally

• Developers evaluating local inference versus cloud AI APIs

About FOSDEM

FOSDEM one of the largest open source software conferences in Europe, bringing together developers and engineers to share advances across infrastructure, systems programming, and open source technologies.

Learn Go with Ardan Labs

If you want to deepen your expertise in Go and modern backend development, explore professional training from Ardan Labs.

Training programs available in:

• Go

• Kubernetes

• Rust

Learn more: https://www.ardanlabs.com/training

Connect with Ardan Labs

Website: https://www.ardanlabs.com

X: https://x.com/ardanlabs

LinkedIn: https://www.linkedin.com/company/ardanlabs

#golang #llm #ai #llamaCPP #LocalInference #goprogramming #softwareengineering

Comments

Want to join the conversation?

Loading comments...