Bill Kennedy at FOSDEM'26: Directly Integrating LLM Models Into Go Applications
Why It Matters
By removing external model servers, Go teams can cut infrastructure spend, lower latency, and bring AI capabilities to edge environments, accelerating product development.
Key Takeaways
- •Go can directly embed LlamaCPP without C or Python dependencies.
- •Ron Evans created pure Go FFI bindings for LlamaCPP APIs.
- •New “Kron” API mimics OpenAI endpoints within a single binary.
- •Eliminates model servers, enabling lightweight deployment on Cloud Run.
- •Local inference reduces cost and latency compared to cloud models.
Summary
At FOSDEM ‘26 Bill Kennedy unveiled a new approach for integrating large‑language‑model inference directly into Go applications, bypassing traditional model‑server architectures.
He explained how licensing costs and the need to run separate C‑or‑Python services have hampered Go developers. Ron Evans’ pure‑Go FFI bindings to LlamaCPP eliminate the cumbersome cgo/Sego layer, and the resulting “Kron” library offers an idiomatic Go API that mirrors OpenAI’s HTTP schema while running entirely in‑process.
Kennedy demonstrated the EMA example, loading a model from disk, tokenising input and streaming output through Go channels. He highlighted features such as automatic LlamaCPP updates, concurrent request handling, and the ability to embed models up to ~2 GB, effectively turning a single binary into a full‑featured LLM server.
The breakthrough enables developers to ship lightweight, cost‑effective binaries to Cloud Run, edge devices, or even embedded hardware, shifting the AI stack away from expensive cloud endpoints toward on‑device inference and faster iteration cycles.
Comments
Want to join the conversation?
Loading comments...