AI News and Headlines
  • All Technology
  • AI
  • Autonomy
  • B2B Growth
  • Big Data
  • BioTech
  • ClimateTech
  • Consumer Tech
  • Crypto
  • Cybersecurity
  • DevOps
  • Digital Marketing
  • Ecommerce
  • EdTech
  • Enterprise
  • FinTech
  • GovTech
  • Hardware
  • HealthTech
  • HRTech
  • LegalTech
  • Nanotech
  • PropTech
  • Quantum
  • Robotics
  • SaaS
  • SpaceTech
AllNewsDealsSocialBlogsVideosPodcastsDigests

AI Pulse

EMAIL DIGESTS

Daily

Every morning

Weekly

Sunday recap

NewsDealsSocialBlogsVideosPodcasts
AINewsNew in llama.cpp: Model Management
New in llama.cpp: Model Management
AI

New in llama.cpp: Model Management

•December 11, 2025
0
Hugging Face
Hugging Face•Dec 11, 2025

Companies Mentioned

OpenAI

OpenAI

GitHub

GitHub

Ollama

Ollama

Why It Matters

Router mode eliminates downtime and simplifies multi‑model deployments, giving developers faster iteration and more reliable production environments.

Key Takeaways

  • •Router mode loads multiple models without server restart
  • •Auto‑discovers GGUF files from cache or custom directory
  • •LRU eviction caps loaded models, default max four
  • •Each model runs in separate process, isolating crashes
  • •Supports API endpoints for load, unload, and list

Pulse Analysis

Llama.cpp has become a cornerstone for developers seeking lightweight, OpenAI‑compatible inference on local hardware. By introducing router mode, the project bridges a gap that previously required custom orchestration or heavyweight platforms. The multi‑process architecture mirrors enterprise‑grade service designs, where each model runs in its own sandbox, preventing a single crash from cascading across the entire deployment. This mirrors the model‑management experience popularized by tools like Ollama, but retains llama.cpp’s low‑overhead footprint.

The new capabilities revolve around three technical pillars: auto‑discovery, on‑demand loading, and LRU eviction. When the server starts without a specific model, it scans the LLAMA_CACHE or a user‑defined folder for GGUF files, instantly making them addressable via the "model" field in API calls. Models load the first time they are requested, then stay resident until the configured limit (default four) is exceeded, at which point the least‑recently‑used model is evicted to free VRAM. Advanced users can fine‑tune each instance through preset INI files, adjusting context size, temperature, or GPU offload without touching the global server flags.

From a business perspective, router mode accelerates experimentation and reduces operational friction. Teams can A/B test model versions, spin up tenant‑specific instances, or switch between specialized models during development without incurring restart latency. The isolated process model also improves reliability, a critical factor for SaaS providers and internal AI platforms. As the open‑source community adopts these features, llama.cpp is poised to become a go‑to solution for scalable, cost‑effective LLM serving on commodity hardware.

New in llama.cpp: Model Management

Authors: Xuan‑Son Nguyen, Victor Mustar

llama.cpp server now ships with router mode, which lets you dynamically load, unload, and switch between multiple models without restarting.

Reminder: llama.cpp server is a lightweight, OpenAI‑compatible HTTP server for running LLMs locally.

This feature was a popular request to bring Ollama‑style model management to llama.cpp. It uses a multi‑process architecture where each model runs in its own process, so if one model crashes, others remain unaffected.


Quick Start

Start the server in router mode by not specifying a model:


llama-server

The server auto‑discovers models from your llama.cpp cache (LLAMA_CACHE or ~/.cache/llama.cpp). If you have previously downloaded models via llama-server -hf user/model, they will be available automatically.

You can also point to a local directory of GGUF files:


llama-server --models-dir ./my-models


Features

  1. Auto‑discovery – Scans your llama.cpp cache (default) or a custom --models-dir folder for GGUF files.

  2. On‑demand loading – Models load automatically when first requested.

  3. LRU eviction – When you hit --models-max (default: 4), the least‑recently‑used model unloads.

  4. Request routing – The model field in your request determines which model handles it.


Examples

Chat with a specific model


curl http://localhost:8080/v1/chat/completions \

  -H "Content-Type: application/json" \

  -d '{

    "model": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",

    "messages": [{"role": "user", "content": "Hello!"}]

  }'

On the first request, the server automatically loads the model into memory (loading time depends on model size). Subsequent requests to the same model are instant since it’s already loaded.

List available models


curl http://localhost:8080/models

Returns all discovered models with their status (loaded, loading, or unloaded).

Manually load a model


curl -X POST http://localhost:8080/models/load \

  -H "Content-Type: application/json" \

  -d '{"model": "my-model.gguf"}'

Unload a model to free VRAM


curl -X POST http://localhost:8080/models/unload \

  -H "Content-Type: application/json" \

  -d '{"model": "my-model.gguf"}'


Key Options

| Flag | Description |

|--------------------------|--------------------------------------------------------|

| --models-dir PATH | Directory containing your GGUF files |

| --models-max N | Max models loaded simultaneously (default: 4) |

| --no-models-autoload | Disable auto‑loading; require explicit /models/load calls |

All model instances inherit settings from the router:


llama-server --models-dir ./models -c 8192 -ngl 99

All loaded models will use an 8192‑token context and full GPU offload. You can also define per‑model settings using presets:


llama-server --models-preset config.ini


[my-model]

model = /path/to/model.gguf

ctx-size = 65536

temp = 0.7


Also available in the Web UI

The built‑in web UI also supports model switching. Just select a model from the dropdown and it loads automatically.


Join the Conversation

We hope this feature makes it easier to A/B test different model versions, run multi‑tenant deployments, or simply switch models during development without restarting the server.

Have questions or feedback? Drop a comment below or open an issue on GitHub.

Read Original Article
0

Comments

Want to join the conversation?

Loading comments...