MemKV Provides Distributed Shared Context Memory with MinIO

Tech Field Day
Tech Field DayJun 17, 2026

Why It Matters

MemKV turns GPU memory from a bottleneck into a scalable asset, dramatically cutting inference latency and enabling enterprises to serve far more AI agents without additional hardware investment.

Key Takeaways

  • Distributed KV cache reduces GPU memory bottleneck for inference
  • MemKV enables up to 43× more concurrent requests per GPU
  • 800 Gbps network achieves ~97 GB/s throughput with two servers
  • Scaling to 1,000 GPUs can handle 16k concurrent requests
  • Exact‑match KV reuse avoids costly recomputation of context

Summary

The video introduces MemKV, a distributed shared context memory layer built on MinIO, designed to alleviate GPU memory constraints during large‑scale inference workloads. By offloading the KV cache to a high‑speed NVMe‑backed service, MemKV lets every GPU in a cluster access a common context without repeatedly recomputing it. Key performance figures show up to a 43‑fold increase in concurrent request capacity per GPU, with an 800‑Gbps network delivering roughly 97 GB/s aggregated throughput using just two MemKV servers and an eight‑GPU H200 node. In a simulated super‑pod of 1,000 GPUs and 125 nodes, the system could sustain 16,000 concurrent requests and approach 12 TB/s of total data flow, demonstrating linear scalability. The presenters stress that the metric most users care about is time‑to‑first‑token (TTFT). When the KV cache resides in GPU HBM, TTFT drops dramatically; otherwise, large contexts (100k‑256k tokens) force costly recomputation. MemKV relies on exact‑match prefix reuse, ensuring that even a single token change invalidates the cache, which preserves correctness while eliminating redundant work. For enterprises deploying agentic AI workloads, MemKV translates into lower latency, higher GPU utilization, and the ability to run more simultaneous sessions without expanding hardware. The solution bridges the gap between storage and memory, offering a purpose‑built, low‑latency layer that scales with modern AI inference demands.

Original Description

This presentation introduces MinIO MemKV, a critical new memory layer designed to address the challenges of scaling AI inference workloads, particularly for agentic applications. The core problem stems from the increasing size of context memory, known as the KV cache, which frequently exceeds the high-bandwidth memory (HBM) capacity of GPUs. In agentic workloads, where requests build upon previous interactions, this context memory constantly expands. When HBM is exhausted, GPUs resort to evicting and recomputing the KV cache, leading to wasted cycles during the "pre-fill" phase and significantly increasing the "Time To First Token" (TTFT) for users. This inefficient utilization plagues modern inference deployments, where GPUs might appear 100% utilized but are largely performing redundant computations.
MinIO MemKV offers a purpose-built solution by creating a distributed shared memory layer that GPUs can access quickly. It bypasses traditional file systems and kernel overhead by leveraging direct NVMe access to achieve microsecond latency. Unlike conventional enterprise storage, MemKV is engineered as an extension of memory, free from the "baggage" of durability features unnecessary for transient KV cache data. Benchmarks demonstrate impressive gains: a single H200 GPU node with two MemKV servers can handle up to 43 times more concurrent requests, achieve aggregated throughputs of nearly 97 gigabytes per second, and scale linearly to support vast superpods with thousands of GPUs and petabytes of context memory, delivering 16,000 concurrent requests per second and 12 terabits per second throughput. This dramatically reduces TTFT and ensures that GPUs focus on useful decoding rather than repetitive recomputation.
The speakers highlight that this bottleneck is fundamentally a "software problem" best addressed through optimized software, rather than complex hardware solutions like CXL, which is deemed too low-level and slow to adapt. MemKV's approach allows for a "G3.5" memory tier that efficiently serves large KV cache blocks (2MB to 64MB tensors) directly from NVMe, avoiding file system metadata overhead. This enables superior effective GPU utilization, leading to significant cost reductions, estimated at over $2 million per year for a single H200 node in the public cloud. By providing a fast, scalable, and shared context memory, MemKV ensures GPUs perform meaningful work, boosting efficiency and handling the high concurrency demanded by modern AI inference.
Presented by AB Periasamy, co-CEO and co-founder, MinIO, Dil Radhakrishnan, Architect, MinIO. Recorded live at AI Infrastructure Field Day in Millbrae, California, on June 11th, 2026. Watch the entire presentation at https://techfieldday.com/appearance/solidigm-and-minio-present-at-ai-infrastructure-field-day-5/ or visit https://techfieldday.com/event/aiifd5/ or https://www.minio.com for more information.

Comments

Want to join the conversation?

Loading comments...