Penguin Solutions Introduces Industry’s First Production-Ready CXL-Based KV Cache Server

•March 17, 2026

HPCwire•Mar 17, 2026

Key Takeaways

•First production‑ready CXL KV cache server
•Provides up to 11 TB memory capacity
•Cuts inference latency and improves token throughput
•Offers 10× faster access than NVMe storage
•Compatible with NVIDIA Dynamo for KV offloading

Summary

Penguin Solutions unveiled the MemoryAI KV cache server, the industry’s first production‑ready key‑value cache built on Compute Express Link (CXL) memory. The appliance combines 3 TB of DDR5 with up to eight 1 TB CXL add‑in cards, delivering up to 11 TB of disaggregated memory for AI inference. By expanding the memory pool available to GPUs, it reduces latency, boosts token‑throughput, and cuts GPU idle time for large‑scale, latency‑sensitive workloads. The system is compatible with NVIDIA Dynamo and promises up to ten‑fold faster KV access than traditional NVMe solutions.

Pulse Analysis

CXL’s emergence as a high‑speed, cache‑coherent interconnect is reshaping data‑center architecture, allowing memory to be disaggregated from compute nodes without sacrificing bandwidth. Penguin Solutions leverages this capability to create a dedicated KV cache tier that sits between GPU memory and traditional DRAM, effectively extending the memory hierarchy. This approach mitigates the long‑standing "memory wall" that has limited inference performance, especially as large language models grow in parameter count and context length.

For enterprises deploying real‑time AI services—such as financial news parsing, retrieval‑augmented generation over massive regulatory filings, or conversational agents—the latency of each token matters. By offloading KV data to an 11 TB CXL pool, the MemoryAI server reduces the number of costly GPU recompute cycles and shortens time‑to‑first‑token. Early benchmarks indicate up to 30 % lower GPU idle time and a measurable increase in token‑per‑second rates, translating directly into higher throughput and lower operational costs for inference clusters.

The introduction of a production‑ready CXL KV cache also signals a shift in competitive dynamics. Vendors that previously relied on NVMe or purely on‑board HBM must now consider memory‑disaggregation to stay relevant. Penguin’s alignment with NVIDIA Dynamo further integrates the solution into existing AI software stacks, easing adoption. As more organizations seek to run larger models with tighter SLAs, CXL‑based memory expansion is poised to become a standard component of AI‑focused data centers, driving both hardware innovation and new pricing models.