
By extending effective memory capacity without adding GPUs, H³ cuts inference latency, capital costs, and energy use for LLM services that require massive context windows.
The explosive growth of large‑language models has exposed a fundamental memory bottleneck in inference servers. GPUs rely on high‑bandwidth memory (HBM) to feed cores at terabyte‑per‑second rates, yet current HBM stacks top out at a few hundred gigabytes, far short of the multi‑terabyte key‑value (KV) caches required for 10‑million‑token contexts. Engineers have been forced to spill data to local NVMe SSDs, incurring PCIe latency and reducing throughput. High‑bandwidth flash (HBF), a NAND‑based device that delivers bandwidth comparable to HBM while offering terabyte‑scale capacity, promises to bridge this gap.
SK Hynix’s H³ architecture places HBM and HBF side‑by‑side on a common interposer, exposing a unified address space to the GPU. The GPU can address HBF directly through the HBM base die, while a latency‑hiding buffer pre‑fetches data to mask the nanosecond‑to‑microsecond latency differential. In simulation with an Nvidia Blackwell B200 GPU equipped with eight HBM3E and eight HBF stacks, token‑per‑second rates rose 1.25× for one‑million‑token sequences and 6.14× for ten‑million‑token sequences. Power efficiency improved 2.69×, and the system handled 18.8× more simultaneous queries, effectively cutting the required GPU count.
The commercial impact is immediate: cloud providers and enterprises can serve longer context windows without expanding GPU farms, lowering capital expenditure and energy bills. Because HBF endurance favors read‑heavy workloads, the hybrid solution is especially attractive for cache‑augmented generation and other inference patterns that reuse a static KV cache. As HBM generations evolve slowly, H³ offers a pragmatic bridge until next‑gen memory arrives. Early adoption could also spur standards for interposer‑based memory hierarchies, positioning SK Hynix as a key enabler of cost‑effective, high‑throughput LLM inference.
Comments
Want to join the conversation?
Loading comments...