Scaling the Memory Wall: Towards 3D-DRAM-Based Accelerators for Efficient Generative Inference
Why It Matters
Accelerating the decode phase with 3D‑DRAM reduces latency and energy costs, making real‑time AI services economically sustainable.
Key Takeaways
- •LLM inference is dominated by memory bandwidth, not compute.
- •KV‑cache size grows linearly with token count, limiting capacity.
- •GPUs excel at pre‑fill but struggle with decode phase latency.
- •3D‑DRAM accelerators can boost bandwidth and reduce energy use.
- •Batching improves compute but inflates KV‑cache memory demands.
Summary
The talk addresses the growing "memory wall" that hampers generative AI inference and proposes 3D‑DRAM‑based accelerators as a hardware remedy. While processor speed has risen 50‑60% annually, memory bandwidth lags at 7‑8%, turning the decode stage of large language models into a bandwidth‑bound bottleneck.
Key insights include the stark contrast between the compute‑heavy pre‑fill phase, which GPUs handle well, and the decode phase, where each token requires a full read of model weights and a growing KV‑cache. For a 70‑billion‑parameter model, weights occupy ~70 GB and each token adds ~0.16 MB of KV data, quickly exhausting GPU HBM capacity when serving multiple users. The arithmetic intensity of such workloads hovers around 2 FLOPs/byte, far below the roofline of modern HBM‑equipped chips.
The speaker illustrates the problem with historical analogies—just as the 1858 Atlantic telegraph cable shrank communication latency, modern AI demands a comparable reduction in data‑movement latency. He cites concrete numbers: 18 TB/s HBM bandwidth yields an AI roofline of 278, yet the workload’s AI star is only ~2, leaving most compute idle. Attempts to prune KV‑cache or compress data only partially alleviate the issue; the fundamental limitation remains memory bandwidth and capacity.
Implications are clear: without redesigning the memory subsystem, interactive AI services will suffer high latency and prohibitive energy costs, undermining commercial viability. 3D‑DRAM stacks promise higher bandwidth per watt and larger on‑chip capacity, directly targeting the decode bottleneck and enabling scalable, low‑latency inference for next‑generation models.
Comments
Want to join the conversation?
Loading comments...