Scaling the Memory Wall: Towards 3D-DRAM-Based Accelerators for Efficient Generative Inference

Onur Mutlu Lectures
Onur Mutlu LecturesApr 28, 2026

Why It Matters

Accelerating the decode phase with 3D‑DRAM reduces latency and energy costs, making real‑time AI services economically sustainable.

Key Takeaways

  • LLM inference is dominated by memory bandwidth, not compute.
  • KV‑cache size grows linearly with token count, limiting capacity.
  • GPUs excel at pre‑fill but struggle with decode phase latency.
  • 3D‑DRAM accelerators can boost bandwidth and reduce energy use.
  • Batching improves compute but inflates KV‑cache memory demands.

Summary

The talk addresses the growing "memory wall" that hampers generative AI inference and proposes 3D‑DRAM‑based accelerators as a hardware remedy. While processor speed has risen 50‑60% annually, memory bandwidth lags at 7‑8%, turning the decode stage of large language models into a bandwidth‑bound bottleneck.

Key insights include the stark contrast between the compute‑heavy pre‑fill phase, which GPUs handle well, and the decode phase, where each token requires a full read of model weights and a growing KV‑cache. For a 70‑billion‑parameter model, weights occupy ~70 GB and each token adds ~0.16 MB of KV data, quickly exhausting GPU HBM capacity when serving multiple users. The arithmetic intensity of such workloads hovers around 2 FLOPs/byte, far below the roofline of modern HBM‑equipped chips.

The speaker illustrates the problem with historical analogies—just as the 1858 Atlantic telegraph cable shrank communication latency, modern AI demands a comparable reduction in data‑movement latency. He cites concrete numbers: 18 TB/s HBM bandwidth yields an AI roofline of 278, yet the workload’s AI star is only ~2, leaving most compute idle. Attempts to prune KV‑cache or compress data only partially alleviate the issue; the fundamental limitation remains memory bandwidth and capacity.

Implications are clear: without redesigning the memory subsystem, interactive AI services will suffer high latency and prohibitive energy costs, undermining commercial viability. 3D‑DRAM stacks promise higher bandwidth per watt and larger on‑chip capacity, directly targeting the decode bottleneck and enabling scalable, low‑latency inference for next‑generation models.

Original Description

SAFARI Live Seminar — Scaling the Memory Wall: Towards 3D-DRAM-based Accelerators for Efficient Generative Inference
Speaker: Prof. Prashant Nair
Slides (pdf): 
Abstract:
Generative AI now underpins search, digital assistants, and media applications, making inference cost a first-order design constraint. Unlike traditional compute-bound workloads, large language and speech models are typically limited by memory bandwidth and capacity rather than raw arithmetic throughput. Thus, their inference cost is driven as much by data movement as by compute, and therefore hinges on the memory system’s design. This concern is especially acute during autoregressive decoding, which must repeatedly stream model weights and key-value (KV) caches at high bandwidths and low latencies while also providing enough capacity to support long context windows and several concurrent users. To make matters worse, these demands are accelerating with state-of-the-art models now exceeding hundreds of billions of parameters, context windows expanding from 4K to 128K tokens and beyond, and mixture-of-experts designs introducing additional irregularity in memory access patterns. Thus, today’s memory technologies force difficult trade-offs. SRAM can deliver extremely high bandwidth, but at prohibitive area and capacity limits. HBM offers higher capacity, but remains constrained by achievable bandwidth and I/O power. Closing this gap will require a fundamental rethinking of how memory is integrated with accelerator logic.
In this talk I will introduce our upcoming ISCA 2026 paper on the upcoming memory-centric accelerator from d-Matrix. This accelerator vertically integrates logic with 3D-stacked DRAM to deliver SRAM-level bandwidth and HBM-class capacity while substantially reducing energy consumption. I will describe the architectural challenges addressed by workload-aware channel mapping, optimized power management, topology-preserving redundancy, and thermal-aware reliability mechanisms, enabling the practical deployment of 3D-DRAM. Evaluations using models such as Llama-3.1, DeepSeek-V3, Canary, and Whisper show that our accelerator achieves significantly higher throughput and interactivity compared to HBM-based alternatives. I will conclude by examining the broader implications for computer architecture, particularly how advanced logic-memory integration through hybrid bonding and multi-high stacking can reshape inference cost structures and enable the next generation of trillion-parameter models.
Biography: Prashant J. Nair is an Associate Professor at the University of British Columbia (UBC) and also the lead architect of the 3D-memory architecture at d-Matrix. He leads the Systems and Architectures (STAR) Lab at UBC and is also an Affiliate Fellow of the Quantum Algorithms Institute. His research focuses on memory architectures and systems. Dr. Nair’s recognitions include the 2024 TCCA Young Architect Award, the 2025 DSN Test of Time Award, the HPCA 2023 Best Paper Award, a MICRO 2024 Best Paper nomination, and the HPCA 2025 Distinguished Artifact Award. Over the past decade, he has published more than 40 papers in top-tier venues. Prior to his promotion to Associate Professor, as an Assistant Professor, he was inducted into all three halls of fame of computer architecture: ISCA, MICRO, and HPCA.

Comments

Want to join the conversation?

Loading comments...