Microarchitecture Tailored to 3D-Stacked Near-Memory Processing LLM Decoding (U. Of Edinburgh, Peking U., Cambridge Et Al.)

Microarchitecture Tailored to 3D-Stacked Near-Memory Processing LLM Decoding (U. Of Edinburgh, Peking U., Cambridge Et Al.)

Semiconductor Engineering
Semiconductor EngineeringApr 28, 2026

Why It Matters

LLM inference dominates data‑center workloads, and accelerating decoding with area‑efficient, high‑bandwidth NMP chips can slash latency and power costs, reshaping AI hardware economics.

Key Takeaways

  • Systolic array replaces MAC tree for area efficiency
  • Reconfigurable array shape matches diverse decode operator patterns
  • Unified vector core reduces buffer needs, cuts silicon area
  • Multi‑core scheduler yields 2.91× speedup over Stratum
  • Energy efficiency improves 2.40× for dense and MoE models

Pulse Analysis

The surge in generative AI has pushed large‑language‑model decoding into a performance bottleneck, primarily because the task’s low arithmetic intensity makes it highly dependent on memory bandwidth. Conventional off‑chip interfaces struggle to feed data fast enough, prompting researchers to explore 3D‑stacked near‑memory processing (NMP) where memory resides directly atop the compute die. This proximity delivers orders‑of‑magnitude higher local bandwidth, but it also forces many decode operators into a compute‑bound regime, demanding a rethink of the underlying compute substrate.

The new study tackles this challenge by replacing the legacy MAC‑tree units with a compact systolic array that can be reshaped on‑the‑fly to match the heterogeneous tensor dimensions typical of LLM decoding. Because the high‑bandwidth memory reduces the need for large on‑chip buffers, the authors repurpose an existing vector core—originally intended for auxiliary tensor work—to provide the necessary control logic and multi‑ported buffering. This unification slashes silicon area while preserving fine‑grained reconfigurability, a critical factor for maintaining high utilization across both dense and mixture‑of‑experts (MoE) models.

Performance results are striking: the prototype delivers a 2.91× speedup and 2.40× improvement in energy efficiency compared with the state‑of‑the‑art Stratum design. For cloud providers and enterprises running inference workloads at scale, such gains translate into lower latency, reduced power bills, and the ability to serve more queries per server. The paper’s co‑design of microarchitecture and scheduling also sets a roadmap for future NMP chips, suggesting that tighter integration of memory and compute will become a cornerstone of next‑generation AI accelerators.

Microarchitecture Tailored to 3D-Stacked Near-Memory Processing LLM Decoding (U. of Edinburgh, Peking U., Cambridge et al.)

Comments

Want to join the conversation?

Loading comments...