
Replacing GPU Compute Dies With PNM-Enabled HBM Cubes For Long-Context Decode Attention (UCSD, Columbia, Yonsei U., NVIDIA, Samsung)
Why It Matters
AMMA dramatically reduces latency and power for ultra‑long‑context LLM serving, unlocking faster, cheaper AI inference for enterprise workloads.
Key Takeaways
- •AMMA swaps GPU compute dies for HBM‑PNM memory cubes
- •Memory bandwidth roughly doubles versus traditional GPU designs
- •Latency for 1M‑token attention drops 15.5×
- •Energy consumption cuts 6.9× compared with NVIDIA H100
- •Hybrid two‑level parallelism reduces intra‑chip die‑to‑die traffic
Pulse Analysis
The surge in foundation models has pushed inference workloads toward ever‑longer context windows, often exceeding a million tokens for complex reasoning or agentic tasks. Conventional GPU‑centric servers, while powerful for compute‑heavy kernels, struggle with the decode‑phase of attention, which is fundamentally memory‑bound. This mismatch inflates latency and wastes silicon, prompting researchers to rethink the hardware hierarchy. AMMA’s proposition—replacing the GPU’s compute die with high‑bandwidth memory (HBM) cubes equipped with processing‑in‑memory (PNM) capabilities—directly addresses the bottleneck by delivering twice the bandwidth of existing GPU memory stacks.
At the heart of AMMA lies a lightweight logic die that orchestrates per‑cube bandwidth, paired with a two‑level hybrid parallelism that balances intra‑cube and inter‑cube workloads. The architecture also introduces a reordered collective flow, slashing die‑to‑die communication overhead that typically hampers multi‑chiplet systems. Design‑space exploration in the paper shows that even modest compute power per cube, combined with high‑speed D2D links, yields substantial latency reductions. The reported 15.5× improvement in attention latency and 6.9× drop in energy consumption versus the NVIDIA H100 underscores the efficiency gains achievable when memory, rather than raw compute, becomes the primary engine for attention.
For data‑center operators and AI service providers, AMMA’s gains translate into faster response times and lower operational costs, especially as models grow in size and context demands. The memory‑centric approach could reshape future AI accelerators, encouraging a shift toward modular, chiplet‑based designs that prioritize bandwidth and energy efficiency. As the industry moves toward heterogeneous compute fabrics, AMMA offers a concrete blueprint for integrating PNM‑enabled HBM cubes into existing infrastructure, potentially accelerating the rollout of ultra‑long‑context AI services across cloud and edge environments.
Replacing GPU Compute Dies With PNM-Enabled HBM Cubes For Long-Context Decode Attention (UCSD, Columbia, Yonsei U., NVIDIA, Samsung)
Comments
Want to join the conversation?
Loading comments...