Scaling Intelligence Through the Memory Hierarchy with Solidigm

Tech Field Day
Tech Field DayMay 21, 2026

Why It Matters

Memory capacity is a decisive lever for AI performance; expanding the KV cache and tiered storage directly raises throughput, accuracy, and cost efficiency of large‑scale inference workloads.

Key Takeaways

  • Memory capacity directly boosts LLM throughput by reducing recomputation.
  • Larger KV cache enables higher “tenacity,” improving complex problem scores.
  • Tiered storage hierarchy (HBM → SSD) balances speed and capacity.
  • Parallel chain‑of‑thought models need >16 GB HBM for optimal performance.
  • Truncating reasoning tokens drops accuracy to zero, highlighting capacity importance.

Summary

Kapil Kirkra, senior principal engineer at Solidigm, argued that scaling AI intelligence requires a third, often overlooked axis: memory capacity. While larger models and more compute dominate headlines, the talk demonstrated how the memory hierarchy—from high‑bandwidth HBM to NVMe SSD tiers—directly influences model performance and quality.

Using a single RTX 6000 Pro GPU with 96 GB HBM, Kirkra showed that when the KV cache fits within HBM, the system achieves 29 requests per second (system‑one recall). Expanding the working set beyond the 22 GB usable cache forces recomputation, dropping throughput to 2.68 rps (system‑two). Similarly, on the AIM 2024 math benchmark, a 32‑billion‑parameter model scored 7% on the first run but rose to 83% when given sufficient token budget, illustrating how extra capacity fuels tenacity and higher accuracy.

Key examples included a jump from 67% to 80% and then 83% on the math test by allocating parallel chain‑of‑thought instances within the 22 GB HBM limit, and the stark finding that any truncation of reasoning tokens resulted in a 0% score. These data points underscore that memory capacity—not just compute—determines whether a model can retain context, reason deeply, and deliver reliable outputs.

The implication for enterprises is clear: investing in tiered memory architectures and expanding KV cache capacity can dramatically improve AI service throughput, reduce latency, and boost the quality of complex inference tasks. Companies that overlook this lever risk slower, less accurate AI deployments and higher operational costs.

Original Description

Solidigm's presentation at AI Field Day 8, led by Kapil Karkra, highlighted memory capacity as a critical, often overlooked, third axis for scaling AI intelligence, alongside model size and compute power. Solidigm introduced its "CRAFT" framework to define and measure AI intelligence across five dimensions: Comprehension, Recall, Adaptability, Fluency, and Tenacity. The core argument is that expanding memory capacity beyond the GPU's high-bandwidth memory (HBM) to system DRAM and NVMe SSDs dramatically improves AI performance and quality by enabling more efficient inference and preventing costly recomputations.
Through various benchmarks and experiments, Solidigm demonstrated the impact of memory capacity on each CRAFT dimension. For Recall, offloading Key-Value (KV) cache to SSDs prevented the GPU from recomputing previous states, significantly boosting throughput. Tenacity was illustrated with an AIME 2024 math test, where increased output token capacity allowed the model to deliberate longer and achieve a higher score, showcasing how more "scratch space" leads to better reasoning quality. Adaptability, measured by requests per second, and Fluency, indicated by inter-token latency, both saw substantial improvements (up to 4x throughput and 21x better latency) when NVMe SSDs extended the KV cache, allowing the system to handle more concurrent requests without compromising responsiveness. Similarly, Comprehension, tested with a "needle in a haystack" benchmark, showed 78 times faster reading when context fit in the extended cache.
The presentation concluded that while higher bandwidth storage is beneficial when working sets fit within faster tiers, ultimately, sheer capacity becomes paramount for larger, more complex AI workloads involving multiple agents and extensive context lengths. The discussion emphasized the need for a tiered memory hierarchy, where automatic caching across HBM, DRAM, and NVMe SSDs optimizes resource utilization and avoids GPU stalls. This approach allows organizations to balance performance and cost effectively, ensuring that AI systems can sustain deeper reasoning, handle greater concurrency, and deliver higher quality, more fluent responses by leveraging expanded memory capacity.
Presented by Kapil Karkra, Sr. Principal Engineer AI Solutions and Software, Solidigm. Recorded live at AI Field Day 8 in San Jose, California on May 14, 2026. Watch the entire presentation at https://techfieldday.com/appearance/solidigm-presents-at-ai-field-day-8/or visit https://TechFieldDay.com/event/aifd8/ or https://Solidigm.com for more information.

Comments

Want to join the conversation?

Loading comments...