IndexCache, a New Sparse Attention Optimizer, Delivers 1.82x Faster Inference on Long-Context AI Models

IndexCache, a New Sparse Attention Optimizer, Delivers 1.82x Faster Inference on Long-Context AI Models

VentureBeat
VentureBeatMar 27, 2026

Why It Matters

By slashing inference latency and compute cost for ultra‑long contexts, IndexCache makes enterprise‑scale LLM deployments more affordable and responsive, unlocking new use‑cases such as RAG and multi‑step agentic workflows.

Key Takeaways

  • IndexCache removes up to 75% of DSA indexers
  • 1.82× faster prefill at 200K tokens
  • Throughput gains reach 1.48× for generation
  • No quality loss on long‑context benchmarks
  • Compatible with serving stacks like vLLM

Pulse Analysis

The quadratic scaling of traditional self‑attention has long been a choke point for large language models, especially when handling extended contexts such as full documents or multi‑turn reasoning. Sparse attention architectures like DeepSeek Sparse Attention (DSA) already reduce the core attention cost from quadratic to linear, but their auxiliary indexer modules still operate quadratically at every layer, inflating pre‑fill latency as token windows grow.

IndexCache tackles this hidden bottleneck by exploiting the observation that adjacent transformer layers select largely overlapping token subsets—often 70‑100% similarity. By designating a small set of "full" layers to compute fresh indices and allowing subsequent "shared" layers to reuse those cached selections, the method eliminates up to three‑quarters of indexer work. The approach is deployment‑friendly: a training‑free greedy calibration automatically determines optimal layer placement, while a training‑aware variant integrates a multi‑layer distillation loss for models still in development. Both paths preserve benchmark performance, even yielding marginal gains on challenging math reasoning tasks.

For businesses, the practical payoff is immediate. On a 30‑billion‑parameter GLM model, IndexCache cuts pre‑fill latency from 19.5 seconds to 10.7 seconds and lifts token‑per‑second throughput from 58 to 86, translating into roughly 20% lower compute spend for long‑context workloads. The open‑source patches integrate seamlessly with popular inference engines such as vLLM and SGLang, reducing engineering overhead. As AI products increasingly rely on real‑time, document‑heavy interactions, techniques like IndexCache signal a broader shift toward inference‑aware model design, ensuring that scaling model size does not come at the expense of operational efficiency.

IndexCache, a new sparse attention optimizer, delivers 1.82x faster inference on long-context AI models

Comments

Want to join the conversation?

Loading comments...