IndexCache, a New Sparse Attention Optimizer, Delivers 1.82x Faster Inference on Long-Context AI Models
Why It Matters
By slashing inference latency and compute cost for ultra‑long contexts, IndexCache makes enterprise‑scale LLM deployments more affordable and responsive, unlocking new use‑cases such as RAG and multi‑step agentic workflows.
Key Takeaways
- •IndexCache removes up to 75% of DSA indexers
- •1.82× faster prefill at 200K tokens
- •Throughput gains reach 1.48× for generation
- •No quality loss on long‑context benchmarks
- •Compatible with serving stacks like vLLM
Pulse Analysis
The quadratic scaling of traditional self‑attention has long been a choke point for large language models, especially when handling extended contexts such as full documents or multi‑turn reasoning. Sparse attention architectures like DeepSeek Sparse Attention (DSA) already reduce the core attention cost from quadratic to linear, but their auxiliary indexer modules still operate quadratically at every layer, inflating pre‑fill latency as token windows grow.
IndexCache tackles this hidden bottleneck by exploiting the observation that adjacent transformer layers select largely overlapping token subsets—often 70‑100% similarity. By designating a small set of "full" layers to compute fresh indices and allowing subsequent "shared" layers to reuse those cached selections, the method eliminates up to three‑quarters of indexer work. The approach is deployment‑friendly: a training‑free greedy calibration automatically determines optimal layer placement, while a training‑aware variant integrates a multi‑layer distillation loss for models still in development. Both paths preserve benchmark performance, even yielding marginal gains on challenging math reasoning tasks.
For businesses, the practical payoff is immediate. On a 30‑billion‑parameter GLM model, IndexCache cuts pre‑fill latency from 19.5 seconds to 10.7 seconds and lifts token‑per‑second throughput from 58 to 86, translating into roughly 20% lower compute spend for long‑context workloads. The open‑source patches integrate seamlessly with popular inference engines such as vLLM and SGLang, reducing engineering overhead. As AI products increasingly rely on real‑time, document‑heavy interactions, techniques like IndexCache signal a broader shift toward inference‑aware model design, ensuring that scaling model size does not come at the expense of operational efficiency.
Comments
Want to join the conversation?
Loading comments...