LLM System Design Interview #50 - The Rejection Sampling Paradox

LLM System Design Interview #50 - The Rejection Sampling Paradox

AI Interview Prep
AI Interview PrepMay 13, 2026

Key Takeaways

  • Token Acceptance Rate below 30% nullifies speculative decoding speed gains
  • Align draft model via logits distillation to reduce KL divergence
  • Fine‑tune draft on target domain data to boost acceptance
  • Dynamically adjust lookahead length based on real‑time acceptance metrics
  • Monitoring acceptance rate is essential for latency‑critical AI services

Pulse Analysis

Speculative decoding promises to double inference throughput by letting a lightweight draft model generate multiple tokens that the heavyweight target model then verifies. In theory, a 1 billion‑parameter draft paired with a 70 billion‑parameter model should cut latency in half, but only if the draft’s predictions are frequently accepted. When the draft’s probability distribution diverges—measured by a high KL divergence—the target rejects entire token blocks, forcing full autoregressive computation and eliminating any speed advantage.

The hidden metric that determines success is the Token Acceptance Rate. Below roughly 30 percent, the overhead of running both models outweighs any gains, turning speculative decoding into a latency penalty. Engineers must therefore focus on statistical alignment rather than raw hardware speed. Direct logits distillation copies the target’s output distribution into the draft, dramatically lowering KL divergence. Equally important is domain‑specific fine‑tuning; a draft trained on generic web text will stumble on specialized code or scientific data, leading to frequent rejections. Adaptive lookahead—adjusting the number of draft tokens based on real‑time acceptance—further safeguards performance.

For AI infrastructure teams, monitoring acceptance rates becomes as critical as tracking GPU utilization. A disciplined pipeline that measures, distills, and fine‑tunes drafts can reclaim the promised 2× speedup, reducing cloud spend and meeting strict latency SLAs. The lesson also resonates in hiring: candidates who recognize the statistical nature of the bottleneck demonstrate the depth of expertise prized by leading labs like DeepMind. As speculative decoding matures, alignment‑first strategies will define the next wave of efficient large‑model deployments.

LLM System Design Interview #50 - The Rejection Sampling Paradox

Comments

Want to join the conversation?