LLM System Design Interview #40 - The Expert Capacity Paradox

LLM System Design Interview #40 - The Expert Capacity Paradox

AI Interview Prep
AI Interview PrepMay 3, 2026

Key Takeaways

  • Expert capacity caps trigger token dropping in mixed‑expert models.
  • Dropped tokens bypass MLP, preserving residual connection unchanged.
  • Batch composition determines which expert overflows, causing output variance.
  • Increasing capacity factor or load‑balancing eliminates nondeterministic behavior.

Pulse Analysis

Mixture‑of‑Experts (MoE) architectures have become a cornerstone for scaling large language models, offering specialized expert MLPs that handle subsets of tokens. In theory, setting the sampling temperature to zero should guarantee deterministic outputs, a requirement for many enterprise applications. However, the inference engine must also respect hardware constraints such as GPU memory and compute limits. To prevent out‑of‑memory crashes, MoE systems enforce an expert capacity factor—a hard ceiling on the number of tokens each expert can process per forward pass. When a batch contains a surge of tokens that all favor the same expert, that expert reaches its limit and any surplus tokens are discarded.

The discarded tokens do not undergo the expert’s transformation; instead, they travel straight through the residual connection, effectively multiplying by zero. Because batches are assembled from many users’ requests, the competition for any given expert fluctuates second by second. Consequently, the same prompt can be fully processed at one moment and partially dropped at another, producing subtly different completions even with temperature = 0. For enterprise clients, this hidden stochasticity can break reproducibility guarantees, complicate debugging, and erode confidence in AI‑driven services.

Mitigating the expert capacity paradox requires a trade‑off between efficiency and determinism. Operators can increase the capacity factor, padding the workload to ensure no expert overflows, though this consumes additional FLOPs. More sophisticated solutions involve drop‑free routing algorithms and dynamic, sequence‑wise load balancing, as seen in newer models like DeepSeek‑V3. Implementing these strategies restores consistent inference, aligns with service‑level agreements, and positions organizations to safely scale MoE deployments in production environments.

LLM System Design Interview #40 - The Expert Capacity Paradox

Comments

Want to join the conversation?