ReMix: Reinforcement Routing for Mixtures of LoRAs in LLM Finetuning

•March 12, 2026

AI Paper of the Day•Mar 12, 2026

Key Takeaways

•Routing weight collapse limits Mixture-of-LoRAs efficiency.
•ReMix fixes weights, uses reinforcement learning for router training.
•Top‑k deterministic selection yields optimal inference performance.
•Outperforms SOTA on GSM8K, HumanEval, ARC‑c benchmarks.
•Scales with more adapters and compute, using fewer parameters.

Summary

The paper introduces ReMix, a reinforcement‑learning based routing strategy for Mixture‑of‑LoRAs that eliminates the common “routing weight collapse” where a single adapter dominates. By assigning constant, equal weights to all activated adapters and training the router as a policy, ReMix forces diverse adapter usage. During inference it selects the top‑k adapters deterministically, delivering higher accuracy with fewer trainable parameters. Benchmarks across GSM8K, HumanEval and ARC‑c show consistent gains over existing parameter‑efficient fine‑tuning methods.

Pulse Analysis

Mixture‑of‑LoRAs has become a popular approach for parameter‑efficient fine‑tuning, allowing multiple low‑rank adapters to specialize on different input types. However, the learned routing scores often converge to a single dominant adapter, a phenomenon known as routing weight collapse. This bottleneck wastes the capacity of the remaining adapters and hampers the model’s ability to capture nuanced patterns, especially in tasks that benefit from heterogeneous expertise such as mathematical reasoning or code synthesis.

ReMix tackles the collapse by discarding learnable routing weights altogether. Instead, every activated adapter receives a fixed, equal contribution, guaranteeing that the computational budget is shared. Because the routing decisions are no longer differentiable, the authors recast router training as a reinforcement‑learning problem, using the supervised loss as a negative reward and applying the Reinforce Leave‑One‑Out (RLOO) variance‑reduction technique. During inference, the system switches to a deterministic top‑k selection, which the authors prove is optimal once the policy is well‑trained. This design preserves the flexibility of mixture models while eliminating the instability of gradient‑based routing.

Empirical results validate ReMix’s advantages: on GSM8K, HumanEval, and ARC‑c, the method achieves higher accuracy with a fraction of the trainable parameters compared to prior Mixture‑of‑LoRAs and other PEFT techniques. Moreover, performance scales smoothly with additional compute and more activated adapters, demonstrating that the approach effectively leverages diverse adapter combinations. For enterprises deploying LLMs, ReMix offers a cost‑effective path to tailor models for niche domains without incurring the heavy parameter overhead typical of full fine‑tuning, positioning it as a compelling tool in the evolving AI‑ops toolkit.