
The approach demonstrates that efficient inference can be retrofitted onto existing large reasoning models without full retraining, offering a cost‑effective path for enterprises seeking faster LLM deployment.
The Apriel‑H1 effort reshapes how organizations think about model efficiency. Rather than rebuilding a reasoning model from scratch, ServiceNow‑AI showed that strategic distillation—focused on the teacher’s SFT data where multi‑step reasoning is explicit—preserves the nuanced attention patterns that drive high‑quality answers. By swapping attention layers with linear‑complexity Mamba mixers and training with reverse KL divergence, the student model inherits the teacher’s confident predictions while shedding the quadratic cost of attention, especially for long contexts.
A key operational insight is the staged replacement methodology. Initial leave‑one‑out analysis pinpoints layers that contribute little to performance, allowing a safe first wave of Mamba integration. Subsequent progressive substitution, guided by a loss‑based metric, ensures that inter‑layer dependencies are respected, preventing degradation that would arise from a naïve bulk swap. The final fine‑tuning on SFT data consolidates reasoning abilities, delivering a model that matches or exceeds the teacher on benchmarks such as MATH‑500, MTBench, and GSM8k while delivering up to 2.1× higher throughput.
The open‑source Fast‑LLM framework underpins reproducibility, offering modular configurations that treat attention and Mamba as interchangeable mixers. This modularity not only accelerates research cycles but also lowers the barrier for enterprises to adopt hybrid architectures tailored to their latency and compute constraints. As the industry grapples with the trade‑off between model size and inference cost, Apriel‑H1 provides a pragmatic template: leverage existing strong models, select distillation data that mirrors the target capability, and employ systematic layer replacement to achieve efficient, high‑performing reasoning systems.
Community Article · Published November 19, 2025
Authors: Torsten Scholak (tscholak), Oleksiy Ostapenko (ostapeno), Raymond Li (RaymondLi), Luke Kumar (nitsanluke), Joel Lamy‑Poirier (jlamypoirier) – ServiceNow‑AI
We converted our 15 B reasoning model to a Mamba hybrid achieving 2.1× throughput with minimal quality loss. The key? A non‑obvious insight about what data to distill on, and why intuition fails here.
When MiniMax published their M2 post‑mortem in October explaining why they abandoned efficient attention at 230 B scale, the narrative briefly became “efficient attention is dead.” Within days, Kimi Linear proved otherwise. The real lesson: it depends on your constraints.
Our constraint was simple: we had a strong 15 B reasoning model and needed to make it efficient without starting over. No infinite compute for 20 T‑token pretraining. No luxury of architectural co‑design from day one. Just a practical question: can you retrofit efficiency into an existing model through distillation?
Spoilers: yes, but only if you ignore your intuition about what data to use.
The Apriel‑H1 family (seven checkpoints spanning 25‑40 Mamba layers out of 50 total) shows the complete efficiency‑quality frontier. Our flagship Apriel‑H1‑15b‑Thinker‑SFT achieves 2.1× throughput with minimal quality loss:
| Benchmark | Teacher → Student | Δ |
|-----------|-------------------|---|
| MATH‑500 | 0.90 → 0.92 | +0.02 |
| MTBench | 8.30 → 8.58 | +0.28 |
| GSM8k | 0.97 → 0.95 | –0.02 |
| GPQA | 0.59 → 0.55 | –0.04 |
| AIME24 | 0.70 → 0.65 | –0.05 |
Total training: 76.8 B tokens.
Apriel‑H1‑15b‑Thinker‑SFT (green) vs full‑attention teacher (blue). Reasoning quality stays nearly flat across benchmarks while throughput increases 1.89‑2.09× depending on context length.
Full details are in our Apriel‑H1 paper. Below we focus on the key insight that made it work.
What we initially thought would work: distill on pretraining data and round it out with some SFT.
The reasoning seemed solid. We were inserting completely new Mamba layers that had never seen data. These linear SSMs need to learn general‑purpose token mixing from scratch. How could they become effective mixers unless they got exposure to the same broad distribution the original attention layers saw?
So we tried it—mixing pretraining and SFT data. It didn’t work. The distilled hybrids lost reasoning quality, sometimes dramatically.
What actually worked: high‑quality reasoning traces from the teacher’s SFT dataset.
Distilling a reasoning model isn’t about transferring general next‑token prediction. The base model already has that, and we started from a strong 15 B foundation. What we’re preserving is specific and fragile: the teacher’s multi‑step reasoning patterns.
Those patterns emerge from intricate attention mechanisms—retrieval heads pulling context from thousands of tokens, induction heads recognizing and continuing logical chains, long‑range dependencies connecting premises to conclusions many steps later. When you replace attention wholesale with Mamba’s linear recurrence, these computational mechanisms are disrupted. The hybrid must discover new paths to the same reasoning outcomes.
That discovery requires explicit examples where reasoning structure is visible and correct:
Multi‑step math proofs where each thought follows from the previous
Coding tasks with clear logical dependencies
Scientific analysis with detailed explanatory chains
Pretraining data, on the other hand, is too noisy and too diffuse. The reasoning signal gets lost. You need concentrated examples of the specific capability you’re trying to preserve.
Once we understood the data choice, our distillation method became clear too. We used reverse KL divergence (temperature 1) rather than forward KL. Reverse KL consistently won because we’re training on problems where the teacher has high confidence and clear structure. Reverse KL’s mode‑seeking behavior encourages the student to commit to those high‑confidence predictions. When your teacher is confident and correct, you want your student to be confident too.
Key takeaway: match your distillation data to the capability you’re preserving, not the capability you’re building.
You can’t just swap 40 attention layers for Mamba and hope. We learned this the hard way and eventually developed a staged distillation procedure to get there reliably.
We performed a Leave‑One‑Out (LOO) analysis on MMLU: remove each layer, replace it with an identity, then measure the drop. Sorting by importance, we replaced the bottom 25 layers with Mamba‑in‑Llama (MIL)‑initialized mixers and distilled end‑to‑end. This produced our H‑25 checkpoint.
LOO broke down past 25 layers because layers unimportant in isolation became critical in combination. To address this, we introduced MIL‑Mamba‑Replacement (MMR):
For each remaining attention layer, initialize a Mamba mixer with MIL.
Run 100 training steps and record the distillation loss.
Layers converging to lower loss are “easier” to replace.
We progressed incrementally: 25 → 27 → 30 → 34 → 37 → 40 Mamba layers, grouping replacements by MMR scores. Each checkpoint distilled from the previous one.
After reaching the target Mamba layer count, we performed a final SFT pass until reasoning performance stabilized. After 55.9 B distillation tokens and 20.9 B SFT tokens, we obtained the final Apriel‑H1‑15b‑Thinker‑SFT model.
The complete efficiency frontier. Each checkpoint shows cumulative training tokens. Our flagship H‑30‑SFT (released as Apriel‑H1‑15b‑Thinker‑SFT) used 76.8 B total for 2.1× throughput at 0.76 average score. The aggressively converted H‑40 variant used 136.5 B tokens for 3.4× throughput. For reference, NVIDIA’s Nemotron‑Nano‑9B‑v2 achieves 4.6× at 0.77 score but required training from scratch with orders of magnitude more compute.
We built all this on Fast‑LLM, our open‑source training framework. The core architectural principle: large language model transformers should be modular. Attention and Mamba are different implementations of the same “mixing” interface and can be swapped freely.
decoder:
type: "pattern"
blocks:
attention_block:
mixer:
type: "attention"
heads: 32
head_groups: 8
head_size: 128
mlp:
type: "gated"
activation: "silu"
mamba_block:
mixer:
type: "mamba"
d_inner: 4096
state_size: 16
dt_rank: 16
mlp:
type: "gated"
activation: "silu"
num_blocks: 50
pattern: ["attention_block", "attention_block", "mamba_block", ...] # specify order
For Apriel‑H1‑15b‑Thinker‑SFT the pattern contains 30 mamba_blocks followed by 20 attention_blocks, placed by importance.
model:
base_model:
head:
distillation_model: teacher
distillation_loss_implementation: reverse_kl
reference_models:
teacher:
pretrained:
format: mistral
path: path/to/Apriel-Nemotron-15b-Thinker
Fast‑LLM handles gradient accumulation, distributed training, tensor parallelism, checkpointing, and everything needed for large‑scale experimentation. It is open source (Apache 2.0), so you can reproduce this work.
Why release all checkpoints?
Because the optimal choice depends on your constraints. H‑30 offers the best balance; H‑40 maximizes throughput for latency‑critical workloads. Intermediate checkpoints let you pick the exact trade‑off you need.
Why do speedups vary with context length?
Mamba’s linear‑complexity advantage grows with sequence length, while attention degrades quadratically.
Why only Mamba?
We used Mamba‑1 for three reasons: proven distillation track record, strong empirical performance, and simplicity of implementation. This let us focus on the data question first.
What were the Mamba hyper‑parameters?
State size = 16, DT rank = 16, inner dimension = 4096. For our GQA setup we expanded the input projection B and state x to match total attention heads following M1.
Why not try more advanced conversion methods?
We used Mamba‑in‑Llama initialization and knowledge distillation rather than the multi‑stage procedure from MOHAWK (paper) because preliminary experiments showed no significant advantage.
Why only SFT the H‑30 model?
We applied SFT solely to H‑30 because it provided the best balance of efficiency and quality for downstream tasks; further conversion stages did not benefit noticeably from additional SFT.
Comments
Want to join the conversation?
Loading comments...