Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models

•November 19, 2025

Hugging Face•Nov 19, 2025

Companies Mentioned

ServiceNow

NOW

NVIDIA

NVDA

MiniMax

Moonshot AI

Why It Matters

The approach demonstrates that efficient inference can be retrofitted onto existing large reasoning models without full retraining, offering a cost‑effective path for enterprises seeking faster LLM deployment.

Key Takeaways

•Distill on teacher's SFT reasoning traces, not pretraining data.
•Reverse KL loss improves mode‑seeking during reasoning distillation.
•Staged layer replacement yields 2.1× throughput with minimal loss.
•Apriel‑H1‑15b‑Thinker‑SFT matches teacher on MATH‑500 benchmark.
•Fast‑LLM framework enables reproducible hybrid Mamba‑Attention models.

Pulse Analysis

The Apriel‑H1 effort reshapes how organizations think about model efficiency. Rather than rebuilding a reasoning model from scratch, ServiceNow‑AI showed that strategic distillation—focused on the teacher’s SFT data where multi‑step reasoning is explicit—preserves the nuanced attention patterns that drive high‑quality answers. By swapping attention layers with linear‑complexity Mamba mixers and training with reverse KL divergence, the student model inherits the teacher’s confident predictions while shedding the quadratic cost of attention, especially for long contexts.

A key operational insight is the staged replacement methodology. Initial leave‑one‑out analysis pinpoints layers that contribute little to performance, allowing a safe first wave of Mamba integration. Subsequent progressive substitution, guided by a loss‑based metric, ensures that inter‑layer dependencies are respected, preventing degradation that would arise from a naïve bulk swap. The final fine‑tuning on SFT data consolidates reasoning abilities, delivering a model that matches or exceeds the teacher on benchmarks such as MATH‑500, MTBench, and GSM8k while delivering up to 2.1× higher throughput.

The open‑source Fast‑LLM framework underpins reproducibility, offering modular configurations that treat attention and Mamba as interchangeable mixers. This modularity not only accelerates research cycles but also lowers the barrier for enterprises to adopt hybrid architectures tailored to their latency and compute constraints. As the industry grapples with the trade‑off between model size and inference cost, Apriel‑H1 provides a pragmatic template: leverage existing strong models, select distillation data that mirrors the target capability, and employ systematic layer replacement to achieve efficient, high‑performing reasoning systems.

Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models

Community Article · Published November 19, 2025

Authors: Torsten Scholak (tscholak), Oleksiy Ostapenko (ostapeno), Raymond Li (RaymondLi), Luke Kumar (nitsanluke), Joel Lamy‑Poirier (jlamypoirier) – ServiceNow‑AI

We converted our 15 B reasoning model to a Mamba hybrid achieving 2.1× throughput with minimal quality loss. The key? A non‑obvious insight about what data to distill on, and why intuition fails here.

When MiniMax published their M2 post‑mortem in October explaining why they abandoned efficient attention at 230 B scale, the narrative briefly became “efficient attention is dead.” Within days, Kimi Linear proved otherwise. The real lesson: it depends on your constraints.

Our constraint was simple: we had a strong 15 B reasoning model and needed to make it efficient without starting over. No infinite compute for 20 T‑token pretraining. No luxury of architectural co‑design from day one. Just a practical question: can you retrofit efficiency into an existing model through distillation?

Spoilers: yes, but only if you ignore your intuition about what data to use.

What We Built

The Apriel‑H1 family (seven checkpoints spanning 25‑40 Mamba layers out of 50 total) shows the complete efficiency‑quality frontier. Our flagship Apriel‑H1‑15b‑Thinker‑SFT achieves 2.1× throughput with minimal quality loss:

| Benchmark | Teacher → Student | Δ |

|-----------|-------------------|---|

| MATH‑500 | 0.90 → 0.92 | +0.02 |

| MTBench | 8.30 → 8.58 | +0.28 |

| GSM8k | 0.97 → 0.95 | –0.02 |

| GPQA | 0.59 → 0.55 | –0.04 |

| AIME24 | 0.70 → 0.65 | –0.05 |

Total training: 76.8 B tokens.

Apriel‑H1‑15b‑Thinker‑SFT (green) vs full‑attention teacher (blue). Reasoning quality stays nearly flat across benchmarks while throughput increases 1.89‑2.09× depending on context length.

Full details are in our Apriel‑H1 paper. Below we focus on the key insight that made it work.

The Non‑Obvious Insight

What we initially thought would work: distill on pretraining data and round it out with some SFT.

The reasoning seemed solid. We were inserting completely new Mamba layers that had never seen data. These linear SSMs need to learn general‑purpose token mixing from scratch. How could they become effective mixers unless they got exposure to the same broad distribution the original attention layers saw?

So we tried it—mixing pretraining and SFT data. It didn’t work. The distilled hybrids lost reasoning quality, sometimes dramatically.

What actually worked: high‑quality reasoning traces from the teacher’s SFT dataset.

Distilling a reasoning model isn’t about transferring general next‑token prediction. The base model already has that, and we started from a strong 15 B foundation. What we’re preserving is specific and fragile: the teacher’s multi‑step reasoning patterns.

Those patterns emerge from intricate attention mechanisms—retrieval heads pulling context from thousands of tokens, induction heads recognizing and continuing logical chains, long‑range dependencies connecting premises to conclusions many steps later. When you replace attention wholesale with Mamba’s linear recurrence, these computational mechanisms are disrupted. The hybrid must discover new paths to the same reasoning outcomes.

That discovery requires explicit examples where reasoning structure is visible and correct:

Multi‑step math proofs where each thought follows from the previous
Coding tasks with clear logical dependencies
Scientific analysis with detailed explanatory chains

Pretraining data, on the other hand, is too noisy and too diffuse. The reasoning signal gets lost. You need concentrated examples of the specific capability you’re trying to preserve.

Once we understood the data choice, our distillation method became clear too. We used reverse KL divergence (temperature 1) rather than forward KL. Reverse KL consistently won because we’re training on problems where the teacher has high confidence and clear structure. Reverse KL’s mode‑seeking behavior encourages the student to commit to those high‑confidence predictions. When your teacher is confident and correct, you want your student to be confident too.

Key takeaway: match your distillation data to the capability you’re preserving, not the capability you’re building.

How to Apply It: Staged Distillation

You can’t just swap 40 attention layers for Mamba and hope. We learned this the hard way and eventually developed a staged distillation procedure to get there reliably.

Stage 1 – Identify Least‑Important Layers

We performed a Leave‑One‑Out (LOO) analysis on MMLU: remove each layer, replace it with an identity, then measure the drop. Sorting by importance, we replaced the bottom 25 layers with Mamba‑in‑Llama (MIL)‑initialized mixers and distilled end‑to‑end. This produced our H‑25 checkpoint.

Stage 2 – Progressive Conversion Beyond 25 Layers

LOO broke down past 25 layers because layers unimportant in isolation became critical in combination. To address this, we introduced MIL‑Mamba‑Replacement (MMR):

For each remaining attention layer, initialize a Mamba mixer with MIL.
Run 100 training steps and record the distillation loss.
Layers converging to lower loss are “easier” to replace.

We progressed incrementally: 25 → 27 → 30 → 34 → 37 → 40 Mamba layers, grouping replacements by MMR scores. Each checkpoint distilled from the previous one.

Stage 3 – End‑to‑End Training on SFT Data

After reaching the target Mamba layer count, we performed a final SFT pass until reasoning performance stabilized. After 55.9 B distillation tokens and 20.9 B SFT tokens, we obtained the final Apriel‑H1‑15b‑Thinker‑SFT model.

The complete efficiency frontier. Each checkpoint shows cumulative training tokens. Our flagship H‑30‑SFT (released as Apriel‑H1‑15b‑Thinker‑SFT) used 76.8 B total for 2.1× throughput at 0.76 average score. The aggressively converted H‑40 variant used 136.5 B tokens for 3.4× throughput. For reference, NVIDIA’s Nemotron‑Nano‑9B‑v2 achieves 4.6× at 0.77 score but required training from scratch with orders of magnitude more compute.

Making It Reproducible: Fast‑LLM

We built all this on Fast‑LLM, our open‑source training framework. The core architectural principle: large language model transformers should be modular. Attention and Mamba are different implementations of the same “mixing” interface and can be swapped freely.

Hybrid Architecture (Fast‑LLM config)


decoder:

  type: "pattern"

  blocks:

    attention_block:

      mixer:

        type: "attention"

        heads: 32

        head_groups: 8

        head_size: 128

      mlp:

        type: "gated"

        activation: "silu"

    mamba_block:

      mixer:

        type: "mamba"

        d_inner: 4096

        state_size: 16

        dt_rank: 16

      mlp:

        type: "gated"

        activation: "silu"

  num_blocks: 50

  pattern: ["attention_block", "attention_block", "mamba_block", ...]   # specify order

For Apriel‑H1‑15b‑Thinker‑SFT the pattern contains 30 mamba_blocks followed by 20 attention_blocks, placed by importance.

Distillation Configuration


model:

  base_model:

    head:

      distillation_model: teacher

      distillation_loss_implementation: reverse_kl

reference_models:

  teacher:

    pretrained:

      format: mistral

      path: path/to/Apriel-Nemotron-15b-Thinker

Fast‑LLM handles gradient accumulation, distributed training, tensor parallelism, checkpointing, and everything needed for large‑scale experimentation. It is open source (Apache 2.0), so you can reproduce this work.

FAQs

Why release all checkpoints?

Because the optimal choice depends on your constraints. H‑30 offers the best balance; H‑40 maximizes throughput for latency‑critical workloads. Intermediate checkpoints let you pick the exact trade‑off you need.

Why do speedups vary with context length?

Mamba’s linear‑complexity advantage grows with sequence length, while attention degrades quadratically.

Why only Mamba?

We used Mamba‑1 for three reasons: proven distillation track record, strong empirical performance, and simplicity of implementation. This let us focus on the data question first.

What were the Mamba hyper‑parameters?

State size = 16, DT rank = 16, inner dimension = 4096. For our GQA setup we expanded the input projection B and state x to match total attention heads following M1.

Why not try more advanced conversion methods?

We used Mamba‑in‑Llama initialization and knowledge distillation rather than the multi‑stage procedure from MOHAWK (paper) because preliminary experiments showed no significant advantage.

Why only SFT the H‑30 model?

We applied SFT solely to H‑30 because it provided the best balance of efficiency and quality for downstream tasks; further conversion stages did not benefit noticeably from additional SFT.

Read Original Article

Comments

Want to join the conversation?

Loading comments...

Authors: Torsten Scholak (tscholak), Oleksiy Ostapenko (ostapeno), Raymond Li (RaymondLi), Luke Kumar (nitsanluke), Joel Lamy‑Poirier (jlamypoirier) – ServiceNow‑AI

Spoilers: yes, but only if you ignore your intuition about what data to use.

What We Built

| Benchmark | Teacher → Student | Δ |

|-----------|-------------------|---|

| MATH‑500 | 0.90 → 0.92 | +0.02 |

| MTBench | 8.30 → 8.58 | +0.28 |

| GSM8k | 0.97 → 0.95 | –0.02 |

| GPQA | 0.59 → 0.55 | –0.04 |

| AIME24 | 0.70 → 0.65 | –0.05 |

Total training: 76.8 B tokens.

Full details are in our Apriel‑H1 paper. Below we focus on the key insight that made it work.

The Non‑Obvious Insight

What we initially thought would work: distill on pretraining data and round it out with some SFT.

So we tried it—mixing pretraining and SFT data. It didn’t work. The distilled hybrids lost reasoning quality, sometimes dramatically.

What actually worked: high‑quality reasoning traces from the teacher’s SFT dataset.

That discovery requires explicit examples where reasoning structure is visible and correct:

Multi‑step math proofs where each thought follows from the previous
Coding tasks with clear logical dependencies
Scientific analysis with detailed explanatory chains

Pretraining data, on the other hand, is too noisy and too diffuse. The reasoning signal gets lost. You need concentrated examples of the specific capability you’re trying to preserve.

Key takeaway: match your distillation data to the capability you’re preserving, not the capability you’re building.

How to Apply It: Staged Distillation

You can’t just swap 40 attention layers for Mamba and hope. We learned this the hard way and eventually developed a staged distillation procedure to get there reliably.

Stage 1 – Identify Least‑Important Layers

Stage 2 – Progressive Conversion Beyond 25 Layers

LOO broke down past 25 layers because layers unimportant in isolation became critical in combination. To address this, we introduced MIL‑Mamba‑Replacement (MMR):

For each remaining attention layer, initialize a Mamba mixer with MIL.
Run 100 training steps and record the distillation loss.
Layers converging to lower loss are “easier” to replace.

We progressed incrementally: 25 → 27 → 30 → 34 → 37 → 40 Mamba layers, grouping replacements by MMR scores. Each checkpoint distilled from the previous one.

Stage 3 – End‑to‑End Training on SFT Data

Making It Reproducible: Fast‑LLM

Hybrid Architecture (Fast‑LLM config)


decoder:

  type: "pattern"

  blocks:

    attention_block:

      mixer:

        type: "attention"

        heads: 32

        head_groups: 8

        head_size: 128

      mlp:

        type: "gated"

        activation: "silu"

    mamba_block:

      mixer:

        type: "mamba"

        d_inner: 4096

        state_size: 16

        dt_rank: 16

      mlp:

        type: "gated"

        activation: "silu"

  num_blocks: 50

  pattern: ["attention_block", "attention_block", "mamba_block", ...]   # specify order

For Apriel‑H1‑15b‑Thinker‑SFT the pattern contains 30 mamba_blocks followed by 20 attention_blocks, placed by importance.

Distillation Configuration


model:

  base_model:

    head:

      distillation_model: teacher

      distillation_loss_implementation: reverse_kl

reference_models:

  teacher:

    pretrained:

      format: mistral

      path: path/to/Apriel-Nemotron-15b-Thinker

FAQs

Why release all checkpoints?

Why do speedups vary with context length?

Mamba’s linear‑complexity advantage grows with sequence length, while attention degrades quadratically.

Why only Mamba?

We used Mamba‑1 for three reasons: proven distillation track record, strong empirical performance, and simplicity of implementation. This let us focus on the data question first.

What were the Mamba hyper‑parameters?

State size = 16, DT rank = 16, inner dimension = 4096. For our GQA setup we expanded the input projection B and state x to match total attention heads following M1.

Why not try more advanced conversion methods?

Why only SFT the H‑30 model?

We applied SFT solely to H‑30 because it provided the best balance of efficiency and quality for downstream tasks; further conversion stages did not benefit noticeably from additional SFT.

AI Pulse

Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models

Companies Mentioned

Why It Matters

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models

What We Built

The Non‑Obvious Insight

How to Apply It: Staged Distillation

Stage 1 – Identify Least‑Important Layers

Stage 2 – Progressive Conversion Beyond 25 Layers

Stage 3 – End‑to‑End Training on SFT Data

Making It Reproducible: Fast‑LLM

Hybrid Architecture (Fast‑LLM config)

Distillation Configuration

FAQs

Comments

AI Pulse

Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models

Companies Mentioned

Why It Matters

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models

What We Built

The Non‑Obvious Insight

How to Apply It: Staged Distillation

Stage 1 – Identify Least‑Important Layers

Stage 2 – Progressive Conversion Beyond 25 Layers

Stage 3 – End‑to‑End Training on SFT Data

Making It Reproducible: Fast‑LLM

Hybrid Architecture (Fast‑LLM config)

Distillation Configuration

FAQs

Comments

AI Pulse

Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models

Companies Mentioned

Why It Matters

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models

What We Built

The Non‑Obvious Insight

How to Apply It: Staged Distillation

Stage 1 – Identify Least‑Important Layers

Stage 2 – Progressive Conversion Beyond 25 Layers

Stage 3 – End‑to‑End Training on SFT Data

Making It Reproducible: Fast‑LLM

Hybrid Architecture (Fast‑LLM config)

Distillation Configuration

FAQs

Comments

AI Pulse

Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models

Companies Mentioned

Why It Matters

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models

What We Built

The Non‑Obvious Insight

How to Apply It: Staged Distillation

Stage 1 – Identify Least‑Important Layers

Stage 2 – Progressive Conversion Beyond 25 Layers

Stage 3 – End‑to‑End Training on SFT Data

Making It Reproducible: Fast‑LLM

Hybrid Architecture (Fast‑LLM config)

Distillation Configuration

FAQs

Comments

Stage 1 – Identify Least‑Important Layers

Stage 2 – Progressive Conversion Beyond 25 Layers

Stage 3 – End‑to‑End Training on SFT Data

Stage 1 – Identify Least‑Important Layers

Stage 2 – Progressive Conversion Beyond 25 Layers

Stage 3 – End‑to‑End Training on SFT Data