Differential Transformer V2

•January 20, 2026

Hugging Face•Jan 20, 2026

Companies Mentioned

Microsoft

MSFT

Why It Matters

DIFF V2 boosts LLM efficiency and training stability without sacrificing speed, offering a practical upgrade for production‑scale transformers.

Key Takeaways

•Doubles query heads without extra KV heads
•Removes per-head RMSNorm, stabilizing gradients
•Achieves decoding speed comparable to standard Transformers
•Saves ~25% attention module parameters
•Improves language modeling loss by 0.02–0.03

Pulse Analysis

Differential Transformer V2 builds on the original DIFF concept by re‑engineering the attention block to use twice as many query heads while sharing the same key‑value cache. This architectural tweak aligns query, key, and value dimensions, allowing the use of off‑the‑shelf FlashAttention kernels and avoiding the custom kernels that slowed DIFF V1. The per‑token, per‑head λ parameter, passed through a sigmoid, scales the subtraction between paired attention heads, effectively controlling the context RMS and eliminating the need for a separate RMSNorm layer.

From a performance standpoint, DIFF V2 raises the arithmetic intensity of the attention module during decoding, a phase that is typically memory‑bound for large language models. Because KV heads remain unchanged, the value cache is loaded only once, preserving the throughput of standard Transformers on H‑series and B‑series GPUs. The result is decoding latency that matches baseline models while the output projection matrix stays the same size, delivering roughly a 25% reduction in attention‑module parameters that can be reallocated to other network components.

Empirical results from trillion‑token pre‑training runs, including a 30‑billion‑parameter mixture‑of‑experts model, indicate that DIFF V2 consistently lowers language‑modeling loss by 0.02–0.03 points and curtails gradient spikes even at aggressive learning rates (6e‑4 to 1e‑3). The removal of RMSNorm also mitigates activation outliers, enhancing numerical stability. These gains make DIFF V2 an attractive drop‑in upgrade for organizations seeking faster, more stable LLM training and inference without extensive code rewrites.

Differential Transformer V2

Authors: Tianzhu Ye, Li Dong, Yutao Sun, Furu Wei

GitHub: https://github.com/microsoft/unilm/blob/master/Diff-Transformer/Diff-Transformer-V2

Code

We compare DIFF V2 with DIFF V1 below (for simplicity we omit the batch dimension and assume that both the input and output of the following flash_attn_func are three‑dimensional tensors (tokens, heads, head dimension). Heads belonging to the same GQA group are arranged contiguously in the output).


def DiffAttnV1(

        layer_index, q1, q2, k1, k2, v,

        lam_q1, lam_k1, lam_q2, lam_k2,

):

    """

    q1, q2: (N, h/2, d)

    k1, k2: (N, h_kv/2, d)

    v:      (N, h_kv/2, 2d)

    lam_*: (d,)

    """

    attn1 = flash_attn_func(q1, k1, v)

    attn2 = flash_attn_func(q2, k2, v)



    lam_init = 0.8 - 0.6 * exp(-0.3 * layer_index)

    lam1 = exp(sum(lam_q1 * lam_k1))

    lam2 = exp(sum(lam_q2 * lam_k2))

    lam = lam1 - lam2 + lam_init

    attn = attn1 - lam * attn2



    attn = rmsnorm(attn)

    attn = attn * (1 - lam_init)

    return attn


def DiffAttnV2(

        q, k, v, lam

):

    """

    q:   (N, 2h, d)

    k:   (N, h_kv, d)

    v:   (N, h_kv, d)

    lam: (N, h, 1)

    """

    attn = flash_attn_func(q, k, v)

    attn1, attn2 = (attn[:, 0::2], attn[:, 1::2])



    lam_val = sigmoid(lam)

    attn = attn1 - lam_val * attn2

    return attn

Full code: https://github.com/microsoft/unilm/tree/master/Diff-Transformer/Diff-Transformer-V2

In the script, h is the number of query heads, h_kv the number of key‑value heads, and d the head dimension. The λ in DIFF V2 is projected from X for each token and each head.

DIFF V2 doubles the number of query heads while keeping the number of KV heads unchanged; the extra dimension is reduced back to h*d after the differential operation, so the W_O projection remains the same as in a baseline Transformer.

Motivation

Faster Decoding & No Custom Kernels

DIFF V2 adds query heads but does not increase KV heads. Because LLM decoding is typically memory‑bound, this design lets DIFF V2 achieve decoding speeds comparable to a standard Transformer. Moreover, since query, key, and value dimensions are aligned, no custom attention kernels are required. In contrast, DIFF V1 can be slower during decoding because the value cache must be loaded twice, and a custom kernel is needed. DIFF V2 also raises the arithmetic intensity of the attention module during decoding.

During pre‑training, when using FlashAttention kernels on H‑series and B‑series GPUs, the throughput reduction introduced by DIFF V2 is negligible. For long‑sequence prefilling we recommend combining DIFF V2 with techniques such as YOCO (also used in Gemma 3n), which already reduces prefilling complexity to linear time with respect to sequence length.

An alternative perspective is to compare DIFF V2 with a Transformer that has the same query dimension 2h*d. Under this comparison, both models exhibit the same attention‑kernel speed, while DIFF V2 has fewer parameters and FLOPs in the output projection.

Softmax Magnitude Constraint

In standard Scaled Dot‑Product Attention (SDPA), let (Q, K, V \in \mathbb{R}^{n \times d}) be the queries, keys, and values. The context vector (C) is

[

C = \operatorname{Softmax}!\left(\frac{QK^{\top}}{\sqrt{d}}\right)V .

]

Denote the attention weight matrix by (A\in\mathbb{R}^{n\times n}). For a single row (\mathbf{c}_i),

[

\mathbf{c}i = \sum{j=1}^{n} a_{ij},\mathbf{v}_j .

]

The Context RMS (Root Mean Square) of this output is

[

\operatorname{RMS}(\mathbf{c}_i)=\sqrt{\frac{1}{d},|\mathbf{c}_i|^{2}} .

]

Assuming the value vectors are uncorrelated with RMS = 1, the Context RMS is bounded in ([\frac{1}{\sqrt{n}},,1)):

Focused attention on a single token → RMS = 1.
Uniform attention across all tokens → RMS = (1/\sqrt{n}).

In DIFF V1 we add a per‑head RMSNorm on context vectors:

[

\hat{\mathbf{c}}_i = \frac{\mathbf{c}_i}{\operatorname{RMS}(\mathbf{c}_i)} .

]

When the model learns a uniform attention distribution, RMSNorm must multiply the vector by (\sqrt{n}) (≈ 90.5 for (n=8192)), leading to a ~100× magnification of the output. In large‑scale pre‑training this causes massive gradients and numerical instability.

DIFF V2 removes the per‑head RMSNorm, bringing the gradient‑norm scale back in line with a standard Transformer and reducing gradient spikes.

Beyond Softmax Constraint & Elimination of Attention Sinks

DIFF V2 can overcome the Softmax magnitude constraint and help eliminate attention sinks.

Standard Softmax attention

[

a_{ij}= \frac{\exp(z_{ij})}{\sum_{k=1}^{n}\exp(z_{ik})},\qquad

\mathbf{c}i = \sum{j=1}^{n} a_{ij},\mathbf{v}_j,\qquad

\operatorname{RMS}(\mathbf{c}_i)\in\Bigl[\frac{1}{\sqrt{n}},,1\Bigr).

]

DIFF V2

[

\mathbf{c}i = \sum{j=1}^{n}\Bigl(\operatorname{Softmax}(z_{ij}^{1}) - \sigma(\lambda_i),\operatorname{Softmax}(z_{ij}^{2})\Bigr)\mathbf{v}_j,

\qquad

\operatorname{RMS}(\mathbf{c}_i)\in(0,\sqrt{2}) .

]

The projected (\lambda_i) (passed through a sigmoid) controls the context RMS. Lowering the lower bound of RMS to zero is crucial: it helps eliminate attention sinks and improves training stability, while the upper bound only needs to remain bounded.

Other recent works that relax the Softmax constraint:

Attention Is Off‑By‑One – adds a constant term in the denominator, yielding RMS in ((0,1)).
gpt‑oss – introduces a learnable scalar (s) per head, also giving RMS in ((0,1)).
Gated Attention – multiplies the Softmax output by a sigmoid gate, again bounding RMS in ((0,1)).

Experimental Observations

We conducted pre‑training experiments on production‑scale LLMs, including dense models and a 30‑B MoE trained on trillions of tokens with large learning rates (6e‑4 – 1e‑3). The experiments are still running; current observations:

Lower language‑modeling loss compared to a baseline Transformer (gap of 0.02 – 0.03).
Reduced loss and gradient spikes during training, especially under large learning‑rate settings where the Transformer baseline becomes unstable.
Reduced activation outlier magnitude.

Future work will explore:

Learning efficiency in mid‑ and post‑training phases.
Performance on downstream long‑context benchmarks (alleviating “context rot”).

Discussions

Construction of Differential Operation

A standard Transformer with (2h) attention heads could, in theory, learn the differential operation by setting (W_O^{2i} = -W_O^{2i+1}) for each pair of heads. In practice this exact cancellation is hard to discover through optimization.

If the model learns the differential operation on its own, explicitly constructing it before the output projection (as in DIFF V2) saves half of the (W_O) parameters. Under the current GQA setting, about 25 % of the attention‑module parameters can be saved and re‑allocated elsewhere.

Even if DIFF V2 only matches the baseline loss after re‑allocation, the method is still valuable because it can improve training stability, control outliers, or increase efficiency—similar to the benefits of GQA.

Design Ablations

Wrong grouping of heads – subtracting two heads that are not in the same GQA group (i.e., they do not share KV). This leads to training instability and higher loss. The correct implementation uses attn[:, 0::2] and attn[:, 1::2] to pair heads within the same group.
Omitting the λ scaling factor – using attn1 - attn2 instead of attn1 - lam_val * attn2 yields an excessively small context RMS at initialization and hurts performance.
Skipping the sigmoid on λ – using the raw projected λ makes the context RMS unbounded from above, leading to less stable training.

Ablation 2 remains relatively stable but incurs higher loss; ablation 3 is less stable than DIFF V2 but still better than ablation 1.

Miscellaneous

In DIFF, qk‑logit outliers are smaller than in the baseline. This may help mitigate attention rounding errors discussed in recent blogs and papers.
DIFF V2 is compatible with sparse‑attention frameworks. When using block‑sparse attention, query heads within the same GQA group must attend to the same KV blocks. For DIFF V2, the block‑selection strategy can treat each pair of differential heads together or simply average attention logits across the larger GQA group; no fundamental changes are required.

Read Original Article

Comments

Want to join the conversation?

Loading comments...

Differential Transformer V2

Authors: Tianzhu Ye, Li Dong, Yutao Sun, Furu Wei

GitHub: https://github.com/microsoft/unilm/blob/master/Diff-Transformer/Diff-Transformer-V2

Code


def DiffAttnV1(

        layer_index, q1, q2, k1, k2, v,

        lam_q1, lam_k1, lam_q2, lam_k2,

):

    """

    q1, q2: (N, h/2, d)

    k1, k2: (N, h_kv/2, d)

    v:      (N, h_kv/2, 2d)

    lam_*: (d,)

    """

    attn1 = flash_attn_func(q1, k1, v)

    attn2 = flash_attn_func(q2, k2, v)



    lam_init = 0.8 - 0.6 * exp(-0.3 * layer_index)

    lam1 = exp(sum(lam_q1 * lam_k1))

    lam2 = exp(sum(lam_q2 * lam_k2))

    lam = lam1 - lam2 + lam_init

    attn = attn1 - lam * attn2



    attn = rmsnorm(attn)

    attn = attn * (1 - lam_init)

    return attn


def DiffAttnV2(

        q, k, v, lam

):

    """

    q:   (N, 2h, d)

    k:   (N, h_kv, d)

    v:   (N, h_kv, d)

    lam: (N, h, 1)

    """

    attn = flash_attn_func(q, k, v)

    attn1, attn2 = (attn[:, 0::2], attn[:, 1::2])



    lam_val = sigmoid(lam)

    attn = attn1 - lam_val * attn2

    return attn

Full code: https://github.com/microsoft/unilm/tree/master/Diff-Transformer/Diff-Transformer-V2

In the script, h is the number of query heads, h_kv the number of key‑value heads, and d the head dimension. The λ in DIFF V2 is projected from X for each token and each head.

Motivation

Faster Decoding & No Custom Kernels

Softmax Magnitude Constraint

In standard Scaled Dot‑Product Attention (SDPA), let (Q, K, V \in \mathbb{R}^{n \times d}) be the queries, keys, and values. The context vector (C) is

[

C = \operatorname{Softmax}!\left(\frac{QK^{\top}}{\sqrt{d}}\right)V .

]

Denote the attention weight matrix by (A\in\mathbb{R}^{n\times n}). For a single row (\mathbf{c}_i),

[

\mathbf{c}i = \sum{j=1}^{n} a_{ij},\mathbf{v}_j .

]

The Context RMS (Root Mean Square) of this output is

[

\operatorname{RMS}(\mathbf{c}_i)=\sqrt{\frac{1}{d},|\mathbf{c}_i|^{2}} .

]

Assuming the value vectors are uncorrelated with RMS = 1, the Context RMS is bounded in ([\frac{1}{\sqrt{n}},,1)):

Focused attention on a single token → RMS = 1.
Uniform attention across all tokens → RMS = (1/\sqrt{n}).

In DIFF V1 we add a per‑head RMSNorm on context vectors:

[

\hat{\mathbf{c}}_i = \frac{\mathbf{c}_i}{\operatorname{RMS}(\mathbf{c}_i)} .

]

DIFF V2 removes the per‑head RMSNorm, bringing the gradient‑norm scale back in line with a standard Transformer and reducing gradient spikes.

Beyond Softmax Constraint & Elimination of Attention Sinks

DIFF V2 can overcome the Softmax magnitude constraint and help eliminate attention sinks.

Standard Softmax attention

[

a_{ij}= \frac{\exp(z_{ij})}{\sum_{k=1}^{n}\exp(z_{ik})},\qquad

\mathbf{c}i = \sum{j=1}^{n} a_{ij},\mathbf{v}_j,\qquad

\operatorname{RMS}(\mathbf{c}_i)\in\Bigl[\frac{1}{\sqrt{n}},,1\Bigr).

]

DIFF V2

[

\mathbf{c}i = \sum{j=1}^{n}\Bigl(\operatorname{Softmax}(z_{ij}^{1}) - \sigma(\lambda_i),\operatorname{Softmax}(z_{ij}^{2})\Bigr)\mathbf{v}_j,

\qquad

\operatorname{RMS}(\mathbf{c}_i)\in(0,\sqrt{2}) .

]

Other recent works that relax the Softmax constraint:

Attention Is Off‑By‑One – adds a constant term in the denominator, yielding RMS in ((0,1)).
gpt‑oss – introduces a learnable scalar (s) per head, also giving RMS in ((0,1)).
Gated Attention – multiplies the Softmax output by a sigmoid gate, again bounding RMS in ((0,1)).

Experimental Observations

Lower language‑modeling loss compared to a baseline Transformer (gap of 0.02 – 0.03).
Reduced loss and gradient spikes during training, especially under large learning‑rate settings where the Transformer baseline becomes unstable.
Reduced activation outlier magnitude.

Future work will explore:

Learning efficiency in mid‑ and post‑training phases.
Performance on downstream long‑context benchmarks (alleviating “context rot”).

Discussions

Construction of Differential Operation

Design Ablations

Wrong grouping of heads – subtracting two heads that are not in the same GQA group (i.e., they do not share KV). This leads to training instability and higher loss. The correct implementation uses attn[:, 0::2] and attn[:, 1::2] to pair heads within the same group.
Omitting the λ scaling factor – using attn1 - attn2 instead of attn1 - lam_val * attn2 yields an excessively small context RMS at initialization and hurts performance.
Skipping the sigmoid on λ – using the raw projected λ makes the context RMS unbounded from above, leading to less stable training.

Ablation 2 remains relatively stable but incurs higher loss; ablation 3 is less stable than DIFF V2 but still better than ablation 1.

Miscellaneous

In DIFF, qk‑logit outliers are smaller than in the baseline. This may help mitigate attention rounding errors discussed in recent blogs and papers.
DIFF V2 is compatible with sparse‑attention frameworks. When using block‑sparse attention, query heads within the same GQA group must attend to the same KV blocks. For DIFF V2, the block‑selection strategy can treat each pair of differential heads together or simply average attention logits across the larger GQA group; no fundamental changes are required.

AI Pulse

Differential Transformer V2

Companies Mentioned

Why It Matters

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Differential Transformer V2

Differential Transformer V2

Code

Motivation

Faster Decoding & No Custom Kernels

Softmax Magnitude Constraint

Beyond Softmax Constraint & Elimination of Attention Sinks

Experimental Observations

Discussions

Construction of Differential Operation

Design Ablations

Miscellaneous

Comments

AI Pulse

Differential Transformer V2

Companies Mentioned

Why It Matters

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Differential Transformer V2

Differential Transformer V2

Code

Motivation

Faster Decoding & No Custom Kernels

Softmax Magnitude Constraint

Beyond Softmax Constraint & Elimination of Attention Sinks

Experimental Observations

Discussions

Construction of Differential Operation

Design Ablations

Miscellaneous

Comments