Moonshot AI Releases 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔 to Replace Fixed Residual Mixing with Depth-Wise Attention for Better Scaling in Transformers

Moonshot AI Releases 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔 to Replace Fixed Residual Mixing with Depth-Wise Attention for Better Scaling in Transformers

MarkTechPost
MarkTechPost•Mar 16, 2026

Why It Matters

Learnable depth‑wise residuals stabilize deep Transformer training and boost scaling efficiency, giving large language models measurable performance gains without major hardware overhead.

Key Takeaways

  • •AttnRes uses depth-wise softmax attention instead of fixed residuals
  • •Block AttnRes reduces memory from O(Ld) to O(Nd)
  • •Training overhead under pipeline parallelism stays below 4%
  • •MMLU improves from 73.5 to 74.6
  • •GPQA-Diamond jumps from 36.9 to 44.4

Pulse Analysis

Residual connections have been a cornerstone of Transformer stability, yet their uniform mixing of all prior layer outputs can cause hidden‑state magnitude to balloon and dilute the influence of individual layers. Moonshot AI’s Attention Residuals reframe this fixed recurrence as a learnable depth‑wise attention mechanism, allowing each layer to weight earlier representations dynamically. By introducing a pseudo‑query per layer, AttnRes preserves the simplicity of PreNorm architectures while granting selective access to the depth dimension, addressing the three bottlenecks of standard residuals: lack of selectivity, irreversible blending, and output growth.

The research team presents two practical implementations. Full AttnRes attends to every preceding layer, delivering the most expressive depth‑wise mixing at the cost of O(L²d) arithmetic and O(Ld) memory per token. To make the method viable for trillion‑parameter models, Block AttnRes partitions the network into N blocks, collapsing each block’s outputs into a single representation and applying attention only across these summaries. This reduces memory and communication from O(Ld) to O(Nd), incurring less than 4% additional training overhead under pipeline parallelism and under 2% inference latency. Scaling experiments across five model sizes reveal that both variants achieve lower validation loss than a strong PreNorm baseline, with Block AttnRes matching the baseline’s performance that would otherwise require 25% more compute.

When integrated into Kimi Linear, Moonshot’s 48‑billion‑parameter MoE model, AttnRes stabilizes gradient flow and bounds output magnitudes across depth, translating into tangible benchmark improvements: MMLU rises to 74.6, GPQA‑Diamond to 44.4, and HumanEval to 62.2, among others. These gains demonstrate that depth‑wise residual attention can be a drop‑in upgrade for existing large‑scale language models, offering a cost‑effective path to higher accuracy without extensive architectural redesign. As the industry pushes toward ever deeper and wider Transformers, mechanisms like AttnRes provide a scalable solution to the residual bottleneck, potentially reshaping best practices for model training and deployment.

Moonshot AI Releases 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔 to Replace Fixed Residual Mixing with Depth-Wise Attention for Better Scaling in Transformers

Comments

Want to join the conversation?

Loading comments...