
Moonshot AI Releases đ¨đđđđđđđđ đšđđđđ đđđđ to Replace Fixed Residual Mixing with Depth-Wise Attention for Better Scaling in Transformers
Why It Matters
Learnable depthâwise residuals stabilize deep Transformer training and boost scaling efficiency, giving large language models measurable performance gains without major hardware overhead.
Key Takeaways
- â˘AttnRes uses depth-wise softmax attention instead of fixed residuals
- â˘Block AttnRes reduces memory from O(Ld) to O(Nd)
- â˘Training overhead under pipeline parallelism stays below 4%
- â˘MMLU improves from 73.5 to 74.6
- â˘GPQA-Diamond jumps from 36.9 to 44.4
Pulse Analysis
Residual connections have been a cornerstone of Transformer stability, yet their uniform mixing of all prior layer outputs can cause hiddenâstate magnitude to balloon and dilute the influence of individual layers. Moonshot AIâs Attention Residuals reframe this fixed recurrence as a learnable depthâwise attention mechanism, allowing each layer to weight earlier representations dynamically. By introducing a pseudoâquery per layer, AttnRes preserves the simplicity of PreNorm architectures while granting selective access to the depth dimension, addressing the three bottlenecks of standard residuals: lack of selectivity, irreversible blending, and output growth.
The research team presents two practical implementations. Full AttnRes attends to every preceding layer, delivering the most expressive depthâwise mixing at the cost of O(L²d) arithmetic and O(Ld) memory per token. To make the method viable for trillionâparameter models, Block AttnRes partitions the network into N blocks, collapsing each blockâs outputs into a single representation and applying attention only across these summaries. This reduces memory and communication from O(Ld) to O(Nd), incurring less than 4% additional training overhead under pipeline parallelism and under 2% inference latency. Scaling experiments across five model sizes reveal that both variants achieve lower validation loss than a strong PreNorm baseline, with Block AttnRes matching the baselineâs performance that would otherwise require 25% more compute.
When integrated into Kimi Linear, Moonshotâs 48âbillionâparameter MoE model, AttnRes stabilizes gradient flow and bounds output magnitudes across depth, translating into tangible benchmark improvements: MMLU rises to 74.6, GPQAâDiamond to 44.4, and HumanEval to 62.2, among others. These gains demonstrate that depthâwise residual attention can be a dropâin upgrade for existing largeâscale language models, offering a costâeffective path to higher accuracy without extensive architectural redesign. As the industry pushes toward ever deeper and wider Transformers, mechanisms like AttnRes provide a scalable solution to the residual bottleneck, potentially reshaping best practices for model training and deployment.
Moonshot AI Releases đ¨đđđđđđđđ đšđđđđ đđđđ to Replace Fixed Residual Mixing with Depth-Wise Attention for Better Scaling in Transformers
Comments
Want to join the conversation?
Loading comments...