Attention ISN'T All You Need?! New Qwen3 Variant Brumby-14B-Base Leverages Power Retention Technique

Attention ISN'T All You Need?! New Qwen3 Variant Brumby-14B-Base Leverages Power Retention Technique

VentureBeat AI
VentureBeat AINov 4, 2025

Why It Matters

By eliminating the quadratic attention bottleneck at a fraction of traditional training costs, Brumby demonstrates that attention‑free models can match transformer performance, potentially democratizing large‑scale AI development and opening new possibilities for efficient long‑context applications.

Summary

Manifest AI unveiled Brumby-14B-Base, a 14‑billion‑parameter model derived from Qwen3-14B-Base that replaces all attention layers with a novel Power Retention mechanism. The model was retrained in 60 hours on 32 H100 GPUs for roughly $4,000 and achieves accuracy on par with leading transformers such as Qwen3‑14B and GLM‑4.5‑Air, while delivering constant‑time per‑token computation and reported 100× speedups on very long contexts. Benchmarks show parity or modest gains on math and long‑context reasoning tasks, though it trails slightly on knowledge‑heavy evaluations like MMLU‑Pro, and the architecture can be adopted by a simple code swap followed by a few thousand fine‑tuning steps.

Attention ISN'T all you need?! New Qwen3 variant Brumby-14B-Base leverages Power Retention technique

Comments

Want to join the conversation?

Loading comments...