Alibaba's Qwen Team Makes AI Models Think Deeper with New Algorithm

Alibaba's Qwen Team Makes AI Models Think Deeper with New Algorithm

THE DECODER
THE DECODERApr 5, 2026

Why It Matters

FIPO demonstrates that smarter reward allocation can break reasoning ceilings, offering a more efficient path to higher‑performing LLMs for complex problem solving.

Key Takeaways

  • FIPO rewards tokens based on downstream influence
  • Doubles chain‑of‑thought length to over 10,000 tokens
  • Improves AIME‑2024 accuracy to up to 58 %
  • Eliminates auxiliary value model, matching PPO performance
  • Training stability requires discount factor and drift filtering

Pulse Analysis

Reinforcement learning for large language models has long suffered from a blunt credit‑assignment scheme: every token receives the same end‑of‑answer reward, regardless of its logical importance. The Qwen team’s Future‑KL Influenced Policy Optimization (FIPO) reframes this by estimating how each token shifts the probability distribution of all subsequent tokens, effectively measuring its downstream influence. This nuanced signal lets the optimizer amplify tokens that spark productive reasoning chains while dampening those that lead to dead ends, all without the need for an auxiliary value network.

The performance impact is immediate. Training Qwen2.5‑32B‑Base with FIPO more than doubles the average chain‑of‑thought length, pushing past the 10,000‑token barrier that earlier methods like DAPO could not breach. On the AIME‑2024 benchmark, accuracy climbs from a baseline 50 % to a peak of 58 %, edging out DeepSeek‑R1‑Zero’s 47 % and matching OpenAI’s o1‑mini at 56 %. The algorithm also introduces stability guardrails—discount factors that prioritize nearby tokens and filters that discard extreme drift—preventing the training collapse observed in earlier experiments.

While the results are promising, they raise new considerations for the industry. Longer reasoning sequences increase compute and memory demands, potentially limiting deployment in cost‑sensitive environments. Moreover, FIPO’s current validation is confined to math problems and a single public dataset, leaving open questions about its transferability to code generation, symbolic logic, or multimodal tasks. The Qwen team’s plan to open‑source the training framework will enable broader testing, and could accelerate a shift toward more sophisticated, reward‑aware reinforcement learning as a core technique for next‑generation AI reasoning.

Alibaba's Qwen team makes AI models think deeper with new algorithm

Comments

Want to join the conversation?

Loading comments...