I Tested the First Diffusion Reasoning LLM… It’s Insanely Fast

•February 25, 2026

0

Skill Leap AI

Skill Leap AI•Feb 25, 2026

Why It Matters

The unprecedented generation speed reduces latency for real‑time AI services, enabling more responsive user experiences without sacrificing reasoning quality. This gives developers a competitive edge in building high‑throughput applications.

Key Takeaways

•Mercury 2 generates ~1,000 tokens per second.
•Uses diffusion model for parallel token generation.
•Beats Claude Haiku in speed while retaining reasoning.
•Handles code generation for checkers and chess instantly.
•Ideal for AI agents, voice, search, customer service.

Pulse Analysis

Diffusion‑based language models represent a shift from traditional autoregressive architectures that emit tokens sequentially. By treating token generation as a denoising process, Mercury 2 can refine multiple token candidates simultaneously, dramatically increasing throughput. This parallelism translates to roughly 1,000 tokens per second, a rate that challenges the latency ceiling of most commercial LLMs and opens the door for real‑time applications previously constrained by response time.

In practical benchmarks, Mercury 2 was tasked with producing a functional checkers game and a full chess engine. The model delivered working code in a matter of seconds, and subsequent prompt refinements were applied just as quickly. When pitted against Claude Haiku, Mercury 2 not only finished the tasks faster but also maintained comparable logical consistency, proving that speed does not necessarily compromise reasoning depth. For developers, this means faster iteration cycles, reduced API costs, and the ability to embed sophisticated logic generation into interactive tools.

The implications for the broader AI market are significant. High‑speed reasoning models like Mercury 2 are poised to power next‑generation AI agents, voice‑driven assistants, and search platforms where milliseconds matter. Customer‑service bots can now handle complex queries without noticeable lag, and developers can build more responsive coding assistants and automation pipelines. As enterprises prioritize both performance and intelligence, diffusion LLMs may become the new standard for latency‑sensitive workloads, prompting a wave of integration across cloud APIs and on‑premise solutions.

Original Description

You can try Mercury 2 here:

M2 Playground: https://chat.inceptionlabs.ai/

M2 API: http://platform.inceptionlabs.ai/

I walk through how I test a new reasoning LLM called Mercury 2 and show why it’s so fast.

I ask it to build a full game of checkers, and it writes working code almost instantly. Then I push it to create a full game of chess, and it generates hundreds of lines of code in seconds. I also show how it handles follow-up prompts and rewrites the code just as fast.

Mercury 2 is different from other large language models like ChatGPT and Claude Haiku. Most LLMs generate tokens one at a time. Mercury 2 uses a diffusion model, which creates and refines tokens in parallel. That’s why this diffusion LLM can reach around 1,000 tokens per second.

I compare Mercury 2 to Claude Haiku in a speed test, and Mercury 2 finishes much faster while still keeping strong reasoning. This makes it a great fit for AI agents, coding, voice apps, search tools, and customer service apps where you need both speed and reasoning.

If you’re building with an API and need a fast reasoning LLM, Mercury 2 is worth testing in the playground or API.

0

Comments

Want to join the conversation?

Loading comments...