ElevenLabs and Google Dominate Artificial Analysis' Updated Speech-to-Text Benchmark

•March 1, 2026

THE DECODER•Mar 1, 2026

Why It Matters

Lower transcription error rates enable more reliable voice interfaces, accelerating adoption across consumer and enterprise applications. The benchmark demonstrates that general‑purpose multimodal models can now rival dedicated ASR systems, reshaping competitive dynamics.

Key Takeaways

•Scribe v2 achieves 2.3% WER, best overall
•Gemini 3 Pro reaches 2.9% WER without transcription training
•Whisper Large v3 sits at 4.2% WER, mid‑range
•Voice‑assistant test shows Scribe v2 at 1.6% WER
•Alibaba, Amazon, Rev AI trail with >5% WER

Pulse Analysis

Accurate speech‑to‑text conversion remains a cornerstone of emerging AI services, from virtual assistants to real‑time transcription tools. Artificial Analysis’ AA‑WER v2.0 benchmark, the latest iteration of its industry‑focused evaluation suite, pits 12 commercial and open‑source models against a diverse set of audio samples. By measuring word error rate (WER) across both generic dictation and voice‑assistant queries, the test provides a granular view of each system’s robustness under varied acoustic conditions. The rankings reveal a tightening gap between traditional ASR specialists and newer multimodal platforms, signaling a shift in how transcription quality is achieved.

ElevenLabs’ Scribe v2 clinches the top spot with a 2.3 % WER, a notable improvement over its predecessor and a clear indicator of focused model refinement. Google’s Gemini 3 Pro, despite not being explicitly trained for transcription, posts a competitive 2.9 % WER, leveraging its broader multimodal architecture to generalize across speech tasks. This performance underscores the growing power of large‑scale foundation models that can repurpose vision‑language training for audio processing, reducing the need for dedicated data pipelines. For developers, the result means faster deployment cycles and lower maintenance overhead when integrating voice capabilities.

The benchmark’s findings have immediate commercial implications. Enterprises seeking to embed voice interfaces can now consider general‑purpose models like Gemini as cost‑effective alternatives to niche ASR vendors, while ElevenLabs demonstrates that specialized offerings still hold a performance edge for high‑precision use cases. Open‑source solutions such as Whisper remain viable, especially where transparency and on‑premise deployment are priorities, but they lag behind the leading proprietary systems in raw accuracy. As multimodal models continue to evolve, we can expect further compression of error rates, driving broader adoption of voice‑first experiences across industries.