Enterprises can use these insights to balance accuracy, speed, and language support when selecting ASR solutions, accelerating deployment in global and real‑time applications.
Automatic speech recognition (ASR) has entered a period of rapid expansion, with more than 150 audio‑text models now available on major hubs. This abundance creates a selection dilemma for businesses that need reliable transcription across diverse use cases. Community‑driven benchmarks like the Open ASR Leaderboard provide a critical yardstick, measuring not only word error rate (WER) but also efficiency metrics such as inverse real‑time factor (RTFx). By aggregating results from over 60 open and closed‑source models, the leaderboard offers a single source of truth for performance comparison.
The latest leaderboard data highlights three clear trends. First, Conformer encoders combined with large language model (LLM) decoders now dominate English transcription accuracy, achieving record‑low WERs. Second, speed‑focused architectures—CTC and TDT decoders—deliver throughput gains of up to two orders of magnitude, making them ideal for real‑time or batch processing of meetings and podcasts. Third, multilingual models broaden language coverage but typically incur a penalty in single‑language precision, while closed‑source offerings continue to outperform open alternatives on long‑form audio due to proprietary optimizations.
For industry stakeholders, these insights translate into actionable decisions. Companies prioritizing multilingual reach may opt for fine‑tuned Whisper variants or Meta’s MMS, accepting modest accuracy trade‑offs. Organizations requiring high‑volume, low‑latency transcription should consider CTC‑based Conformers, especially for English‑only pipelines. Meanwhile, the open‑source community is poised to close the long‑form gap as more datasets and fine‑tuning guides become available. Continued contributions to the Open ASR Leaderboard will drive transparency, foster competition, and accelerate innovation across the global speech AI ecosystem.
Published November 21, 2025
Authors: Eric Bezzam, Steven Zheng, Eustache Le Bihan, Vaibhav Srivastav
While everyone (and their grandma 👵) is spinning up new ASR models, picking the right one for your use case can feel more overwhelming than choosing your next Netflix show. As of 21 Nov 2025, there are 150 Audio‑Text‑to‑Text and 27 K ASR models on the Hub 🤯
Most benchmarks focus on short‑form English transcription (<30 s) and overlook other important tasks, such as (1) multilingual performance and (2) model throughput, which can be a deciding factor for long‑form audio like meetings and podcasts.
Over the past two years, the Open ASR Leaderboard has become a standard for comparing open and closed‑source models on both accuracy and efficiency. Recently, multilingual and long‑form transcription tracks have been added to the leaderboard 🎉
📝 New preprint on ASR trends from the leaderboard: https://hf.co/papers/2510.06961
🧠 Best accuracy: Conformer encoder + LLM decoders (open‑source ftw 🥳)
⚡ Fastest: CTC / TDT decoders
🌍 Multilingual: Comes at the cost of single‑language performance
⌛ Long‑form: Closed‑source systems still lead (for now 😉)
🧑💻 Fine‑tuning guides: Parakeet, Voxtral, Whisper – to continue pushing performance
As of 21 Nov 2025, the Open ASR Leaderboard compares 60+ open and closed‑source models from 18 organizations, across 11 datasets.
In a recent preprint, we dive into the technical setup and highlight some key trends in modern ASR. Here are the big takeaways 👇
Models combining Conformer encoders with large language model (LLM) decoders currently lead in English transcription accuracy. Examples include:
NVIDIA Canary‑Qwen‑2.5B
IBM Granite‑Speech‑3.3‑8B
Microsoft Phi‑4‑Multimodal‑Instruct
These achieve the lowest word error rates (WER), showing that integrating LLM reasoning can significantly boost ASR accuracy.
💡 Pro‑tip: NVIDIA introduced Fast Conformer, a 2× faster variant of the Conformer, used in their Canary and Parakeet suite of models.
While highly accurate, LLM decoders tend to be slower than simpler approaches. On the Open ASR Leaderboard, efficiency is measured using inverse real‑time factor (RTFx), where higher is better.
For even faster inference, CTC and TDT decoders deliver 10–100× faster throughput, albeit with slightly higher error rates. This makes them ideal for real‑time, offline, or batch transcription tasks (e.g., meetings, lectures, podcasts).
OpenAI Whisper Large v3 remains a strong multilingual baseline, supporting 99 languages. However, fine‑tuned or distilled variants such as Distil‑Whisper and CrisperWhisper often outperform the original on English‑only tasks, showing how targeted fine‑tuning can improve specialization.
Focusing on English tends to reduce multilingual coverage → a classic trade‑off between specialization and generalization. Self‑supervised systems like Meta’s Massively Multilingual Speech (MMS) and Omnilingual ASR can support 1 K+ languages but still trail language‑specific encoders in accuracy.
⭐ While only five languages are currently benchmarked, we plan to expand to more languages and welcome new dataset and model contributions via GitHub pull requests.
Community‑driven leaderboards also exist for individual languages, e.g.:
Open Universal Arabic ASR Leaderboard – evaluates models on Modern Standard Arabic and regional dialects.
Russian ASR Leaderboard – focuses on Russian‑specific phonology and morphology.
These localized efforts mirror the broader multilingual leaderboard’s mission to encourage dataset sharing, fine‑tuned checkpoints, and transparent model comparisons, especially for languages with fewer ASR resources.
For long‑form audio (podcasts, lectures, meetings), closed‑source systems still edge out open ones, likely due to domain tuning, custom chunking, or production‑grade optimization.
Among open models, Whisper Large v3 performs the best. For throughput, CTC‑based Conformers shine:
NVIDIA Parakeet CTC 1.1B – RTFx = 2793.75, WER = 6.68
Whisper Large v3 – RTFx = 68.56, WER = 6.43
The trade‑off? Parakeet is English‑only, again highlighting the multilingual vs. specialization tension.
⭐ While closed systems still lead, there’s huge potential for open‑source innovation. Long‑form ASR remains one of the most exciting frontiers for the community to tackle next!
Given how fast ASR is evolving, we’re excited to see what new architectures push performance and efficiency, and how the Open ASR Leaderboard continues to serve as a transparent, community‑driven benchmark for the field—and as a reference for other leaderboards (Russian, Arabic, Speech DeepFake Detection).
We’ll keep expanding the Open ASR Leaderboard with more models, more languages, and more datasets, so stay tuned 👀
👉 Want to contribute? Head over to the GitHub repo https://github.com/huggingface/open_asr_leaderboard to open a pull request 🚀
Comments
Want to join the conversation?
Loading comments...