AI News and Headlines
  • All Technology
  • AI
  • Autonomy
  • B2B Growth
  • Big Data
  • BioTech
  • ClimateTech
  • Consumer Tech
  • Crypto
  • Cybersecurity
  • DevOps
  • Digital Marketing
  • Ecommerce
  • EdTech
  • Enterprise
  • FinTech
  • GovTech
  • Hardware
  • HealthTech
  • HRTech
  • LegalTech
  • Nanotech
  • PropTech
  • Quantum
  • Robotics
  • SaaS
  • SpaceTech
AllNewsDealsSocialBlogsVideosPodcastsDigests

AI Pulse

EMAIL DIGESTS

Daily

Every morning

Weekly

Sunday recap

NewsDealsSocialBlogsVideosPodcasts
AINewsOpen ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks
Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks
AI

Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks

•November 21, 2025
0
Hugging Face
Hugging Face•Nov 21, 2025

Companies Mentioned

NVIDIA

NVIDIA

NVDA

OpenAI

OpenAI

Meta

Meta

META

Microsoft

Microsoft

MSFT

IBM

IBM

IBM

GitHub

GitHub

Why It Matters

Enterprises can use these insights to balance accuracy, speed, and language support when selecting ASR solutions, accelerating deployment in global and real‑time applications.

Key Takeaways

  • •Conformer‑LLM models set new English WER records.
  • •CTC/TDT decoders deliver 10‑100× faster inference.
  • •Multilingual models sacrifice single‑language accuracy.
  • •Closed‑source systems dominate long‑form transcription performance.
  • •Open ASR Leaderboard drives transparent model comparison.

Pulse Analysis

Automatic speech recognition (ASR) has entered a period of rapid expansion, with more than 150 audio‑text models now available on major hubs. This abundance creates a selection dilemma for businesses that need reliable transcription across diverse use cases. Community‑driven benchmarks like the Open ASR Leaderboard provide a critical yardstick, measuring not only word error rate (WER) but also efficiency metrics such as inverse real‑time factor (RTFx). By aggregating results from over 60 open and closed‑source models, the leaderboard offers a single source of truth for performance comparison.

The latest leaderboard data highlights three clear trends. First, Conformer encoders combined with large language model (LLM) decoders now dominate English transcription accuracy, achieving record‑low WERs. Second, speed‑focused architectures—CTC and TDT decoders—deliver throughput gains of up to two orders of magnitude, making them ideal for real‑time or batch processing of meetings and podcasts. Third, multilingual models broaden language coverage but typically incur a penalty in single‑language precision, while closed‑source offerings continue to outperform open alternatives on long‑form audio due to proprietary optimizations.

For industry stakeholders, these insights translate into actionable decisions. Companies prioritizing multilingual reach may opt for fine‑tuned Whisper variants or Meta’s MMS, accepting modest accuracy trade‑offs. Organizations requiring high‑volume, low‑latency transcription should consider CTC‑based Conformers, especially for English‑only pipelines. Meanwhile, the open‑source community is poised to close the long‑form gap as more datasets and fine‑tuning guides become available. Continued contributions to the Open ASR Leaderboard will drive transparency, foster competition, and accelerate innovation across the global speech AI ecosystem.

Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks

Published November 21, 2025

Authors: Eric Bezzam, Steven Zheng, Eustache Le Bihan, Vaibhav Srivastav


While everyone (and their grandma 👵) is spinning up new ASR models, picking the right one for your use case can feel more overwhelming than choosing your next Netflix show. As of 21 Nov 2025, there are 150 Audio‑Text‑to‑Text and 27 K ASR models on the Hub 🤯

Most benchmarks focus on short‑form English transcription (<30 s) and overlook other important tasks, such as (1) multilingual performance and (2) model throughput, which can be a deciding factor for long‑form audio like meetings and podcasts.

Over the past two years, the Open ASR Leaderboard has become a standard for comparing open and closed‑source models on both accuracy and efficiency. Recently, multilingual and long‑form transcription tracks have been added to the leaderboard 🎉

TL;DR – Open ASR Leaderboard

  • 📝 New preprint on ASR trends from the leaderboard: https://hf.co/papers/2510.06961

  • 🧠 Best accuracy: Conformer encoder + LLM decoders (open‑source ftw 🥳)

  • ⚡ Fastest: CTC / TDT decoders

  • 🌍 Multilingual: Comes at the cost of single‑language performance

  • ⌛ Long‑form: Closed‑source systems still lead (for now 😉)

  • 🧑‍💻 Fine‑tuning guides: Parakeet, Voxtral, Whisper – to continue pushing performance


Takeaways from 60+ models

As of 21 Nov 2025, the Open ASR Leaderboard compares 60+ open and closed‑source models from 18 organizations, across 11 datasets.

In a recent preprint, we dive into the technical setup and highlight some key trends in modern ASR. Here are the big takeaways 👇

1. Conformer encoder 🤝 LLM decoder tops the charts 📈

Models combining Conformer encoders with large language model (LLM) decoders currently lead in English transcription accuracy. Examples include:

  • NVIDIA Canary‑Qwen‑2.5B

  • IBM Granite‑Speech‑3.3‑8B

  • Microsoft Phi‑4‑Multimodal‑Instruct

These achieve the lowest word error rates (WER), showing that integrating LLM reasoning can significantly boost ASR accuracy.

💡 Pro‑tip: NVIDIA introduced Fast Conformer, a 2× faster variant of the Conformer, used in their Canary and Parakeet suite of models.

2. Speed–accuracy tradeoffs ⚖️

While highly accurate, LLM decoders tend to be slower than simpler approaches. On the Open ASR Leaderboard, efficiency is measured using inverse real‑time factor (RTFx), where higher is better.

For even faster inference, CTC and TDT decoders deliver 10–100× faster throughput, albeit with slightly higher error rates. This makes them ideal for real‑time, offline, or batch transcription tasks (e.g., meetings, lectures, podcasts).

3. Multilingual 🌍

OpenAI Whisper Large v3 remains a strong multilingual baseline, supporting 99 languages. However, fine‑tuned or distilled variants such as Distil‑Whisper and CrisperWhisper often outperform the original on English‑only tasks, showing how targeted fine‑tuning can improve specialization.

Focusing on English tends to reduce multilingual coverage → a classic trade‑off between specialization and generalization. Self‑supervised systems like Meta’s Massively Multilingual Speech (MMS) and Omnilingual ASR can support 1 K+ languages but still trail language‑specific encoders in accuracy.

⭐ While only five languages are currently benchmarked, we plan to expand to more languages and welcome new dataset and model contributions via GitHub pull requests.

Community‑driven leaderboards also exist for individual languages, e.g.:

  • Open Universal Arabic ASR Leaderboard – evaluates models on Modern Standard Arabic and regional dialects.

  • Russian ASR Leaderboard – focuses on Russian‑specific phonology and morphology.

These localized efforts mirror the broader multilingual leaderboard’s mission to encourage dataset sharing, fine‑tuned checkpoints, and transparent model comparisons, especially for languages with fewer ASR resources.

4. Long‑form transcription is a different game ⏳

For long‑form audio (podcasts, lectures, meetings), closed‑source systems still edge out open ones, likely due to domain tuning, custom chunking, or production‑grade optimization.

Among open models, Whisper Large v3 performs the best. For throughput, CTC‑based Conformers shine:

  • NVIDIA Parakeet CTC 1.1B – RTFx = 2793.75, WER = 6.68

  • Whisper Large v3 – RTFx = 68.56, WER = 6.43

The trade‑off? Parakeet is English‑only, again highlighting the multilingual vs. specialization tension.

⭐ While closed systems still lead, there’s huge potential for open‑source innovation. Long‑form ASR remains one of the most exciting frontiers for the community to tackle next!


🎤 The Show Must Go On

Given how fast ASR is evolving, we’re excited to see what new architectures push performance and efficiency, and how the Open ASR Leaderboard continues to serve as a transparent, community‑driven benchmark for the field—and as a reference for other leaderboards (Russian, Arabic, Speech DeepFake Detection).

We’ll keep expanding the Open ASR Leaderboard with more models, more languages, and more datasets, so stay tuned 👀

👉 Want to contribute? Head over to the GitHub repo https://github.com/huggingface/open_asr_leaderboard to open a pull request 🚀

Read Original Article
0

Comments

Want to join the conversation?

Loading comments...