Speech to Text Is Harder Than You Think

•December 16, 2025

0

Louis Bouchard

Louis Bouchard•Dec 16, 2025

Why It Matters

Accurate, low‑latency, multilingual speech‑to‑text is essential for reliable voice agents; without it, transcription errors can cripple CRM workflows, erode customer trust, and inflate operational costs.

Summary

The video tackles a misconception that speech‑to‑text (STT) is merely a matter of converting audio into words. It argues that for production voice agents, transcription is only the first step; the real battle lies in extracting precise entities, handling latency, and supporting seamless multilingual code‑switching. The speaker uses a Montreal‑based scenario—where a Canadian French accent leads a model to hear “Austin” instead of “Boston”—to illustrate how a simple word error can cascade into CRM failures, mis‑fired automations, or incorrect appointments.

Key insights focus on three technical pillars. First, entity accuracy trumps raw word‑error‑rate (WER) because not all errors carry equal business weight; mis‑recognizing an email or phone number can break downstream processes. Second, latency is mitigated by streaming partial transcripts within milliseconds, allowing large language models to begin inference before the speaker finishes, mimicking natural human turn‑taking. Third, true multilingual readiness requires on‑the‑fly language identification and code‑switching without degrading performance, a challenge amplified in bilingual markets like Montreal.

The speaker highlights Gladia as a concrete example that addresses these gaps. Gladia stress‑tests its models on noisy, accent‑heavy datasets, embeds built‑in named‑entity recognition for emails, names, and numbers, and supports real‑time code‑switching across more than 100 languages. Its partial results arrive in roughly 100 ms, delivering the responsiveness needed for natural voice interactions without extensive custom infrastructure. The anecdote about the “Austin vs. Boston” misrecognition underscores how even minor transcription slips can have outsized operational costs.

For enterprises building voice agents, the implication is clear: selecting an STT solution that prioritizes entity fidelity, ultra‑low latency, and multilingual agility can dramatically improve user experience and reduce costly automation errors. Companies that overlook these dimensions risk deploying agents that sound competent but fail in real‑world interactions, whereas leveraging a proven platform like Gladia can accelerate time‑to‑market and safeguard downstream business logic.

Original Description

Most people think speech to text is just “audio in, words out.”

That’s fine… until you build a real voice agent.

Then one misheard city name breaks your CRM.

One wrong digit fires the wrong automation.

And suddenly your “great WER” means nothing.

What actually matters is entity accuracy, latency measured in milliseconds, and handling messy human speech. Accents. Code switching. Half finished sentences. Real conversations don’t wait for perfect transcripts.

This is why partial transcripts, fast turn taking, and strong NER matter more than leaderboard metrics. And why Montreal is the ultimate stress test for STT systems.

This is also why I like how Gladia approaches the problem. No hype. Just engineering for how people actually speak.

If you’re building voice agents, ask a better question than “what’s the best STT?”

Ask whether it understands humans in the real world. 🎙️

I’m Louis-François, PhD dropout, now CTO & co-founder at Towards AI. Follow me for tomorrow’s no-BS AI roundup 🚀

#AIEngineering #VoiceAI #SpeechToText #short

0

Comments

Want to join the conversation?

Loading comments...