
Google Health AI Releases MedASR: A Conformer Based Medical Speech to Text Model for Clinical Dictation
Why It Matters
MedASR provides a free, high‑accuracy, domain‑specific ASR solution that lowers entry barriers for developers building automated clinical documentation, speeding up EHR integration and reducing physician transcription burden.
Google Health AI Releases MedASR: a Conformer Based Medical Speech to Text Model for Clinical Dictation
Google Health AI team has released MedASR, an open‑weights medical speech‑to‑text model that targets clinical dictation and physician‑patient conversations and is designed to plug directly into modern AI workflows.
What MedASR is and where it fits
MedASR is a speech‑to‑text model based on the Conformer architecture and is pre‑trained for medical dictation and transcription. It is positioned as a starting point for developers who want to build healthcare‑based voice applications such as radiology dictation tools or visit‑note capture systems.

The model has 105 million parameters and accepts mono‑channel audio at 16 kHz with 16‑bit integer waveforms. It produces text‑only output, so it drops directly into downstream natural‑language‑processing or generative models such as MedGemma.
MedASR sits inside the Health AI Developer Foundations portfolio, alongside MedGemma, MedSigLIP and other domain‑specific medical models that share common terms of use and a consistent governance story.
Training data and domain specialization
MedASR is trained on a diverse corpus of de‑identified medical speech. The dataset includes about 5 000 hours of physician dictations and clinical conversations across radiology, internal medicine and family medicine.

The training pairs audio segments with transcripts and metadata. Subsets of the conversational data are annotated with medical named entities including symptoms, medications and conditions. This gives the model strong coverage of clinical vocabulary and phrasing patterns that appear in routine documentation.
The model is English‑only, and most training audio comes from speakers for whom English is a first language and who were raised in the United States. The documentation notes that performance may be lower for other speaker profiles or noisy microphones and recommends fine‑tuning for such settings.
Architecture and decoding
MedASR follows the Conformer encoder design. Conformer combines convolution blocks with self‑attention layers so it can capture local acoustic patterns and longer‑range temporal dependencies in the same stack.
The model is exposed as an automated speech detector with a CTC‑style interface. In the reference implementation, developers use AutoProcessor to create input features from waveform audio and AutoModelForCTC to produce token sequences. Decoding uses greedy decoding by default. The model can also be paired with an external six‑gram language model with beam search of size 8 to improve word error rate.
MedASR training uses JAX and ML Pathways on TPUv4p, TPUv5p and TPUv5e hardware. These systems provide the scale needed for large speech models and align with Google’s broader foundation‑model training stack.
Performance on medical speech tasks
Key results (greedy decoding vs. six‑gram language model):
| Task | MedASR greedy | MedASR + LM | Gemini 2.5 Pro | Gemini 2.5 Flash | Whisper v3 Large |
|------|---------------|------------|----------------|------------------|------------------|
| RAD DICT (radiologist dictation) | 6.6 % | 4.6 % | 10.0 % | 24.4 % | 25.3 % |
| GENERAL DICT (general & internal medicine) | 9.3 % | 6.9 % | 16.4 % | 27.1 % | 33.1 % |
| FM DICT (family medicine) | 8.1 % | 5.8 % | 14.6 % | 19.9 % | 32.5 % |
| Eye Gaze (998 MIMIC chest‑X‑ray cases) | 6.6 % | 5.2 % | 5.9 % | 9.3 % | 12.5 % |
Developer workflow and deployment options
A minimal pipeline example:
from transformers import pipeline
import huggingface_hub
audio = huggingface_hub.hf_hub_download("google/medasr", "test_audio.wav")
pipe = pipeline("automatic-speech-recognition", model="google/medasr")
result = pipe(audio, chunk_length_s=20, stride_length_s=2)
print(result)
For more control, developers can load AutoProcessor and AutoModelForCTC, resample audio to 16 kHz with librosa, move tensors to CUDA if available, and call model.generate followed by processor.batch_decode.
Key Takeaways
-
MedASR is a lightweight, open‑weights Conformer‑based medical ASR model – 105 M parameters, trained specifically for medical dictation and transcription, released under the Health AI Developer Foundations program (English‑only).
-
Domain‑specific training on ~5 000 hours of de‑identified medical audio – physician dictations and clinical conversations across radiology, internal medicine and family medicine give strong coverage of clinical terminology.
-
Competitive or better word‑error rates on medical dictation benchmarks – on internal radiology, general medicine, family medicine and Eye Gaze datasets, MedASR (greedy or with a language model) matches or outperforms large general models such as Gemini 2.5 Pro, Gemini 2.5 Flash and Whisper v3 Large.
Resources
Asif Razzaq
Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in‑depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts over 2 million monthly views, illustrating its popularity among audiences.
Comments
Want to join the conversation?
Loading comments...