
Eliminating word‑level alignment removes a major data bottleneck, enabling faster, cheaper deployment of high‑quality real‑time translation across more languages.
The demand for real‑time speech‑to‑speech translation has outpaced the supply of models that can operate at low latency without sacrificing naturalness. Traditional pipelines depend on painstakingly curated word‑level alignments, a bottleneck that limits language coverage and inflates development costs. Kyutai’s Hibiki‑Zero disrupts this paradigm by eliminating the need for such fine‑grained supervision. Leveraging only sentence‑level alignments and a reinforcement‑learning fine‑tuning stage, the system learns when to listen and when to speak, opening the door to rapid deployment across under‑resourced languages.
At its core, Hibiki‑Zero is a 3‑billion‑parameter decoder‑only model that processes three synchronized streams: source audio tokens, target audio tokens, and an internal text monologue. The architecture relies on the Mimi causal audio codec, which converts waveforms into 12.5 Hz discrete tokens, and an RQ‑Transformer that captures temporal and depth dimensions across 28 and 6 layers respectively. The novel Group Relative Policy Optimization (GRPO) algorithm treats BLEU score as a reward signal, iteratively reducing average lag while preserving translation fidelity. This multistream‑RL combination yields an average lag of 2.3 seconds on long‑form benchmarks, far below competing systems.
The performance gains translate into tangible business value. On five X‑to‑English tasks Hibiki‑Zero outperforms Meta’s Seamless model, delivering a 30‑point lead in speaker similarity and higher ASR‑BLEU scores, while maintaining comparable latency. Its ability to adapt to a new source language—Italian—with fewer than 1,000 hours of speech data demonstrates a scalable path for multilingual product rollouts. Enterprises seeking to embed live translation in conferencing, customer support, or content localization can now consider a solution that reduces data collection overhead, shortens time‑to‑market, and enhances user experience across diverse linguistic markets.
Comments
Want to join the conversation?
Loading comments...