Tencent Released Tencent HY-Motion 1.0: A Billion-Parameter Text-to-Motion Model Built on the Diffusion Transformer (DiT) Architecture and Flow Matching

•December 31, 2025

MarkTechPost•Dec 31, 2025

Companies Mentioned

Tencent Cloud

Hugging Face

GitHub

X (formerly Twitter)

Why It Matters

The model brings billion‑scale generative motion capability to developers, enabling rapid creation of high‑fidelity 3D character animations for games, film, and digital humans, and sets a new performance baseline for text‑to‑motion AI.

Key Takeaways

•Billion‑parameter Diffusion Transformer for text‑to‑motion
•3,000‑hour motion corpus with 200+ categories
•Hybrid dual‑stream DiT with asymmetric text‑motion attention
•Prompt rewrite module predicts duration, normalizes user input
•Open‑source code, checkpoints, Gradio UI for developers

Pulse Analysis

The rise of text‑to‑motion generation marks a pivotal shift in how creators produce 3D content. Traditional pipelines rely on manual key‑framing or costly motion‑capture sessions, limiting scalability. By harnessing diffusion models and large language models, HY‑Motion 1.0 translates simple textual descriptions into realistic SMPL‑H skeleton sequences, dramatically lowering the barrier for studios and indie developers to populate virtual worlds with diverse, context‑aware animations.

Technically, HY‑Motion 1.0 distinguishes itself through a hybrid DiT architecture that blends dual‑stream and single‑stream processing, allowing motion tokens to query rich semantic features from powerful Qwen3 and CLIP‑L encoders while preserving modality‑specific structure. Flow Matching replaces conventional denoising diffusion, offering stable training on long sequences and efficient inference via ordinary‑differential‑equation integration. The model’s three‑stage curriculum—large‑scale pretraining, high‑quality fine‑tuning, and reinforcement‑learning alignment—leverages a 3,000‑hour curated dataset and a dedicated prompt‑rewrite module to ensure both semantic fidelity and physical plausibility.

For the industry, the open‑source release on GitHub and Hugging Face democratizes access to state‑of‑the‑art motion synthesis. Developers can integrate the Gradio interface or CLI into existing pipelines, accelerating content creation for games, cinematic pre‑visualization, and interactive avatars. As the taxonomy expands and future iterations scale beyond the billion‑parameter mark, HY‑Motion sets a foundation for increasingly nuanced, multi‑modal generative systems that could eventually combine motion, speech, and facial expression in a single, coherent AI‑driven workflow.