Meta Brings Segment Anything to Audio, Letting Editors Pull Sounds From Video with a Click or Text Prompt

•December 26, 2025

THE DECODER•Dec 26, 2025

Companies Mentioned

Why It Matters

SAM Audio democratizes advanced sound‑source separation, giving creators, broadcasters, and accessibility tech a flexible, real‑time tool that can streamline production and improve listening experiences.

Key Takeaways

•SAM Audio isolates sounds via text, clicks, or time markers
•Supports real‑time processing up to 3 billion‑parameter model
•New benchmarks evaluate separation without reference tracks
•Open‑source code and weights available in Segment Anything Playground
•Partnerships target hearing‑aid accessibility and audio editing workflows

Pulse Analysis

Meta’s foray into audio segmentation marks a significant evolution of its Segment Anything framework, originally built for images and 3‑D objects. By integrating a flow‑matching diffusion transformer with the Perception Encoder Audiovisual (PE‑AV), SAM Audio can synchronize visual cues from video frames with corresponding audio waveforms, allowing precise extraction of sounds anchored to on‑screen entities. This multimodal alignment not only improves isolation accuracy but also supports diverse input modalities—textual descriptions, direct clicks, and temporal span prompts—making the system adaptable to a wide range of creative workflows.

The practical implications are immediate for industries that rely on clean audio tracks. Music producers can pull individual instruments from mixed recordings, podcasters can mute background chatter with a single command, and film editors can strip traffic noise from location shoots without re‑recording ADR. Meta’s collaboration with Starkey and the incubator 2gether‑International hints at broader accessibility applications, such as enhancing hearing‑aid performance by isolating speech from noisy environments. By offering the model’s code and weights openly, Meta encourages third‑party innovation, potentially accelerating the development of niche tools for sound design, game audio, and live‑stream moderation.

Evaluation has traditionally hinged on reference‑track metrics, a limitation SAM Audio addresses through its new SAM Audio‑Bench and SAM Audio Judge. These benchmarks assess perceptual fidelity without clean references, aligning model scores more closely with human listening judgments. While the system still struggles with highly similar sources—like separating a single vocalist in a choir—it delivers real‑time processing even at the 3 billion‑parameter scale. As competitors roll out comparable multimodal audio tools, Meta’s open‑source stance and robust benchmarking suite position SAM Audio as a foundational platform for the next generation of AI‑driven audio editing and accessibility solutions.