By automating and massively scaling high-quality image annotation, Meta cut the data bottleneck for object recognition—yielding substantial performance gains and enabling tighter integration of vision and language capabilities that could accelerate multimodal AI applications and product features.
Meta’s SAM 3 introduces text prompting to its segmentation model, allowing users to input short phrases and have the model automatically find and segment objects. To scale annotated training data, Meta used fine-tuned LLaMA-based AI annotators that learned from human examples to produce both positive and negative labels, enabling faster, more accurate mask creation at much larger scale. The dataset emphasized short, diverse phrases and doubled the model’s performance versus competitors, and SAM 3 is positioned to integrate with large language models for more complex multimodal tasks. Meta frames SAM 3 as a vision agent that enhances language models’ visual perception.
Comments
Want to join the conversation?
Loading comments...