The TWIML AI Podcast

Why Vision Language Models Ignore What They See with Munawar Hayat - #758

The TWIML AI Podcast

•December 9, 2025•57 min

The TWIML AI Podcast•Dec 9, 2025

Key Takeaways

•Vision tokens receive insufficient attention in current VLMs.
•Physics-aware generation remains a major limitation for VLMs.
•Cross‑attention layers improve visual grounding and reduce hallucinations.
•Benchmark bias lets language models answer without looking at images.
•Offline segmentation masks enable efficient training without inference overhead.

Pulse Analysis

In this episode, Munawar Hayat from Qualcomm AI Research explains why modern vision‑language models (VLMs) often ignore visual input, leading to hallucinations and poor physics‑aware generation. He highlights that when a language model is fused with a vision encoder, the massive textual pre‑training dwarfs the visual signal, causing the system to answer from parametric memory rather than the image. The discussion references recent NeurIPS papers that systematically analyze this failure, showing tangled similarity distributions between image and text embeddings and demonstrating that many benchmarks can be solved without actually looking at the picture.

Hayat’s team tackles the problem by redesigning the transformer architecture: instead of concatenating vision and text tokens at the input, they insert cross‑attention modules every few layers and add an auxiliary loss that forces attention onto relevant visual regions. Offline segmentation masks—derived from models like SAM—guide the loss, ensuring the model focuses on the correct object while keeping inference speed unchanged. This hierarchical injection reduces the quadratic self‑attention cost (from (M+N)^2 to N^2+MN), enabling longer contexts and more efficient training for multimodal generative AI on mobile hardware.

The conversation also critiques existing benchmarks, noting that datasets such as ScienceQA or AI2D allow language‑only shortcuts, while newer vision‑centric suites like CVBench demand true visual grounding. By emphasizing spatial correspondence and object‑centric queries, these benchmarks expose the vision‑ignoring behavior and drive research toward robust visual understanding. Hayat predicts that as Qualcomm scales these techniques, we’ll see more reliable VLMs on edge devices, better physics‑aware image synthesis, and fewer hallucinations—advancing both practical applications and fundamental AI research.

Episode Description

In this episode, we’re joined by Munawar Hayat, researcher at Qualcomm AI Research, to discuss a series of papers presented at NeurIPS 2025 focusing on multimodal and generative AI. We dive into the persistent challenge of object hallucination in Vision-Language Models (VLMs), why models often discard visual information in favor of pre-trained language priors, and how his team used attention-guided alignment to enforce better visual grounding. We also explore a novel approach to generalized contrastive learning designed to solve complex, composed retrieval tasks—such as searching via combined text and image queries—without increasing inference costs. Finally, we cover the difficulties generative models face when rendering multiple human subjects, and the new "MultiHuman Testbench" his team created to measure and mitigate issues like identity leakage and attribute blending. Throughout the discussion, we examine how these innovations align with the need for efficient, on-device AI deployment.

The complete show notes for this episode can be found at https://twimlai.com/go/758.

Show Notes

Comments

Want to join the conversation?

Loading comments...

The complete show notes for this episode can be found at https://twimlai.com/go/758.

AI Pulse

Why Vision Language Models Ignore What They See with Munawar Hayat - #758

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Episode Description

Show Notes

Comments

AI Pulse

Why Vision Language Models Ignore What They See with Munawar Hayat - #758

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Episode Description

Show Notes

Comments