VLMs Rely Too Much on Text !!

•November 20, 2025

0

Louis Bouchard

Louis Bouchard•Nov 20, 2025

Why It Matters

Over‑reliance on text makes VLMs prone to hallucinations, jeopardizing safety and accuracy in sectors such as healthcare and autonomous robotics.

Summary

The video highlights a growing concern in the field of vision‑language models (VLMs): they tend to lean heavily on textual cues at the expense of visual grounding, leading to what researchers call "text‑driven hallucinations." Leticia, a recent PhD graduate specializing in VLMs, explains that the models can misinterpret straightforward visual queries simply because the phrasing of a question aligns with more common textual patterns seen during training.

Key data points from her research show that when asked how many cats appear in an image containing five felines, many state‑of‑the‑art VLMs answer "two"—the most frequent numeral in the training set for similar prompts. Attempts to correct this bias by injecting a broader range of numerals into the training corpus have limited scalability; while adding examples for one, five, or six cats is feasible, covering extreme cases like 123 cats quickly becomes impractical.

In her dissertation, Leticia introduced metrics such as mmsharp and ccsharp to quantify the relative contributions of the textual versus visual streams. The findings consistently reveal that textual tokens dominate the model’s decision‑making, often outweighing image patches by a large margin. This imbalance is especially problematic for high‑stakes domains like medical imaging and robotics, where reliable visual grounding is essential.

The broader implication is clear: practitioners must rigorously test VLMs for grounding before deployment, rather than assuming inherent reliability. Without explicit verification, reliance on these models could propagate errors in critical applications, undermining trust and potentially causing costly or dangerous outcomes.

Original Description

What if your vision language model isn’t actually seeing… but mostly guessing from text? 👀

Letitia Parcalabescu explains it perfectly: when VLMs rely too heavily on text, they start hallucinating answers based on the most common phrasing in their training data instead of what’s in the image. Ask “How many cats are there?” and even if the image shows five, the model might say two simply because “two” appears more often in similar prompts.

This is the hidden trap behind text-driven hallucinations. And unless we explicitly measure grounding, these models will keep sounding confident while being wrong.

I love this kind of research because it shows exactly where our tools break and where we need to push next, especially in high-stakes domains like medicine or robotics.

#visionlanguagemodels #VLM #AIresearch

0

Comments

Want to join the conversation?

Loading comments...