Over‑reliance on text makes VLMs prone to hallucinations, jeopardizing safety and accuracy in sectors such as healthcare and autonomous robotics.
The video highlights a growing concern in the field of vision‑language models (VLMs): they tend to lean heavily on textual cues at the expense of visual grounding, leading to what researchers call "text‑driven hallucinations." Leticia, a recent PhD graduate specializing in VLMs, explains that the models can misinterpret straightforward visual queries simply because the phrasing of a question aligns with more common textual patterns seen during training.
Key data points from her research show that when asked how many cats appear in an image containing five felines, many state‑of‑the‑art VLMs answer "two"—the most frequent numeral in the training set for similar prompts. Attempts to correct this bias by injecting a broader range of numerals into the training corpus have limited scalability; while adding examples for one, five, or six cats is feasible, covering extreme cases like 123 cats quickly becomes impractical.
In her dissertation, Leticia introduced metrics such as mmsharp and ccsharp to quantify the relative contributions of the textual versus visual streams. The findings consistently reveal that textual tokens dominate the model’s decision‑making, often outweighing image patches by a large margin. This imbalance is especially problematic for high‑stakes domains like medical imaging and robotics, where reliable visual grounding is essential.
The broader implication is clear: practitioners must rigorously test VLMs for grounding before deployment, rather than assuming inherent reliability. Without explicit verification, reliance on these models could propagate errors in critical applications, undermining trust and potentially causing costly or dangerous outcomes.
Comments
Want to join the conversation?
Loading comments...