Your LLM Issues Are Really Data Issues

Stack Overflow Podcast

Your LLM Issues Are Really Data Issues

Stack Overflow PodcastApr 28, 2026

Why It Matters

Understanding that AI’s limitations often stem from messy, undocumented data helps organizations prioritize metadata governance and data cataloging before deploying LLMs. As more businesses rely on AI for analytics and decision‑making, solving these data issues is essential to avoid costly misinterpretations and to unlock the true value of AI in production environments.

Key Takeaways

  • LLMs struggle with structured, evolving production data semantics.
  • Metadata discovery and ownership are critical for data reliability.
  • Uber’s data issues highlight universal challenges across company sizes.
  • OpenMetadata creates standardized schemas linking lineage, quality, governance.
  • Human-defined semantics must be explicit before AI can interpret.

Pulse Analysis

The episode opens with Harsha Chintalapani explaining why large language models falter when faced with real‑time, structured production data. He cites Uber’s infamous "trips" table mishap, where analysts accessed stale or duplicate tables, and the chaos that follows schema changes, ambiguous definitions of core concepts like "location," and GDPR‑driven manual data classification. These examples illustrate that the bottleneck isn’t model capability but the underlying data ecosystem—its semantics, freshness, and discoverability—especially when organizations attempt to feed raw enterprise data directly into LLMs.

To address these pain points, Harsha describes the OpenMetadata platform, a schema‑first knowledge graph that automatically ingests metadata from sources such as Hive, Snowflake, Kafka, and Airflow. By standardizing table definitions, ownership, data‑quality signals, and lineage relationships, the system turns opaque data lakes into searchable catalogs. Analysts can instantly locate the correct customer table, verify freshness, and understand privacy classifications without manual hunting. This metadata‑centric approach not only accelerates business‑intelligence workflows but also prepares data for AI consumption, ensuring that models receive well‑documented, trustworthy inputs.

Finally, the conversation broadens to emphasize that these challenges transcend Uber’s scale. Start‑ups and mid‑size firms face the same ownership ambiguity and semantic drift once they democratize data access. The key takeaway for leaders is to embed explicit, human‑crafted semantics into metadata repositories before deploying AI agents. By doing so, organizations can reduce onboarding time, improve decision‑making speed, and unlock the true potential of LLMs in production environments. The future of data governance lies in automated metadata capture paired with clear, organization‑wide definitions.

Episode Description

Ryan welcomes Harsha Chintalapani, co-founder and CTO at Collate and co-creator of Open Metadata, to the show to discuss why AI and LLMs struggle with real-time, structured production data. They explore how schema changes, inconsistent definitions (like “customer”), and weak governance can break both your analytics and MLs, and what companies can do to get their data AI-ready, from metadata management to observability. 

Episode Notes: 

Collate is a semantic intelligence platform built on a semantic metadata graph for discovery, governance, and AI observability across your data ecosystem.

Connect with Harsha on LinkedIn. 

Congrats to user buttonsrtoys, who won a Famous Question badge for their question Possible to edit PDF without embedded font installed?.

See Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

Show Notes

Comments

Want to join the conversation?

Loading comments...