AI Videos

All News Deals Social Blogs Videos Podcasts Digests

AI CTO Pulse Big Data

Generative AI in the Real World: Chang She on Data Infrastructure for AI

•May 14, 2026

O’Reilly Media

O’Reilly Media•May 14, 2026

Why It Matters

By providing an open, unified platform for multimodal AI data, LanceDB enables enterprises to cut pipeline complexity, accelerate model development, and unlock value from previously untapped media assets.

Key Takeaways

•Traditional tools like Pandas, Parquet can't handle AI multimodal data.
•LanceDB offers a lakehouse format optimized for embeddings and large media.
•Vector databases focus on retrieval, lacking end‑to‑end data pipeline support.
•Lance format integrates with Spark, Arrow, DuckDB, enabling single‑line code changes.
•Multimodal lakehouse unifies batch, online, and GPU‑intensive AI workloads.

Summary

The podcast spotlights the growing gap between legacy analytics stacks and the data demands of generative AI. Chang Shi, CEO of LanceDB, explains how his experience building embeddings at Tubi TV revealed that tools such as Pandas, Spark, and Parquet struggle with multimodal assets like video, audio, and image embeddings, prompting the creation of a new data infrastructure. Key insights include the limitations of conventional vector databases, which assume pre‑computed embeddings and handle only narrow retrieval tasks, and the need for an end‑to‑end platform that manages data ingestion, metadata, indexing, and GPU‑ready serving. LanceDB’s open‑source Lance format acts as a “Parquet for AI,” offering smaller file sizes, random‑access efficiency, and a table‑format layer that supports versioning, branching, and multimodal indexing. Chang cites real‑world examples: enterprises moving from “trillion is new billion” data volumes to a unified lakehouse with a single line of code change from Parquet to Lance, and using the format to feed embeddings directly into distributed training pipelines. He also highlights that traditional firms—insurance, finance, and others—already possess massive multimodal archives that AI can now unlock. The broader implication is a shift toward a multimodal lakehouse that consolidates batch analytics, online serving, and GPU‑intensive workloads under one open architecture. This promises reduced data‑movement costs, faster model iteration, and a scalable foundation for enterprises eager to monetize their unstructured data assets.

Original Description

As a pandas core contributor and early Parquet adopter who built AI data pipelines at streaming company Tubi TV, Chang She saw firsthand why the traditional data stack breaks down for AI workloads—and founded LanceDB to fix it. Chang joined Ben Lorica to explain why vector databases are too narrow a solution for modern AI data needs, and what a true multimodal data infrastructure actually looks like. Chang and Ben get into why the Lance file format is quickly becoming the open source standard for multimodal data, how the rise of agents is exploding data infrastructure demands, why open-weight models are the enterprise cost shift to watch in the next 12 months, and more. "Trillion is the new billion," Chang says, and the enterprises that set up their data infrastructure now for that scale will be the ones that succeed.

Follow O'Reilly on:

LinkedIn: https://www.linkedin.com/company/oreilly/

Facebook: http://facebook.com/OReilly

Instagram: https://www.instagram.com/oreillymedia

BlueSky: https://bsky.app/profile/oreilly.bsky.social

Comments

Want to join the conversation?

Loading comments...