
The episode explains how Apache Parquet’s hybrid columnar‑row format optimizes storage and query performance for large datasets. It contrasts row‑wise and pure columnar layouts, highlighting the inefficiencies of each, and then describes Parquet’s structure of row groups, column chunks, and pages, along with its self‑describing metadata that enables column pruning and efficient reads. The host also notes Parquet’s origins at Twitter and Cloudera and points listeners to a newsletter for deeper data‑engineering content.
In this episode, Dr. Bob Jarvie, Associate CMIO and Medical Director for Population Health Analytics at Corewell Health, explains why the health system built its own internal population health data platform instead of relying on external vendors. He highlights the...

In this episode, ThoughtSpot CEO Ketan Karkhanis discusses how AI agents are reshaping data analytics, turning self‑service BI from a long‑standing promise into a reality. He showcases ThoughtSpot’s agents—Spotter, Spotter Model, and SpotterWiz—that can answer business questions, automate data engineering...

In this episode Camille Bank reveals how mid‑size companies are paying upwards of $800 K annually for data stacks that solve far smaller problems, exposing hidden costs in Snowflake compute, connector services like Fivetran, BI tools, and the salaries of multiple...

The episode traces the evolution from Google’s MapReduce model to Apache Spark, explaining how Spark’s in‑memory processing and the Resilient Distributed Dataset (RDD) abstraction overcome MapReduce’s limitations for iterative and interactive workloads. It breaks down Spark’s core concepts—transformations vs. actions,...

In this episode of the Dashboard Effect podcast, hosts Brick Thompson and Landon Oaks explore why the most valuable dashboards are often the simplest in appearance, yet the most complex to build behind the scenes. They share real‑world examples—including a...

In this episode, host Dan Beach chats with data engineering veteran Daniel Aronovich about his 15‑year journey from MATLAB‑based signal processing at Intel to Python, Spark, and his current startup, True Data Flynn. Daniel explains how he transitioned from data...
In this episode, Aravind Suresh, head of OpenAI's real‑time infrastructure team, explains how the company built a highly reliable, scalable streaming backbone for products like ChatGPT using Kafka and Flink. He describes the challenges of scaling a streaming platform tenfold...

In this episode, Danielle Crop, EVP of Digital Strategy & Alliances at WNS, discusses the rapid rise of AI agents in enterprises, emphasizing the need to evaluate whether they deliver real value and operate securely. She advocates a balanced mindset...

In this episode, Dan Beach chats with State Farm staff engineer Matt Martin about his journey from industrial engineering to data engineering, his deep involvement with DuckDB, and the evolving landscape of data platforms. Matt shares how early automation with...
In this episode, Tim talks with Gunnar Morling, a principal technologist at Confluent and a key contributor to projects like Hibernate and Debezium, about his "One Billion Row Challenge"—a viral coding contest he launched for the Java community in January...

In this re‑aired episode, hosts Eric Dotz and John Wessel chat with regular guest Matt, the Cynical Data Guy, about the rise of low‑code data tools like Clay and the evolving role of the “GT‑M engineer.” They debate whether such...

In this episode, Anders Swanson, a developer experience advocate at dbt Labs, walks through the current state of the Apache Iceberg ecosystem, covering how open‑source and cloud vendors are converging on shared standards, the rise of external catalog integrations, and...

In this episode the hosts explore whether a true single source of truth (SSOT) for construction project data is achievable or merely aspirational. NuFORMA’s Dave Wagner and Carl Beillette argue that a single vendor solution is unrealistic; instead, the goal...

In this episode, Luke Flemmer, head of private assets at MSCI, explains how standardizing and normalizing data can unlock transparency, price formation, and liquidity in private markets, drawing parallels to past evolutions in bonds, FX, and equities. He argues that...