When "Garbage In, Garbage Out" Gets It Wrong
In this episode, Terence Lee St. John, founder of Enly and lead author of the paper "From Garbage to Gold: A Data Architectural Theory of Predictive Robustness," explains why machine‑learning models can achieve state‑of‑the‑art performance even when trained on noisy, error‑filled data. He argues that traditional "garbage in, garbage out" thinking overlooks two distinct sources of noise: observational error and structural uncertainty arising from variables that are merely proxies for latent drivers. By expanding the predictor set rather than over‑cleaning a limited set, models can triangulate these latent factors and achieve robust predictions, a insight with major implications for regulated fields like healthcare and finance. The discussion also touches on how theoretical grounding can guide data collection, model design, and user‑experience strategies for gaining stakeholder trust.

Re-Air: The Rise of the Citizen Developer: Solving Business Problems with Alteryx and AI with Andy Macmillan
In this re‑aired episode, Alteryx CEO Andy Macmillan discusses the evolution of the citizen developer—business users with enough technical skill to build data solutions—and how AI is reshaping that role. He explains Alteryx’s mission to democratize data preparation and analytics,...

Snap’s Secret to Processing 10 Petabytes a Day: GPU-Accelerated Spark | NVIDIA AI Podcast Ep. 298
Snap’s engineering platform head, Prudvi Vatala, explains how the company slashed data‑processing costs by 76% and reduced core usage by 62% by migrating its 10‑petabyte‑per‑day experimentation pipeline to GPU‑accelerated Spark using NVIDIA Spark RAPIDS on Google Cloud. The move delivered...

Your LLM Issues Are Really Data Issues
In this episode, Ryan Donovan talks with Harsha Chintalapani, co‑founder and CTO of Collate, about why the biggest challenges facing LLMs in production are actually data problems. Harsha explains how issues like schema drift, ambiguous business definitions, data discovery, lineage,...
Perceptron Network – A Thousand Eyes, One Vision for Decentralized AI Data
In this episode, Andy Pickering talks with Peter Anthony, co‑founder of Perceptron, about the company’s decentralized data infrastructure that taps idle user bandwidth to collect real‑time, geographically diverse web data for AI training. Peter explains how the "thousand eyes, one...
Building Banking Systems with Kafka Streams with Mateo Rojas | Ep. 28
In this episode, Mateo Rojas recounts his early‑day experiences building a policy‑management platform for a banking‑type application using Kafka Streams when the technology was still nascent. He describes the challenges of orchestrating multiple microservices via stream joins, handling windowing limits,...

Scaling Regulated Data Workflows Without Lock‑In - with Juan Orlandini of Insight
In this episode, Juan Orlandini, CTO of North America at Insight, explains how finance leaders can modernize chaotic, regulated data environments by integrating AI thoughtfully rather than layering it on outdated systems. He stresses that generative AI excels at pattern...

Postgres Can Be Your Data Lake (Pg_lake)
In this episode Marco introduces PgLake, an extension that lets PostgreSQL query and manage data lakes stored as Iceberg tables in object storage. By delegating analytical queries to DuckDB’s vectorized engine, PgLake can achieve up to 100× faster performance than...

#354 Beyond BI: Decision Intelligence with Graphs with Jamie Hutton, CTO at Quantexa
In this episode, CTO Jamie Hutton of Quantexa explains how decision intelligence extends beyond traditional business intelligence by using graph‑based context and entity resolution to create a single, trustworthy view of people, companies, and relationships. He details how Quantexa’s platform...

Parquet Fundamentals in 3 Mins
The episode explains how Apache Parquet’s hybrid columnar‑row format optimizes storage and query performance for large datasets. It contrasts row‑wise and pure columnar layouts, highlighting the inefficiencies of each, and then describes Parquet’s structure of row groups, column chunks, and...
Corewell Health’s Jarve Says Population Health Data Challenges Demand Internal Builds
In this episode, Dr. Bob Jarvie, Associate CMIO and Medical Director for Population Health Analytics at Corewell Health, explains why the health system built its own internal population health data platform instead of relying on external vendors. He highlights the...

#353 The Data Team's Agentic Future with Ketan Karkhanis, CEO at ThoughtSpot
In this episode, ThoughtSpot CEO Ketan Karkhanis discusses how AI agents are reshaping data analytics, turning self‑service BI from a long‑standing promise into a reality. He showcases ThoughtSpot’s agents—Spotter, Spotter Model, and SpotterWiz—that can answer business questions, automate data engineering...

Your Data Vendor Is Charging You $800K to Solve a $100K Problem
In this episode Camille Bank reveals how mid‑size companies are paying upwards of $800 K annually for data stacks that solve far smaller problems, exposing hidden costs in Snowflake compute, connector services like Fivetran, BI tools, and the salaries of multiple...

(Video) What Is Apache Spark?
The episode traces the evolution from Google’s MapReduce model to Apache Spark, explaining how Spark’s in‑memory processing and the Resilient Distributed Dataset (RDD) abstraction overcome MapReduce’s limitations for iterative and interactive workloads. It breaks down Spark’s core concepts—transformations vs. actions,...

The Hidden Complexity Behind Simple Dashboards
In this episode of the Dashboard Effect podcast, hosts Brick Thompson and Landon Oaks explore why the most valuable dashboards are often the simplest in appearance, yet the most complex to build behind the scenes. They share real‑world examples—including a...