Big Data Podcasts: Trending Episodes & Updates

Podcast•May 21, 2026•43 min

When "Garbage In, Garbage Out" Gets It Wrong

In this episode, Terence Lee St. John, founder of Enly and lead author of the paper "From Garbage to Gold: A Data Architectural Theory of Predictive Robustness," explains why machine‑learning models can achieve state‑of‑the‑art performance even when trained on noisy, error‑filled data. He argues that traditional "garbage in, garbage out" thinking overlooks two distinct sources of noise: observational error and structural uncertainty arising from variables that are merely proxies for latent drivers. By expanding the predictor set rather than over‑cleaning a limited set, models can triangulate these latent factors and achieve robust predictions, a insight with major implications for regulated fields like healthcare and finance. The discussion also touches on how theoretical grounding can guide data collection, model design, and user‑experience strategies for gaining stakeholder trust.

By The Data Exchange

Podcast•May 20, 2026•50 min

Re-Air: The Rise of the Citizen Developer: Solving Business Problems with Alteryx and AI with Andy Macmillan

In this re‑aired episode, Alteryx CEO Andy Macmillan discusses the evolution of the citizen developer—business users with enough technical skill to build data solutions—and how AI is reshaping that role. He explains Alteryx’s mission to democratize data preparation and analytics,...

By The Data Stack Show

Podcast•May 13, 2026•23 min

Snap’s Secret to Processing 10 Petabytes a Day: GPU-Accelerated Spark | NVIDIA AI Podcast Ep. 298

Snap’s engineering platform head, Prudvi Vatala, explains how the company slashed data‑processing costs by 76% and reduced core usage by 62% by migrating its 10‑petabyte‑per‑day experimentation pipeline to GPU‑accelerated Spark using NVIDIA Spark RAPIDS on Google Cloud. The move delivered...

By The AI Podcast (NVIDIA)

Podcast•Apr 28, 2026•31 min

Your LLM Issues Are Really Data Issues

In this episode, Ryan Donovan talks with Harsha Chintalapani, co‑founder and CTO of Collate, about why the biggest challenges facing LLMs in production are actually data problems. Harsha explains how issues like schema drift, ambiguous business definitions, data discovery, lineage,...

By Stack Overflow Podcast

Podcast•Apr 22, 2026•28 min

Perceptron Network – A Thousand Eyes, One Vision for Decentralized AI Data

In this episode, Andy Pickering talks with Peter Anthony, co‑founder of Perceptron, about the company’s decentralized data infrastructure that taps idle user bandwidth to collect real‑time, geographically diverse web data for AI training. Peter explains how the "thousand eyes, one...

By The Crypto Conversation

Podcast•Apr 20, 2026•44 min

Building Banking Systems with Kafka Streams with Mateo Rojas | Ep. 28

In this episode, Mateo Rojas recounts his early‑day experiences building a policy‑management platform for a banking‑type application using Kafka Streams when the technology was still nascent. He describes the challenges of orchestrating multiple microservices via stream joins, handling windowing limits,...

By Streaming Audio (Kafka / Confluent)

Podcast•Apr 17, 2026•22 min

Scaling Regulated Data Workflows Without Lock‑In - with Juan Orlandini of Insight

In this episode, Juan Orlandini, CTO of North America at Insight, explains how finance leaders can modernize chaotic, regulated data environments by integrating AI thoughtfully rather than layering it on outdated systems. He stresses that generative AI excels at pattern...

By The AI in Business Podcast

Podcast•Apr 9, 2026•0 min

Postgres Can Be Your Data Lake (Pg_lake)

In this episode Marco introduces PgLake, an extension that lets PostgreSQL query and manage data lakes stored as Iceberg tables in object storage. By delegating analytical queries to DuckDB’s vectorized engine, PgLake can achieve up to 100× faster performance than...

By Stanislav’s Big Data Stream (Substack)

Podcast•Apr 6, 2026•46 min

#354 Beyond BI: Decision Intelligence with Graphs with Jamie Hutton, CTO at Quantexa

In this episode, CTO Jamie Hutton of Quantexa explains how decision intelligence extends beyond traditional business intelligence by using graph‑based context and entity resolution to create a single, trustworthy view of people, companies, and relationships. He details how Quantexa’s platform...

By DataFramed

Podcast•Apr 3, 2026•0 min

Parquet Fundamentals in 3 Mins

The episode explains how Apache Parquet’s hybrid columnar‑row format optimizes storage and query performance for large datasets. It contrasts row‑wise and pure columnar layouts, highlighting the inefficiencies of each, and then describes Parquet’s structure of row groups, column chunks, and...

By VuTrinh (Substack)

Podcast•Apr 1, 2026•37 min

Corewell Health’s Jarve Says Population Health Data Challenges Demand Internal Builds

In this episode, Dr. Bob Jarvie, Associate CMIO and Medical Director for Population Health Analytics at Corewell Health, explains why the health system built its own internal population health data platform instead of relying on external vendors. He highlights the...

By healthsystemCIO

Podcast•Mar 30, 2026•49 min

#353 The Data Team's Agentic Future with Ketan Karkhanis, CEO at ThoughtSpot

In this episode, ThoughtSpot CEO Ketan Karkhanis discusses how AI agents are reshaping data analytics, turning self‑service BI from a long‑standing promise into a reality. He showcases ThoughtSpot’s agents—Spotter, Spotter Model, and SpotterWiz—that can answer business questions, automate data engineering...

By DataFramed

Podcast•Mar 28, 2026•0 min

Your Data Vendor Is Charging You $800K to Solve a $100K Problem

In this episode Camille Bank reveals how mid‑size companies are paying upwards of $800 K annually for data stacks that solve far smaller problems, exposing hidden costs in Snowflake compute, connector services like Fivetran, BI tools, and the salaries of multiple...

By AI Adopters Club

Podcast•Mar 26, 2026•0 min

(Video) What Is Apache Spark?

The episode traces the evolution from Google’s MapReduce model to Apache Spark, explaining how Spark’s in‑memory processing and the Resilient Distributed Dataset (RDD) abstraction overcome MapReduce’s limitations for iterative and interactive workloads. It breaks down Spark’s core concepts—transformations vs. actions,...