As AI models grow larger, the infrastructure required to train them becomes a critical bottleneck for businesses. Understanding how to design and manage high‑cost GPU clusters ensures that organizations can deliver AI solutions reliably and affordably, making this insight vital for engineers and decision‑makers navigating the AI boom.
The episode opens with a striking fact: NVIDIA’s GB200 GPU rack carries a $3 million price tag, yet its value hinges on how tightly its GPUs are packed together. Traditional schedulers like Kubernetes weren’t built for this level of contiguity, so platform teams now chase topology‑aware scheduling to keep latency low and bandwidth high. Projects in the CNCF ecosystem and classic HPC tools such as SLURM are being retrofitted to understand GPU proximity, turning a hardware cost into a performance advantage for large‑scale AI workloads.
Beyond placement, the conversation dives into the hidden variability of seemingly identical GPUs. Differences in cooling, power delivery, and even firmware can cause a single rack to exhibit a wide performance spread, jeopardizing critical training jobs. To tame this, engineers are gathering fine‑grained telemetry and feeding it into variability‑aware schedulers. The hosts highlight emerging chaos‑engineering practices for GPUs—injecting faults via NVIDIA’s DCGM, simulating noisy neighbors, and testing failover paths—to surface weaknesses before production hits. Open‑source schedulers like Kai, SkyPilot, and extensions to the Kubernetes API are gaining traction, but they demand dedicated investment.
Finally, the discussion frames these technical shifts as a broader evolution of delivery systems in the AI era. What once qualified as a “small” language model now stretches across multiple $3 million racks, forcing platform engineers to rethink scaling, cost, and reliability. ThoughtWorks’ upcoming radar and Bryan Oliver’s forthcoming O’Reilly book promise deeper guidance on AI‑centric scheduling, chaos testing, and monitoring. For businesses eyeing cloud‑native AI, the takeaway is clear: mastering GPU topology, variability, and resilience is no longer optional—it’s a prerequisite for competitive, production‑grade intelligence.
Adi Polak talks to Bryan Oliver (Thoughtworks) about his career in platform engineering and large-scale AI infrastructure. Bryan’s first job: building pools and teaching swimming lessons. His challenge: running large-scale GPU data centers while keeping AI workloads predictable and reliable.
SEASON 2
Hosted by Tim Berglund, Adi Polak and Viktor Gamov
Produced and Edited by Noelle Gallagher, Peter Furia and Nurie Mohamed
Music by Coastal Kites
Artwork by Phil Vo
🎧 Subscribe to Confluent Developer wherever you listen to podcasts.
▶️ Subscribe on YouTube, and hit the 🔔 to catch new episodes.
👍 If you enjoyed this, please leave us a rating.
🎧 Confluent also has a podcast for tech leaders: "Life Is But A Stream" hosted by our friend, Joseph Morais.
Adi Polak talks to Bryan Oliver (Thoughtworks) about his career in platform engineering and large-scale AI infrastructure. Bryan’s first job: building pools and teaching swimming lessons. His challenge: running large-scale GPU data centers while keeping AI workloads predictable and reliable.
**SEASON 2
**Hosted by Tim Berglund, Adi Polak and Viktor Gamov
Produced and Edited by Noelle Gallagher, Peter Furia and Nurie Mohamed
Music by Coastal Kites
Artwork by Phil Vo
🎧 Subscribe to Confluent Developer wherever you listen to podcasts.
▶️ Subscribe on YouTube, and hit the 🔔 to catch new episodes.
👍 If you enjoyed this, please leave us a rating.
🎧 Confluent also has a podcast for tech leaders: "Life Is But A Stream" hosted by our friend, Joseph Morais.
Comments
Want to join the conversation?
Loading comments...