Why AI Needs a New Kind of Supercomputer Network — the OpenAI Podcast Ep. 18

OpenAI
OpenAIMay 6, 2026

Why It Matters

Without purpose‑built, resilient networking, AI training stalls at scale, inflating costs and slowing innovation; OpenAI’s approach sets a new blueprint for future supercomputer design.

Key Takeaways

  • Traditional internet networking fails for synchronized GPU training workloads.
  • OpenAI built multipath reliable connections to minimize worst‑case latency.
  • Network failures become frequent at scale; design must be failure‑resilient.
  • Co‑design between researchers and infrastructure teams accelerates model training.
  • New protocols replace statistical multiplexing with deterministic, low‑tail performance.

Summary

The OpenAI Podcast episode dives into why AI model training demands a fundamentally new supercomputer network. As GPUs scale to thousands, the traditional internet‑style networking that relies on statistical multiplexing can’t keep up with the tightly synchronized, single‑task workloads of modern AI.

Hosts Mark Handley and Greg Steinbrecher explain that the bottleneck is no longer average bandwidth but the worst‑case latency across millions of optical links. A single slow or failed GPU stalls the entire cluster, turning network congestion into a critical failure point. To address this, OpenAI has engineered a multipath reliable connection protocol that actively selects optimal paths and tolerates component failures without interrupting training.

Handley cites his work on early video‑conferencing standards that now power 4G/5G, highlighting the shift from global consensus to data‑center‑specific designs. Steinbrecher describes the sheer scale—thousands of switches and millions of lasers—where even a minor link glitch can crash a training run. Their team lives on‑call, iterating with researchers to pinpoint pain points and redesign network layers for deterministic performance.

The implications are profound: AI workloads are reshaping data‑center architecture, forcing a move away from generic web‑scale networks toward purpose‑built, failure‑aware systems. Companies that co‑design hardware, software, and research pipelines will achieve faster model iteration, lower costs, and maintain competitive advantage as AI models continue to grow.

Original Description

Training frontier models isn’t as simple as adding more GPUs—one small problem and the whole coordinated dance falls apart. OpenAI’s Mark Handley and Greg Steinbrecher discuss how a new supercomputer network design, used to train some of the company’s latest models, keeps the whole system moving in lockstep, even with record numbers of GPUs.
They break down Multipath Reliable Connection, a new protocol OpenAI developed with AMD, Broadcom, Intel, Microsoft, and Nvidia, and why they’re making it available for the whole industry to use.
Chapters
00:00 Intro
00:39 Greg and Mark's paths to OpenAI
04:34 Why training AI stresses networks differently
10:05 Bottlenecks, failures, and the cost of waiting
15:19 How Multipath Reliable Connection works
18:59 A protocol to route around failures
25:05 Why OpenAI is making MRC an open standard
35:09 Could AI compute move to space?

Comments

Want to join the conversation?

Loading comments...