Venture Capital

Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters

•October 28, 2025

0

Sequoia Capital

Sequoia Capital•Oct 28, 2025

Why It Matters

The pivot reshapes the competitive landscape of AI hardware, making networking as critical as silicon and accelerating the rollout of massive AI services worldwide.

Key Takeaways

•Mellanox buy enabled Nvidia's full-stack AI infrastructure
•Scaling to 100K GPUs demands advanced networking fabrics
•Network latency now limits AI training efficiency
•Inference workloads drive demand for distributed GPU clusters
•AI viewed as tool to discover new physics

Pulse Analysis

Nvidia’s strategic acquisition of Mellanox in 2019 marked a watershed moment for the company, expanding its portfolio beyond GPUs into high‑performance networking. By integrating Mellanox’s InfiniBand and Ethernet technologies, Nvidia now offers a cohesive hardware stack that addresses both compute and data movement, positioning it as the de‑facto architect of AI infrastructure. This vertical integration reduces latency bottlenecks and simplifies system design for cloud providers and hyperscale data centers, reinforcing Nvidia’s dominance in the AI market.

Scaling AI workloads to 100 000‑plus GPUs introduces unprecedented engineering challenges. Traditional PCIe and Ethernet fabrics cannot sustain the bandwidth and latency requirements of modern deep‑learning models, prompting Nvidia to develop custom silicon‑level networking solutions and software stacks that orchestrate data across thousands of nodes. Kagan highlighted that network performance has become the primary limiter of training speed, eclipsing raw GPU throughput. Achieving million‑GPU clusters will therefore hinge on innovations in topology, congestion control, and fault‑tolerant routing, areas where Nvidia’s Mellanox heritage provides a decisive edge.

The broader industry impact is profound. As inference workloads dominate production AI, the need for distributed, low‑latency GPU clusters grows, enabling real‑time services such as recommendation engines and autonomous systems. Moreover, Kagan’s vision of AI as a “spaceship of the mind” suggests a future where massive compute fabrics accelerate scientific discovery, potentially uncovering new physical laws. Companies that can harness both compute and networking at scale will capture the next wave of AI‑driven value, making Nvidia’s integrated approach a critical benchmark for competitors and partners alike.

Original Description

Recorded live at Sequoia’s Europe100 event: Michael Kagan, co-founder of Mellanox and CTO of Nvidia, explains how the $7 billion Mellanox acquisition helped transform Nvidia from a chip company into the architect of AI infrastructure. Kagan breaks down the technical challenges of scaling from single GPUs to 100K and eventually million-GPU data centers. He reveals why network performance—not just compute power—determines AI system efficiency. He discusses the shift from training to inference workloads, and his vision for AI as humanity's "spaceship of the mind," and why he thinks AI may help us discover laws of physics we haven’t yet imagined.

Hosted by Sonya Huang and Pat Grady

0

Comments

Want to join the conversation?

Loading comments...