DevOps Videos

All News Deals Social Blogs Videos Podcasts Digests

DevOps AI Hardware

Fixing GPU Starvation in Large-Scale Distributed Training

•April 10, 2026

MLOps Community

MLOps Community•Apr 10, 2026

Why It Matters

Eliminating data‑I/O bottlenecks directly lowers cloud‑GPU costs and speeds up model development, giving firms a tangible edge in AI‑driven markets.

Key Takeaways

•GPU starvation stems from data I/O bottlenecks, not model inefficiency.
•Caching training data on local SSD dramatically raises GPU utilization.
•Uber’s Pastorm library reveals producer‑consumer queue imbalances during training.
•Profiling and tracing pinpoint remote file reads as the choke point.
•Optimizing data layout and avoiding duplicate transfers cuts latency.

Summary

The video examines a pervasive problem in large‑scale distributed machine‑learning: GPUs sit idle because the data pipeline cannot feed them fast enough. Engineers at Uber and former Google staff explain that the bottleneck is not model architecture or quantization, but the latency of reading massive parquet datasets from remote storage and shuffling them into the GPU.

Key insights include the discovery that GPU utilization on A100s fell to 15‑20 % despite powerful hardware. By instrumenting Uber’s open‑source Pastorm library, the team identified a producer‑consumer imbalance: the producer’s remote reads left the queue empty, starving the consumer that drives the GPU. Simple experiments—loading a slice into RAM—boosted utilization to 85 %, confirming that the model itself was fine.

Concrete examples illustrate the fix: caching each epoch’s data on the local SSD of the GPU host eliminates repeated network calls, and restructuring data to avoid duplicate query features reduces transfer overhead. The engineers also highlighted the importance of tracing at both producer and consumer stages to locate choke points quickly.

The broader implication is clear: as GPUs become faster, data‑movement inefficiencies become cost‑driving liabilities. Companies that invest in pipeline profiling, local caching, and smarter data layouts can reclaim up to 80 % of GPU spend, accelerate model iteration, and maintain competitive advantage in AI‑intensive services.

Original Description

Kashish Mittal is a Staff Software Engineer at Uber, working on large-scale distributed systems and core backend infrastructure.

Fixing GPU Starvation in Large-Scale Distributed Training // MLOps Podcast #367 with Kashish Mittal, Staff Software Engineer at Uber

Join the Community: https://go.mlops.community/YTJoinIn

Get the newsletter: https://go.mlops.community/YTNewsletter

MLOps GPU Guide: https://go.mlops.community/gpuguide

// Abstract

Kashish zooms out to discuss a universal industry pattern: how infrastructure—specifically data loading—is almost always the hidden constraint for ML scaling.

The conversation dives deep into a recent architectural war story. Kashish walks through the full-stack profiling and detective work required to solve a massive GPU starvation bottleneck. By redesigning the Petastorm caching layer to bypass CPU transformation walls and uncovering hidden distributed race conditions, his team boosted GPU utilization to 60%+ and cut training time by 80%. Kashish also shares his philosophy on the fundamental trade-offs between latency and efficiency in GPU serving.

// Bio

Kashish Mittal is a Staff Software Engineer at Uber, where he architects the hyperscale machine learning infrastructure that powers Uber’s core mobility and delivery marketplaces. Prior to Uber, Kashish spent nearly a decade at Google building highly scalable, low-latency distributed ML systems for flagship products including YouTube Ads and Core Search Ranking. His engineering expertise lies at the intersection of distributed systems and AI—specifically focusing on large-scale data processing, eliminating critical I/O bottlenecks, and maximizing GPU efficiency for petabyte-scale training pipelines. When he isn't hunting down distributed race conditions, he is a passionate advocate for open-source architecture and building reproducible, high-throughput ML systems.

// Related Links

Website: https://www.uber.com/

Getting Humans Out of the Way: How to Work with Teams of Agents // MLOps Podcast #368 with Rob Ennals, the Creator of Broomy: https://www.youtube.com/watch?v=ie1M8p-SVfM

~~~~~~~~ ✌️Connect With Us ✌️ ~~~~~~~

Catch all episodes, blogs, newsletters, and more: https://go.mlops.community/TYExplore

Join our Slack community [https://go.mlops.community/slack]

Follow us on X/Twitter [@mlopscommunity](https://x.com/mlopscommunity) or [LinkedIn](https://go.mlops.community/linkedin)]

Sign up for the next meetup: [https://go.mlops.community/register]

MLOps Swag/Merch: [https://shop.mlops.community/]

Connect with Demetrios on LinkedIn: /dpbrinkm

Connect with Kashish on LinkedIn: /kashishmittal/

Timestamps:

[00:00] Local dataset caching

[00:30] Engineers Evolving Roles

[04:44] GPU Resource Management

[10:21] GPU Utilization Issues

[21:49] More GPU War Stories

[32:12] Model Serving Issues

[39:58] Reflective Learning in Coding

[43:23] Workflow and Reflective Skills

[52:30] Wrap up

Comments

Want to join the conversation?

Loading comments...