SkyPilot at Shopify: Multi-Cloud GPUs without the Pain (2026) – Shopify

Summary
In this episode, the Shopify ML Platform team explains how they use the open‑source SkyPilot framework to run multi‑cloud GPU workloads at scale, routing jobs across Nebius and GCP based on resource requirements. They detail a custom SkyPilot plugin that enforces cost‑center labeling, fair‑share scheduling with Kueue, and automatic configuration for specialized hardware like H200 GPUs with InfiniBand. The talk also covers developer‑focused features such as one‑click dev environments, automatic caching of model weights, and policies that prevent idle GPU waste. Overall, the system lets engineers declare resources in YAML while the platform handles cloud selection, cost tracking, and scheduling, simplifying multi‑cloud GPU management.
SkyPilot at Shopify: Multi-cloud GPUs without the pain (2026) – Shopify
Comments
Want to join the conversation?