AI Videos

All News Deals Social Blogs Videos Podcasts Digests

AI DevOps Hardware

Performance Optimization and Software/Hardware Co-Design Across PyTorch, CUDA, and NVIDIA GPUs

•March 23, 2026

MLOps Community

MLOps Community•Mar 23, 2026

Why It Matters

Democratizing GPU‑software co‑design equips more engineers to extract peak performance, shortening development cycles and cutting cloud costs for AI workloads.

Key Takeaways

•SageMaker HyperPod provides pre‑warmed GPU standby for instant scaling
•Co‑design of PyTorch, CUDA, and NVIDIA hardware boosts performance
•Modern apps favor rapid prototyping over traditional software engineering rigor
•AI debugging agents and playground skills streamline code troubleshooting
•Book aims to democratize hardware‑software knowledge for millions

Summary

The conversation centers on performance optimization and software‑hardware co‑design spanning PyTorch, CUDA, and NVIDIA GPUs, highlighted by the launch of SageMaker HyperPod—a service that keeps GPUs pre‑warmed for instant swapping. The speaker also promotes his new O'Reilly book that stitches together the three layers of hardware, software, and algorithms.

Key insights include the value of warm‑standby GPU pools for latency‑critical workloads, the continued necessity of skilled software engineers despite low‑code, throwaway app trends, and the rise of AI‑driven debugging tools like Claude‑based playground skills that generate diagrams and mermaid visualizations. The discussion also touches on hacky personal workflows using Notion as a database and the challenges of maintaining ad‑hoc systems.

A memorable example is the speaker’s custom bot that injects AI into Google Docs, allowing on‑the‑fly queries and automatic note‑taking. He recounts pitching his book to O'Reilly using a Sequoia‑style deck, likening the process to VC fundraising, and notes the difficulty of obtaining official NVIDIA reviewers, underscoring the scarcity of expertise that bridges hardware and software.

By demystifying CUDA and GPU internals for a broader audience, the book aims to expand the pool of engineers capable of co‑designing efficient AI pipelines, accelerating innovation while reducing reliance on proprietary, opaque documentation. For enterprises, this translates into faster model deployment, lower compute costs, and more resilient production systems.

Original Description

Chris Fregly is currently focused on building and scaling high-performance AI systems, writing and teaching about AI infrastructure, helping organizations adopt generative AI and performance engineering principles on AWS, and fostering large developer communities around these topics.

Performance Optimization and Software/Hardware Co-design across PyTorch, CUDA, and NVIDIA GPUs // MLOps Podcast #363 with Chris Fregly, Founder, AI Performance Engineer, and Investor

Join the Community: https://go.mlops.community/YTJoinIn

Get the newsletter: https://go.mlops.community/YTNewsletter

MLOps GPU Guide: https://go.mlops.community/gpuguide

// Abstract

In today’s era of massive generative models, it's important to understand the full scope of AI systems' performance engineering. This talk discusses the new O'Reilly book, AI Systems Performance Engineering, and the accompanying GitHub repo (https://github.com/cfregly/ai-performance-engineering).

This talk provides engineers, researchers, and developers with a set of actionable optimization strategies. You'll learn techniques to co-design and co-optimize hardware, software, and algorithms to build resilient, scalable, and cost-effective AI systems for both training and inference.

// Bio

Chris Fregly is an AI performance engineer and startup founder with experience at AWS, Databricks, and Netflix. He's the author of three (3) O'Reilly books, including Data Science on AWS (2021), Generative AI on AWS (2023), and AI Systems Performance Engineering (2025). He also runs the global AI Performance Engineering meetup and speaks at many AI-related conferences, including Nvidia GTC, ODSC, Big Data London, and more.

// Related Links

AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch 1st Edition by Chris Fregly: https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/

Coding Agents Conference: https://luma.com/codingagents

~~~~~~~~ ✌️Connect With Us ✌️ ~~~~~~~

Catch all episodes, blogs, newsletters, and more: https://go.mlops.community/TYExplore

Join our Slack community [https://go.mlops.community/slack]

Follow us on X/Twitter [@mlopscommunity](https://x.com/mlopscommunity) or [LinkedIn](https://go.mlops.community/linkedin)]

Sign up for the next meetup: [https://go.mlops.community/register]

MLOps Swag/Merch: [https://shop.mlops.community/]

Connect with Demetrios on LinkedIn: /dpbrinkm

Connect with Chris on LinkedIn: /cfregly

Timestamps:

[00:00] SageMaker HyperPod Resilience

[00:27] Book Creation and Software Engineering

[04:57] Software Engineers and Maintenance

[11:49] AI Systems Performance Engineering

[22:03] Cognitive Biases and Optimization / "Mechanical Sympathy"

[29:36] GPU Rack-Scale Architecture

[33:58] Data Center Reliability Issues

[43:52] AI Compute Platforms

[49:05] Hardware vs Ecosystem Choice

[1:00:05] Claude vs Codex vs Gemini

[1:14:53] Kernel Budget Allocation

[1:18:49] Steerable Reasoning Challenges

[1:24:18] Data Chain Value Awareness

Comments

Want to join the conversation?

Loading comments...