Performance Optimization and Software/Hardware Co-Design Across PyTorch, CUDA, and NVIDIA GPUs

MLOps Community
MLOps CommunityMar 23, 2026

Why It Matters

Democratizing GPU‑software co‑design equips more engineers to extract peak performance, shortening development cycles and cutting cloud costs for AI workloads.

Key Takeaways

  • SageMaker HyperPod provides pre‑warmed GPU standby for instant scaling
  • Co‑design of PyTorch, CUDA, and NVIDIA hardware boosts performance
  • Modern apps favor rapid prototyping over traditional software engineering rigor
  • AI debugging agents and playground skills streamline code troubleshooting
  • Book aims to democratize hardware‑software knowledge for millions

Summary

The conversation centers on performance optimization and software‑hardware co‑design spanning PyTorch, CUDA, and NVIDIA GPUs, highlighted by the launch of SageMaker HyperPod—a service that keeps GPUs pre‑warmed for instant swapping. The speaker also promotes his new O'Reilly book that stitches together the three layers of hardware, software, and algorithms.

Key insights include the value of warm‑standby GPU pools for latency‑critical workloads, the continued necessity of skilled software engineers despite low‑code, throwaway app trends, and the rise of AI‑driven debugging tools like Claude‑based playground skills that generate diagrams and mermaid visualizations. The discussion also touches on hacky personal workflows using Notion as a database and the challenges of maintaining ad‑hoc systems.

A memorable example is the speaker’s custom bot that injects AI into Google Docs, allowing on‑the‑fly queries and automatic note‑taking. He recounts pitching his book to O'Reilly using a Sequoia‑style deck, likening the process to VC fundraising, and notes the difficulty of obtaining official NVIDIA reviewers, underscoring the scarcity of expertise that bridges hardware and software.

By demystifying CUDA and GPU internals for a broader audience, the book aims to expand the pool of engineers capable of co‑designing efficient AI pipelines, accelerating innovation while reducing reliance on proprietary, opaque documentation. For enterprises, this translates into faster model deployment, lower compute costs, and more resilient production systems.

Original Description

Chris Fregly is currently focused on building and scaling high-performance AI systems, writing and teaching about AI infrastructure, helping organizations adopt generative AI and performance engineering principles on AWS, and fostering large developer communities around these topics.
Performance Optimization and Software/Hardware Co-design across PyTorch, CUDA, and NVIDIA GPUs // MLOps Podcast #363 with Chris Fregly, Founder, AI Performance Engineer, and Investor
// Abstract
In today’s era of massive generative models, it's important to understand the full scope of AI systems' performance engineering. This talk discusses the new O'Reilly book, AI Systems Performance Engineering, and the accompanying GitHub repo (https://github.com/cfregly/ai-performance-engineering).
This talk provides engineers, researchers, and developers with a set of actionable optimization strategies. You'll learn techniques to co-design and co-optimize hardware, software, and algorithms to build resilient, scalable, and cost-effective AI systems for both training and inference.
// Bio
Chris Fregly is an AI performance engineer and startup founder with experience at AWS, Databricks, and Netflix. He's the author of three (3) O'Reilly books, including Data Science on AWS (2021), Generative AI on AWS (2023), and AI Systems Performance Engineering (2025). He also runs the global AI Performance Engineering meetup and speaks at many AI-related conferences, including Nvidia GTC, ODSC, Big Data London, and more.
// Related Links
AI Systems Performance Engineering: Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch 1st Edition by Chris Fregly: https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/
Coding Agents Conference: https://luma.com/codingagents
~~~~~~~~ ✌️Connect With Us ✌️ ~~~~~~~
Catch all episodes, blogs, newsletters, and more: https://go.mlops.community/TYExplore
Join our Slack community [https://go.mlops.community/slack]
Follow us on X/Twitter [@mlopscommunity](https://x.com/mlopscommunity) or [LinkedIn](https://go.mlops.community/linkedin)]
Sign up for the next meetup: [https://go.mlops.community/register]
Connect with Demetrios on LinkedIn: /dpbrinkm
Connect with Chris on LinkedIn: /cfregly
Timestamps:
[00:00] SageMaker HyperPod Resilience
[00:27] Book Creation and Software Engineering
[04:57] Software Engineers and Maintenance
[11:49] AI Systems Performance Engineering
[22:03] Cognitive Biases and Optimization / "Mechanical Sympathy"
[29:36] GPU Rack-Scale Architecture
[33:58] Data Center Reliability Issues
[43:52] AI Compute Platforms
[49:05] Hardware vs Ecosystem Choice
[1:00:05] Claude vs Codex vs Gemini
[1:14:53] Kernel Budget Allocation
[1:18:49] Steerable Reasoning Challenges
[1:24:18] Data Chain Value Awareness

Comments

Want to join the conversation?

Loading comments...