How to Engineer AI Inference Systems [Philip Kiely] - 766

TWiML AI (This Week in Machine Learning & AI)
TWiML AI (This Week in Machine Learning & AI)Apr 30, 2026

Why It Matters

Because inference determines the speed, cost, and user experience of AI products, mastering it gives companies a competitive advantage and fuels a booming demand for specialized engineers.

Key Takeaways

  • Inference timelines shrink to hours, outpacing other AI stages
  • Building inference systems requires multidisciplinary expertise, akin to mixed martial arts
  • Rapid research-to-production cycle: new techniques deployed within a day
  • Demand for inference engineers will grow tenfold in coming years
  • Effective inference strategy differentiates fast, magical products from sluggish ones

Summary

The podcast episode with Sam Sharington and Philip Kiely focuses on the emerging discipline of AI inference engineering, highlighting how inference has become the most critical and fastest‑moving workload in the AI stack.

Kiely explains that unlike model training, which can take weeks, a new model architecture can be supported in hours. He cites the 31‑hour turnaround for the PolarQuant CUDA kernel and the need to juggle GPU programming, quantization, speculative decoding, and large‑scale distributed systems—all while meeting sub‑200 ms latency SLAs.

He uses a mixed‑martial‑arts metaphor, saying inference engineers must master many “techniques” from CUDA to cloud orchestration. He also notes that the community of inference engineers has exploded from a few hundred to tens of thousands, and that the research‑to‑production pipeline is arguably the fastest in any industry.

The rapid pace and high stakes mean companies that can deliver low‑latency, cost‑effective inference gain a competitive edge, while the talent shortage creates a hiring frenzy. Mastery of inference will therefore be a decisive factor for AI‑native products and for any firm building its own generative‑AI services.

Original Description

In this episode, Philip Kiely, head of AI education at Baseten, joins us to unpack the fast-evolving discipline of inference engineering. We explore why inference has become the stickiest and most critical workload in AI, how it blends GPU programming, applied research, and large-scale distributed systems, and where the line sits between inference and model serving. Philip shares how research-to-production can move in hours, not months, and why understanding “the knobs” of inference—batching, quantization, speculation, and KV cache reuse—lets teams design better products and SLAs. We trace the inference maturity journey from closed APIs to dedicated deployments and in-house platforms, discuss GPU lifecycles, and survey today’s runtime landscape, including vLLM, SGLang, and TensorRT LLM. Finally, we look ahead to agents and multimodality, making the case for specialized, workload-specific runtimes when performance and efficiency matter most.
🗒️ For the full list of resources for this episode, visit the show notes page: https://twimlai.com/go/766.
🔔 Subscribe to our channel for more great content just like this: https://youtube.com/twimlai?sub_confirmation=1
🗣️ CONNECT WITH US!
===============================
Subscribe to the TWIML AI Podcast: https://twimlai.com/podcast/twimlai/
Follow us on Twitter: https://twitter.com/twimlai
Join our Slack Community: https://twimlai.com/community/
Subscribe to our newsletter: https://twimlai.com/newsletter/
Want to get in touch? Send us a message: https://twimlai.com/contact/
📖 CHAPTERS
===============================
00:00 - Introduction
03:40 - Why inference is the most important AI workload?
06:21 - Inference vs model serving
07:18 - Inference challenges
09:57 - Pace of inference research to production timeline
13:41 - Reasons to care about inference engineering
15:49 - Considerations in build vs buy decisions
22:08 - Product maturity cycle
27:14 - GPU lifecycles in inference maturity
32:14 - LLM-assisted inference
36:46 - Agents and multimodal models in specialized inference optimization
47:21 - Open source runtimes: vLLM, SGLang, and TensorRT LLM
49:50 - Specialized AI hardware
51:24 - Future trends and predictions
52:36 - Where to find the inference engineering book
🔗 LINKS & RESOURCES
===============================
🎙️Microphone: https://amzn.to/3t5zXeV
🎛️ Audio Interface: https://amzn.to/3TVFAIq
🎚️ Stream Deck: https://amzn.to/3zzm7F5

Comments

Want to join the conversation?

Loading comments...