DevOps Videos

All News Deals Social Blogs Videos Podcasts Digests

DevOps AI CTO Pulse

Building AI Agents That Survive Production

•May 14, 2026

MLOps Community

MLOps Community•May 14, 2026

Why It Matters

Resilient agent architectures turn experimental prototypes into reliable products, protecting revenue and user trust as AI services scale.

Key Takeaways

•Production agents face crashes, memory limits, and API throttling.
•Design agents to tolerate failures rather than prevent them entirely.
•Platforms must provide dynamism, durability, and secure execution environments.
•Declare required infrastructure in code to enable automatic retries.
•Record actions for deterministic crash recovery and avoid redundant user prompts.

Summary

The Seattle AI agents conference opened with Demetrios Brinkman introducing Union AI CTO Hayam, who framed the session around building AI agents that can survive real‑world production. Hayam highlighted the gap between lab‑tested prototypes and the harsh realities of deployment—memory exhaustion, API throttling, spot‑instance loss, and long‑running user sessions that can span weeks. He argued that engineers should stop trying to create flawless agents and instead design them to tolerate inevitable failures. Three platform pillars emerged: dynamism—allowing developers to code agents in familiar Python without restrictive DSLs; durability—automatic retries, crash‑recovery, and state logging to preserve context; and defensibility—secure sandboxing for generated code and controlled escalation when agents hit limits. A personal anecdote about a honeymoon travel agent illustrated user expectations: agents must remember prior interactions and resume seamlessly after interruptions. Hayam demonstrated practical tactics such as declaring required CPU, memory, and GPU resources directly in code, enabling the runtime to re‑allocate or retry failed jobs, and logging every action so a crashed session can replay deterministic steps without re‑prompting users. The broader implication is a shift in MLOps mindset: resilient agent architectures become a prerequisite for scaling AI services. Companies that adopt platforms offering dynamic resource specification, built‑in durability, and secure execution will cut token waste, improve user experience, and accelerate time‑to‑market for AI‑driven products.

Original Description

Haytham Abuelfutuh, Co-founder and CTO of Union.ai and co-author of the open-source orchestrator Flyte, opens the AI Agents 2026 conference in Seattle with a brutally simple message: stop trying to design AI agents that never fail. Build agents that fail cheaply and recover automatically.

In this 25-minute talk, Haytham walks through the three design principles every production agent needs — the 3 D's: Dynamic, Durable, and Defended — and shows what each one actually requires from your platform. He grounds it in a real case study with Dragonfly, who took a laptop prototype to a production agent system indexing 250,000+ products in a single sitting on Flyte 2.

Topics covered:

- The travel agent thought experiment: what 18 years of human agents teach us about long-running sessions, dropped calls, and not asking the user the same question twice

- The show-of-hands problem: why so many teams build agents but so few ever ship them

- The full taxonomy of agent failure: semantic errors, infrastructure errors, network errors, API throttling, and corrupt context

- Dynamic: why agent platforms must run native Python instead of forcing you into a constrained DSL for branching and loops

- Durable: declaring infrastructure inside your code so agents can react to OOMs, spot machine preemption, and crashes

- Crash recovery for long-running sessions: caching non-deterministic LLM calls and tool calls so agents can resume from the last checkpoint

- Cross-session caching: when to share LLM outputs across users and when to recompute

- Defended: sandboxing agent-generated code with Pydantic Monty and network-isolated execution environments

- Human-in-the-loop bailouts when the agent has exhausted its retries

- Dragonfly case study: a four-tier agent architecture (catalog, coordinator, researcher, tools) for product recommendation across 250K+ products

- Q&A: why Union.ai uses Go and Rust under the Python SDK, and how platform teams can shift agent infrastructure left to developers without losing control

For ML engineers, platform engineers, and anyone who has built an agent on their laptop and watched it crashloop the moment it hit production traffic.

Links and Resources:

- Union.ai: https://www.union.ai/

- Flyte (open source): https://flyte.org/

- Flyte 2 announcement: https://www.union.ai/flyte/2-0-announcement

- Haytham Abuelfutuh on LinkedIn: https://www.linkedin.com/in/haythamafutuh/

- Pydantic Monty (sandboxed Python execution): https://github.com/pydantic/monty

- Union.ai $19M Series A (GeekWire): https://www.geekwire.com/2026/seattle-area-startup-union-ai-raises-19m-to-fuel-ai-workflow-platform/

- AI Agents 2026 conference (MLOps Community): https://mlops.community/

Timestamps (approximate — adjust on upload):

00:00 Conference opening and housekeeping (David, MLOps Community)

01:09 Dimitrios on stage: AI Agents 2026 in Seattle

03:52 Introducing Haytham Abuelfutuh, CTO of Union.ai

04:39 Haytham takes the stage

05:28 The travel agent story: what a great human agent looks like

06:26 Show of hands: who has shipped an agent to production?

08:08 Categorizing agent failures: semantic, infrastructure, network, API

09:32 The 3 D's framework introduced

09:49 D #1: Dynamic — write native Python, not a constrained DSL

10:33 D #2: Durable — surviving crashes, OOMs, and spot preemption

11:15 D #3: Defended — sandboxing untrusted agent-generated code

15:58 Durability deep dive: long-running sessions and crash recovery

17:58 Cross-session caching: when to share LLM and tool calls

19:02 Making failures cheap as a first principle

20:22 Defended in practice: secure code execution

21:34 Pydantic Monty for sandboxed Python

24:03 Case study: Dragonfly's 250K+ product agent catalog

26:26 From laptop prototype to production in one sitting

26:56 The 3 D's quick recap quiz

27:59 Q&A begins

28:27 Is Union.ai built on Erlang? (Go and Rust under the Python SDK)

29:05 Platform teams vs. developers: how to shift agent infra left

31:53 Closing

#AIAgents #DurableExecution #Flyte

Comments

Want to join the conversation?

Loading comments...