Building AI Agents That Survive Production

MLOps Community
MLOps CommunityMay 14, 2026

Why It Matters

Resilient agent architectures turn experimental prototypes into reliable products, protecting revenue and user trust as AI services scale.

Key Takeaways

  • Production agents face crashes, memory limits, and API throttling.
  • Design agents to tolerate failures rather than prevent them entirely.
  • Platforms must provide dynamism, durability, and secure execution environments.
  • Declare required infrastructure in code to enable automatic retries.
  • Record actions for deterministic crash recovery and avoid redundant user prompts.

Summary

The Seattle AI agents conference opened with Demetrios Brinkman introducing Union AI CTO Hayam, who framed the session around building AI agents that can survive real‑world production. Hayam highlighted the gap between lab‑tested prototypes and the harsh realities of deployment—memory exhaustion, API throttling, spot‑instance loss, and long‑running user sessions that can span weeks. He argued that engineers should stop trying to create flawless agents and instead design them to tolerate inevitable failures. Three platform pillars emerged: dynamism—allowing developers to code agents in familiar Python without restrictive DSLs; durability—automatic retries, crash‑recovery, and state logging to preserve context; and defensibility—secure sandboxing for generated code and controlled escalation when agents hit limits. A personal anecdote about a honeymoon travel agent illustrated user expectations: agents must remember prior interactions and resume seamlessly after interruptions. Hayam demonstrated practical tactics such as declaring required CPU, memory, and GPU resources directly in code, enabling the runtime to re‑allocate or retry failed jobs, and logging every action so a crashed session can replay deterministic steps without re‑prompting users. The broader implication is a shift in MLOps mindset: resilient agent architectures become a prerequisite for scaling AI services. Companies that adopt platforms offering dynamic resource specification, built‑in durability, and secure execution will cut token waste, improve user experience, and accelerate time‑to‑market for AI‑driven products.

Original Description

Haytham Abuelfutuh, Co-founder and CTO of Union.ai and co-author of the open-source orchestrator Flyte, opens the AI Agents 2026 conference in Seattle with a brutally simple message: stop trying to design AI agents that never fail. Build agents that fail cheaply and recover automatically.
In this 25-minute talk, Haytham walks through the three design principles every production agent needs — the 3 D's: Dynamic, Durable, and Defended — and shows what each one actually requires from your platform. He grounds it in a real case study with Dragonfly, who took a laptop prototype to a production agent system indexing 250,000+ products in a single sitting on Flyte 2.
Topics covered:
- The travel agent thought experiment: what 18 years of human agents teach us about long-running sessions, dropped calls, and not asking the user the same question twice
- The show-of-hands problem: why so many teams build agents but so few ever ship them
- The full taxonomy of agent failure: semantic errors, infrastructure errors, network errors, API throttling, and corrupt context
- Dynamic: why agent platforms must run native Python instead of forcing you into a constrained DSL for branching and loops
- Durable: declaring infrastructure inside your code so agents can react to OOMs, spot machine preemption, and crashes
- Crash recovery for long-running sessions: caching non-deterministic LLM calls and tool calls so agents can resume from the last checkpoint
- Cross-session caching: when to share LLM outputs across users and when to recompute
- Defended: sandboxing agent-generated code with Pydantic Monty and network-isolated execution environments
- Human-in-the-loop bailouts when the agent has exhausted its retries
- Dragonfly case study: a four-tier agent architecture (catalog, coordinator, researcher, tools) for product recommendation across 250K+ products
- Q&A: why Union.ai uses Go and Rust under the Python SDK, and how platform teams can shift agent infrastructure left to developers without losing control
For ML engineers, platform engineers, and anyone who has built an agent on their laptop and watched it crashloop the moment it hit production traffic.
Links and Resources:
- Flyte (open source): https://flyte.org/
- Haytham Abuelfutuh on LinkedIn: https://www.linkedin.com/in/haythamafutuh/
- Pydantic Monty (sandboxed Python execution): https://github.com/pydantic/monty
- AI Agents 2026 conference (MLOps Community): https://mlops.community/
Timestamps (approximate — adjust on upload):
00:00 Conference opening and housekeeping (David, MLOps Community)
01:09 Dimitrios on stage: AI Agents 2026 in Seattle
03:52 Introducing Haytham Abuelfutuh, CTO of Union.ai
04:39 Haytham takes the stage
05:28 The travel agent story: what a great human agent looks like
06:26 Show of hands: who has shipped an agent to production?
08:08 Categorizing agent failures: semantic, infrastructure, network, API
09:32 The 3 D's framework introduced
09:49 D #1: Dynamic — write native Python, not a constrained DSL
10:33 D #2: Durable — surviving crashes, OOMs, and spot preemption
11:15 D #3: Defended — sandboxing untrusted agent-generated code
15:58 Durability deep dive: long-running sessions and crash recovery
17:58 Cross-session caching: when to share LLM and tool calls
19:02 Making failures cheap as a first principle
20:22 Defended in practice: secure code execution
21:34 Pydantic Monty for sandboxed Python
24:03 Case study: Dragonfly's 250K+ product agent catalog
26:26 From laptop prototype to production in one sitting
26:56 The 3 D's quick recap quiz
27:59 Q&A begins
28:27 Is Union.ai built on Erlang? (Go and Rust under the Python SDK)
29:05 Platform teams vs. developers: how to shift agent infra left
31:53 Closing
#AIAgents #DurableExecution #Flyte

Comments

Want to join the conversation?

Loading comments...