The TWIML AI Podcast

Proactive Agents for the Web with Devi Parikh - #756

The TWIML AI Podcast

•November 18, 2025•56 min

The TWIML AI Podcast•Nov 18, 2025

Why It Matters

Visual‑first agents promise more resilient web automation, reshaping enterprise workflows and reducing reliance on brittle code‑based scrapers.

Key Takeaways

•Visual models use screenshots, not DOM, for robustness
•Yutori's pipeline adds rejection sampling, reinforcement learning
•“Scouts” coordinate sub‑agents to solve complex queries
•Ambient agents operate continuously, monitoring web tasks
•Path from monitoring to full web automation outlined

Pulse Analysis

The shift from DOM‑centric automation to screenshot‑based visual models marks a fundamental change in how AI interacts with the web. Traditional scripts falter when page structures evolve, but visual grounding treats the rendered page as an image, allowing agents to recognize buttons, forms, and dynamic elements regardless of underlying HTML changes. This approach mirrors human perception, delivering a more generalizable solution that can scale across heterogeneous sites without constant re‑engineering.

Yutori’s training pipeline illustrates the maturation of web‑agent learning. Starting with supervised fine‑tuning on curated datasets, the system now incorporates rejection sampling to filter low‑confidence actions and reinforcement learning to reward successful task completions. This multi‑stage regimen accelerates convergence, reduces error propagation, and equips agents with the adaptability needed for real‑world browsing scenarios where unexpected pop‑ups or layout shifts are common.

The “Scouts” architecture extends these capabilities by orchestrating a suite of specialized sub‑agents, each handling distinct subtasks such as data extraction, form filling, or navigation. Operating in an ambient mode, Scouts continuously monitor web contexts, seamlessly transitioning from passive observation to active execution when conditions align. This modular, tool‑driven strategy paves the way for end‑to‑end web automation, promising enterprises faster workflow integration, lower maintenance costs, and new opportunities for AI‑driven digital assistants.

Episode Description

Today, we're joined by Devi Parikh, co-founder and co-CEO of Yutori, to discuss browser use models and a future where we interact with the web through proactive, autonomous agents. We explore the technical challenges of creating reliable web agents, the advantages of visually-grounded models that operate on screenshots rather than the browser’s more brittle document object model, or DOM, and why this counterintuitive choice has proven far more robust and generalizable for handling complex web interfaces. Devi also shares insights into Yutori’s training pipeline, which has evolved from supervised fine-tuning to include rejection sampling and reinforcement learning. Finally, we discuss how Yutori’s “Scouts” agents orchestrate multiple tools and sub-agents to handle complex queries, the importance of background, "ambient" operation for these systems, and what the path looks like from simple monitoring to full task automation on the web.

The complete show notes for this episode can be found at https://twimlai.com/go/756.

Show Notes

Comments

Want to join the conversation?

Loading comments...

The complete show notes for this episode can be found at https://twimlai.com/go/756.

AI Pulse

Proactive Agents for the Web with Devi Parikh - #756

Why It Matters

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Episode Description

Show Notes

Comments

AI Pulse

Proactive Agents for the Web with Devi Parikh - #756

Why It Matters

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Episode Description

Show Notes

Comments