DataFramed

#358 How AI Agents Will Work While You Sleep | Ruslan Salakhutdinov, Professor at Carnegie Mellon

DataFramed

•May 4, 2026•58 min

DataFramed•May 4, 2026

Why It Matters

Understanding how AI agents can operate for extended periods and manage complex workflows is crucial for businesses looking to automate routine work and boost productivity. As these systems become more reliable, they will reshape job roles, allowing workers to focus on higher‑value decisions while agents handle repetitive tasks, making the discussion timely for anyone navigating the AI‑driven future of work.

Key Takeaways

•AI agents now handle coding tasks autonomously.
•Long‑horizon agents can run for hours, approaching days.
•Reward design shifts from binary to rubric‑based intermediate feedback.
•Multi‑agent architectures enable task parallelism and cheaper sub‑models.
•Automation promises overnight experiment fixes and streamlined job searches.

Pulse Analysis

The conversation with Carnegie Mellon professor Ruslan Salakhutdinov highlights how AI agents have moved from experimental demos to practical tools that can write code, fill forms, and browse the web. Recent releases from Anthropic, OpenAI and other frontier labs show agents achieving 45‑50 % success on complex tasks, a level that rivals early human‑in‑the‑loop workflows. By automating routine computer‑use activities, these systems free professionals to focus on higher‑value decisions, while the underlying models benefit from larger training datasets and improved reasoning pipelines. This shift toward autonomy is reshaping productivity across software development, data analysis, and everyday office work.

Despite the progress, extending agent operation to long‑horizon tasks remains a research bottleneck. Tasks that last several hours—or eventually days—require more than a single binary reward; they need intermediate signals that tell the model which steps are correct. Salakhutdinov describes the emerging “rubric‑based” approach, where partial credit is assigned to sub‑tasks, addressing the classic credit‑assignment problem in reinforcement learning. By providing richer feedback, agents can backtrack, retry, and refine their plans, moving beyond simple unit‑test verification toward robust, multi‑step problem solving in domains such as scientific computing and complex web workflows.

The next wave focuses on multi‑agent orchestration, where a powerful planner delegates subtasks to smaller, cheaper models that execute locally. This swarm‑like architecture enables parallel processing, reduces inference costs, and improves scalability for enterprise deployments. For businesses, the payoff is tangible: overnight experiment monitoring, automated job‑search aggregation, and continuous form‑filling without human supervision. As agents become reliable enough to run while employees sleep, organizations can capture hidden productivity, cut compute waste, and accelerate decision cycles. The convergence of better reward structures and coordinated multi‑agent systems signals a near‑term transition toward truly autonomous digital assistants in the workplace.

Episode Description

Almost every AI agent demo lands in roughly the same place: it works most of the time, looks remarkable, and then fails in a way no one anticipated. Self-driving cars hit this wall a decade ago, and agents are running into it now. For data and AI teams, the question is no longer whether agents can complete a task — it's whether they can complete it reliably enough to remove the human reviewer. Which categories of work tolerate a 90% success rate? Which absolutely don't? And where should the next layer of guardrails sit?

Ruslan Salakhutdinov is a UPMC Professor of Computer Science at Carnegie Mellon University and one of Geoffrey Hinton's former PhD students. He has previously served as Director of AI Research at Apple and VP of Research in Generative AI at Meta. His research focuses on deep learning, reasoning, and AI agents.

In the episode, Richie and Russ explore the most exciting use cases of AI agents today, long horizon tasks, the credit assignment problem, multi-agent systems, designing reliable human-in-the-loop workflows, agent safety and guardrails, embodied and physical AI, lessons from self-driving cars, the difference between academia and industry, and much more.

Links Mentioned in the Show:

• Claude Code (Anthropic)

• Yutori

• Waymo

• Apple Project Titan

• DeepSeek-V3 Technical Report

• Kimi K2 Technical Report

• Connect with Ruslan: LinkedIn

• AI-Native Course: Intro to AI for Work

• Related Episode: AI Agents at Work: What Actually Breaks (and How to Fix It) with Danielle Crop

New to DataCamp?

Learn on the go using the DataCamp mobile app

Empower your business with world-class data and AI skills with DataCamp for business

Show Notes

Comments

Want to join the conversation?

Loading comments...