AI Dev 26 X SF | Ara Khan: Evals Are Broken Use Them Anyway

DeepLearning.AI
DeepLearning.AIMay 22, 2026

Why It Matters

For AI teams and product leaders, this reframes evals from showy marketing metrics into actionable tools for development and risk management: used properly, evals can guide model selection, debugging, and deployment decisions; misused, they can mislead engineering and business choices. Treating benchmarks with nuance helps prioritize real-world performance and user experience over headline scores.

Summary

Speaker Ara Khan argues that AI evals—particularly for coding agents—are often misunderstood: they’re neither gospel nor worthless. She identifies two flawed extremes—blind faith in benchmark scores from model labs and pure “taste”-based judgments—and urges a pragmatic middle path using heuristics. Khan recommends using evals to interpret others’ results, to iteratively improve your own agents, and to build bespoke evals only if you have the resources. She emphasizes treating public benchmarks as approximate signals, staying current without reflexively chasing earliest results, and integrating evals into agent workflows rather than relying on them alone.

Original Description

This talk by Cline's Ara Khan explains why they went from "evals are useless" to using them as a core part of my agent improvement loop. I share practical heuristics for interpreting, running, and creating evals, and why doing them anyway is better than pure "vibes".

Comments

Want to join the conversation?

Loading comments...