AI Dev 26 X SF | Ara Khan: Evals Are Broken Use Them Anyway
Why It Matters
For AI teams and product leaders, this reframes evals from showy marketing metrics into actionable tools for development and risk management: used properly, evals can guide model selection, debugging, and deployment decisions; misused, they can mislead engineering and business choices. Treating benchmarks with nuance helps prioritize real-world performance and user experience over headline scores.
Summary
Speaker Ara Khan argues that AI evals—particularly for coding agents—are often misunderstood: they’re neither gospel nor worthless. She identifies two flawed extremes—blind faith in benchmark scores from model labs and pure “taste”-based judgments—and urges a pragmatic middle path using heuristics. Khan recommends using evals to interpret others’ results, to iteratively improve your own agents, and to build bespoke evals only if you have the resources. She emphasizes treating public benchmarks as approximate signals, staying current without reflexively chasing earliest results, and integrating evals into agent workflows rather than relying on them alone.
Comments
Want to join the conversation?
Loading comments...