Blog•Apr 16, 2026
Open-World Evaluations for Measuring Frontier AI Capabilities
The paper introduces “open‑world evaluations,” a new class of AI testing that places agents in messy, real‑world tasks rather than tidy benchmarks. It surveys ten recent experiments, outlines best practices, and launches the CRUX collaboration of 17 researchers to run such evaluations regularly. In CRUX’s first study, an AI agent built and published an iOS app to the Apple App Store, incurring roughly $1,000 in costs and making only two errors, one requiring manual fix. The authors argue these evaluations provide early warnings of emerging capabilities, such as automated app‑store spam.
By AI as Normal Technology