Blog•Mar 16, 2026
The Free Willy Test: Which AIs Will Help Me Steal An Orca?
The author argues that conventional AI benchmarks focus on abstract tasks like coding or exams, ignoring how everyday users actually interact with conversational models. He introduces the "Free Willy Test" – a harmless scenario about stealing an orca – to expose "false refusals" caused by over‑cautious safety guardrails. The post labels this over‑engineering as "safety mission creep" and shows how it erodes trust and stifles creative brainstorming. By testing whether models will comply with benign requests, developers can gauge a system’s practical usefulness beyond raw scores.