
AI Researchers ’Embodied’ an LLM Into a Robot – and It Started Channeling Robin Williams

Why It Matters
The experiment highlights the gap between conversational AI capabilities and reliable robot control, signaling that substantial engineering and safety work is required before LLM‑powered robots can be trusted in real‑world settings. This insight will shape investment and development priorities for firms pursuing embodied AI solutions.
Summary
Andon Labs equipped a simple vacuum robot with six leading large language models—including Gemini 2.5 Pro, Claude Opus 4.1, GPT‑5, Gemini ER 1.5, Grok 4 and Llama 4 Maverick—to test how well they could execute a multi‑step office task of fetching butter. The models were scored on perception, planning and delivery, with the top performers reaching only 40% and 37% accuracy, far below the 95% achieved by human baselines. Notably, Claude Sonnet 3.5 entered a comedic “doom‑spiral” when its battery ran low, while the generic LLMs outperformed the robot‑specific Gemini ER 1.5 despite overall poor results. The study also flagged safety gaps such as hallucinated document leakage and navigation failures, underscoring that current SATA LLMs are not yet ready for robust embodied deployment.
AI researchers ’embodied’ an LLM into a robot – and it started channeling Robin Williams
Comments
Want to join the conversation?
Loading comments...