The debate matters because misreading such research can distort public and policy views on AI risk and capability; in practice, LLMs remain powerful when paired with tools and proper prompting, so businesses should focus on integration and guardrails rather than assuming outright incapacity.
A widely shared Apple paper arguing that large language models (LLMs) “don’t reason” sparked sensational headlines, but a close read shows its findings largely restate known limits: LLMs are probabilistic generators that struggle with exact, high-complexity computation and long multi-step tasks. The paper’s experiments—on puzzles like Tower of Hanoi, checkers and extended arithmetic—show performance drops as task complexity and token-length demands rise, issues exacerbated by evaluation choices and token limits. Crucially, the models perform far better when allowed to use tools or code and when chain-of-thought prompting is used, suggesting the failures reflect design and testing limitations rather than a fundamental inability to “reason.” The critique also notes the authors shifted tests midstream after initial comparisons didn’t support their narrative, weakening the paper’s though-provoking but overstated claims.
Comments
Want to join the conversation?
Loading comments...