LLMs Crush Coding and Math but Choke on Casual Questions, and That's Not a Contradiction

LLMs Crush Coding and Math but Choke on Casual Questions, and That's Not a Contradiction

THE DECODER
THE DECODERApr 10, 2026

Companies Mentioned

Why It Matters

The divergence signals where AI investment will yield immediate productivity gains—high‑value, verifiable workloads—while highlighting the need for better feedback mechanisms in broader, less structured use cases.

Key Takeaways

  • High‑tier LLMs solve codebase restructuring in hours.
  • Same models stumble on simple everyday questions.
  • Verifiable tasks accelerate AI reinforcement learning gains.
  • Coding and math see measurable improvements versus fuzzy writing tasks.
  • OpenAI’s rumored universal verifier remains unreleased.

Pulse Analysis

The public’s perception of AI often hinges on free, consumer‑grade chat interfaces that still produce hallucinations and miss basic facts. Karpathy argues that this view is outdated; enterprise‑grade models, accessed through specialized APIs like Codex or Claude Code, now handle intricate programming challenges, from refactoring massive repositories to probing security flaws. These capabilities stem from recent model scaling and targeted reinforcement‑learning pipelines that reward concrete, testable outcomes, underscoring a widening gap between hobbyist and professional AI experiences.

At the heart of this gap lies verifiability. In software‑centric tasks, correctness can be automatically validated—unit tests, compiler errors, or security scans provide immediate pass/fail signals. Such clear feedback loops allow reinforcement learning to fine‑tune models efficiently, a principle Karpathy calls the "Software 2.0" paradigm. By contrast, open‑ended domains like casual conversation or creative writing lack objective metrics, making it harder to apply the same training rigor. This explains why coding and mathematics have surged ahead, while conversational fluency still wrestles with inconsistency.

For businesses, the takeaway is strategic: prioritize AI deployment in areas with measurable outcomes—code generation, data analysis, and quantitative research—where ROI can be quantified. Meanwhile, the industry watches for a universal verifier that could extend reinforcement‑learning benefits to less deterministic tasks; OpenAI’s rumored tool has yet to materialize, and recent leadership exits hint at internal challenges. Until such feedback mechanisms mature, the most reliable AI gains will continue to emerge from domains where success is objectively verifiable.

LLMs crush coding and math but choke on casual questions, and that's not a contradiction

Comments

Want to join the conversation?

Loading comments...