AI Chatbots Miss Initial Diagnoses 80% of the Time: Mass General Brigham Study

AI Chatbots Miss Initial Diagnoses 80% of the Time: Mass General Brigham Study

Becker’s Hospital Review
Becker’s Hospital ReviewApr 13, 2026

Companies Mentioned

Why It Matters

The findings expose a critical gap in AI’s ability to perform early diagnostic reasoning, limiting its safe deployment in frontline patient care and underscoring the need for human oversight.

Key Takeaways

  • 21 AI chatbots failed differential diagnosis over 80% of cases
  • Final diagnosis errors fell below 40%, best 9% with more data
  • Study used 29 MSD Manual cases, 16,254 responses, no web search
  • AI models lack reasoning needed for safe, unsupervised clinical use
  • Targeted, clinician‑supervised deployment recommended for low‑uncertainty tasks

Pulse Analysis

The rapid rise of generative AI has sparked optimism that chatbots could soon assist physicians in real‑time decision making. Yet the Mass General Brigham analysis, encompassing 21 models and over sixteen thousand responses, reveals that the technology still falters at the most fundamental step of patient evaluation: generating a plausible differential diagnosis. By disabling web‑search capabilities and feeding the models standardized case vignettes from the MSD Manual, researchers isolated pure language‑model performance, showing that even the most advanced versions miss the correct diagnosis list in eight out of ten cases.

This shortfall matters because differential diagnosis is the cornerstone of safe clinical reasoning. An inaccurate or incomplete list can delay critical testing, misguide treatment, and erode patient trust. The study’s secondary findings—final‑diagnosis error rates dropping to as low as 9% when more data are supplied—suggest that AI can be useful once the information landscape is narrowed, but it remains unreliable for the open‑ended, high‑uncertainty scenarios that dominate primary care and emergency medicine. Compared with earlier research, the persistent weakness indicates that scaling model size alone does not resolve the underlying lack of clinical reasoning.

Given these constraints, the authors advocate a measured rollout: AI tools should be confined to low‑uncertainty tasks such as documentation assistance, guideline retrieval, or triage support under direct clinician supervision. Regulatory bodies and health systems will need to embed robust validation pipelines and continuous monitoring to prevent overreliance. Future advancements must focus on integrating structured medical knowledge and causal reasoning, moving beyond pattern matching toward true clinical cognition.

AI chatbots miss initial diagnoses 80% of the time: Mass General Brigham study

Comments

Want to join the conversation?

Loading comments...