Which Is The Best AI For Medical Questions? Here’s The Winner

•March 4, 2026

Forbes – Healthcare•Mar 4, 2026

Why It Matters

The results reshape how health systems evaluate AI decision‑support tools, directly influencing patient safety, regulatory scrutiny, and market adoption of medical AI platforms.

Key Takeaways

•AMBOSS LiSA 1.0 leads with 62.3% accuracy.
•AI models still produce harmful advice in 22% cases.
•Safety peaks at moderate restraint; excess caution increases harm.
•Multi‑agent AI combos boost top‑quartile safety sixfold.
•Best AI beats internists by over 15 points.

Pulse Analysis

The rapid diffusion of large language models into clinical workflows has outpaced traditional validation methods that rely on multiple‑choice exams. Recognizing this gap, researchers from Stanford, Harvard and partner institutions built a dataset of 100 authentic consult questions drawn from Stanford Health Care’s electronic records. Each case was annotated by 29 specialist physicians, generating over 12,000 expert judgments that captured not only diagnostic accuracy but also the appropriateness and potential harm of every recommended action. This real‑world benchmark, dubbed NOHARM, provides a far more stringent yardstick for medical AI than prior knowledge‑only tests.

When the 31 evaluated models were scored against the NOHARM dataset, AMBOSS LiSA 1.0 emerged as the overall leader with a 62.3% alignment to expert recommendations, closely followed by Google’s Gemini 2.5 Pro and OpenAI’s GPT‑5. However, the study revealed a striking safety‑restraint paradox: models that were either too aggressive or overly conservative generated the most severe errors. The safest systems occupied a middle ground, balancing completeness with restraint. Notably, configurations that paired a primary AI with one or two “guardian” models—creating an automated second opinion—improved top‑quartile safety performance by nearly six times, especially when the agents came from diverse vendors.

For health‑tech investors and providers, these insights carry immediate strategic implications. AI tools that integrate curated medical knowledge bases and adopt moderate restraint are poised to earn clinician trust and regulatory approval, while open‑source or purely generative models may require additional safety layers. The public NOHARM leaderboard sets a new standard for continuous, transparent evaluation as models evolve, encouraging vendors to iterate toward safer, more reliable decision‑support. Ultimately, the study suggests that AI will augment—not replace—physicians, offering a powerful safety net when deployed within robust, multi‑agent frameworks that prioritize both accuracy and patient protection.

Healthcare Pulse

Which Is The Best AI For Medical Questions? Here’s The Winner

Why It Matters

Key Takeaways

Pulse Analysis

Ask Pulse AI: