AI Legal Research Startup Descrybe Launches ‘Legal Reasoning’ Tool; Says It Outperforms ChatGPT, Claude, And Gemini On Bar Exam Benchmark

•March 12, 2026

Above the Law•Mar 12, 2026

Why It Matters

A flawless benchmark demonstrates that purpose‑built legal AI can deliver more reliable reasoning than generic models, potentially reducing verification costs for law firms and reshaping vendor selection.

Key Takeaways

•DescrybeLM scored 100% on NCBE MBE benchmark
•General AI missed 13‑23 questions, 88.5‑93.5% accuracy
•Overconfidence flagged in ChatGPT, Claude, Gemini; none in DescrybeLM
•Benchmark methodology and data released for public replication
•Purpose‑built legal AI reduces "confidently wrong" risk

Pulse Analysis

The legal technology sector has seen a surge of foundation models repurposed for law, yet most vendors rely on generic large language models that were not trained on structured jurisprudence. Descrybe’s launch of DescrybeLM marks a deliberate shift toward purpose‑built AI, leveraging a curated corpus of over 100 million primary‑law records and more than 100 billion tokens of preprocessing. By positioning the engine as a reasoning and drafting workspace rather than a simple search tool, the company aims to deliver authority‑grounded analysis that aligns with the rigorous standards of bar‑exam evaluation.

The benchmark against ChatGPT 5.2, Claude Opus 4.5 and Gemini 3 Pro used 200 NCBE multiple‑choice items, a standard proxy for legal reasoning skill. DescrybeLM answered every question correctly, while the general‑purpose models achieved 88.5‑93.5 % accuracy and generated 52 incorrect outputs, 49 of which were “confidently wrong.” This pattern forces attorneys to spend additional time verifying fluent but inaccurate prose. Notably, DescrybeLM recorded zero overconfidence flags, suggesting its confidence signals are more calibrated—a critical advantage when legal decisions hinge on precise rule application.

By publishing its full methodology, scoring rubric and per‑output logs, Descrybe invites independent replication—a rare practice in the fast‑moving AI‑legal space. Replication will test whether the observed performance gap holds across different question sets, jurisdictions, and model updates, and will help the industry establish transparent benchmarks. If purpose‑built systems consistently outperform foundation models on reasoning tasks, law firms and corporate legal departments may prioritize specialized vendors, reshaping procurement strategies and potentially accelerating the adoption of AI that can reduce costly verification cycles. The move also pressures larger AI providers to deepen their legal data pipelines.