Databricks' OfficeQA Uncovers Disconnect: AI Agents Ace Abstract Tests but Stall at 45% on Enterprise Docs

•December 9, 2025

VentureBeat•Dec 9, 2025

Companies Mentioned

Databricks

OpenAI

Why It Matters

It exposes a critical gap between headline AI benchmark scores and actual enterprise document workflows, indicating current agents are not yet reliable for high‑stakes business decisions. The benchmark gives companies a concrete tool to assess and improve AI pipelines before deployment.

Key Takeaways

•AI agents score <45% on raw enterprise PDFs.
•Pre‑parsed docs raise accuracy to ~68% Claude, 53% GPT.
•Parsing errors, versioning, visual reasoning are primary gaps.
•OfficeQA uses Treasury Bulletins to mimic complex corporate docs.
•Benchmark enables targeted parsing improvements without human labeling.

Pulse Analysis

The AI community has long celebrated impressive scores on benchmarks like Humanity's Last Exam, ARC‑AGI, and GDPval, yet those tests focus on abstract reasoning or niche tasks. Databricks recognized that enterprises spend the majority of their AI budget on extracting insights from sprawling, unstructured document corpora. By curating a 246‑question suite from decades‑long Treasury Bulletins—rich in scanned pages, nested tables, and charts—OfficeQA mirrors the messiness of real corporate data while remaining publicly accessible for research.

Performance results are sobering. Claude Opus 4.5 and GPT‑5.1 agents hover below 45% accuracy on raw PDFs, a figure that jumps dramatically once Databricks’ ai_parse_document preprocessing is applied. This disparity pinpoints parsing as the primary bottleneck, not the models’ reasoning capabilities. Moreover, agents stumble on versioned documents and visual chart queries, underscoring gaps that could lead to costly misinterpretations in finance, compliance, or supply‑chain analytics. The benchmark’s ground‑truth answers also enable reinforcement‑learning loops without human labeling, offering a pathway for vendors to iterate rapidly on parsing pipelines and retrieval strategies.

For enterprises, OfficeQA serves as a reality check before large‑scale deployment. Companies should first audit their document complexity against the Treasury Bulletins profile, then allocate resources to custom OCR and table‑extraction solutions rather than relying on generic APIs. Continuous evaluation with OfficeQA can surface failure modes early, allowing human oversight where agents still lag—especially on multi‑step calculations or visual reasoning. As the industry pivots toward grounded AI, benchmarks that reflect true business workloads will become essential levers for competitive advantage.