
A Modest Proposal: Reformat Everything to Make Documents More Palatable to AI
Companies Mentioned
Why It Matters
DocLang promises to slash token consumption and improve accuracy, directly lowering AI operating costs for enterprises that ingest large volumes of documents. Its open‑standard approach could reshape how companies build scalable, reliable AI document‑processing workflows.
Key Takeaways
- •DocLang is an open XML‑based format optimized for LLM tokenizers.
- •IBM’s Docling toolkit converts PDFs, HTML, etc., into AI‑ready data.
- •Benchmarks show 4×‑30× token cost reduction versus traditional PDFs.
- •Standard preserves layout, tables, formulas, and provenance metadata.
- •Early adopters include IBM, NVIDIA, Red Hat, ABBYY, HumanSignal, Forgis.
Pulse Analysis
Enterprises are rapidly discovering that traditional document formats—PDFs, Markdown, HTML, LaTeX—are ill‑suited for large‑language‑model ingestion. These formats were engineered for human rendering, stripping away semantic cues and structural relationships that AI models need to interpret accurately. As organizations scale AI‑driven knowledge extraction, the token overhead of parsing PDFs can become a hidden expense, especially when using high‑cost frontier models. The industry’s response has been a patchwork of custom parsers, each adding engineering debt and increasing hallucination risk.
DocLang, spearheaded by a Linux Foundation working group, tackles the problem at its core by defining a minimal XML vocabulary that maps one‑to‑one with LLM tokens. Built on IBM’s open‑source Docling toolkit, the format converts legacy files into a lossless, AI‑native representation that retains tables, formulas, charts and provenance metadata. Early benchmarks from ABBYY show input token counts dropping from 8,421 to 5,310 for a typical annual report, with latency improving from 4.2 seconds to 2.7 seconds. Such reductions translate into 4‑to‑30‑fold cost savings, making large‑scale document ingestion financially viable and technically more reliable.
Beyond immediate cost benefits, DocLang’s open‑standard model invites broad ecosystem participation, encouraging vendors and enterprises to build interoperable tools without licensing barriers. By preserving document governance data, it also addresses regulatory and compliance concerns that often arise when metadata is stripped during OCR. As AI becomes a central layer in enterprise knowledge management, a standardized, token‑efficient format could become as foundational as PDF was for digital publishing, reshaping how businesses automate insight extraction from their ever‑growing document repositories.
A modest proposal: Reformat everything to make documents more palatable to AI
Comments
Want to join the conversation?
Loading comments...