The guide argues open models offer better cost efficiency and privacy and gives firms concrete criteria for selecting and deploying OCR pipelines that preserve layout, reduce hallucinations and integrate with LLMs—key for automating document workflows and analytics.
TL;DR
The rise of powerful vision‑language models has transformed document AI. Each model comes with unique strengths, making it tricky to choose the right one. Open‑weight models offer better cost efficiency and privacy. To help you get started with them, we’ve put together this guide.
In this guide, you’ll learn:
The landscape of current models and their capabilities
When to fine‑tune models vs. use models out‑of‑the‑box
Key factors to consider when selecting a model for your use case
How to move beyond OCR with multimodal retrieval and document QA
By the end, you’ll know how to choose the right OCR model, start building with it, and gain deeper insights into document AI. Let’s go!
Optical Character Recognition (OCR) is one of the earliest and longest running challenges in computer vision. Many of AI’s first practical applications focused on turning printed text into digital form. With the surge of vision‑language models (VLMs), OCR has advanced significantly. Recently, many OCR models have been developed by fine‑tuning existing VLMs. Today’s capabilities extend far beyond OCR: you can retrieve documents by query or answer questions about them directly. Thanks to stronger vision features, these models can also handle low‑quality scans, interpret complex elements like tables, charts, and images, and fuse text with visuals to answer open‑ended questions across documents.
Recent models transcribe text into a machine‑readable format. The input can include:
Handwritten text
Various scripts like Latin, Arabic, and Japanese characters
Mathematical expressions
Chemical formulas
Image/Layout/Page number tags
OCR models convert them into machine‑readable text that comes in many different formats such as HTML, Markdown and more.
On top of text, some models can also recognize:
Images
Charts
Tables
Some models know where images are inside the document, extract their coordinates, and insert them appropriately between texts. Other models generate captions for images and insert them where they appear. This is especially useful if you are feeding the machine‑readable output into an LLM. Example models are OlmOCR by AllenAI, or PaddleOCR‑VL by PaddlePaddle.
Models use different machine‑readable output formats, such as DocTags, HTML or Markdown (explained in the next section). The way a model handles tables and charts often depends on the output format they are using. Some models treat charts like images: they are kept as is. Other models convert charts into markdown tables or JSON, e.g., a bar chart can be converted as follows.
Similarly for tables, cells are converted into a machine‑readable format while retaining context from headings and columns.
Different OCR models have different output formats. Briefly, here are the common output formats used by modern models.
DocTag: DocTag is an XML‑like format for documents that expresses location, text format, component‑level information, and more. This format is employed by the open Docling models.
HTML: HTML is one of the most popular output formats used for document parsing as it properly encodes structure and hierarchical information.
Markdown: Markdown is the most human‑readable format. It’s simpler than HTML but not as expressive. For example, it can’t represent split‑column tables.
JSON: JSON is not a format that models use for the entire output, but it can be used to represent information in tables or charts.
The right model depends on how you plan to use its outputs:
| Use case | Recommended format | Reason |
|---|---|---|
| Digital reconstruction | DocTags or HTML | Preserves layout |
| LLM input or Q&A | Markdown with image captions | Natural language style |
| Programmatic use | JSON | Structured data for analysis |
Documents can have complex structures, like multi‑column text blocks and floating figures. Older OCR models handled these documents by detecting words and then the layout of pages manually in post‑processing to have the text rendered in reading order, which is brittle. Modern OCR models, on the other hand, incorporate layout metadata to help preserve reading order and accuracy. This metadata is called “anchor” and can come in bounding boxes. This process is also called grounding/anchoring because it helps with reducing hallucination.
OCR models can either take in images and an optional text prompt, depending on the model architecture and the pre‑training setup. Some OCR models support prompt‑based task switching, e.g. granite‑docling can parse an entire page with the prompt “Convert this page to Docling” while it can also take prompts like “Convert this formula to LaTeX” along with a page full of formulas. Other models, however, are trained only for parsing entire pages, and they are conditioned to do this through a system prompt. For instance, OlmOCR by AllenAI takes a long conditioning prompt. Like many others, OlmOCR is technically an OCR fine‑tuned version of a VLM (Qwen2.5VL in this case), so you can prompt for other tasks, but its performance will not be on par with the OCR capabilities.
We’ve seen an incredible wave of new models this past year. Because so much work is happening in the open, these players build on and benefit from each other’s work. A great example is AllenAI’s release of OlmOCR, which not only released a model but also the dataset used to train it. With these, others can build upon them in new directions. The field is incredibly active, but it’s not always obvious which model to use.
| Model Name | Output formats | Features | Model Size | Multilingual? |
|---|---|---|---|---|
| Nanonets‑OCR2‑3B | Structured Markdown with semantic tagging (plus HTML tables, etc.) | Captions images in the documents; Signature & watermark extraction; Handles checkboxes, flowcharts, and handwriting | 4 B | ✅ Supports English, Chinese, French, Arabic and more |
| PaddleOCR‑VL | Markdown, JSON, HTML tables and charts | Handles handwriting, old documents; Allows prompting; Converts tables & charts to HTML; Extracts and inserts images directly | 0.9 B | ✅ Supports 109 languages |
| dots.ocr | Markdown, JSON | Grounding; Extracts and inserts images; Handles handwriting | 3 B | ✅ Multilingual with language info not available |
| OlmOCR | Markdown, HTML, LaTeX | Grounding; Optimized for large‑scale batch processing | 8 B | ❎ English‑only |
| Granite‑Docling‑258M | DocTags | Prompt‑based task switching; Ability to prompt element locations with location tokens | 258 M | ✅ Supports English, Japanese, Arabic and Chinese |
| DeepSeek‑OCR | Markdown + HTML | Supports general visual understanding; Can parse and re‑render all charts, tables, and more into HTML; Handles handwriting | 3 B | ✅ Supports nearly 100 languages |
Here’s a small demo for you to try some of the latest models and compare their outputs.
There’s no single best model, as every problem has different needs. Should tables be rendered in Markdown or HTML? Which elements should we extract? How should we quantify text accuracy and error rates? While there are many evaluation datasets and tools, many don’t answer these questions. So we suggest using the following benchmarks:
OmniDocBenchmark – This widely used benchmark stands out for its diverse document types: books, magazines, and textbooks. Its evaluation criteria are well designed, accepting tables in both HTML and Markdown formats. A novel matching algorithm evaluates the reading order, and formulas are normalized before evaluation. Most metrics rely on edit distance or tree edit distance (tables). Notably, the annotations used for evaluation are not solely human‑generated but are acquired through state‑of‑the‑art VLMs or conventional OCR methods.
OlmOCR‑Bench – OlmOCR‑Bench takes a different approach: they treat the evaluation as a set of unit tests. For example, table evaluation is done by checking the relation between selected cells of a given table. They use PDFs from public sources, and annotations are done using a wide range of closed‑source VLMs.
(End of article)
Comments
Want to join the conversation?
Loading comments...