Small Yet Mighty: Improve Accuracy In Multimodal Search and Visual Document Retrieval with Llama Nemotron RAG Models
AI

Small Yet Mighty: Improve Accuracy In Multimodal Search and Visual Document Retrieval with Llama Nemotron RAG Models

Hugging Face
Hugging FaceJan 6, 2026

Why It Matters

Accurate multimodal retrieval reduces hallucinations and speeds up AI‑driven document Q&A, unlocking reliable knowledge extraction from PDFs, charts, and screenshots at scale.

Small Yet Mighty: Improve Accuracy In Multimodal Search and Visual Document Retrieval with Llama Nemotron RAG Models

Authors: Ronay Ak, Gabriel de Souza Pereira Moreira, Bo Liu

In real applications, data is not just text. It lives in PDFs with charts, scanned contracts, tables, screenshots, and slide decks, so a text‑only retrieval system will miss important information. Multimodal RAG pipelines change this by enabling retrieval and reasoning over text, images, and layouts together, leading to more accurate and actionable answers.

This post walks through two small Llama Nemotron models for multimodal retrieval over visual documents:

  • llama‑nemotron‑embed‑vl‑1b‑v2 – a dense single‑vector multimodal (image + text) embedding model for page‑level retrieval and similarity search.

  • llama‑nemotron‑rerank‑vl‑1b‑v2 – a cross‑encoder reranking model for query–page relevance scoring.

Both models are:

  • Small enough to run with most NVIDIA GPU resources

  • Compatible with standard vector databases (single dense vector per page)

  • Designed to reduce hallucinations by grounding generation on better evidence, not longer prompts

We will show how they behave on realistic document benchmarks below.


Why multimodal RAG needs world‑class retrieval

Multimodal RAG pipelines combine a retriever with a vision‑language model (VLM) so responses are grounded in both retrieved page text and visual content, not just raw text prompts.

  • Embeddings control which pages are retrieved and shown to the VLM.

  • Reranking models decide which of those pages are most relevant and should influence the answer.

If either step is inaccurate, the VLM is more likely to hallucinate—often with high confidence. Using multimodal embeddings together with a multimodal reranker keeps generation grounded in the correct page images and text.


The State‑of‑the‑Art in Commercial Multimodal Search

The llama‑nemotron‑embed‑vl‑1b‑v2 and llama‑nemotron‑rerank‑vl‑1b‑v2 models are designed for developers building multimodal question‑answering and search over large corpora of PDFs and images.

  • llama‑nemotron‑embed‑vl‑1b‑v2 – a single‑vector (dense) embedding model that efficiently condenses visual and textual information into a single representation. This design ensures compatibility with all standard vector databases and enables millisecond‑latency search at enterprise scale.

  • llama‑nemotron‑rerank‑v1‑1b‑v2 – a cross‑encoder reranking model that reorders the top retrieved candidates to improve relevance and boosts downstream answer quality without changing your storage or index format.

We evaluated these models on five visual‑document retrieval datasets: the popular ViDoRe V1, V2 and V3, a realistic visual‑document retrieval benchmark for enterprises composed of eight public datasets, and two internal datasets:

  • DigitalCorpora‑10k – over 1 300 questions based on a corpus of 10 000 documents from DigitalCorpora, containing a mix of text, tables, and charts.

  • Earnings V2 – an internal retrieval dataset of 287 questions based on 500 PDFs, mostly earnings reports from big‑tech companies.


Visual Document Retrieval (page retrieval) benchmarks

Average retrieval accuracy (Recall@5) across the five datasets

| Model | Text | Image | Image + Text |

|---|---|---|---|

| llama‑nemotron‑embed‑1b‑v2 | 69.35% | – | – |

| llama‑3.2‑nemoretriever‑1b‑vlm‑embed‑v1 | 71.07% | 70.46% | 71.71% |

| llama‑nemotron‑embed‑vl‑1b‑v2 | 71.04% | 71.20% | 73.24% |

| llama‑nemotron‑embed‑vl‑1b‑v2 + llama‑nemotron‑rerank‑vl‑1b‑v2 | 76.12% | 76.12% | 77.64% |

Note: Image + Text modality means that both the page image and its extracted text (e.g., via NV‑Ingest) are fed to the embedding model for a richer representation.

Reranker comparison

| Model | Text | Image | Image + Text |

|---|---|---|---|

| llama‑nemotron‑rerank‑vl‑1b‑v2 | 76.12% | 76.12% | 77.64% |

| jina‑reranker‑m0 | 69.31% | 78.33% | N/A |

| MonoQwen2‑VL‑v0.1 | 74.70% | 75.80% | 75.98% |

jina‑reranker‑m0 performs well on image‑only tasks but its weights are restricted to non‑commercial use (CC‑BY‑NC). In contrast, llama‑nemotron‑rerank‑vl‑1b‑v2 offers superior performance across Text and combined Image + Text modalities and carries a permissive commercial license.


Architectural Highlights & Training Methodology

  • Embedding modelllama‑nemotron‑embed‑vl‑1b‑v2 is a transformer‑based encoder (~1.7 B parameters). It fine‑tunes the NVIDIA Eagle family using the Llama 3.2 1B language model and SigLip2 400 M vision encoder. The model uses a bi‑encoder architecture, mean‑pools the language‑model token embeddings, and outputs a 2048‑dimensional dense vector. Contrastive learning trains the model to increase similarity for relevant query‑document pairs and decrease it for negatives.

  • Reranker modelllama‑nemotron‑rerank‑vl‑1b‑v2 is a cross‑encoder (~1.7 B parameters) also built on an NVIDIA Eagle‑family backbone. Final hidden states are mean‑pooled and fed to a binary classification head fine‑tuned for ranking. Training uses cross‑entropy loss on publicly available and synthetically generated datasets.


How Organizations are Using These Models

Cadence – design and EDA workflows

Cadence embeds logic‑design assets (micro‑architecture specs, constraints, verification collateral) as multimodal documents. Engineers can ask, “I want to extend the interrupt controller to support a low‑power state, show me which spec sections need changes,” and instantly retrieve the relevant pages, receive alternative update strategies, and generate spec edits.

IBM – domain‑heavy storage and infra docs

IBM Storage treats each page of long PDFs (product guides, configuration manuals, architecture diagrams) as a multimodal document, embeds it, and uses the reranker to prioritize pages where domain‑specific terms appear in the correct context before passing them to downstream LLMs. This improves AI interpretation of storage concepts and reasoning over complex infrastructure documentation.

ServiceNow – chat over large sets of PDFs

ServiceNow indexes pages from organizational PDFs with multimodal embeddings and applies the reranker to select the most relevant pages for each user query in its “Chat with PDF” experience. By keeping high‑scoring pages in context across turns, agents maintain more coherent conversations and help users navigate large document collections more effectively.


Get Started

You can try the models directly:

  • Run llama‑nemotron‑embed‑vl‑1b‑v2 in your vector database of choice to power multimodal search over PDFs and images.

  • Add llama‑nemotron‑rerank‑vl‑1b‑v2 as a second‑stage reranker on your top‑k results to improve retrieval quality without changing your index.

  • Download the Nemotron RAG collection for end‑to‑end components for agents. Models can be integrated into ingestion pipelines or combined with other open models on Hugging Face to build multimodal agents that understand your PDFs, not just their extracted text.

Stay up to date on NVIDIA Nemotron by following NVIDIA AI on LinkedIn, X, YouTube, and the Nemotron Discord channel.

Comments

Want to join the conversation?

Loading comments...