AI News and Headlines
  • All Technology
  • AI
  • Autonomy
  • B2B Growth
  • Big Data
  • BioTech
  • ClimateTech
  • Consumer Tech
  • Crypto
  • Cybersecurity
  • DevOps
  • Digital Marketing
  • Ecommerce
  • EdTech
  • Enterprise
  • FinTech
  • GovTech
  • Hardware
  • HealthTech
  • HRTech
  • LegalTech
  • Nanotech
  • PropTech
  • Quantum
  • Robotics
  • SaaS
  • SpaceTech
AllNewsDealsSocialBlogsVideosPodcastsDigests

AI Pulse

EMAIL DIGESTS

Daily

Every morning

Weekly

Sunday recap

NewsDealsSocialBlogsVideosPodcasts
AINewsNemotron ColEmbed V2: Raising the Bar for Multimodal Retrieval with ViDoRe V3’s Top Model
Nemotron ColEmbed V2: Raising the Bar for Multimodal Retrieval with ViDoRe V3’s Top Model
AI

Nemotron ColEmbed V2: Raising the Bar for Multimodal Retrieval with ViDoRe V3’s Top Model

•February 4, 2026
0
Hugging Face
Hugging Face•Feb 4, 2026

Companies Mentioned

NVIDIA

NVIDIA

NVDA

Meta

Meta

META

Google

Google

GOOG

Why It Matters

By pushing multimodal retrieval accuracy to new levels, Nemotron ColEmbed V2 enables enterprise RAG systems to extract information from complex visual documents, a critical capability for next‑generation AI search and knowledge management.

Key Takeaways

  • •Nemotron ColEmbed V2 achieves state‑of‑the‑art NDCG@10
  • •8B model tops ViDoRe V3 leaderboard with 63.42 score
  • •Late‑interaction architecture enables fine‑grained token matching
  • •Models built on Qwen3‑VL and SigLIP foundations
  • •Higher storage needed for multi‑vector embeddings

Pulse Analysis

Modern enterprise search increasingly confronts heterogeneous documents—pages that blend text, tables, charts, and graphics. Traditional single‑vector embeddings compress an entire document into one point, sacrificing the nuance needed to distinguish visual elements. Late‑interaction models, pioneered by ColBERT, retain token‑level embeddings and compute relevance through a MaxSim operation, allowing each query token to find its strongest match across the document’s visual and textual tokens. This approach, while storage‑intensive, yields markedly higher retrieval fidelity, especially for visually rich assets.

NVIDIA’s Nemotron ColEmbed V2 series translates this concept to the multimodal domain. Built on Qwen3‑VL and SigLIP backbones, the 3B, 4B, and 8B variants employ bi‑directional self‑attention and a contrastive bi‑encoder training pipeline that mixes text‑only and text‑image pairs. On the ViDoRe V3 benchmark—a rigorous enterprise‑focused evaluation—the 8B model reaches an NDCG@10 of 63.42, outpacing prior releases and securing the top leaderboard spot. Advanced model merging and enriched synthetic multilingual data further stabilize performance without adding inference latency.

The implications for businesses are immediate. High‑accuracy multimodal retrieval powers next‑generation Retrieval‑Augmented Generation (RAG) workflows, enabling conversational AI to cite exact chart values, table rows, or infographic details. Companies can integrate the ColEmbed V2 models via NVIDIA’s NeMo Retriever suite or NGC containers, balancing storage costs against the need for precise document understanding. As visual document volumes grow, these models set a new standard for AI‑driven knowledge extraction, positioning NVIDIA as a key enabler of enterprise‑grade multimodal AI.

Nemotron ColEmbed V2: Raising the Bar for Multimodal Retrieval with ViDoRe V3’s Top Model

NVIDIA introduces the Nemotron ColEmbed V2 family

Modern search systems are increasingly designed to process heterogeneous document images that may contain text, tables, charts, figures, and other visual components. In this context, accurately retrieving relevant information across these diverse modalities is a central challenge. Multimodal embedding models built on top of foundational vision–language models (VLMs) map diverse content types into a shared representation space, enabling unified retrieval over text, images, and structured visual elements. Although encoding an entire query and candidate document into a single vector is a common practice—exemplified by our recently released commercial‑ready Llama‑Nemotron‑Embed‑VL‑1B which prioritizes efficiency and low storage—there is an increasing research direction on multi‑vector, late‑interaction style embedding architectures which provide fine‑grained multi‑vector interaction between queries and documents. By enabling richer token representations, these models better capture more detailed semantic relationships, and they have shown higher accuracy performance on various (multimodal) benchmarks.

NVIDIA introduces the Nemotron ColEmbed V2 family, a set of late‑interaction embedding models available in three sizes—3B, 4B, and 8B—designed for highly accurate multimodal retrieval. These models adopt a unified approach to text–image retrieval and achieve state‑of‑the‑art performance on the ViDoRe V1, V2, and V3 benchmarks.

Nemotron ColEmbed V2 Highlights (TL;DR)

The nemotron-colembed-vl-8b-v2, nemotron-colembed-vl-4b-v2 and llama-nemotron-colembed-vl-3b-v2 are state‑of‑the‑art late‑interaction embedding models that rank 1st, 3rd and 6th—the highest ranked models in each weight class, as of Feb 3 2026, on the ViDoRe V3 benchmark: a comprehensive evaluation of visual document retrieval for enterprise use‑case benchmark.

late interaction diagram

The late interaction mechanism introduced by ColBERT for multi‑vector embedding matching has been extended in our work to a multimodal setting, enabling fine‑grained interactions between query and document tokens, whether textual or visual. As illustrated in the figure, each query token embedding interacts with all document token embeddings via the MaxSim operator, which selects the maximum similarity for each query token and then sums these maxima to produce the final relevance score. This approach requires storing the token embeddings for the entire document corpus, whether textual or visual, thereby increasing storage requirements. During inference, query token embeddings are computed and matched against the stored document embeddings using the same MaxSim operation.

Nemotron ColEmbed V2 models are intended for researchers exploring visual document retrieval applications where accuracy is paramount. This distinguishes them from our 1B single‑vector model released last month, which was designed for commercial environments requiring minimal storage and high throughput. The V2 models are instrumental in multimodal RAG systems, where textual queries can be used to retrieve document images—pages, text, charts, tables, or infographics. The models output multi‑vector embeddings for input queries and documents. Potential applications include multimedia search engines, cross‑modal retrieval systems, and conversational AI with rich input understanding.

As a new benchmark, ViDoRe V3 is designed to set an industry standard for multimodal enterprise document retrieval. It tackles a key challenge in production RAG systems: accurately extracting information from complex, visually‑rich documents. With its strong multimodal document retrieval capability, the nemotron-colembed-vl-8b-v2 model ranks #1 on the ViDoRe V3 leaderboard, setting a new standard for accuracy.

Visual Document Retrieval benchmark (page retrieval) – Avg NDCG@10 on ViDoRe V3 public and private tasks

| Model | Emb dim | # of parameters | ViDoRe V3 Accuracy (NDCG@10) |

|-------|---------|----------------|------------------------------|

| nemotron-colembed-vl-8b-v2 | 4096 | 8.8 B | 63.42 |

| nemotron-colembed-vl-4b-v2 | 2560 | 4.8 B | 61.54 |

| llama-nemotron-colembed-vl-3b-v2 | 3072 | 4.4 B | 59.79 |

| lama-nemoretriever-colembed-3b-v1 | 3072 | 4.4 B | 57.26 |

Models’ Architecture

The llama-nemotron-colembed-vl-3b-v2 is a transformer‑based multimodal embedding model built on top of a VLM based on google/siglip2-giant-opt-patch16-384 and meta-llama/Llama-3.2-3B. The nemotron-colembed-vl-8b-v2 and nemotron-colembed-vl-4b-v2 multimodal encoder models were built from Qwen3‑VL‑8B‑Instruct and Qwen3‑VL‑4B‑Instruct, respectively.

Architecture modifications

  • Our models use bi‑directional self‑attention instead of the original uni‑directional causal self‑attention from the LLM decoder models. This allows the model to learn rich representations from the whole input sequence.

  • ColBERT‑style late interaction mechanism – for each input token, each model outputs an n‑dimensional embedding vector of floating‑point values, where n is determined by the model’s hidden size.

Training methodology

The nemotron-colembed-vl-8b-v2, nemotron-colembed-vl-4b-v2 and llama-nemotron-colembed-vl-3b-v2 models were trained using a bi‑encoder architecture, independently. This involves encoding a pair of sentences (e.g., a query and a document) independently using the embedding model. Using contrastive learning, the model maximizes the late‑interaction similarity between the query and the document that contains the answer, while minimizing similarity with sampled negative documents.

The llama-nemotron-colembed-vl-3b-v2 model was trained in a two‑stage pipeline: first fine‑tuned with 12.5 M text‑QA pairs, then fine‑tuned with text–image pairs. The nemotron-colembed-vl-8b-v2 and nemotron-colembed-vl-4b-v2 models were fine‑tuned using only text‑image pairs (second stage).

Our training datasets contain both text‑only and text‑image pairs, and we apply hard negative mining following the positive‑aware hard negative mining methods presented in the NV‑Retriever paper to improve retrieval performance.

Key improvements over V1

  • Advanced Model Merging – Utilizes post‑training model merging to combine the strengths of multiple fine‑tuned checkpoints, delivering the accuracy stability of an ensemble without any additional inference latency.

  • Enhanced Synthetic Data – Significantly enriched our training mixture with diverse multilingual synthetic data, improving semantic alignment across languages and complex document types.

model performance on ViDoRe V3

Start building with Nemotron ColEmbed V2

Nemotron ColEmbed V2 models mark a major step forward in high‑accuracy text–image retrieval, delivering state‑of‑the‑art results on the ViDoRe V1, V2, and V3 benchmarks. The availability of 3B, 4B and 8B model variants further establishes a solid foundation for future research and advanced experimentation in multimodal retrieval applications.

Get started by downloading the models:

  • nemotron-colembed-vl-8b-v2

  • nemotron-colembed-vl-4b-v2

  • llama-nemotron-colembed-vl-3b-v2

Learn more about the NVIDIA NeMo Retriever family of Nemotron RAG models on the product page, or access the microservice container from NVIDIA NGC. This is an excellent opportunity to explore state‑of‑the‑art retrieval in your own applications and workflows.

Try the NVIDIA Enterprise RAG Blueprint, using the Nemotron RAG models that are powered by the same technology behind our ViDoRe V3 winning results.

Read Original Article
0

Comments

Want to join the conversation?

Loading comments...