Nemotron ColEmbed V2: Raising the Bar for Multimodal Retrieval with ViDoRe V3’s Top Model

Nemotron ColEmbed V2: Raising the Bar for Multimodal Retrieval with ViDoRe V3’s Top Model

Hugging Face
Hugging FaceFeb 4, 2026

Companies Mentioned

Why It Matters

By pushing multimodal retrieval accuracy to new levels, Nemotron ColEmbed V2 enables enterprise RAG systems to extract information from complex visual documents, a critical capability for next‑generation AI search and knowledge management.

Key Takeaways

  • Nemotron ColEmbed V2 achieves state‑of‑the‑art NDCG@10
  • 8B model tops ViDoRe V3 leaderboard with 63.42 score
  • Late‑interaction architecture enables fine‑grained token matching
  • Models built on Qwen3‑VL and SigLIP foundations
  • Higher storage needed for multi‑vector embeddings

Pulse Analysis

Modern enterprise search increasingly confronts heterogeneous documents—pages that blend text, tables, charts, and graphics. Traditional single‑vector embeddings compress an entire document into one point, sacrificing the nuance needed to distinguish visual elements. Late‑interaction models, pioneered by ColBERT, retain token‑level embeddings and compute relevance through a MaxSim operation, allowing each query token to find its strongest match across the document’s visual and textual tokens. This approach, while storage‑intensive, yields markedly higher retrieval fidelity, especially for visually rich assets.

NVIDIA’s Nemotron ColEmbed V2 series translates this concept to the multimodal domain. Built on Qwen3‑VL and SigLIP backbones, the 3B, 4B, and 8B variants employ bi‑directional self‑attention and a contrastive bi‑encoder training pipeline that mixes text‑only and text‑image pairs. On the ViDoRe V3 benchmark—a rigorous enterprise‑focused evaluation—the 8B model reaches an NDCG@10 of 63.42, outpacing prior releases and securing the top leaderboard spot. Advanced model merging and enriched synthetic multilingual data further stabilize performance without adding inference latency.

The implications for businesses are immediate. High‑accuracy multimodal retrieval powers next‑generation Retrieval‑Augmented Generation (RAG) workflows, enabling conversational AI to cite exact chart values, table rows, or infographic details. Companies can integrate the ColEmbed V2 models via NVIDIA’s NeMo Retriever suite or NGC containers, balancing storage costs against the need for precise document understanding. As visual document volumes grow, these models set a new standard for AI‑driven knowledge extraction, positioning NVIDIA as a key enabler of enterprise‑grade multimodal AI.

Nemotron ColEmbed V2: Raising the Bar for Multimodal Retrieval with ViDoRe V3’s Top Model

Comments

Want to join the conversation?

Loading comments...