Generative AI in the Real World: Douwe Kiela on Why RAG Isn’t Dead

O’Reilly Media
O’Reilly MediaMay 14, 2026

Why It Matters

RAG 2.0 delivers cost‑effective, scalable AI that bridges the gap between demo prototypes and enterprise‑grade production, making advanced language models financially viable for real‑world applications.

Key Takeaways

  • RAG remains essential despite models' expanding context windows.
  • Combining retrieval with long contexts reduces compute waste and improves accuracy.
  • RAG 2.0 focuses on jointly optimized, end‑to‑end system components.
  • Document parsing, hierarchy, and smart chunking outweigh embedding tricks.
  • Hybrid retrieval mixtures, not just vectors, stay critical for production.

Summary

The podcast with Douwe Kiela, CEO of Contextual AI, tackles the hot question of whether Retrieval‑Augmented Generation (RAG) is obsolete in the era of massive‑context language models. Kiela argues that expanding context windows solve the same problem RAG addresses—bringing relevant information to the model—yet they are computationally wasteful and can even degrade performance after a few thousand tokens.

Key insights include the observation that feeding an entire 10‑million‑token context is inefficient, while a well‑designed RAG pipeline delivers the same answers at a fraction of the cost. Kiela introduces “RAG 2.0,” an end‑to‑end system where the retriever, reranker, chunker, and language model are jointly optimized. Their platform automates document extraction, layout segmentation, hierarchical metadata capture, and smart chunking, eliminating the trial‑and‑error that plagued early RAG implementations.

Notable examples illustrate the point: asking “who is the headmaster in Harry Potter?” shouldn’t require reading all books, just a targeted retrieval. Kiela also highlights Contextual AI’s two‑API approach—ingestion and query—mirroring OpenAI’s completion endpoint but grounded in user data. He stresses that document parsing, not just embedding selection, is the true bottleneck, and that hybrid “mixture of retrievers” (including graph‑based and BM25 methods) remains the hardest part of the pipeline.

The implications are clear for enterprises: adopting a production‑ready RAG 2.0 stack cuts latency and cloud spend, scales to millions of complex PDFs, and avoids the brittle Frankenstein solutions common in early demos. Companies that invest in robust retrieval, hierarchical metadata, and joint system optimization will gain a competitive edge as AI moves from proof‑of‑concept to mission‑critical workloads.

Original Description

Join Ben and Douwe Kiela, cofounder of Contextual AI and author of the first paper on RAG, to find out why RAG remains as relevant as ever. Regardless of what you call it, retrieval is at the heart of generative AI. Find out why—and how to build effective RAG-based systems.
Follow O'Reilly on:

Comments

Want to join the conversation?

Loading comments...