Generative AI in the Real World: Shreya Shankar on AI for Corporate Data Processing

O’Reilly Media
O’Reilly MediaMay 8, 2026

Why It Matters

By turning unstructured corporate data into reliable, searchable structures, LLM‑powered pipelines dramatically cut analytics costs and speed decision‑making, but only if enterprises enforce guardrails to mitigate hallucinations and variability.

Key Takeaways

  • LLMs enable semantic extraction from unstructured enterprise documents.
  • Docet provides map‑reduce style pipelines driven by LLM prompts.
  • Non‑technical analysts can run partial pipelines without coding.
  • Doc Wrangler IDE adds observability, prompt engineering, and guardrails.
  • Accuracy vs. creativity trade‑off managed via temperature and validation loops.

Summary

In this podcast, UC Berkeley PhD candidate Shreya Shankar explains how generative AI is reshaping enterprise data processing. She highlights the long‑standing challenge of extracting structure from unstructured assets—PDFs, transcripts, logs—and shows how large language models now make that feasible. Shankar’s dissertation project, Docet, reimagines the classic map‑reduce workflow: LLM‑driven prompts perform the map step, extracting themes, entities, or pain points, while a semantic reduce step groups and aggregates results into reports or knowledge‑graph tables.

Key technical insights include task decomposition (splitting large prompt batches into smaller, related subtasks), temperature control for deterministic versus creative outputs, and a plug‑in architecture that lets users incorporate external parsers, OCR tools, or Gemini’s native PDF handling. The companion IDE, Doc Wrangler, adds observability, automatic prompt generation, incremental execution, and LLM‑based guardrails that validate outputs and loop until expectations are met. Users can run full pipelines or isolated map operations, enabling analysts without coding expertise to extract insights from clinical notes, contracts, or customer reviews.

Shankar illustrates the approach with concrete examples: extracting “pain points” and associated quotes from product reviews, then summarizing each theme for executive briefs. She likens the resulting data layers to a “semantic bronze‑silver‑gold” hierarchy, where raw extracted tables become queryable assets for downstream retrieval‑augmented generation or traditional analytics. Guardrails—LLM judges that check for hallucinations or enforce schema constraints—provide a safety net comparable to data‑warehouse expectations.

The broader implication is that enterprises can rapidly convert massive unstructured corpora into structured, queryable formats without building bespoke ML pipelines. This lowers the barrier for business units to gain actionable insights, accelerates time‑to‑value, and opens new opportunities for integrating LLM‑generated data into existing data‑lake or knowledge‑graph architectures, while highlighting the need for robust validation to manage LLM nondeterminism.

Original Description

Businesses have a lot of data—but most of that data is unstructured textual data: reports, catalogs, emails, notes, and much more. Without structure, business analysts can’t make sense of the data; there is value in the data, but it can’t be put to use.
AI can be a tool for finding and extracting the structure that’s hidden in textual data. In this episode, Ben and Shreya talk about a new generation of tooling that brings AI to enterprise data processing.
Follow O'Reilly on:

Comments

Want to join the conversation?

Loading comments...