Generative AI in the Real World: Shreya Shankar on AI for Corporate Data Processing
Why It Matters
By turning unstructured corporate data into reliable, searchable structures, LLM‑powered pipelines dramatically cut analytics costs and speed decision‑making, but only if enterprises enforce guardrails to mitigate hallucinations and variability.
Key Takeaways
- •LLMs enable semantic extraction from unstructured enterprise documents.
- •Docet provides map‑reduce style pipelines driven by LLM prompts.
- •Non‑technical analysts can run partial pipelines without coding.
- •Doc Wrangler IDE adds observability, prompt engineering, and guardrails.
- •Accuracy vs. creativity trade‑off managed via temperature and validation loops.
Summary
In this podcast, UC Berkeley PhD candidate Shreya Shankar explains how generative AI is reshaping enterprise data processing. She highlights the long‑standing challenge of extracting structure from unstructured assets—PDFs, transcripts, logs—and shows how large language models now make that feasible. Shankar’s dissertation project, Docet, reimagines the classic map‑reduce workflow: LLM‑driven prompts perform the map step, extracting themes, entities, or pain points, while a semantic reduce step groups and aggregates results into reports or knowledge‑graph tables.
Key technical insights include task decomposition (splitting large prompt batches into smaller, related subtasks), temperature control for deterministic versus creative outputs, and a plug‑in architecture that lets users incorporate external parsers, OCR tools, or Gemini’s native PDF handling. The companion IDE, Doc Wrangler, adds observability, automatic prompt generation, incremental execution, and LLM‑based guardrails that validate outputs and loop until expectations are met. Users can run full pipelines or isolated map operations, enabling analysts without coding expertise to extract insights from clinical notes, contracts, or customer reviews.
Shankar illustrates the approach with concrete examples: extracting “pain points” and associated quotes from product reviews, then summarizing each theme for executive briefs. She likens the resulting data layers to a “semantic bronze‑silver‑gold” hierarchy, where raw extracted tables become queryable assets for downstream retrieval‑augmented generation or traditional analytics. Guardrails—LLM judges that check for hallucinations or enforce schema constraints—provide a safety net comparable to data‑warehouse expectations.
The broader implication is that enterprises can rapidly convert massive unstructured corpora into structured, queryable formats without building bespoke ML pipelines. This lowers the barrier for business units to gain actionable insights, accelerates time‑to‑value, and opens new opportunities for integrating LLM‑generated data into existing data‑lake or knowledge‑graph architectures, while highlighting the need for robust validation to manage LLM nondeterminism.
Comments
Want to join the conversation?
Loading comments...