Beyond the Gold Standard: Evaluating and Trusting Agents in the Wild // Sanjana Sharma

•February 19, 2026

0

MLOps Community

MLOps Community•Feb 19, 2026

Why It Matters

Without reliable, context‑rich AI agents, regulated enterprises risk costly errors and lost competitive edge, making reliability infrastructure a strategic imperative.

Key Takeaways

•Reliability depends on context, not just model size
•Embed tacit expert rules to boost agent accuracy dramatically
•Adopt system‑first thinking: versioned context, evaluation, monitoring throughout
•Treat AI agents like software: CI/CD, unit tests, audits
•Organizational readiness—clear processes, rules, data—prerequisite for successful automation

Summary

AI agents look impressive in demos, but production reliability hinges on context, evaluation, and trust. Sanjana Sharma argues enterprises must shift from model‑first to system‑first thinking, embedding explicit business rules, subject‑matter‑expert (SME) heuristics, and versioned context layers.

The talk outlines three context layers—structured data, unstructured documents, and undocumented tacit expertise—and shows how missing the latter caps reliability at 70‑80%. A healthcare case study raised reliability from 73% to 91% by codifying clinicians’ hidden rules without changing the underlying model.

She likens reliable agents to software engineering practices: unit tests, CI/CD pipelines, version control, and audit logs. Examples include supply‑chain agents orchestrating disruptions and AI “apprentices” that internalize company policies, illustrating the need for uncertainty handling, fallback logic, and human‑in‑the‑loop feedback.

Enterprises that invest now in reliability infrastructure—clear processes, rule libraries, continuous evaluation—will unlock autonomous, trustworthy agents, while those that ignore these fundamentals will remain stuck in firefighting mode.

Original Description

March 3rd, Computer History Museum CODING AGENTS CONFERENCE, come join us while there are still tickets left.

https://luma.com/codingagents

Thanks to ⁨@ProsusGroup for collaborating on the Agents in Production Virtual Conference 2025.

Abstract //

Building agents is easy; trusting them in production is hard. Accuracy benchmarks and gold datasets only get you so far - once agents are deployed, they face ambiguous data, edge cases, and workflows that don’t exist in neat benchmarks. In this talk, I’ll share technical lessons from deploying agents in high-stakes environments, where reliability matters as much as innovation. Starting from gold datasets, I’ll show how we layered in structured feedback from subject matter experts to build “living ground truth” that evolves with the system. Using healthcare examples - like validating clinical dates and bed levels where 80% accuracy isn’t good enough - I’ll illustrate frameworks for auditing, measuring, and improving agent reliability. The insights generalize beyond healthcare: whether in e-commerce, fraud detection, or logistics, the key challenge is the same - how do you know your agent is ready for production, and how do you keep it trustworthy once it’s there?

Bio //

Passionate about designing AI that serves people, powers enterprises, and teaches AI to understand AI.

A Prosus | MLOps Community Production

0

Comments

Want to join the conversation?

Loading comments...