Sayd Agzamkhodjaev: “Users don’t trust that the system never makes mistakes; they trust that it can safely recover.”

Sayd Agzamkhodjaev · Founding Engineer at Treater · “Founding Engineer at Treater knows how a properly organized pipeline and analytics based on AI agents turn complex LLMs into practical and reliable business tools.”

In 2025, companies around the world are actively adopting generative AI technologies and large language models (LLMs). About 72 % of enterprises plan to increase their investments in these technologies over the next year. This creates enormous opportunities for improving efficiency and automation, but it also raises questions about trust in the outputs generated by such systems: how can organizations ensure the stability, interpretability, and scalability of LLM‑based solutions?

Sayd Agzamkhodjaev — lead engineer and Founding Engineer at Treater, with experience at Meta, Cohere, and Instabase, where he built LLM pipelines and products for millions of users and corporate AI agents that saved tens of thousands of hours of manual work. His expertise is particularly valuable in the context of global AI adoption: the systematic approaches he developed help organizations trust LLM outputs, scale them, and turn complex technologies into manageable business tools.

In this exclusive interview, Sayd explains how his engineering and product methodologies — from multi‑layer LLM evaluation to AI‑agent analytics — ensure the reliability and interpretability of AI systems, and how to design AI tools so that their outputs can be interpreted, verified, and safely scaled.

“LLM reliability is built through multi‑layer validation”

You created a multi‑layer LLM evaluation pipeline at Treater that reduced errors by roughly 40 %. How did you achieve such reliability and model quality?

The principle was simple: you cannot rely on a single check. We combined multiple perspectives on quality.

Deterministic checks – schemas, types, business rules like “sum cannot be negative” or “retail store IDs must match real ones.”
LLM‑as‑a‑Judge – the model evaluates its own outputs based on rubrics we developed with domain experts.
User feedback – we record user edits and repeat them as tests.

Multi‑layer validation lets us detect problems immediately and address them across different layers.

How did your experience at Meta/WhatsApp with millions of users influence your approach to LLM quality control?

I realized that evaluating quality means looking at result distributions, not searching for a single “correct” string. We used impact metrics, not just correctness: A/B tests, gradual rollouts, and rollbacks. It’s important to minimize the “blast radius”: if something goes wrong, the failure should be local, not global. At Treater we applied the same philosophy—guardrails for edge cases, error monitoring, and tracking user behavior.

At Treater, you implemented LLM‑as‑a‑Judge with mandatory explanations for failures. How does this improve interpretability and speed up problem resolution?

Every “failed” output comes with an explanation of why it didn’t pass. This gives engineers and managers insight into where the model misunderstood the task, the data, or the prompt. Errors are grouped by type—“missing price,” “incorrect store,” “hallucinated metric”—and we fix them at the appropriate layer. Over time, recurring patterns become rules for prompts or data checks. Essentially, this is an automated bug‑reporting system for LLMs.

“Self‑correction increases trust”

Your auto‑rewrite cycle allows the system to correct its own mistakes. What did you learn about user trust in LLMs from this feature?

The main takeaway: users don’t trust that the system never makes mistakes; they trust that it can safely recover. The model generates an output, passes it through validations, and if there are fixable errors, it rewrites itself. Attempts are strictly limited, each attempt is logged, and human intervention occurs if the system cannot resolve the issue. Users appreciate the system gradually reaching the correct result rather than trying to be perfect from the start. Self‑correction increases trust, which is evident in daily interactions with LLMs.

You analyzed user edits and integrated them into prompt rules. How does this improve model reliability in production?

Every edit is valuable real‑world data. We keep the diff before and after, include context, identify recurring patterns, and turn them into rules: what never to do, what must always be mentioned in certain situations. Over time, the model behaves like an experienced analyst who has internalized all business rules and company style. Reliability grows because the system learns from real data.

Which guardrails and deterministic checks were most critical when scaling LLM infrastructure?

The most important are schema and type checks, business rules, allowlists/denylists, idempotency, and safe fallbacks. They may not look flashy, but they make LLMs reliable for enterprise use. When something goes wrong, we prefer “do nothing and ask a human” rather than guessing.

“Simulators reveal systemic errors”

You built a simulator modeling 8–10 LLM calls in a chain. How does this help detect systemic regressions?

Most failures don’t occur on the third or seventh call but in the interaction of all steps. The simulator runs realistic end‑to‑end flows, compares the final output to a reference, and shows what changed. Simulators uncover systemic errors and allow us to precisely understand what has been validated and how results evolved.

At Treater, you built a corporate AI analyst — the Treater Agent — which saves tens of thousands of hours of manual work. What principles of trust and interpretability did you use in its design?

We designed it so every output is understandable: sources, data, and time windows. The agent explains how it reached its conclusion, shows confidence, and presents alternative actions. Risky actions go through human review. Users feel they are interacting not with a black box, but with a transparent, fast junior analyst.

How did your experience deploying LLM pipelines at Instabase and Cohere influence your approach to production model quality?

At Instabase we worked with banks and government clients, where rare cases are the norm. This taught me to care about long‑tail errors and build configurable validation layers, not rely on a single model. At Cohere I saw the importance of real business metrics: response speed, CSAT, and problem resolution. At Treater we combined both approaches: we view quality as a property of the entire system, not of one model.

“Offline metrics and online behavior are two sides of the same coin”

How do offline metrics differ from online quality evaluations, and how has this experience improved reliability at Treater?

Offline metrics are static test sets: accuracy, F1, rubric scores. Online metrics are what actually happen in production: user edits, rollbacks, business KPIs. Offline metrics are good for quick iteration and catching obvious regressions. But users ask new questions, data changes, and priorities shift. Offline metrics and online behavior are two sides of the same coin, and we use this to guide pipeline adjustments.

What impact do online signals have on pipeline performance and system reliability?

They show how the system behaves in the real world—for example, the percentage of editable outputs or how often users override recommendations. When online and offline results diverge, we trust online; it’s the real measure of business trust and value.

Which interpretability practices have proven most useful for teams and clients?

Simple approaches work best.

Natural language explanations: “I selected these stores because …”
Source tracing: click to see the underlying data.
Evidence highlighting: specific metrics or lines.
Rules: “three business rules triggered.”

People don’t need complex SHAP plots; they want a clear story and the ability to verify details.

“You don’t remove uncertainty, but you build a system resilient to it”

What challenges arise when scaling LLMs for enterprise clients, and how do multi‑layer pipelines help solve them?

The main challenges are non‑determinism, compliance, security, performance, and cost. Multi‑layer pipelines help structure the process: typed outputs, checks, and clear failure scenarios. You can swap models or prompts without breaking guardrails. Cheaper models run early; expensive ones handle critical steps.

How do you balance automation (auto‑rewrite, eval pipeline) with human oversight to maintain trust in production AI?

We use risk‑based separation.

Low‑risk actions are heavily automated.
Medium‑risk actions go through more layers of review with selective human oversight.
High‑risk actions require drafts or mandatory human review.

Automation speeds up processes; humans make judgment calls where needed. We track telemetry from both sides and gradually expand what we trust.

If you were advising other engineers on building reliable LLM systems, what would you highlight?

Treat prompts and evaluations like code — version, test, validate.
Multi‑layer evaluation — deterministic checks, LLM‑as‑a‑Judge, user feedback.
End‑to‑end simulators to validate complete flows.
Safe self‑correction, measuring online behavior, and aligning with business metrics.

You don’t remove uncertainty, but you build a system resilient to it — that’s real trust in production LLMs.

AI Blogs and Articles

Why It Matters