LF Live Webinar: Context Engineering for Self-Healing AI SRE

The Linux Foundation
The Linux FoundationMar 17, 2026

Why It Matters

Context‑engineered AI transforms incident response from manual, error‑prone triage to automated, rapid remediation, delivering measurable MTTR reductions for enterprises managing thousands of Kubernetes clusters.

Key Takeaways

  • Commodore processed over one million real‑world Kubernetes incidents.
  • Simple runbooks failed; context‑driven AI needed for scaling.
  • Incident categories expanded beyond six to dozens of nuanced buckets.
  • Memory‑leak vs. memory‑limit cases illustrate need for deep context.
  • Context engineering enables automated root‑cause analysis and remediation.

Summary

The LF Live webinar featured Assaf Saf Salvich, AI Engineering Group Manager at Commodore, outlining the company’s journey toward self‑healing AI‑driven Site Reliability Engineering (SRE). He described how Commodore has amassed close to two million real‑world Kubernetes incidents, initially attempting to address them with deterministic runbooks before realizing the approach could not scale.

Key insights revealed that early categorization into six broad buckets quickly proved insufficient; the incident taxonomy ballooned to dozens of nuanced sub‑categories, each demanding distinct remediation logic. To cut through the noise, Commodore introduced a “context engine” that aggregates organizational, cluster, cloud, and historical incident data, feeding it into machine‑learning models that generate dynamic, situation‑specific runbooks.

Illustrative examples highlighted the perils of shallow analysis: two services—Cash Loader and Event Processor—both exhibited out‑of‑memory crashes, yet one required a simple memory‑limit increase while the other stemmed from a memory leak that would be exacerbated by the same fix. A second case contrasted an order‑processing chain with a data‑analytics pipeline, showing identical storage‑service symptoms but divergent root causes, underscoring the necessity of deep contextual signals.

The broader implication is a paradigm shift from static, one‑size‑fits‑all runbooks to adaptive, AI‑powered incident remediation. By automating root‑cause identification and prescribing context‑aware fixes, Commodore aims to dramatically shrink mean‑time‑to‑recovery (MTTR) for SRE teams operating at massive scale, setting a new benchmark for operational resilience in cloud‑native environments.

Original Description

Sponsored by Komodor
In this webinar, we’ll trace our own reliability journey - from reactive incident chaos to data-driven prevention and, ultimately, AI-powered self-healing. After analyzing over a million real production incidents, we hit the predictability paradox: why repeatable failures still catch teams off guard if most Kubernetes outages follow recognizable patterns that we can systematically address?
We discovered the undeniable truth that in modern sprawling Cloud-Native infrastructures, no two issues are the same, and none exist in isolation. Deterministic approaches break at a certain scale, and AI agents can’t replace humans by executing a simple runbook. We’ll review the 6 main categories of failures, how the same error can have different root causes, why the same fix doesn’t always apply, and how to provide AI agents with the right context to achieve human-level reasoning during RCA.
We’ll conclude with a forward-looking view of AI agents as reliability partners, a short demo, and a set of immediate, actionable steps attendees can take to reduce toil and begin building toward autonomous, self-healing operations.

Comments

Want to join the conversation?

Loading comments...