
The LF Live webinar featured Assaf Saf Salvich, AI Engineering Group Manager at Commodore, outlining the company’s journey toward self‑healing AI‑driven Site Reliability Engineering (SRE). He described how Commodore has amassed close to two million real‑world Kubernetes incidents, initially attempting to address them with deterministic runbooks before realizing the approach could not scale. Key insights revealed that early categorization into six broad buckets quickly proved insufficient; the incident taxonomy ballooned to dozens of nuanced sub‑categories, each demanding distinct remediation logic. To cut through the noise, Commodore introduced a “context engine” that aggregates organizational, cluster, cloud, and historical incident data, feeding it into machine‑learning models that generate dynamic, situation‑specific runbooks. Illustrative examples highlighted the perils of shallow analysis: two services—Cash Loader and Event Processor—both exhibited out‑of‑memory crashes, yet one required a simple memory‑limit increase while the other stemmed from a memory leak that would be exacerbated by the same fix. A second case contrasted an order‑processing chain with a data‑analytics pipeline, showing identical storage‑service symptoms but divergent root causes, underscoring the necessity of deep contextual signals. The broader implication is a paradigm shift from static, one‑size‑fits‑all runbooks to adaptive, AI‑powered incident remediation. By automating root‑cause identification and prescribing context‑aware fixes, Commodore aims to dramatically shrink mean‑time‑to‑recovery (MTTR) for SRE teams operating at massive scale, setting a new benchmark for operational resilience in cloud‑native environments.

AI adoption is accelerating, but high‑performing models depend on open‑source foundations such as Linux, Kubernetes, and cloud‑native infrastructure. Without this stack, AI systems struggle to scale, deploy reliably, and move beyond experimental phases. The video highlights a growing talent gap:...

The Kubernetes Steering Committee announced that the Ingress NGINX controller – a core ingress solution for roughly half of cloud‑native deployments – will be officially retired at the end of March, six weeks from the announcement. After that date the...

In a sponsored session at the conference, Grafana Labs’ Developer Programs Director Ted Young—also a co‑founder of the OpenTelemetry project—outlined the current state of installing OpenTelemetry and previewed the roadmap for simplifying the process. He emphasized that the guidance applies...