GitHub’s CTO Vlad Fedorov detailed three recent availability incidents—a Feb. 9 database overload, a Feb. 2 security‑policy‑induced failover, and a Mar. 5 Redis failover that left writes disabled. The post explains how a new AI model release, a reduced cache‑TTL, and peak traffic combined to saturate the database, while telemetry gaps and configuration errors amplified failover impacts. GitHub says it will add finer‑grained traffic‑shaping controls and improve incident‑response tooling. The blog underscores the company’s commitment to greater transparency about outages.

AI SRE platforms such as PagerDuty, Datadog, and several startups are emerging to automate incident diagnostics and mitigation, but they largely ignore the coordination side of incident response. The author argues that incident management—aligning multiple responders, preventing fixation, and maintaining...