Surfing Complexity

Publication

0 followers

Incident analysis, resilience engineering, human factors in SRE/ops; practical guidance on coordination during incidents.

News•Mar 13, 2026

Quick Thoughts on GitHub CTO’s Post on Availability

GitHub’s CTO Vlad Fedorov detailed three recent availability incidents—a Feb. 9 database overload, a Feb. 2 security‑policy‑induced failover, and a Mar. 5 Redis failover that left writes disabled. The post explains how a new AI model release, a reduced cache‑TTL, and peak traffic combined to saturate the database, while telemetry gaps and configuration errors amplified failover impacts. GitHub says it will add finer‑grained traffic‑shaping controls and improve incident‑response tooling. The blog underscores the company’s commitment to greater transparency about outages.

By Surfing Complexity

News•Feb 14, 2026

Lots of AI SRE, No AI Incident Management

AI SRE platforms such as PagerDuty, Datadog, and several startups are emerging to automate incident diagnostics and mitigation, but they largely ignore the coordination side of incident response. The author argues that incident management—aligning multiple responders, preventing fixation, and maintaining...

By Surfing Complexity

Technology Pulse

Top Publishers

Top Creators

Top Companies

Top Investors

Surfing Complexity

Quick Thoughts on GitHub CTO’s Post on Availability

Lots of AI SRE, No AI Incident Management

Technology Pulse

Top Publishers

Top Creators

Top Companies

Top Investors

Surfing Complexity

Quick Thoughts on GitHub CTO’s Post on Availability

Lots of AI SRE, No AI Incident Management