Infinite Curiosity
Site reliability engineering (SRE) remains the backbone of modern infrastructure, handling alerts, triage, and remediation across cloud and on‑prem environments. Traditional automation has long filtered noisy alerts and grouped related incidents, but the growing complexity of telemetry—from logs in Elasticsearch to metrics in Prometheus—creates a scalability challenge. By leveraging generative AI, agents can pattern‑match across vast historical data, offering human‑like decision making that reduces mean‑time‑to‑resolution and frees engineers to focus on strategic work.
When Newbird’s AI‑SRE product Hawkeye is introduced, day one involves connecting the agent to existing observability stacks via simple API key integrations. Within the first week Hawkeye learns the organization’s incident patterns, while teams assess its recommendations. Over a 100‑day horizon the system transitions to a "silent" assistant, autonomously handling routine fixes, submitting pull requests, and updating ITSM tickets. Crucially, a human‑in‑the‑loop model persists: engineers approve actions, provide feedback, and gradually grant broader autonomy, mirroring the trust‑building journey seen in self‑driving car deployments.
Transparency is baked into Hawkeye through a documented chain‑of‑thought, allowing operators to query the reasoning behind each remediation and to version control successful strategies. Access control mirrors established API permissions, limiting write capabilities to safe channels like ITSM notes while keeping telemetry read‑only. Newbird’s AI‑first development ethos—using agents that argue, review code, and iterate rapidly—compresses years of engineering effort into months. The result is a productivity boost, reduced incident volume, and a clearer path for emerging startups to embed agentic systems into their infrastructure pipelines.
Gou Rao is CEO of NeuBird, an agentic AI Site Reliability Engineer for IT teams. They've raised $44.5 Million from Mayfield and M12. He was previously the CTO of Citrix and Portworx.
(00:01) Introduction
(01:07) What Does an SRE Do?
(02:19) Inside a Typical Incident Flow
(04:16) What Can Be Automated?
(05:52) Deploying Hawkeye: Day 1 to Day 100
(11:59) Earning Trust for Autonomous Agents
(14:57) Versioning Agent Behavior & Chain of Thought
(17:02) Building Agentic Infra Products
(18:38) Access Control for Agents
(20:29) Company Building in the AI Era
(23:53) Competitive Edge in AI + Infra
(26:35) Model Choice & Agent Reasoning Quality
(29:33) Biggest Product Bet
(31:22) Exciting AI Advancements
(33:04) Rapid Fire Round
Where to find Gou Rao:
LinkedIn: https://www.linkedin.com/in/gouthamrao/
Where to find Prateek Joshi:
Research Column: https://www.infrastartups.com
Newsletter: https://prateekjoshi.substack.com
Website: https://prateekj.com
LinkedIn: https://www.linkedin.com/in/prateek-joshi-infinite
X: https://x.com/prateekj
Comments
Want to join the conversation?
Loading comments...