NSDI '26 - Skyline: A Cloud Centric Internet Monitoring Engine
Why It Matters
Skyline gives cloud operators real‑time, end‑to‑end visibility of the public internet, dramatically reducing outage detection time and false‑positive alerts, which translates into higher service reliability and lower operational costs.
Key Takeaways
- •Public internet causes 63% of cloud network incidents.
- •Bidirectional probing with programmable IGW ensures full cloud subnet coverage.
- •Flex Pass uses segment routing to monitor primary and backup paths.
- •Flow Hunter provides zero‑overhead, tenant‑level TCP analysis via optical splitters.
- •Skyline achieves >99% recall and 97% precision in production.
Summary
The video introduces Skyline, a cloud‑centric internet monitoring engine co‑developed by ByteDance, the University of Hong Kong, and the University of Michigan. It tackles the opaque, massive, and fault‑diverse public‑internet segment that accounts for the majority of cloud‑network incidents.
Skyline addresses three coverage gaps: bidirectional reachability, incident‑life‑cycle visibility, and tenant‑level diagnostics. It combines BiProp for inbound/outbound probing via programmable IGW switches, Flex Pass that leverages segment‑routing MPLS tunnels to steer probes along primary and backup AS paths, and Flow Hunter that duplicates traffic with optical splitters for zero‑overhead, per‑tenant TCP‑state analysis.
The presenters cite concrete data: 63% of network incidents occur on the public internet, 78% of P0/P1 outages are internet‑related, and Skyline has already detected over 2,000 events. Real‑world cases show inbound probing exposing hidden loss despite normal outbound pings, Flex Pass identifying path repair 20 minutes earlier than ISP notifications, and Flow Hunter catching early TCP‑reset storms caused by misconfigured security devices.
With >99% recall and 97% precision across two years of production data, Skyline delivers high‑accuracy, low‑overhead monitoring that enables cloud providers to diagnose, remediate, and assure service continuity faster than traditional tools, directly improving customer experience and operational efficiency.
Comments
Want to join the conversation?
Loading comments...