NSDI '26 - Skyline: A Cloud Centric Internet Monitoring Engine

USENIX
USENIXJun 4, 2026

Why It Matters

Skyline gives cloud operators real‑time, end‑to‑end visibility of the public internet, dramatically reducing outage detection time and false‑positive alerts, which translates into higher service reliability and lower operational costs.

Key Takeaways

  • Public internet causes 63% of cloud network incidents.
  • Bidirectional probing with programmable IGW ensures full cloud subnet coverage.
  • Flex Pass uses segment routing to monitor primary and backup paths.
  • Flow Hunter provides zero‑overhead, tenant‑level TCP analysis via optical splitters.
  • Skyline achieves >99% recall and 97% precision in production.

Summary

The video introduces Skyline, a cloud‑centric internet monitoring engine co‑developed by ByteDance, the University of Hong Kong, and the University of Michigan. It tackles the opaque, massive, and fault‑diverse public‑internet segment that accounts for the majority of cloud‑network incidents.

Skyline addresses three coverage gaps: bidirectional reachability, incident‑life‑cycle visibility, and tenant‑level diagnostics. It combines BiProp for inbound/outbound probing via programmable IGW switches, Flex Pass that leverages segment‑routing MPLS tunnels to steer probes along primary and backup AS paths, and Flow Hunter that duplicates traffic with optical splitters for zero‑overhead, per‑tenant TCP‑state analysis.

The presenters cite concrete data: 63% of network incidents occur on the public internet, 78% of P0/P1 outages are internet‑related, and Skyline has already detected over 2,000 events. Real‑world cases show inbound probing exposing hidden loss despite normal outbound pings, Flex Pass identifying path repair 20 minutes earlier than ISP notifications, and Flow Hunter catching early TCP‑reset storms caused by misconfigured security devices.

With >99% recall and 97% precision across two years of production data, Skyline delivers high‑accuracy, low‑overhead monitoring that enables cloud providers to diagnose, remediate, and assure service continuity faster than traditional tools, directly improving customer experience and operational efficiency.

Original Description

Skyline: A Cloud Centric Internet Monitoring Engine
Shixian Guo, ByteDance; Ziqian Liu, The University of Hong Kong; Yangyang Bai, Yuan Chen, Kefei Liu, Qi Zhang, Songlin Liu, Yang Lv, Jianwei Hu, Gen Li, Zhenyang Zhong, Sisi Wen, Yongbin Dong, Feng Luo, Anjian Chen, Rui Han, Jiale Feng, Lingpei Meng, Siwan Chen, Hang Li, Shuai Xu, Juntao Zhong, and Chaoran Hu, ByteDance; Yibo Huang, University of Michigan; Yiming Qiu, The University of Hong Kong
Cloud providers depend on the public Internet to connect tenants and their clients, yet Internet faults are a leading cause of cloud outages: in our organization, more than 60% of network incidents happen in the Internet and account for close to 80% of user-impacting events. Effectively monitoring the Internet is challenging for cloud providers because they lack direct control and visibility into Internet internals. Our key insight is to treat coverage as a first-class goal and decompose the monitoring requirements into three coverage dimensions—traffic direction, incident lifecycle, and tenant granularity—then resolve each independently. We present Skyline, a cloud-centric Internet monitoring system that addresses all coverage dimensions at scale by combining purpose-built dataplane hooks with lightweight software control to minimize resource overhead, shorten reaction time, and preserve non-intrusiveness. Skyline has been deployed for more than two years. In 2025, it identified over 2,000 incidents with very high precision and recall over confirmed issues, thereby significantly improving the reliability of our cloud network.

Comments

Want to join the conversation?

Loading comments...