
How Skyscanner Scales OpenTelemetry: Managing Collectors Across 24 Production Clusters
Why It Matters
The approach proves that a small platform team can deliver enterprise‑scale, cost‑effective observability without locking into a single vendor, setting a template for other microservice‑heavy firms. It also demonstrates how strategic collector design can reduce data volume, improve metric quality, and streamline incident response.
Key Takeaways
- •Skyscanner runs OpenTelemetry collectors across 24 Kubernetes clusters
- •Central DNS with Istio routes telemetry to the nearest collector
- •Span metrics connector converts Istio spans into low‑cardinality metrics
- •Filter processor removes 404 errors to avoid unnecessary trace sampling
- •Base Java image disables all, then enables curated instrumentations
Pulse Analysis
Observability at scale is a persistent challenge for companies that run thousands of microservices. Skyscanner’s decision in 2021 to adopt the OpenTelemetry Collector as a vendor‑agnostic backbone allowed it to sidestep lock‑in while standardizing data collection across a globally distributed fleet. By centralizing the entry point with a DNS address and letting Istio perform intelligent, nearest‑node routing, the architecture minimizes latency and simplifies network topology, a model that other enterprises can replicate without extensive custom code.
The technical design hinges on two collector patterns: a Gateway replica set that processes bulk OTLP traces and metrics, and an Agent daemon set that scrapes Prometheus endpoints for services lacking native OTLP support. A standout innovation is the span‑metrics connector, which ingests Istio‑generated spans, transforms them to semantic‑convention names, and emits aggregated metrics such as http.client.duration. This eliminates the need for per‑service instrumentation while keeping cardinality under control. Additionally, a targeted filter processor unsets error status on routine 404 responses from cache services, preventing the collector from inflating error‑based sampling rates.
Operationally, Skyscanner’s platform team embeds auto‑instrumentation into a shared Java base image, disabling all instrumentations by default and enabling only a curated subset (HTTP, gRPC, runtime). This opinionated stance reduces noise and cost, while still allowing service teams to opt‑in to additional probes. Collectors are updated on a six‑month cadence with progressive rollouts via Argo CD, catching configuration issues early. The experience underscores that modest‑size teams can manage sophisticated telemetry pipelines, offering a repeatable blueprint for firms seeking reliable, cost‑efficient observability at scale.
How Skyscanner scales OpenTelemetry: managing collectors across 24 production clusters
Comments
Want to join the conversation?
Loading comments...