SREcon26 Americas - Intelligent Load Balancing in Kubernetes
Why It Matters
By eliminating request skew at scale, Databricks reduces error rates, improves latency, and saves compute resources, a critical advantage for any multi‑cloud, high‑throughput SaaS provider.
Key Takeaways
- •Kubernetes balances connections, not requests, causing request skew.
- •gRPC over HTTP/2 multiplexing amplifies pod traffic imbalance.
- •Resetting connections, headless services, and service mesh proved insufficient.
- •Custom intelligent load balancer built for fairness, efficiency, and zone awareness.
- •Push‑based client‑side routing eliminates extra hops and improves latency.
Summary
The SREcon26 talk details Databricks’ effort to solve request‑imbalance issues in its Kubernetes‑based services by moving from the platform’s default load‑balancing to a custom, intelligent solution.
Databricks discovered that Kubernetes distributes connections uniformly, not individual requests. Because their traffic relies heavily on gRPC/HTTP‑2, a single connection can carry thousands of requests, leading to 4‑5× traffic skew across pods and causing 5xx errors and P99 latency spikes.
Initial mitigations—periodic connection resets, headless services for client‑side DNS load‑balancing, and a full service‑mesh—either added CPU overhead, suffered DNS‑caching limits, or introduced prohibitive proxy hops. A quoted observation summed it up: “all pods are equal, but some pods are more equal than others.”
The team ultimately built a push‑based, client‑library‑driven balancer that incorporates pod load, health, and zone metadata, delivering uniform request distribution without extra network hops. This architecture enables Databricks to run 1,500+ clusters across three clouds while maintaining low latency and cost efficiency.
Comments
Want to join the conversation?
Loading comments...