How Netflix Serves ML Predictions to 250M Users at 1 Million Requests Per Second

•May 30, 2026

Better Engineers•May 30, 2026

Key Takeaways

•Switchboard processes >1M ML requests per second for 250M users
•Supports hundreds of model types on shared infrastructure
•Enables instant rollbacks and rapid model experiments
•Abstracts model versioning from client services
•Handles multiple concurrent A/B tests per user

Pulse Analysis

The serving layer is the hidden engine behind every Netflix interaction, yet it receives far less attention than model training. At a scale of 250 million users and more than one million inference calls each second, traditional API gateways or generic service meshes cannot meet the latency and reliability demands. Netflix’s Switchboard was created to fill this gap, providing a purpose‑built router that can dispatch requests to the appropriate model version in microseconds while maintaining high throughput and fault tolerance.

Switchboard’s architecture is organized into three logical layers. The front‑end router ingests incoming traffic, performs feature extraction, and dynamically selects the correct model based on user context, experiment flags, and service contracts. Behind it, a pool of model execution nodes runs containerized versions of each ML service, isolated for quick rollbacks and seamless A/B testing. A control plane orchestrates version deployments, health checks, and traffic shadowing, allowing engineers to push new models without disrupting the user experience. This separation of concerns lets Netflix experiment with dozens of model variants simultaneously, each serving a fraction of the traffic while the router guarantees consistent latency.

For the broader tech industry, Switchboard illustrates that scaling ML inference is as much an engineering problem as a data‑science one. Companies aiming for real‑time personalization must invest in routing intelligence, version isolation, and observability to avoid bottlenecks that can erode user satisfaction. Netflix’s open‑source contributions around request routing and model lifecycle management are likely to influence emerging standards for ML serving platforms, pushing the ecosystem toward more modular, low‑latency solutions that can keep pace with ever‑growing AI workloads.

How Netflix Serves ML Predictions to 250M Users at 1 Million Requests Per Second

Read Original Article

Comments

Want to join the conversation?

How Netflix Serves ML Predictions to 250M Users at 1 Million Requests Per Second

Key Takeaways

Pulse Analysis

Ask Pulse AI:

Comments

DevOps Pulse