Online Retailer Zalando Trims AWS Bill by Getting Manual with Flink Stream Filtering

•March 9, 2026

The Stack (TheStack.technology)•Mar 9, 2026

Why It Matters

The case highlights how high‑level stream‑processing abstractions can become cost‑prohibitive at scale, prompting enterprises to balance elegance against operational efficiency. It offers a concrete roadmap for other retailers facing similar state‑management challenges.

Key Takeaways

•Flink 1.20 joins inflated state to 245 GB per app
•Snapshot jobs maxed CPU, causing 12‑minute restarts
•Custom KeyedProcessFunction cut AWS bill by 13%
•Restart time fell to 4‑5 minutes after redesign
•Flink 2.1 multi‑join may solve issue natively

Pulse Analysis

Zalando’s data‑engineering team manages a massive real‑time pipeline that merges pricing, inventory, and sponsored‑product streams to present a unified view for shoppers. While Apache Flink’s Table API and SQL promise declarative simplicity, the retailer discovered that Flink 1.20 treats each join as an isolated operator, forcing the system to retain every incoming record in RocksDB. When four joins are chained, state size multiplies, reaching 235‑245 GB per job and triggering hourly snapshots that monopolize CPU for over ten minutes, inflating AWS spend and destabilizing the service.

To regain control, Zalando engineers rewrote the pipeline as a single DataStream keyed by SKU, employing a custom MultiStreamJoinProcessor. By manually discarding out‑of‑order events before they entered state, they eliminated unnecessary writes and reduced the overall state footprint dramatically. The result was a 13% reduction in the AWS bill for this workload and a dramatic drop in restart duration—from 12‑20 minutes down to roughly four to five minutes. This hands‑on approach, though more verbose than the original SQL, proved essential for maintaining latency guarantees and operational reliability at the scale of billions of events per day.

The experience underscores a broader lesson for enterprises adopting managed stream‑processing services: abstractions are valuable, but they must be evaluated against real‑world cost and performance metrics. Flink 2.1’s experimental multi‑way join operator promises to address the state‑bloat issue natively, potentially allowing companies to retain the elegance of SQL without the hidden infrastructure overhead. Until such features mature, a hybrid strategy—leveraging declarative APIs for the majority of use cases while falling back to custom low‑level processing for edge cases—offers the best balance of developer productivity and fiscal responsibility.