Serverless Spark Isn't Always the Answer: A Case Study

Serverless Spark Isn't Always the Answer: A Case Study

DZone – Big Data Zone
DZone – Big Data ZoneJan 12, 2026

Why It Matters

The hybrid Glue + DuckDB approach slashes costs by an order of magnitude while delivering required latency, enabling businesses to scale analytics without ballooning budgets or ops burden.

Key Takeaways

  • Pre‑aggregate 80% data to shrink real‑time workload.
  • Glue ETL handles batch, cost under $500 monthly.
  • DuckDB on Fargate serves on‑demand queries under 2 minutes.
  • EMR Serverless can cost $15‑30k for same load.
  • Pick tech by data volatility, SLA, and team skill.

Pulse Analysis

Processing billions of records with strict latency constraints has traditionally pushed enterprises toward heavyweight solutions like EMR Serverless or managed data warehouses. While Spark offers unparalleled scalability, its cold‑start penalties, per‑vCPU pricing, and shared‑account resource limits can quickly erode cost efficiency, especially for workloads that spike intermittently. Moreover, the operational overhead of monitoring multiple Spark UI instances and managing concurrency caps adds hidden complexity that many data teams struggle to absorb.

The case study demonstrates a pragmatic alternative: use AWS Glue ETL to run batch jobs that pre‑aggregate the bulk of static and slowly‑changing data, then hand off the trimmed dataset to DuckDB running in ECS Fargate for real‑time user requests. Glue’s 1‑2‑minute initialization is negligible when jobs execute only a handful of times per day, and its Spark engine still handles the heavy lifting of processing 500 M+ rows. DuckDB, with its in‑process vectorized engine, loads the pre‑aggregated Parquet files in milliseconds and executes SQL logic within a sub‑two‑minute window, all without the need for a persistent cluster. This split architecture drives monthly spend below $500, compared with the $15‑30 k projected for a pure EMR Serverless deployment, and keeps weekly ops time under three hours.

For organizations evaluating similar workloads, the key is to map data volatility against service level objectives and internal skill sets. If more than 80% of the data can be refreshed daily or less, a batch‑first strategy with Glue and a lightweight query engine like DuckDB or even SQLite can satisfy most SLAs. Conversely, use EMR Serverless, StarRocks, or other MPP databases when sub‑second responses, continuous streaming, or complex analytical SQL are non‑negotiable. The decision framework outlined—assess volatility, define SLA tiers, audit team expertise, and model cost at scale—provides a repeatable path to an architecture that balances performance, cost, and operational simplicity.

Serverless Spark Isn't Always the Answer: A Case Study

Comments

Want to join the conversation?

Loading comments...