How We Rebuilt a Legacy HBase + Elasticsearch System Using Apache Iceberg, Spark, Trino, and Doris

•March 10, 2026

DZone – Big Data Zone•Mar 10, 2026

Why It Matters

The shift demonstrates how legacy data pipelines can be modernized into cost‑effective lakehouse architectures while preserving real‑time analytics, a blueprint for enterprises facing similar scalability and budget pressures.

Key Takeaways

•Legacy stack cost $ high, scaling difficult
•Iceberg provides ACID, partition evolution, rollback
•Spark Structured Streaming writes to Iceberg every 5 mins
•Doris outperforms Trino 2–3× for real‑time queries
•Lakehouse reduces infrastructure complexity and storage costs

Pulse Analysis

Enterprises still running monolithic HBase and Elasticsearch pipelines often wrestle with high operational spend, slow scaling, and brittle code paths. In the case of a platform that audits every user action, the latency tolerance of a few minutes allowed a strategic pivot toward a lakehouse model. By moving raw events into cloud object storage and adopting Apache Iceberg as the table format, the team gained transactional safety, flexible partitioning, and instant rollbacks—features that are hard to achieve with traditional NoSQL stores.

The ingestion layer was rebuilt around Apache Spark Structured Streaming, which reads from Kafka, processes micro‑batches, and commits to Iceberg in five‑minute windows. Spark’s native Iceberg support simplifies file compaction and schema evolution, while PySpark accelerated development cycles. Parquet was selected for its columnar compression and query performance, further reducing storage costs compared with Avro. This combination delivers a near‑real‑time data pipeline that satisfies both data‑science model training and ad‑hoc customer queries without the overhead of a continuously running service.

Query performance was the final differentiator. Initial trials with Trino showed strong federated analytics but fell short on sub‑second search latency. After evaluating StarRocks and Apache Doris, the team settled on Doris for its aggressive caching and external table capabilities, achieving a 2–3× speedup on typical audit queries. The result is a streamlined, cost‑efficient architecture that balances analytical depth with real‑time responsiveness—an increasingly common requirement as businesses turn data into a competitive asset.