DZone – Big Data Zone

DZone – Big Data Zone

Publication
0 followers

Community and editorial coverage on Big Data tools, streaming, data lakes, and engineering patterns.

Boost Your Spark Jobs: How Photon Accelerates Apache Spark Performance
NewsApr 13, 2026

Boost Your Spark Jobs: How Photon Accelerates Apache Spark Performance

Databricks introduced Photon, a native C++ engine that replaces Spark’s JVM‑based runtime. By using vectorized, columnar processing and zero‑copy memory management, Photon delivers 3–7× faster query execution and 30–50% lower memory consumption. The engine integrates as a shared library, letting...

By DZone – Big Data Zone
Schema Evolution in Delta Lake: Designing Pipelines That Never Break
NewsApr 10, 2026

Schema Evolution in Delta Lake: Designing Pipelines That Never Break

Schema drift—unexpected column additions or type changes—frequently breaks Spark pipelines. Delta Lake mitigates this risk with two complementary features: schema enforcement, which rejects mismatched writes, and schema evolution, which can automatically merge new columns when explicitly enabled. Each schema change...

By DZone – Big Data Zone
Why Queues Don’t Fix Scaling Problems
NewsApr 8, 2026

Why Queues Don’t Fix Scaling Problems

The article argues that inserting a queue between two overloaded services only masks a capacity problem, not solves it. While queues can absorb brief traffic spikes, sustained overload causes the queue to grow, leading to downstream failures such as database...

By DZone – Big Data Zone
Delta Change Data Feed Deep Dive: Building Incremental Pipelines Without Complexity
NewsApr 1, 2026

Delta Change Data Feed Deep Dive: Building Incremental Pipelines Without Complexity

Delta Lake’s Change Data Feed (CDF) lets engineers capture row‑level changes as soon as they occur, turning a Delta table into a built‑in change‑data‑capture engine. By enabling the table property delta.enableChangeDataFeed, only modified rows are read, eliminating costly full‑table scans for...

By DZone – Big Data Zone
Queues Don't Absorb Load — They Delay Bankruptcy
NewsMar 30, 2026

Queues Don't Absorb Load — They Delay Bankruptcy

Backend teams often add a queue during traffic spikes, seeing immediate latency drops, but the queue merely postpones work. As consumer throughput lags, queue depth grows unchecked, turning milliseconds into minutes of processing delay and eventually causing memory exhaustion or...

By DZone – Big Data Zone
Scaling Kafka Consumers: Proxy Vs. Client Library for High-Throughput Architectures
NewsMar 30, 2026

Scaling Kafka Consumers: Proxy Vs. Client Library for High-Throughput Architectures

Apache Kafka’s pull‑based model excels for event‑driven microservices, but scaling consumer groups creates operational overhead, head‑of‑line blocking, and complex error handling. Large enterprises such as Wix and Uber have addressed these limits by deploying a centralized push‑based consumer proxy, achieving...

By DZone – Big Data Zone
How Piezoelectric Energy Harvesting Is Solving the Battery Waste Crisis in Industrial IoT
NewsMar 18, 2026

How Piezoelectric Energy Harvesting Is Solving the Battery Waste Crisis in Industrial IoT

Industrial IoT deployments rely on millions of short‑life batteries, creating a looming waste problem that could reach 1.4 million metric tons by 2030. High‑temperature piezoelectric energy harvesting converts machine vibration into electricity, tolerating up to 350 °C and eliminating the need for...

By DZone – Big Data Zone
Online Feature Store for AI and Machine Learning with Apache Kafka and Flink
NewsMar 16, 2026

Online Feature Store for AI and Machine Learning with Apache Kafka and Flink

Wix.com has built a real‑time online feature store using Apache Kafka and Apache Flink to power personalized recommendations for its 200 million users. The architecture streams over 70 billion events per day through 50 000 Kafka topics, with FlinkSQL performing low‑latency transformations and...

By DZone – Big Data Zone
How We Rebuilt a Legacy HBase + Elasticsearch System Using Apache Iceberg, Spark, Trino, and Doris
NewsMar 10, 2026

How We Rebuilt a Legacy HBase + Elasticsearch System Using Apache Iceberg, Spark, Trino, and Doris

A fintech audit platform replaced its monolithic HBase + Elasticsearch stack with a lakehouse built on Apache Iceberg, Parquet, and Spark Structured Streaming. Data is ingested from Kafka every five minutes, written to Iceberg tables, and queried via Apache Doris for low‑latency...

By DZone – Big Data Zone
Square, SumUp, Shopify: Data Streaming for Real-Time Point-of-Sale (POS)
NewsMar 9, 2026

Square, SumUp, Shopify: Data Streaming for Real-Time Point-of-Sale (POS)

Point‑of‑sale systems are evolving from simple cash registers into real‑time, connected platforms that handle payments, inventory, and customer insights. Mobile payment leaders Square, SumUp, and Shopify now offer SMBs enterprise‑grade POS capabilities, blurring the line between payment processors and commerce...

By DZone – Big Data Zone
Databricks Lakeflow Spark Declarative Pipelines Migration From Non‑Unity Catalog to Unity Catalog
NewsMar 4, 2026

Databricks Lakeflow Spark Declarative Pipelines Migration From Non‑Unity Catalog to Unity Catalog

Databricks is transitioning Delta Live Tables pipelines from legacy Hive Metastore workspaces to Unity Catalog‑enabled environments, revealing consistent code refactoring and governance adjustments. Teams must adopt three‑level catalog.schema.table references, replace input_file_name() calls with the built‑in _metadata struct, and migrate notebook...

By DZone – Big Data Zone
The Hidden Cost of Custom Logic: A Performance Showdown in Apache Spark
NewsFeb 26, 2026

The Hidden Cost of Custom Logic: A Performance Showdown in Apache Spark

A recent benchmark shows that standard Python UDFs in PySpark dramatically slow pipelines because each row must be serialized to a Python worker. Using Pandas (vectorized) UDFs cuts execution time by roughly fourfold by leveraging Apache Arrow’s columnar transfer. Native...

By DZone – Big Data Zone
AWS SageMaker HyperPod: Distributed Training for Foundation Models at Scale
NewsFeb 19, 2026

AWS SageMaker HyperPod: Distributed Training for Foundation Models at Scale

Amazon Web Services introduced SageMaker HyperPod, a managed, persistent GPU‑cluster service built for training foundation models at massive scale. HyperPod automates node recovery, uses Elastic Fabric Adapter for ultra‑low‑latency interconnect, and integrates with SageMaker Distributed, PyTorch FSDP, and DeepSpeed. The...

By DZone – Big Data Zone
A Pattern for Intelligent Ticket Routing in ITSM
NewsFeb 10, 2026

A Pattern for Intelligent Ticket Routing in ITSM

The article presents an architecture that replaces manual ticket dispatch with a machine‑learning core and a real‑time workload scheduler. Historical ticket data is vectorized with TF‑IDF and classified via Logistic Regression to predict the best resolver. Availability is verified through...

By DZone – Big Data Zone