Big Data Blogs and Articles
  • All Technology
  • AI
  • Autonomy
  • B2B Growth
  • Big Data
  • BioTech
  • ClimateTech
  • Consumer Tech
  • Crypto
  • Cybersecurity
  • DevOps
  • Digital Marketing
  • Ecommerce
  • EdTech
  • Enterprise
  • FinTech
  • GovTech
  • Hardware
  • HealthTech
  • HRTech
  • LegalTech
  • Nanotech
  • PropTech
  • Quantum
  • Robotics
  • SaaS
  • SpaceTech
AllNewsDealsSocialBlogsVideosPodcastsDigests

Big Data Pulse

EMAIL DIGESTS

Daily

Every morning

Weekly

Sunday recap

NewsDealsSocialBlogsVideosPodcasts
Big DataBlogsHealing Tables: When Day-by-Day Backfills Become a Slow-Motion Disaster
Healing Tables: When Day-by-Day Backfills Become a Slow-Motion Disaster
Big Data

Healing Tables: When Day-by-Day Backfills Become a Slow-Motion Disaster

•February 6, 2026
0
Ghost in the data
Ghost in the data•Feb 6, 2026

Why It Matters

Healing Tables eliminate error propagation and performance decay inherent in traditional backfills, delivering reliable, maintainable SCD 2 dimensions for enterprise data warehouses.

Key Takeaways

  • •Day-by-day backfills compound errors and degrade performance.
  • •Healing Tables separate change detection from period construction.
  • •Six-step pipeline rebuilds dimensions deterministically from source data.
  • •Hashing and row compression reduce storage and processing time.
  • •Validation tests ensure temporal integrity before loading.

Pulse Analysis

Traditional day‑by‑day backfills for slowly changing dimensions (SCD 2) seem simple but quickly become a performance nightmare. Each incremental run compares incoming rows to an ever‑growing target, causing non‑linear runtime growth and compounding any logic errors. Source systems often emit multiple updates per day, deletions, and back‑dated changes, which the incremental loop either misses or mishandles, leading to overlapping records and timeline gaps that force a full rebuild anyway. The industry has long accepted this fragility as a cost of historical data reconstruction.

Healing Tables overturn that paradigm by treating the dimension as a pure function of source data. The six‑step framework first builds an Effectivity Table that captures every genuine change point, then creates contiguous time slices with deterministic `valid_from`/`valid_to` boundaries. By joining all source systems on a unified timeline, computing key and row hashes, and compressing consecutive identical states, the process eliminates redundant rows and dramatically reduces storage. A final validation suite checks for a single current record per key, non‑overlapping intervals, and proper ordering, guaranteeing temporal integrity before the load.

Adopting Healing Tables is most beneficial when complete source history is available and rebuild windows are acceptable. Data teams gain a path‑independent pipeline: fixing a detection bug or adding a new attribute simply requires re‑running the framework, producing the same result every time. This reproducibility aligns with modern data‑ops practices, lowers maintenance overhead, and improves confidence in downstream analytics. Organizations that replace fragile incremental backfills with Healing Tables can expect faster issue resolution, lower compute costs, and more trustworthy dimensional models across the enterprise.

Healing Tables: When Day-by-Day Backfills Become a Slow-Motion Disaster

It was 2 AM on a Saturday when I realized we’d been loading data wrong for six months.

The situation: a customer dimension with three years of history needed to be backfilled after a source system migration. The previous team’s approach was straightforward—run the daily incremental process 1,095 times, once for each day of history. They estimated three weeks to complete.

What they hadn’t accounted for was how errors compound. By the time I looked at the data, we had 47,000 records with overlapping date ranges, 12,000 timeline gaps where customers seemed to vanish and reappear, and an unknowable number of missed changes from when source systems updated the same record multiple times in a single day.

The dimension wasn’t just wrong. It was unfixably wrong using traditional methods. Every incremental run had layered new errors on top of old ones, creating a Jenga tower of data quality issues that couldn’t be untangled without starting over.

That night, I started building what I now call a Healing Table—a dimension that can be completely rebuilt from source data at any point, “healing” whatever accumulated inconsistencies have crept in over months or years of incremental loads.

This isn’t just a technique for disaster recovery. It’s a fundamentally different approach to SCD Type 2 that separates change detection from period construction, processes entire historical ranges in a single pass, and produces dimensions that are deterministically reproducible from source data.

Here’s how it works.


Why Day‑by‑Day Backfills Fail

The traditional approach to backfilling historical SCD2 dimensions follows a pattern most data engineers know well: take your daily incremental process, wrap it in a loop, and execute it once for each historical date. Simple, reuses existing code, and catastrophically fragile.

Errors compound across iterations. If your change detection logic misses an edge case on January 3rd, that incorrect record becomes the baseline for January 4th’s comparison. The error propagates forward through every subsequent run. By the time you discover the issue months later, you can’t simply fix January 3rd because hundreds of downstream records depend on that incorrect state.

Performance degrades non‑linearly. Each daily run needs to compare incoming records against the existing dimension. As the dimension grows through backfill, comparison costs increase. A process that takes 5 minutes for a single day’s delta might take 45 minutes per day when the target table contains three years of history. That three‑week estimate becomes three months.

Source system quirks multiply. Real source systems don’t change data once per day at a predictable time. They update records multiple times, delete and recreate rows, and occasionally backdate changes. Day‑by‑day processing either misses these patterns entirely or handles them inconsistently across runs.

Recovery requires complete rebuild anyway. When—not if—something goes wrong, fixing it requires blowing away the dimension and starting over. But if you’re going to rebuild from scratch regardless, why not design for that from the beginning?

The Healing Tables framework embraces this reality. Instead of trying to prevent rebuilds, it makes them fast, reliable, and deterministic.


The Six‑Step Framework

Healing Tables work by separating two concerns that traditional SCD2 implementations conflate: change detection (identifying when attributes changed) and period construction (building valid time slices with proper start and end dates).

Traditional approaches detect changes and construct periods simultaneously, comparing incoming records to existing dimension state. This creates tight coupling between current dimension contents and processing logic—if the dimension is wrong, future processing will also be wrong.

Healing Tables decouple these concerns through a six‑step pipeline that operates entirely on source data, constructing the dimension from scratch without reference to any existing target state:

  1. Effectivity Table Creation – Extract all change points from sources

  2. Time Slice Generation – Build date ranges with proper valid_from/valid_to

  3. Source Table Joining – Conform attributes from multiple sources

  4. Hash Computation – Enable efficient change detection

  5. Row Compression – Eliminate consecutive identical states

  6. Validation Testing – Verify temporal integrity before loading

Let me walk through each step.


Step 1: Building the Effectivity Table

The Effectivity Table captures every moment when any tracked attribute changed for any business key. Think of it as a timeline of “something happened here” markers that we’ll later fill in with actual attribute values.

For sources with explicit timestamps, extraction looks like this:


-- Extract all change points from source

WITH source_ordered AS (

    SELECT

         customer_id,

         customer_name,

         customer_status,

         credit_limit,

         updated_at AS change_timestamp,

         ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at) AS version_sequence

    FROM raw_customers

    WHERE updated_at IS NOT NULL

),



-- Detect actual changes using LAG comparison

change_detection AS (

    SELECT

         *,

         LAG(customer_name)   OVER (PARTITION BY customer_id ORDER BY change_timestamp) AS prev_name,

         LAG(customer_status) OVER (PARTITION BY customer_id ORDER BY change_timestamp) AS prev_status,

         LAG(credit_limit)    OVER (PARTITION BY customer_id ORDER BY change_timestamp) AS prev_credit_limit

    FROM source_ordered

)



-- Keep only first record OR records where attributes actually changed

SELECT

     customer_id,

     customer_name,

     customer_status,

     credit_limit,

     change_timestamp

FROM change_detection

WHERE version_sequence = 1

   OR customer_name   IS DISTINCT FROM prev_name

   OR customer_status IS DISTINCT FROM prev_status

   OR credit_limit    IS DISTINCT FROM prev_credit_limit;

The critical insight is filtering to actual changes only. Raw source tables often contain records where nothing meaningful changed—perhaps a batch process touched every row, or a status field was updated and then immediately reverted. Including these non‑changes creates unnecessary dimension versions that complicate queries and waste storage.

When working with daily snapshot sources (no change timestamps, just daily extracts), the pattern shifts to detecting differences between consecutive snapshots:


WITH daily_ordered AS (

    SELECT

         customer_id,

         customer_name,

         customer_status,

         snapshot_date,

         LAG(customer_name)   OVER (PARTITION BY customer_id ORDER BY snapshot_date) AS prev_name,

         LAG(customer_status) OVER (PARTITION BY customer_id ORDER BY snapshot_date) AS prev_status

    FROM daily_customer_snapshots

)



SELECT

     customer_id,

     customer_name,

     customer_status,

     snapshot_date AS effective_from

FROM daily_ordered

WHERE prev_name IS NULL

   OR customer_name   IS DISTINCT FROM prev_name

   OR customer_status IS DISTINCT FROM prev_status;

IS DISTINCT FROM handles NULL comparisons correctly—a common source of bugs in change detection logic. Standard != treats NULL as unknown, whereas IS DISTINCT FROM treats NULL as a concrete value.


Step 2: Generating Time Slices

Time Slices transform point‑in‑time change events into date ranges with valid_from and valid_to columns. The LEAD window function does the heavy lifting, calculating when each version expires based on when the next version begins:


WITH effectivity_data AS (

    SELECT customer_id, customer_name, customer_status, change_timestamp

    FROM effectivity_table

)



SELECT

     customer_id,

     customer_name,

     customer_status,

     change_timestamp AS valid_from,

     COALESCE(

         LEAD(change_timestamp) OVER (PARTITION BY customer_id ORDER BY change_timestamp),

         DATE '9999-12-31'               -- High date for current records

     ) AS valid_to,

     CASE

         WHEN LEAD(change_timestamp) OVER (PARTITION BY customer_id ORDER BY change_timestamp) IS NULL

         THEN TRUE

         ELSE FALSE

     END AS is_current

FROM effectivity_data;

High dates vs NULLs for current records. Using 9999‑12‑31 simplifies BETWEEN queries but can confuse some BI tools. Including an explicit is_current boolean flag gives query authors flexibility without relying on a magic date.

Interval conventions. Use left‑closed, right‑open intervals (valid_from <= date < valid_to). This ensures every point in time maps to exactly one dimension version without ambiguity at boundaries.

Point‑in‑time lookup pattern:


SELECT d.*

FROM dim_customer d

WHERE @lookup_date >= d.valid_from

  AND @lookup_date <  d.valid_to;


Step 3: Joining Multiple Sources

When dimension attributes come from multiple systems, you need to conform those sources before SCD2 processing begins. The solution is a Unified Timeline that collects all change points from all sources, then joins each source back using as‑of‑date logic:


-- Collect ALL change points from ALL sources

WITH unified_change_points AS (

    SELECT customer_id, change_timestamp FROM crm_customers

    UNION

    SELECT customer_id, change_timestamp FROM erp_customers

    UNION

    SELECT customer_id, change_timestamp FROM web_profiles

),



-- Build time slices from unified timeline

unified_time_slices AS (

    SELECT

         customer_id,

         change_timestamp AS valid_from,

         COALESCE(

             LEAD(change_timestamp) OVER (PARTITION BY customer_id ORDER BY change_timestamp),

             TIMESTAMP '9999-12-31 23:59:59'

         ) AS valid_to

    FROM unified_change_points

),



-- Join each source back using as‑of logic

final_dimension AS (

    SELECT

         ts.customer_id,

         ts.valid_from,

         ts.valid_to,

         crm.customer_name,

         erp.credit_limit,

         erp.payment_terms,

         web.email_opt_in,

         web.preferred_contact_method

    FROM unified_time_slices ts

    LEFT JOIN crm_customers crm

      ON ts.customer_id = crm.customer_id

     AND crm.change_timestamp = (

         SELECT MAX(change_timestamp)

         FROM crm_customers

         WHERE customer_id = ts.customer_id

           AND change_timestamp <= ts.valid_from

     )

    LEFT JOIN erp_customers erp

      ON ts.customer_id = erp.customer_id

     AND erp.change_timestamp = (

         SELECT MAX(change_timestamp)

         FROM erp_customers

         WHERE customer_id = ts.customer_id

           AND change_timestamp <= ts.valid_from

     )

    LEFT JOIN web_profiles web

      ON ts.customer_id = web.customer_id

     AND web.change_timestamp = (

         SELECT MAX(change_timestamp)

         FROM web_profiles

         WHERE customer_id = ts.customer_id

           AND change_timestamp <= ts.valid_from

     )

)



SELECT * FROM final_dimension;

Document attribute ownership explicitly. When two sources provide the same attribute, decide which wins and codify that rule (e.g., using COALESCE or CASE). Keeping this logic in code rather than tribal knowledge prevents future conflicts.


Step 4: Hash Computation

Hash‑based change detection replaces expensive multi‑column comparisons with a single‑column hash comparison. The Healing Tables framework uses a two‑hash strategy:

  • key_hash – identifies the business entity.

  • row_hash – detects when any tracked attribute changes.

Example (Trino/Starburst syntax):


-- Business‑key hash

lower(to_hex(sha256(to_utf8(

    COALESCE(CAST(source_system AS VARCHAR), '') || '|' ||

    COALESCE(CAST(customer_id   AS VARCHAR), '')

)))) AS key_hash;



-- Row‑level hash (attributes only)

lower(to_hex(sha256(to_utf8(

    COALESCE(customer_name, '^^NULL^^') || '|' ||

    COALESCE(customer_status, '^^NULL^^') || '|' ||

    COALESCE(CAST(credit_limit AS VARCHAR), '^^NULL^^') || '|' ||

    COALESCE(CAST(birth_date   AS VARCHAR), '^^NULL^^')

)))) AS row_hash;

Use a placeholder (^^NULL^^) for NULL values so that ('John', NULL) and ('John', '') produce different hashes.

Why not CONCAT_WS? It skips NULLs, causing false positives. Always wrap columns in COALESCE before concatenation.

Column ordering best practice: concatenate columns alphabetically. This yields deterministic hashes that survive schema evolution.

Platform‑specific hash functions:

| Platform | Function | Syntax |

|------------|------------------------------|------------------------------------------|

| Snowflake | SHA2 | SHA2(expression, 256) |

| BigQuery | SHA256 + TO_HEX | TO_HEX(SHA256(expression)) |

| Trino | sha256 + to_hex | to_hex(sha256(to_utf8(expression))) |

| SQL Server | HASHBYTES | HASHBYTES('SHA2_256', expression) |


Step 5: Row Compression

Row compression eliminates consecutive duplicate states—adjacent time periods where all attributes are identical. This is an “Islands and Gaps” pattern:


-- Detect group boundaries

WITH boundary_detection AS (

    SELECT

         *,

         CASE

             WHEN LAG(valid_to) OVER (PARTITION BY customer_id ORDER BY valid_from) = valid_from

              AND row_hash = LAG(row_hash) OVER (PARTITION BY customer_id ORDER BY valid_from)

             THEN 0   -- Continuation of previous group

             ELSE 1   -- Start of new group

         END AS is_group_start

    FROM dimension_with_hashes

),



-- Assign group IDs via running sum

group_assignment AS (

    SELECT

         *,

         SUM(is_group_start) OVER (PARTITION BY customer_id ORDER BY valid_from

                                   ROWS UNBOUNDED PRECEDING) AS group_id

    FROM boundary_detection

)



-- Collapse each group to a single row

SELECT

     customer_id,

     MIN(valid_from) AS valid_from,

     MAX(valid_to)   AS valid_to,

     MAX(CASE WHEN is_current THEN 1 ELSE 0 END) = 1 AS is_current,

     MAX(customer_name)   AS customer_name,

     MAX(customer_status) AS customer_status,

     MAX(row_hash)        AS row_hash,

     MAX(key_hash)        AS key_hash,

     COUNT(*)             AS compressed_row_count   -- audit

FROM group_assignment

GROUP BY customer_id, group_id

ORDER BY customer_id, valid_from;

Rows are compressed only when both:

  1. Temporal contiguity – previous valid_to equals current valid_from.

  2. Attribute identity – row_hash values match.

This prevents merging separate business states that happen to share the same attribute values but are separated by a gap in time.


Step 6: Validation Testing

Production SCD2 dimensions need comprehensive validation to ensure temporal integrity. Tests should run after every load and block pipeline completion on failure.

Test 1 – Single current record per key


SELECT customer_id, COUNT(*) AS current_record_count

FROM dim_customer

WHERE is_current = TRUE

GROUP BY customer_id

HAVING COUNT(*) <> 1;

-- Expected: zero rows

Test 2 – No overlapping date ranges


SELECT a.*, b.*

FROM dim_customer a

JOIN dim_customer b

  ON a.customer_id = b.customer_id

 AND a.dbt_scd_id <> b.dbt_scd_id

WHERE a.valid_from < b.valid_to

  AND a.valid_to   > b.valid_from;

-- Expected: zero rows

Test 3 – No timeline gaps (if continuity required)


WITH gap_check AS (

    SELECT

         customer_id,

         valid_from,

         LAG(valid_to) OVER (PARTITION BY customer_id ORDER BY valid_from) AS prev_valid_to

    FROM dim_customer

)

SELECT customer_id, prev_valid_to AS gap_start, valid_from AS gap_end

FROM gap_check

WHERE prev_valid_to IS NOT NULL

  AND prev_valid_to <> valid_from;

-- Expected: zero rows (or documented acceptable gaps)

Test 4 – Valid date ordering


SELECT *

FROM dim_customer

WHERE valid_from >= valid_to;

-- Expected: zero rows

Test 5 – No consecutive duplicate versions after compression


WITH version_check AS (

    SELECT

         customer_id,

         valid_from,

         row_hash,

         LAG(row_hash) OVER (PARTITION BY customer_id ORDER BY valid_from) AS prev_row_hash,

         LAG(valid_to) OVER (PARTITION BY customer_id ORDER BY valid_from) AS prev_valid_to

    FROM dim_customer

)

SELECT *

FROM version_check

WHERE row_hash = prev_row_hash

  AND valid_from = prev_valid_to;

-- Expected: zero rows

In dbt, these can be expressed as reusable tests in schema.yml.


The Self‑Healing Property

Healing Tables are path‑independent: the dimension produced depends only on source data contents, not on the history of load operations. Re‑run any date range with corrected logic, and the result is identical to a full rebuild.

This property changes how teams handle data‑quality issues:

  • Bug in change detection logic? Fix it, reprocess the affected range, done.

  • Source system correction applied retroactively? Reprocess automatically incorporates it.

  • New attribute added? Reprocess rebuilds all historical versions with the new column.

  • Suspected data corruption? Rebuild and compare; differences reveal the problem.

The dimension becomes reproducible in the same way good software builds are reproducible. Given the same inputs and logic, you always get the same outputs. This isn’t just convenient for debugging—it’s a fundamental quality characteristic that enables reliable data pipelines.


When to Use This Approach

Healing Tables work best when:

  • Complete source history is available.

  • Source data is contiguous (no unexplained gaps).

  • Rebuild time is acceptable for the volume of data.

  • Sources use logical deletes rather than physical row removal.

They are less suitable for:

  • Very high‑volume dimensions where full rebuilds are impractical.

  • Real‑time or near‑real‑time requirements.

  • Situations where source history is unavailable.

  • Simple, stable dimensions with well‑understood sources.

Many teams combine approaches: run Healing Tables periodically (e.g., weekly or monthly) to “heal” the dimension, while keeping a lightweight incremental process for day‑to‑day updates.


Wrapping Up

That 2 AM debugging session taught me something important: the traditional incremental SCD2 approach creates tight coupling between historical load operations and current data quality. Every bug, edge case, and source‑system quirk gets baked into the dimension permanently.

Healing Tables break that coupling. By extracting change points from source data, constructing time slices independently, compressing duplicate states, and validating results before loading, you create dimensions that are deterministically reproducible from source data.

The six‑step framework isn’t complicated, but it requires discipline:

  1. Build the Effectivity Table with actual changes only.

  2. Generate Time Slices with proper valid_from/valid_to boundaries.

  3. Join multiple sources using a unified timeline.

  4. Compute hashes for efficient change detection.

  5. Compress consecutive identical states.

  6. Validate temporal integrity before loading.

Implement those steps correctly, and you’ll have dimensions that can heal themselves from whatever accumulated issues have crept in over months or years of operation.

The next time someone asks you to backfill three years of history by running the daily process 1,095 times, you’ll have a better answer.

Read Original Article
0

Comments

Want to join the conversation?

Loading comments...