Healing Tables eliminate error propagation and performance decay inherent in traditional backfills, delivering reliable, maintainable SCD 2 dimensions for enterprise data warehouses.
Traditional day‑by‑day backfills for slowly changing dimensions (SCD 2) seem simple but quickly become a performance nightmare. Each incremental run compares incoming rows to an ever‑growing target, causing non‑linear runtime growth and compounding any logic errors. Source systems often emit multiple updates per day, deletions, and back‑dated changes, which the incremental loop either misses or mishandles, leading to overlapping records and timeline gaps that force a full rebuild anyway. The industry has long accepted this fragility as a cost of historical data reconstruction.
Healing Tables overturn that paradigm by treating the dimension as a pure function of source data. The six‑step framework first builds an Effectivity Table that captures every genuine change point, then creates contiguous time slices with deterministic `valid_from`/`valid_to` boundaries. By joining all source systems on a unified timeline, computing key and row hashes, and compressing consecutive identical states, the process eliminates redundant rows and dramatically reduces storage. A final validation suite checks for a single current record per key, non‑overlapping intervals, and proper ordering, guaranteeing temporal integrity before the load.
Adopting Healing Tables is most beneficial when complete source history is available and rebuild windows are acceptable. Data teams gain a path‑independent pipeline: fixing a detection bug or adding a new attribute simply requires re‑running the framework, producing the same result every time. This reproducibility aligns with modern data‑ops practices, lowers maintenance overhead, and improves confidence in downstream analytics. Organizations that replace fragile incremental backfills with Healing Tables can expect faster issue resolution, lower compute costs, and more trustworthy dimensional models across the enterprise.
It was 2 AM on a Saturday when I realized we’d been loading data wrong for six months.
The situation: a customer dimension with three years of history needed to be backfilled after a source system migration. The previous team’s approach was straightforward—run the daily incremental process 1,095 times, once for each day of history. They estimated three weeks to complete.
What they hadn’t accounted for was how errors compound. By the time I looked at the data, we had 47,000 records with overlapping date ranges, 12,000 timeline gaps where customers seemed to vanish and reappear, and an unknowable number of missed changes from when source systems updated the same record multiple times in a single day.
The dimension wasn’t just wrong. It was unfixably wrong using traditional methods. Every incremental run had layered new errors on top of old ones, creating a Jenga tower of data quality issues that couldn’t be untangled without starting over.
That night, I started building what I now call a Healing Table—a dimension that can be completely rebuilt from source data at any point, “healing” whatever accumulated inconsistencies have crept in over months or years of incremental loads.
This isn’t just a technique for disaster recovery. It’s a fundamentally different approach to SCD Type 2 that separates change detection from period construction, processes entire historical ranges in a single pass, and produces dimensions that are deterministically reproducible from source data.
Here’s how it works.
The traditional approach to backfilling historical SCD2 dimensions follows a pattern most data engineers know well: take your daily incremental process, wrap it in a loop, and execute it once for each historical date. Simple, reuses existing code, and catastrophically fragile.
Errors compound across iterations. If your change detection logic misses an edge case on January 3rd, that incorrect record becomes the baseline for January 4th’s comparison. The error propagates forward through every subsequent run. By the time you discover the issue months later, you can’t simply fix January 3rd because hundreds of downstream records depend on that incorrect state.
Performance degrades non‑linearly. Each daily run needs to compare incoming records against the existing dimension. As the dimension grows through backfill, comparison costs increase. A process that takes 5 minutes for a single day’s delta might take 45 minutes per day when the target table contains three years of history. That three‑week estimate becomes three months.
Source system quirks multiply. Real source systems don’t change data once per day at a predictable time. They update records multiple times, delete and recreate rows, and occasionally backdate changes. Day‑by‑day processing either misses these patterns entirely or handles them inconsistently across runs.
Recovery requires complete rebuild anyway. When—not if—something goes wrong, fixing it requires blowing away the dimension and starting over. But if you’re going to rebuild from scratch regardless, why not design for that from the beginning?
The Healing Tables framework embraces this reality. Instead of trying to prevent rebuilds, it makes them fast, reliable, and deterministic.
Healing Tables work by separating two concerns that traditional SCD2 implementations conflate: change detection (identifying when attributes changed) and period construction (building valid time slices with proper start and end dates).
Traditional approaches detect changes and construct periods simultaneously, comparing incoming records to existing dimension state. This creates tight coupling between current dimension contents and processing logic—if the dimension is wrong, future processing will also be wrong.
Healing Tables decouple these concerns through a six‑step pipeline that operates entirely on source data, constructing the dimension from scratch without reference to any existing target state:
Effectivity Table Creation – Extract all change points from sources
Time Slice Generation – Build date ranges with proper valid_from/valid_to
Source Table Joining – Conform attributes from multiple sources
Hash Computation – Enable efficient change detection
Row Compression – Eliminate consecutive identical states
Validation Testing – Verify temporal integrity before loading
Let me walk through each step.
The Effectivity Table captures every moment when any tracked attribute changed for any business key. Think of it as a timeline of “something happened here” markers that we’ll later fill in with actual attribute values.
For sources with explicit timestamps, extraction looks like this:
-- Extract all change points from source
WITH source_ordered AS (
SELECT
customer_id,
customer_name,
customer_status,
credit_limit,
updated_at AS change_timestamp,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at) AS version_sequence
FROM raw_customers
WHERE updated_at IS NOT NULL
),
-- Detect actual changes using LAG comparison
change_detection AS (
SELECT
*,
LAG(customer_name) OVER (PARTITION BY customer_id ORDER BY change_timestamp) AS prev_name,
LAG(customer_status) OVER (PARTITION BY customer_id ORDER BY change_timestamp) AS prev_status,
LAG(credit_limit) OVER (PARTITION BY customer_id ORDER BY change_timestamp) AS prev_credit_limit
FROM source_ordered
)
-- Keep only first record OR records where attributes actually changed
SELECT
customer_id,
customer_name,
customer_status,
credit_limit,
change_timestamp
FROM change_detection
WHERE version_sequence = 1
OR customer_name IS DISTINCT FROM prev_name
OR customer_status IS DISTINCT FROM prev_status
OR credit_limit IS DISTINCT FROM prev_credit_limit;
The critical insight is filtering to actual changes only. Raw source tables often contain records where nothing meaningful changed—perhaps a batch process touched every row, or a status field was updated and then immediately reverted. Including these non‑changes creates unnecessary dimension versions that complicate queries and waste storage.
When working with daily snapshot sources (no change timestamps, just daily extracts), the pattern shifts to detecting differences between consecutive snapshots:
WITH daily_ordered AS (
SELECT
customer_id,
customer_name,
customer_status,
snapshot_date,
LAG(customer_name) OVER (PARTITION BY customer_id ORDER BY snapshot_date) AS prev_name,
LAG(customer_status) OVER (PARTITION BY customer_id ORDER BY snapshot_date) AS prev_status
FROM daily_customer_snapshots
)
SELECT
customer_id,
customer_name,
customer_status,
snapshot_date AS effective_from
FROM daily_ordered
WHERE prev_name IS NULL
OR customer_name IS DISTINCT FROM prev_name
OR customer_status IS DISTINCT FROM prev_status;
IS DISTINCT FROM handles NULL comparisons correctly—a common source of bugs in change detection logic. Standard != treats NULL as unknown, whereas IS DISTINCT FROM treats NULL as a concrete value.
Time Slices transform point‑in‑time change events into date ranges with valid_from and valid_to columns. The LEAD window function does the heavy lifting, calculating when each version expires based on when the next version begins:
WITH effectivity_data AS (
SELECT customer_id, customer_name, customer_status, change_timestamp
FROM effectivity_table
)
SELECT
customer_id,
customer_name,
customer_status,
change_timestamp AS valid_from,
COALESCE(
LEAD(change_timestamp) OVER (PARTITION BY customer_id ORDER BY change_timestamp),
DATE '9999-12-31' -- High date for current records
) AS valid_to,
CASE
WHEN LEAD(change_timestamp) OVER (PARTITION BY customer_id ORDER BY change_timestamp) IS NULL
THEN TRUE
ELSE FALSE
END AS is_current
FROM effectivity_data;
High dates vs NULLs for current records. Using 9999‑12‑31 simplifies BETWEEN queries but can confuse some BI tools. Including an explicit is_current boolean flag gives query authors flexibility without relying on a magic date.
Interval conventions. Use left‑closed, right‑open intervals (valid_from <= date < valid_to). This ensures every point in time maps to exactly one dimension version without ambiguity at boundaries.
Point‑in‑time lookup pattern:
SELECT d.*
FROM dim_customer d
WHERE @lookup_date >= d.valid_from
AND @lookup_date < d.valid_to;
When dimension attributes come from multiple systems, you need to conform those sources before SCD2 processing begins. The solution is a Unified Timeline that collects all change points from all sources, then joins each source back using as‑of‑date logic:
-- Collect ALL change points from ALL sources
WITH unified_change_points AS (
SELECT customer_id, change_timestamp FROM crm_customers
UNION
SELECT customer_id, change_timestamp FROM erp_customers
UNION
SELECT customer_id, change_timestamp FROM web_profiles
),
-- Build time slices from unified timeline
unified_time_slices AS (
SELECT
customer_id,
change_timestamp AS valid_from,
COALESCE(
LEAD(change_timestamp) OVER (PARTITION BY customer_id ORDER BY change_timestamp),
TIMESTAMP '9999-12-31 23:59:59'
) AS valid_to
FROM unified_change_points
),
-- Join each source back using as‑of logic
final_dimension AS (
SELECT
ts.customer_id,
ts.valid_from,
ts.valid_to,
crm.customer_name,
erp.credit_limit,
erp.payment_terms,
web.email_opt_in,
web.preferred_contact_method
FROM unified_time_slices ts
LEFT JOIN crm_customers crm
ON ts.customer_id = crm.customer_id
AND crm.change_timestamp = (
SELECT MAX(change_timestamp)
FROM crm_customers
WHERE customer_id = ts.customer_id
AND change_timestamp <= ts.valid_from
)
LEFT JOIN erp_customers erp
ON ts.customer_id = erp.customer_id
AND erp.change_timestamp = (
SELECT MAX(change_timestamp)
FROM erp_customers
WHERE customer_id = ts.customer_id
AND change_timestamp <= ts.valid_from
)
LEFT JOIN web_profiles web
ON ts.customer_id = web.customer_id
AND web.change_timestamp = (
SELECT MAX(change_timestamp)
FROM web_profiles
WHERE customer_id = ts.customer_id
AND change_timestamp <= ts.valid_from
)
)
SELECT * FROM final_dimension;
Document attribute ownership explicitly. When two sources provide the same attribute, decide which wins and codify that rule (e.g., using COALESCE or CASE). Keeping this logic in code rather than tribal knowledge prevents future conflicts.
Hash‑based change detection replaces expensive multi‑column comparisons with a single‑column hash comparison. The Healing Tables framework uses a two‑hash strategy:
key_hash – identifies the business entity.
row_hash – detects when any tracked attribute changes.
Example (Trino/Starburst syntax):
-- Business‑key hash
lower(to_hex(sha256(to_utf8(
COALESCE(CAST(source_system AS VARCHAR), '') || '|' ||
COALESCE(CAST(customer_id AS VARCHAR), '')
)))) AS key_hash;
-- Row‑level hash (attributes only)
lower(to_hex(sha256(to_utf8(
COALESCE(customer_name, '^^NULL^^') || '|' ||
COALESCE(customer_status, '^^NULL^^') || '|' ||
COALESCE(CAST(credit_limit AS VARCHAR), '^^NULL^^') || '|' ||
COALESCE(CAST(birth_date AS VARCHAR), '^^NULL^^')
)))) AS row_hash;
Use a placeholder (^^NULL^^) for NULL values so that ('John', NULL) and ('John', '') produce different hashes.
Why not CONCAT_WS? It skips NULLs, causing false positives. Always wrap columns in COALESCE before concatenation.
Column ordering best practice: concatenate columns alphabetically. This yields deterministic hashes that survive schema evolution.
Platform‑specific hash functions:
| Platform | Function | Syntax |
|------------|------------------------------|------------------------------------------|
| Snowflake | SHA2 | SHA2(expression, 256) |
| BigQuery | SHA256 + TO_HEX | TO_HEX(SHA256(expression)) |
| Trino | sha256 + to_hex | to_hex(sha256(to_utf8(expression))) |
| SQL Server | HASHBYTES | HASHBYTES('SHA2_256', expression) |
Row compression eliminates consecutive duplicate states—adjacent time periods where all attributes are identical. This is an “Islands and Gaps” pattern:
-- Detect group boundaries
WITH boundary_detection AS (
SELECT
*,
CASE
WHEN LAG(valid_to) OVER (PARTITION BY customer_id ORDER BY valid_from) = valid_from
AND row_hash = LAG(row_hash) OVER (PARTITION BY customer_id ORDER BY valid_from)
THEN 0 -- Continuation of previous group
ELSE 1 -- Start of new group
END AS is_group_start
FROM dimension_with_hashes
),
-- Assign group IDs via running sum
group_assignment AS (
SELECT
*,
SUM(is_group_start) OVER (PARTITION BY customer_id ORDER BY valid_from
ROWS UNBOUNDED PRECEDING) AS group_id
FROM boundary_detection
)
-- Collapse each group to a single row
SELECT
customer_id,
MIN(valid_from) AS valid_from,
MAX(valid_to) AS valid_to,
MAX(CASE WHEN is_current THEN 1 ELSE 0 END) = 1 AS is_current,
MAX(customer_name) AS customer_name,
MAX(customer_status) AS customer_status,
MAX(row_hash) AS row_hash,
MAX(key_hash) AS key_hash,
COUNT(*) AS compressed_row_count -- audit
FROM group_assignment
GROUP BY customer_id, group_id
ORDER BY customer_id, valid_from;
Rows are compressed only when both:
Temporal contiguity – previous valid_to equals current valid_from.
Attribute identity – row_hash values match.
This prevents merging separate business states that happen to share the same attribute values but are separated by a gap in time.
Production SCD2 dimensions need comprehensive validation to ensure temporal integrity. Tests should run after every load and block pipeline completion on failure.
Test 1 – Single current record per key
SELECT customer_id, COUNT(*) AS current_record_count
FROM dim_customer
WHERE is_current = TRUE
GROUP BY customer_id
HAVING COUNT(*) <> 1;
-- Expected: zero rows
Test 2 – No overlapping date ranges
SELECT a.*, b.*
FROM dim_customer a
JOIN dim_customer b
ON a.customer_id = b.customer_id
AND a.dbt_scd_id <> b.dbt_scd_id
WHERE a.valid_from < b.valid_to
AND a.valid_to > b.valid_from;
-- Expected: zero rows
Test 3 – No timeline gaps (if continuity required)
WITH gap_check AS (
SELECT
customer_id,
valid_from,
LAG(valid_to) OVER (PARTITION BY customer_id ORDER BY valid_from) AS prev_valid_to
FROM dim_customer
)
SELECT customer_id, prev_valid_to AS gap_start, valid_from AS gap_end
FROM gap_check
WHERE prev_valid_to IS NOT NULL
AND prev_valid_to <> valid_from;
-- Expected: zero rows (or documented acceptable gaps)
Test 4 – Valid date ordering
SELECT *
FROM dim_customer
WHERE valid_from >= valid_to;
-- Expected: zero rows
Test 5 – No consecutive duplicate versions after compression
WITH version_check AS (
SELECT
customer_id,
valid_from,
row_hash,
LAG(row_hash) OVER (PARTITION BY customer_id ORDER BY valid_from) AS prev_row_hash,
LAG(valid_to) OVER (PARTITION BY customer_id ORDER BY valid_from) AS prev_valid_to
FROM dim_customer
)
SELECT *
FROM version_check
WHERE row_hash = prev_row_hash
AND valid_from = prev_valid_to;
-- Expected: zero rows
In dbt, these can be expressed as reusable tests in schema.yml.
Healing Tables are path‑independent: the dimension produced depends only on source data contents, not on the history of load operations. Re‑run any date range with corrected logic, and the result is identical to a full rebuild.
This property changes how teams handle data‑quality issues:
Bug in change detection logic? Fix it, reprocess the affected range, done.
Source system correction applied retroactively? Reprocess automatically incorporates it.
New attribute added? Reprocess rebuilds all historical versions with the new column.
Suspected data corruption? Rebuild and compare; differences reveal the problem.
The dimension becomes reproducible in the same way good software builds are reproducible. Given the same inputs and logic, you always get the same outputs. This isn’t just convenient for debugging—it’s a fundamental quality characteristic that enables reliable data pipelines.
Healing Tables work best when:
Complete source history is available.
Source data is contiguous (no unexplained gaps).
Rebuild time is acceptable for the volume of data.
Sources use logical deletes rather than physical row removal.
They are less suitable for:
Very high‑volume dimensions where full rebuilds are impractical.
Real‑time or near‑real‑time requirements.
Situations where source history is unavailable.
Simple, stable dimensions with well‑understood sources.
Many teams combine approaches: run Healing Tables periodically (e.g., weekly or monthly) to “heal” the dimension, while keeping a lightweight incremental process for day‑to‑day updates.
That 2 AM debugging session taught me something important: the traditional incremental SCD2 approach creates tight coupling between historical load operations and current data quality. Every bug, edge case, and source‑system quirk gets baked into the dimension permanently.
Healing Tables break that coupling. By extracting change points from source data, constructing time slices independently, compressing duplicate states, and validating results before loading, you create dimensions that are deterministically reproducible from source data.
The six‑step framework isn’t complicated, but it requires discipline:
Build the Effectivity Table with actual changes only.
Generate Time Slices with proper valid_from/valid_to boundaries.
Join multiple sources using a unified timeline.
Compute hashes for efficient change detection.
Compress consecutive identical states.
Validate temporal integrity before loading.
Implement those steps correctly, and you’ll have dimensions that can heal themselves from whatever accumulated issues have crept in over months or years of operation.
The next time someone asks you to backfill three years of history by running the daily process 1,095 times, you’ll have a better answer.
Comments
Want to join the conversation?
Loading comments...