Why Schema Drift Breaks Pipelines (And What to Do About It)

Michael Adeyemi

February 11, 2025 9 min read

Database schema diagram showing field rename from order_amount to total_value

Schema drift is not a bug in your pipeline code. It is not a failure in your orchestrator. It is a structural property of any system that ingests data from a source it does not control. The schema of that source will change. The only question is whether your pipeline finds out before or after the damage is done.

Most find out after. That's the problem this article is about.

What schema drift actually looks like

When engineers talk about schema drift, they usually mean one of a handful of change types:

Column rename: order_amount becomes total_value. The column still exists. The data is still there. But any downstream reference to the old name now resolves to NULL.
Column drop: A field disappears entirely from the source table. If your pipeline casts it as NOT NULL at the destination, ingestion fails. If it's nullable, ingestion succeeds and you have a column of NULLs you didn't expect.
Type change: VARCHAR(32) becomes TEXT. Often backward-compatible in theory, but warehouse-specific casting rules may reject it. Or FLOAT becomes DECIMAL(10,4), and your dbt model that does arithmetic on that column starts producing rounding differences.
Type widening: INT32 becomes INT64. Usually safe, but not always — if you're storing IDs in a column typed INT32 and the vendor upgrades their ID generation, values above 2,147,483,647 will overflow without warning at the destination.
Nullability change: A column that was always populated becomes nullable — or vice versa. The pipeline doesn't care. The data contract you had with your downstream models does.
Column reorder: Column ordinal positions change. Pipelines that map by position rather than by name silently misalign all values. A COPY INTO that loads CSV by column position will assign customer_id values into the product_sku column without a single error.

All of these cause different failure modes downstream. But they share one property: the pipeline usually keeps running after they happen. That's what makes them so dangerous.

Why pipelines keep running through schema drift

Consider a Postgres → Snowflake pipeline built on a standard EL tool. The tool connects to Postgres, reads rows from the orders table, and writes them to Snowflake. It knows the schema because it introspected it once, when you first set up the connector.

Now the source team runs an ALTER TABLE orders RENAME COLUMN order_amount TO total_value. From the EL tool's perspective, nothing visible has changed. It's still reading from the orders table. Rows still come back. The row count hasn't changed. The ingestion completes with status "success."

What's happened in Snowflake: the order_amount column at the destination now receives NULL for every row, because the source stopped emitting that field name. If Snowflake's destination table allows NULLs on that column (which most do, by default), the rows land cleanly. Zero errors. Zero alerts.

Your revenue dashboard is now showing $0 for every order from that point forward. The chart still renders. The green checkmark on the pipeline run still shows. Nobody knows anything is wrong until someone looks at the actual revenue numbers and asks why they dropped 100% on a Tuesday.

This isn't a pathological edge case. A mid-size analytics team pulling from 15–25 SaaS sources should expect schema changes at a rate of roughly 2–5 per month across the source portfolio. The question isn't if you'll hit this — it's whether your detection layer is running before the data lands or after the Monday morning standup.

Why it's so hard to catch

Three reasons schema drift is so consistently caught late:

1. Orchestrators validate execution, not schema contracts. Airflow, Prefect, and Dagster all provide excellent DAG execution monitoring — run durations, failure rates, retry counts. They tell you whether your tasks completed. They do not, by default, tell you whether the data that moved through those tasks matched the schema your downstream models expect. Those are different questions, and most orchestrators only answer the first one.

2. The signal is absence, not error. When a column drops or renames, what you're looking for is NULL in a column that shouldn't be NULL, or a column that used to have data now being zero. These are statistical signals in your data, not execution signals in your infrastructure. You find them by looking at dashboards or running data quality tests — not by reading log files.

The Snowflake INFORMATION_SCHEMA.COLUMNS view will tell you the current schema. It won't tell you that it changed on Friday at 6pm or that it was different three months ago. Schema history is not stored by default in most data warehouses.

3. Vendor API change windows don't align with your monitoring windows. SaaS vendors push schema changes during their own maintenance windows. Those maintenance windows are often Friday evening or weekend deployments. Your data quality tests might run once a day, at 3am. There's a window of 24–72 hours between when the change lands and when your tests catch it — if your tests catch it at all. If they don't have an assertion on that specific column, they won't.

dbt's source freshness checks will tell you whether data arrived recently. They won't tell you whether the data that arrived has the structure you expected. Those are different guardrails for different failure modes.

The mechanics of a real-world incident

Here's how a typical schema drift incident unfolds in a team without automated detection:

Friday 7:42pm: Vendor pushes API v2.1 to production. order_amount renamed to total_value. No deprecation warning in the release notes (or one sentence buried in a changelog nobody reads).
Friday 8pm: the EL tool's next scheduled run ingests from v2.1. The column mapping silently fails. NULLs land in order_amount in the warehouse. Run exits with status "completed".
Saturday, Sunday: Weekend runs continue. NULLs accumulate. Nobody is watching dashboards.
Monday 9:15am: Revenue standup. Someone pulls the weekend report. Revenue shows zero. Pandemonium.
Monday 10:30am: After checking the pipeline logs (which show no errors), someone finally queries INFORMATION_SCHEMA.COLUMNS in the source and notices the column name changed. The root cause is identified.
Monday 11am: The pipeline is manually updated. A backfill is triggered for the weekend data — but only if the vendor's API supports historical replay. If not, that 3 days of transaction data is corrupted at the destination and cannot be recovered without a full re-ingestion from the source, which may no longer be available in that form.

The total elapsed time from schema change to resolution: ~62 hours. The total elapsed time from schema change to detection: ~61 hours. A 1-hour window for the actual fix.

The backfill question is worth expanding on. Suppose the vendor's API does support historical replay for up to 30 days. You can re-ingest the affected window, but you'll need to handle deduplication at the destination — the rows from Friday and the weekend already landed (as NULLs) and those records exist in Snowflake. A naïve re-run with INSERT will produce duplicates. You need a MERGE with the source record ID as the key, which requires that you designed your destination schema with a stable primary key from the source. If you used a surrogate key generated at ingest time, you cannot deduplicate the backfill.

This is why idempotency design and schema drift detection are connected concerns. Teams that handle both well suffer through the outage faster.

What detection actually requires

To catch schema drift before data moves, you need schema comparison that runs at ingest time, not at query time.

The principle: before you read any row from a source, fingerprint its current schema and compare it against the last known good schema. If they differ, evaluate whether the difference is safe (a column was added — fine) or breaking (a column was renamed or dropped — stop and alert).

Schema fingerprinting works by computing a hash or canonical representation of a table's column set — names, types, nullability, and order. Storing this fingerprint after each successful run creates a schema history that most warehouses don't maintain natively. On the next run, a mismatch between the current fingerprint and the stored fingerprint triggers the evaluation logic.

This is fundamentally different from running dbt tests after the fact. dbt tests run on data that's already in your warehouse. By the time they fire, the corrupted data is already there. Schema drift detection at the connector level runs before any data moves, and can pause ingestion before the first NULL row reaches your destination table.

The change types that matter most for automated pausing:

Column rename: Always pause unless you have an explicit remap rule. A rename is a data contract violation.
Column drop (NOT NULL destination column): Always pause. The destination schema will reject or NULL-fill every row.
Type change (non-widening): Pause and alert. VARCHAR to INT is never safe without manual review.

Changes that are generally safe to auto-migrate:

Column add: New column appears at source. Add it to the destination table automatically. This is an additive, non-breaking change.
Type widening (numeric): INT32 → INT64, FLOAT32 → FLOAT64. Safe to auto-migrate in most cases.
Ordering change: Column ordinal positions changed. Irrelevant if you're mapping by name, not by position.

The nuance is that different pipelines have different risk tolerances. A pipeline feeding a financial reporting table has zero tolerance for silent type changes. A pipeline feeding a marketing analytics table might allow type widening without an alert. This is why drift rules need to be configurable per-pipeline, not global.

In Queryvine's drift_rules.yaml, you specify this per source:

source: salesforce.opportunities
on_column_rename: pause_and_alert
on_column_drop: pause_and_alert
on_type_widening: auto_migrate
on_column_add: auto_migrate
notify:
  channel: "#data-alerts"
  severity: high

Running qv schema diff salesforce.opportunities will show you the current live schema against the last stored fingerprint before any run executes. This is useful during incident triage — you can manually check whether a source's schema has changed without waiting for the next scheduled run.

The limits of drift detection — what it doesn't cover

We're not saying schema drift detection replaces data quality testing. It doesn't.

Schema drift detection tests the structure of data — whether the fields that were supposed to arrive actually arrived with the types and names you expected. dbt tests and Great Expectations test the content of data — whether revenue values are in a plausible range, whether foreign keys resolve, whether a column is ever NULL that shouldn't be based on business rules.

Both layers are necessary. A pipeline can pass schema drift detection (the column is present with the right type) and still deliver data with the wrong values — for example, a vendor that changes the currency unit of a monetary column from USD to EUR without renaming the column. That's a semantic change, not a structural one. Schema fingerprinting won't catch it. A data quality test asserting plausible revenue ranges will.

Similarly, schema drift detection won't catch incorrect data from source-side business logic changes — when a vendor changes what "active customer" means in their CRM without changing the field name. You find that class of problem through anomaly detection on statistical distributions, not through structural schema checks.

Running only schema drift detection without data quality tests leaves you blind to semantic anomalies. Running only data quality tests without schema drift detection leaves you blind to the structural failures that produce NULL rows. Neither is a substitute for the other.

Building toward resilience

Schema drift is reliable. Upstream schemas will change. Vendors will push API updates. Application teams will run ALTER TABLE without notifying the data engineering team. The only question is whether your pipeline detects the change before it reaches your warehouse or after the Monday morning standup.

Detection at the edge — at ingest time, before any row moves — is the only approach that prevents corrupted data from landing. Everything else is forensics: tracing backward from wrong dashboard numbers through pipeline logs to the source change, then figuring out how to reconstruct the corrupted data. That forensics work routinely takes hours, sometimes days, and sometimes involves data that simply cannot be recovered because the source no longer has it in the form you need.

The practical path forward is incremental. Start by auditing which of your sources have ever experienced a silent schema change — most teams doing this audit for the first time find 2–4 incidents in the previous 12 months that they thought were "data issues" but were actually structural schema failures. Use that history to prioritize which sources need pre-ingest schema fingerprinting first. Add drift rules for your highest-criticality sources. Expand from there.

The goal is not zero schema changes — those will always happen. The goal is that every schema change is either handled automatically (additive) or caught at the pipeline edge before it touches your warehouse (breaking). That boundary, once in place, changes the character of schema drift from an invisible chronic risk to an operational event you can respond to in minutes.