Five DAG Dependency Anti-Patterns That Create Silent Data Bugs

Priya Nair

September 8, 2025 8 min read

DAG graph showing anti-patterns like god task and circular dependencies

DAG-based orchestration looks simple on paper: define tasks, declare dependencies, run in order. In practice, DAG design accrues technical debt in specific, repeatable ways. The patterns below are not obvious failures — they work fine at first. They become problems months later, when the DAG has grown, requirements have changed, and the original assumptions no longer hold.

Each of these anti-patterns causes a specific class of silent failure — the kind where the orchestrator reports "succeeded" but the data in your warehouse is wrong. Silent data bugs are the most expensive kind, because they compound before anyone finds them.

Anti-pattern 1: The god task

A god task is a single DAG task that does too many things: extracts data, transforms it, runs validation checks, loads it to the destination, sends a notification, and updates a status table — all in one step.

The failure mode: if any single step within the god task fails, the entire task fails and is retried from the beginning. If step 5 (loading to the destination) fails after step 4 (validation) passed, the retry re-runs steps 1 through 4 before reaching step 5 again. At best, this wastes compute. At worst, it creates duplicate data in intermediate tables that weren't designed for reprocessing, because the retry re-runs an extraction that already wrote partial data.

God tasks also defeat monitoring. When a task shows "failed" in the orchestrator, you want to know immediately what operation caused the failure. A god task that includes both schema validation and warehouse loading forces you to read application logs to determine which step failed — the orchestrator's task-level status is useless as a signal.

The fix: decompose into atomic tasks with explicit dependencies. Each task does one thing and has a clear success/failure state. Retries are scoped to the failed step, not the entire operation chain.

tasks:
  - id: extract_orders          # pulls rows from source API
  - id: validate_schema         # schema drift check, run_if: upstream succeeded
    depends_on: [extract_orders]
  - id: transform_revenue       # dbt model run
    depends_on: [validate_schema]
  - id: load_to_warehouse       # COPY INTO snowflake
    depends_on: [transform_revenue]
  - id: notify_on_complete      # Slack notification
    depends_on: [load_to_warehouse]

If load_to_warehouse fails, only that task retries. Steps 1–3 are not re-executed. The run_if condition on validate_schema ensures the schema check is only meaningful when a fresh extract just completed — not when replaying a partially-complete state.

Anti-pattern 2: Fan-out without barriers

Fan-out without a barrier task is the most common source of unexpected partial-aggregation bugs. The pattern: a single task fans out to many parallel downstream tasks, and a final aggregation task is supposed to run only when all of them complete. Without an explicit barrier, the aggregation task may run on incomplete upstream data.

Consider: you have 12 regional ingestion tasks running in parallel, each loading a region's orders into a regional staging table. A final aggregation model joins all 12 tables to produce global revenue totals. In most orchestrators, you declare the aggregation task as depending on all 12 ingestion tasks. But if one of the 12 fails and is being retried, the orchestrator's behavior depends on its trigger rule for partial upstream completion.

In Airflow, the default trigger rule is ALL_SUCCESS — a task does not run until all its upstream dependencies succeed. This is the correct behavior. But if someone on the team used trigger_rule="all_done" (which runs as long as all upstreams have finished, regardless of their status), the aggregation will run even if one regional ingestion failed. It will aggregate 11 of 12 regions and produce a "succeeded" status on an incomplete result.

The fix: be explicit about trigger rules on every aggregation and merge task. Default to ALL_SUCCESS. Review any all_done or one_success trigger rules in your DAG codebase and confirm they're intentional. Add an explicit barrier task after the fan-out group that validates the count of successful upstreams before the aggregation starts.

Anti-pattern 3: Implicit scheduling assumptions

Implicit scheduling assumptions are hardcoded time-based expectations embedded in task logic rather than declared in the DAG structure. The classic example: a pipeline that runs at 3am assumes the source database's ETL process (which it doesn't control) completes by 2:50am. This works for 11 months. Then the upstream ETL is slow one night and finishes at 3:05am. Your pipeline runs on yesterday's data without error, without warning, without visibility.

The subtler version: a dbt model assumes that a raw table was refreshed sometime earlier in the day. There's no explicit dependency between the ingestion pipeline and the dbt run — they're on separate schedules that happen to align most of the time. When the ingestion pipeline is delayed by maintenance, dbt runs on stale data. dbt reports "succeeded." Your BI users see yesterday's numbers. The stale data flag on the source freshness check didn't fire because the table did receive some data today — just not the batch your model was expecting.

Implicit scheduling assumptions are invisible to the orchestrator because the orchestrator doesn't know they exist. The task succeeds regardless of whether the upstream data is current.

The fix: make all data dependencies explicit. If your dbt model requires fresh data from a specific pipeline, that dbt run should be downstream of that pipeline in the DAG, not on a separate schedule. The orchestrator can then enforce the dependency.

For cross-pipeline dependencies (where the pipelines live in different orchestrators or different DAGs), use dataset sensors or external task sensors that wait for a signal from the upstream pipeline before proceeding. In Airflow, this is ExternalTaskSensor. In Dagster, asset dependencies handle this natively. Don't rely on wall-clock time as a proxy for data availability.

Anti-pattern 4: Partial-refresh chains

A partial-refresh chain is a sequence of tasks where later tasks perform incremental operations on data produced by earlier tasks, but the earlier tasks might have only refreshed a subset of the data. The chain produces a consistent-looking result that is actually based on mixed-vintage data.

Example: a pipeline has three stages. Stage 1 refreshes the last 7 days of raw orders from a Fivetran Postgres connector. Stage 2 runs a dbt incremental model that appends new records to a fct_revenue table. Stage 3 runs a full-refresh model that aggregates from fct_revenue into rpt_monthly_revenue.

The problem: stage 1 only looks back 7 days. If a SaaS vendor applied a historical correction to orders from 30 days ago (refund reprocessing, currency exchange rate adjustment, subscription backdating), stage 1 never picks it up. Stage 2's incremental model appends what stage 1 gave it, now with a broken history. Stage 3 aggregates the corrupted table and produces wrong monthly figures with a "succeeded" status.

The fix: be explicit about what a partial refresh covers and test that it covers what you think it covers. For pipelines where historical corrections are possible, use a watermark strategy that includes an overlap window — refresh the last 30 days even in "incremental" mode, so corrections within that window are picked up. In dbt, this is the lookback_window configuration on incremental models. In Queryvine, the backfill_overlap_days setting on a pipeline definition extends the extraction window on each run to account for late-arriving records.

Anti-pattern 5: Diamond dependencies with implicit join assumptions

A diamond dependency pattern is: task A fans out to tasks B and C, which both feed into task D. The diamond looks clean in the DAG. The problem emerges when tasks B and C process different time windows or use different watermark policies, and task D joins them.

Example: task A extracts raw orders from Postgres. Task B transforms orders by customer segment (incremental, last 24 hours). Task C transforms orders by product category (incremental, last 7 days). Task D joins the outputs of B and C to produce a customer-product revenue breakdown.

Task D's output looks correct on normal days. On the 8th day, task B's 24-hour window and task C's 7-day window both include data from 7 days ago, but task C includes a full week of corrections while task B's history only goes back 24 hours. The join produces revenue figures that don't match either branch's individual aggregates. The mismatch is subtle — typically 0.3–2% of total revenue — and may go undetected for months until an audit surfaces the discrepancy.

The fix: when two branches of a diamond will be joined downstream, they must process the same time window or partition key. Either both are full-refresh, or both are incremental with the same watermark and the same overlap window. The simplest enforcement: inject a shared watermark parameter at the DAG level and pass it to both branches explicitly:

tasks:
  - id: set_window
    type: compute
    output:
      window_start: "{{ execution_date - interval_days(7) }}"
      window_end: "{{ execution_date }}"

  - id: transform_by_segment
    depends_on: [set_window]
    watermark_start: "{{ tasks.set_window.output.window_start }}"
    watermark_end: "{{ tasks.set_window.output.window_end }}"

  - id: transform_by_category
    depends_on: [set_window]
    watermark_start: "{{ tasks.set_window.output.window_start }}"
    watermark_end: "{{ tasks.set_window.output.window_end }}"

  - id: join_revenue
    depends_on: [transform_by_segment, transform_by_category]

Both branches receive the same window boundaries from set_window. The shared parameter is the single source of truth — not each branch's individual state store.

Diagnosis and prevention

The five anti-patterns above share a common root: dependencies that are real but not declared, assumptions that are real but not tested, and time windows that should be aligned but are not enforced. All of them cause silent data bugs — wrong results with a "succeeded" status, which are harder to find and more expensive to fix than explicit failures.

We're not saying these patterns are always avoidable. In early-stage pipelines, a god task gets you moving quickly. Implicit scheduling assumptions are often acceptable when you control both sides of a dependency. The point is that these patterns become liabilities as the pipeline grows, and converting them later is harder than designing them correctly from the start.

Prevention starts with DAG review discipline. For any new DAG or significant change to an existing one, answer these questions before merging:

Is every data dependency declared as an explicit depends_on? If task D uses data produced by task B, is that relationship declared?
Are trigger rules explicit on every aggregation or merge task? Is ALL_SUCCESS the default?
Are time window assumptions enforced by sensors, not by wall-clock schedule alignment?
Do diamond branches use the same watermark parameter from a shared source?
Does each task do exactly one logical operation?

These questions take 10 minutes per DAG to answer during review. They surface the assumptions that, if left implicit, will create a data bug in production somewhere between 3 weeks and 3 months after deployment — typically at the worst possible time.