Blog Debugging

Debugging

Diagnosing Lag in Real-Time Data Pipelines: A Field Guide

Marcus Okafor

May 6, 2025 11 min read

Consumer lag is growing and you don't know why. Your Kafka consumer group is behind by 200,000 messages and the number is still climbing. The SLA requires data to land in the warehouse within 15 minutes of event production. You have 10 minutes before the breach.

This is not a hypothetical. It happens to every team running real-time pipelines long enough. What you need in that moment is not a general explanation of Kafka architecture — you need a systematic method that tells you where to look first and what each finding implies about where to look next.

This article is that method. It's written for staff-level data engineers and platform engineers who operate real-time pipelines in production and need a structured approach to lag triage that goes beyond "restart the consumer and hope."

Before you start: what consumer lag actually measures

Consumer lag is the difference between the latest offset in a topic partition and the last committed offset of your consumer group. A lag of 200,000 messages means your consumer group's last committed offset is 200,000 messages behind where the topic's current end is. It does not, by itself, tell you whether this is a new problem (lag spiked in the last 5 minutes) or a chronic condition (lag has been growing for 3 hours).

Your first data point is the lag trend: is the number growing, stable, or shrinking? A lag that is growing means your consumer is currently processing slower than the producer is producing. A stable lag means consumer throughput equals producer throughput but it hasn't caught up yet. A shrinking lag means the consumer is catching up. The treatment differs depending on which of these you're in.

Pull this from your metrics system before touching anything:

kafka-consumer-groups.sh \
  --bootstrap-server kafka:9092 \
  --group your-consumer-group \
  --describe

The output shows you: consumer group ID, topic, partition, current offset, log end offset, and the lag (log end offset minus current offset). You're looking for which partitions are lagging and whether the lag is distributed evenly across partitions or concentrated in a few.

If lag is concentrated in 1-2 partitions: This usually points to a skewed partition (one partition is receiving significantly more messages than others), a consumer instance that's overloaded or restarting, or a consumer assignment imbalance. Skip to Step 3.

If lag is evenly distributed across all partitions: This indicates a systemic throughput issue — your consumer group as a whole is not keeping up with producer throughput. Go to Step 2.

Step 2: Check producer throughput vs consumer throughput

Pull producer and consumer message rates from your metrics system (Prometheus / Grafana / Kafka JMX). You're looking for the ratio:

Producer messages/sec over the last 15 minutes
Consumer messages/sec over the last 15 minutes

If producer rate has spiked above consumer rate, you have a backpressure event. The lag will grow until either the producer rate normalizes or consumer throughput increases.

If producer rate has not changed: Consumer throughput has dropped. Something changed in the consumer. The most common causes:

A schema change in the topic that's causing the consumer to throw deserialization errors
Slow downstream write operations (database writes, warehouse bulk loads) creating backpressure in the consumer loop
Consumer instances restarting due to out-of-memory or application errors

If producer rate spiked: This is a traffic burst. Common causes: end-of-day batch processing that generates events, a marketing event that drove user activity, or a bug in a producer that emitted duplicate events. Check whether the burst is still ongoing or whether it peaked and will naturally resolve. If it's already past peak, your consumer will catch up on its own — the question is whether it will catch up before your SLA window closes.

Step 3: Check consumer group status for rebalancing

A Kafka consumer group rebalance suspends all consumption while partition assignments are redistributed among consumer instances. If a rebalance is in progress or recently completed, no messages are being consumed during the rebalance window. This is by far the most common cause of sudden lag spikes that don't correspond to any visible application error.

kafka-consumer-groups.sh \
  --bootstrap-server kafka:9092 \
  --group your-consumer-group \
  --describe | grep -E "GROUP|STATE"

Check the STATE field. If the consumer group is in PreparingRebalance or CompletingRebalance, consumption is suspended. If it recently transitioned through a rebalance, lag may have accumulated during that window.

Frequent rebalancing is often caused by:

Consumer heartbeat timeouts: The consumer instance is alive but not sending heartbeats within the session.timeout.ms window. Usually because the consumer's processing loop is blocking (long database write, slow external API call) and the heartbeat thread can't fire. The fix is to increase max.poll.interval.ms to match your actual processing time per batch, or reduce the processing time per batch so it fits within the existing max.poll.interval.ms threshold. The default is 300 seconds — if your batch writes to Snowflake synchronously and a MERGE takes 90 seconds, you're well inside that window. If you're on a managed Kafka service with tighter defaults, you may be getting evicted at 30 seconds.
Consumer instance restarts: OOM kills, application crashes, or rolling deployments trigger rebalances. Check your consumer pod/process restart counts in your container orchestration metrics.
New consumer instance joining: If you recently scaled up the consumer group, the scale event triggers a rebalance. This is expected and temporary — with cooperative rebalancing (introduced in Kafka 2.4 with the CooperativeStickyAssignor), the rebalance only reassigns newly added partitions, not all of them. If you're still on the older RangeAssignor or RoundRobinAssignor, a scale event triggers a full stop-the-world rebalance.

Step 4: Check for deserialization errors

If a producer pushed a schema change and your consumer's schema registry or Avro/Protobuf deserializer doesn't know about the new schema version, the consumer throws a deserialization error on every new message. By default, most consumer frameworks send the failed message to a dead letter queue (DLQ) or stop processing entirely. Both cause lag to grow.

Check your consumer logs for lines like:

org.apache.kafka.common.errors.SerializationException: Error deserializing key/value for partition orders-0 at offset 1847291

If you see serialization exceptions, the root cause is a schema incompatibility. The producer's schema changed in a way that the consumer's registered schema version cannot decode. This is schema drift at the streaming layer — structurally identical to the column-rename problem in batch pipelines, just manifesting as deserialization failures rather than silent NULLs.

The immediate fix is to update the consumer's schema registration to accept the new schema. The longer-term fix is to enforce schema compatibility at the schema registry level — Confluent Schema Registry and its open-source equivalents support setting the compatibility mode to BACKWARD, which prevents producers from pushing schemas that break existing consumers.

A nuance worth understanding: even with BACKWARD compatibility enforcement, the guarantee is one-directional. BACKWARD means consumers on the new schema can read messages produced with the old schema. It does not protect consumers on the old schema from new messages. Teams with heterogeneous consumers (different consumers on different schema versions) need FULL compatibility — the new schema must be both backward and forward compatible — which is much more restrictive and rules out many common change types like dropping optional fields.

Teams using Confluent Schema Registry should also be aware that schema ID is encoded in the message payload header (the first 5 bytes using the Confluent wire format). If a producer writes a message with schema ID 7 and your consumer only has schemas 1–6 registered in its local cache, the deserialization fails even if schema 7 is registered centrally. The consumer's auto.register.schemas=false setting with a stale cache is a common production gotcha.

Step 5: Check downstream write performance

In a pipeline that reads from Kafka and writes to a warehouse (Snowflake, BigQuery, Redshift), the consumer throughput is bounded by the write throughput of the destination. If your warehouse writes are slow, your consumer loop blocks waiting for the write to complete, heartbeats time out, and you end up back at the rebalancing scenario in Step 3.

Indicators of a downstream write bottleneck:

Consumer CPU and memory are low (not using available resources)
The consumer is processing batches of 100 records but taking 4+ seconds per batch
Warehouse query metrics show slow bulk load operations

For Snowflake: check the QUERY_HISTORY view for slow COPY INTO or MERGE operations that correspond to the lag window. If a single MERGE is taking 30+ seconds, your consumer will time out waiting for it. Snowflake's virtual warehouse queuing behavior can also introduce unexpected latency — if a concurrent query is holding a lot of warehouse resources, your streaming consumer's small MERGE may sit in the queue longer than expected.

For BigQuery: streaming inserts have a separate quota from batch load jobs, and the streaming insert API has a per-project rows/second limit. If you're approaching that limit, inserts will return quota errors, and your consumer will retry, blocking the processing loop.

Mitigations: reduce batch size so individual writes complete faster; increase parallelism if the destination can handle concurrent writes; switch from synchronous to asynchronous writes with a completion callback. For very high-volume pipelines, consider a staged write pattern — write to object storage first, then use the warehouse's native bulk load (Snowflake COPY INTO, BigQuery load jobs) in a separate process.

Step 6: Evaluate whether to scale the consumer group

If the root cause is sustained producer throughput exceeding consumer throughput (and it's not a burst), the answer is more consumer instances — up to the limit of the number of partitions in the topic. Kafka's parallelism ceiling is the partition count; adding more consumers than partitions leaves some consumers idle.

The relevant constraint: if you have 12 partitions and 4 consumer instances, each consumer handles 3 partitions. If you add 8 more instances, each handles 1 partition. If you add a 13th instance, it gets no partitions assigned and sits idle.

Before scaling: verify that the downstream write system can handle increased concurrent write load. Scaling consumers without scaling warehouse write capacity just moves the bottleneck downstream. This is a common mistake — adding 8 consumer instances solves the Kafka lag immediately, but 8 concurrent MERGE operations against a small Snowflake warehouse may saturate the compute and create a new bottleneck that's harder to observe.

Also consider partition topology. If your topic was originally created with 8 partitions and you now have 16 consumer instances, 8 instances are permanently idle. Adding partitions to an existing Kafka topic is supported but triggers a rebalance and may change the partition assignment of existing messages — which can break ordering guarantees for any messages that relied on partition-level ordering by key. Plan partition count at topic creation with headroom for growth.

Step 7: Assess the backfill window

Once lag is resolved and consumers are healthy, you need to assess how much data went unprocessed during the lag window and what the downstream impact is.

In a Kafka consumer, data that fell behind is still available in the topic (up to the retention period — typically 7 days to 30 days depending on configuration). The consumer will replay it automatically by continuing to consume from its last committed offset. This is one of Kafka's strongest properties: unlike a queue that drops unread messages, the topic retains messages for replay.

The important question is whether your downstream pipeline is idempotent — whether replaying the lagged messages will produce correct results without duplication. If your warehouse destination uses INSERT without deduplication, replaying messages will create duplicate rows. If it uses MERGE or UPSERT with a stable event ID as the key, replay is safe.

Schema drift events that occurred during the lag window require special handling. If the consumer was paused due to a deserialization error on schema version N+1, replaying messages from during that window means replaying messages in schema N+1 format. Ensure the consumer's drift rules handle the post-change schema before re-enabling consumption. Replaying messages in the new schema format against a destination table that still expects the old schema will cause the same deserialization (or NULL-fill) failure you were trying to recover from.

What monitoring you need to catch this in under 2 minutes

The teams that catch lag incidents in under 2 minutes are not smarter than teams that take 30 minutes — they have different dashboards. Specifically:

Consumer group state transitions: Alert on any transition to PreparingRebalance or Empty state. This is the single most predictive leading indicator of an impending lag spike.
Consumer error rate by partition: A consumer that's throwing serialization errors at a steady rate is silently building lag. This metric shows up before lag does.
Consumer processing time per batch: Track the 95th and 99th percentile of processing time. A P99 above your max.poll.interval.ms threshold is a pre-failure signal.
Downstream write latency: Track warehouse write duration. A P95 above 10 seconds for MERGE operations is a problem even if it hasn't triggered a consumer timeout yet.

Lag itself is a lagging indicator. The causal signals are consumer group state, error rate, processing time, and downstream write latency. Building dashboards around those four metrics reduces mean time to diagnosis from "check logs manually" to "glance at the dashboard and see the answer highlighted."

Diagnostic summary

Is lag growing, stable, or shrinking? → Determine urgency; growing = active problem
Is lag concentrated in specific partitions or uniform? → Skew/instance issue vs systemic throughput issue
Did producer throughput spike? → Traffic burst vs consumer degradation
Is the consumer group rebalancing? → Heartbeat timeout / instance restart / scale event
Are there deserialization errors? → Schema change at producer; schema registry compatibility or cache issue
Is downstream write latency high? → Destination bottleneck; reduce batch size or increase parallelism
Is sustained throughput the issue? → Scale consumer group up to partition count; verify destination can absorb load
After recovery: is replay idempotent? → Verify MERGE key before allowing consumer catch-up to complete

Most lag incidents resolve at steps 3 or 4. Rebalancing and schema incompatibility are by far the most frequent causes of unexpected consumer lag spikes in production pipelines. Both are detectable in real time with the right monitoring — and both are preventable with correct configuration before they occur.