Data Pipeline Monitoring: Detecting Failures Before Your Users Do

How to build monitoring and alerting for data pipelines — the failure modes that matter (volume drops, schema changes, freshness delays, quality degradation), the tools and patterns for detecting each, and the on-call escalation structure that prevents data quality issues from reaching dashboards undetected.

Data pipelines fail silently. A broken extract refresh does not send a push notification to the CTO. A schema change in a source system does not interrupt anyone's morning. A subtle data quality degradation — fewer records than yesterday, a null rate that's crept up 2% — will not appear as an incident in PagerDuty. But the consequences — incorrect dashboards, missed alerts, decisions made on stale data — are very visible to the people who depend on the analytics.

Monitoring data pipelines is not optional for production data environments. This guide covers the failure modes that matter, the tools and patterns for detecting them, and the escalation structure that prevents data quality issues from reaching users.

The Four Failure Categories

**Freshness failures:** The pipeline ran, but the data is older than expected. An extract that refreshes at 6 AM produces data as of yesterday's close. If the extract fails silently and the dashboard shows two-day-old data, users making decisions based on "today's numbers" are working with incorrect information.

**Volume failures:** The pipeline ran, but the row count is materially different from expected. A source table dropped from 1 million rows to 500,000 between runs, or a filter condition changed upstream and silently eliminated half the records. Volume drops look like data quality problems to users — they see lower revenue, fewer customers, or incomplete order counts.

**Schema failures:** The source system changed a column name, added a required column, or changed a data type. The pipeline either fails immediately (if the new schema is incompatible with the transformation logic) or continues silently with incorrect data (if the change is backward-compatible but breaks business logic).

**Quality failures:** The data loaded successfully, the row count looks normal, but metric values are wrong — unexpected nulls, out-of-range values, broken referential integrity, or metric definitions that have drifted from the certified definition. These are the hardest failures to detect because they look like data, not errors.

Freshness Monitoring

Freshness monitoring answers: "Was this table updated when it should have been?"

**The last_updated timestamp pattern.** Every important table should have a row that tracks when it was last successfully loaded. Query that timestamp and alert if it is older than the expected refresh interval plus a tolerance buffer.

In a Snowflake or BigQuery environment, INFORMATION_SCHEMA.TABLE_STORAGE_HISTORY and INFORMATION_SCHEMA.TABLES provide table modification timestamps. A monitoring query:

SELECT

table_name,

DATEDIFF('hour', last_altered, CURRENT_TIMESTAMP()) as hours_since_update

FROM information_schema.tables

WHERE table_schema = 'analytics'

AND hours_since_update > 25 -- alert if not updated in 25 hours for a daily refresh

**Orchestrator-level monitoring.** dbt Cloud, Airflow, Dagster, and Prefect all provide job run history and status. Configure alerts on failed runs and on runs that exceed their expected duration (a sign of performance degradation even if they eventually succeed). Most orchestrators support webhook notifications to Slack or PagerDuty on failure.

**dbt source freshness tests.** dbt's source freshness feature validates that source tables have been updated within a defined window. Define freshness expectations in sources.yml:

sources:

- name: crm

freshness:

warn_after: {count: 24, period: hour}

error_after: {count: 36, period: hour}

loaded_at_field: _extracted_at

tables:

- name: customers

Running dbt source freshness checks the loaded_at_field against the defined thresholds and reports warn/error status. Integrate this check into the monitoring run on the same schedule as pipeline executions.

Volume Monitoring

Volume monitoring answers: "Does the table have roughly as many rows as expected?"

The simplest approach: compare today's row count to the rolling average of the last 7 or 14 days. Alert if the current count deviates by more than X% from the rolling average.

A reusable pattern as a dbt model or monitoring query:

WITH daily_counts AS (

SELECT

DATE_TRUNC('day', loaded_at) as load_date,

COUNT(*) as row_count

FROM fct_orders

WHERE loaded_at >= CURRENT_DATE - 30

GROUP BY 1

stats AS (

SELECT

AVG(row_count) as avg_count,

STDDEV(row_count) as stddev_count

FROM daily_counts

WHERE load_date < CURRENT_DATE

)

SELECT

today.row_count,

stats.avg_count,

ABS(today.row_count - stats.avg_count) / NULLIF(stats.avg_count, 0) as pct_deviation

FROM daily_counts today, stats

WHERE today.load_date = CURRENT_DATE

AND pct_deviation > 0.20 -- alert if more than 20% deviation

Set thresholds based on historical volatility of each table. A table that varies +-5% daily needs a tighter threshold than one that varies +-40% due to weekend vs weekday patterns. Segment volume checks by meaningful dimensions (by region, by source system) to catch partial failures that aggregate volume checks miss.

Schema Monitoring

Schema changes in source systems are a common cause of silent pipeline failures. Monitoring approaches:

**Schema snapshot comparison.** At each pipeline run, capture the source table schema (column names, types, and nullability). Compare to the previous snapshot. Alert on any additions, removals, type changes, or nullability changes.

In Snowflake, INFORMATION_SCHEMA.COLUMNS provides column-level schema information. Store a snapshot after each successful run and diff against it on the next run.

**Contract testing.** Great Expectations and dbt-expectations both support schema contract tests — asserting that specific columns exist with specific types:

- dbt_expectations.expect_table_columns_to_match_ordered_list:

column_list: ["order_id", "customer_id", "order_date", "status", "total_amount"]

A test failure means the source schema has changed. This surfaces schema changes as test failures during the dbt run, before data reaches production tables.

**Source system change communication.** For source systems controlled by your organisation (operational applications, third-party SaaS tools), establish a process for source system owners to notify the data team before schema changes. Not all schema changes are detectable programmatically before they cause failures — a change that is backward-compatible for the operational system may be breaking for the data pipeline.

Quality Monitoring

Quality monitoring answers: "Is the data within the ranges and relationships we expect?"

**Null rate tracking.** For columns that should have low null rates, track the percentage of null values over time. Alert when the null rate increases significantly. A column with a 2% null rate suddenly showing 15% nulls indicates a source system change or pipeline breakage.

**Range checks.** Revenue should not be negative for most businesses. Order quantities should be positive integers. Ages should be between 0 and 120. Metric values outside defined ranges indicate data quality failures. Implement range checks as dbt tests with dbt-expectations:

- dbt_expectations.expect_column_values_to_be_between:

min_value: 0

max_value: 1000000

**Metric anomaly detection.** Track key business metrics (daily revenue, order count, customer count, conversion rate) and alert when values deviate significantly from their historical baseline. Statistical approaches range from simple (3-sigma alert if the value exceeds three standard deviations from the mean) to more sophisticated (seasonal decomposition to account for day-of-week patterns).

**Tools for automated anomaly detection:** Monte Carlo, Bigeye, Soda, and Anomalo provide automated data quality monitoring with anomaly detection, schema change detection, and lineage tracking. These replace the custom SQL monitoring patterns above with a managed service. They are appropriate for data platforms where the volume of tables to monitor exceeds what custom SQL monitoring can cover efficiently.

dbt's built-in tests (not_null, unique, relationships, accepted_values) plus dbt-expectations cover most quality monitoring needs for teams already using dbt. Run dbt tests as part of every pipeline execution — not just on initial deployment.

Alerting and Escalation

The goal of monitoring is actionable alerting — notifications that reach the right person, with enough context to act on, at the right time.

Alert routing by severity:

- Critical: data completely unavailable or systemically incorrect (Tableau Server down, fact table empty, primary key violations) → PagerDuty or SMS to on-call engineer, immediate response required

- High: data delayed or a major table with quality failures → Slack channel + email to data team lead, same-day resolution

- Medium: volume anomaly or freshness warning → Slack channel, resolution within business hours

- Low: edge-case test failures, minor anomalies → weekly digest, triage in next sprint

**Include context in alerts.** An alert that says "fct_orders test failed" is less actionable than one that says "fct_orders: row count 420,000 vs 7-day average of 850,000 (51% drop) — last successful load 2024-03-15 18:30 UTC." The more context in the alert, the faster the diagnosis.

**Avoid alert fatigue.** A monitoring system that generates 20 alerts per day will be ignored. Tune thresholds regularly. If a check is consistently triggering false positives, either raise the threshold or investigate why the underlying metric is volatile. Alert noise is as damaging as no alerting.

**SLA tracking.** Define SLAs for data freshness by table and publish them to data consumers. If the SLA for fct_orders is "data current as of 6 AM daily", document it, monitor it, and report on SLA achievement over time. This sets expectations for business users and creates accountability for pipeline reliability.

Our data engineering consulting practice designs monitoring frameworks for production data pipelines — contact us to discuss your data reliability requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →