Data Quality Monitoring: Automated Checks, Alerting, and Remediation

Bad data in a production analytics environment is worse than no data — it produces confident wrong answers that spread through the organisation before anyone notices. Automated data quality monitoring catches freshness failures, volume anomalies, distribution shifts, and referential integrity violations before they reach dashboards.

Data quality failures are the most expensive invisible problem in analytics. A dashboard that displays wrong numbers looks the same as a dashboard that displays correct numbers. Users act on both with the same confidence. The difference is discovered weeks or months later when someone notices that the reported revenue does not match the finance system, or when a product decision built on bad data produces a bad outcome.

Most data quality failures are not dramatic. They are not "the database is down" failures that are immediately visible. They are subtle: an extract that ran with no errors but captured yesterday's data twice instead of today's. A join condition that started producing duplicates when a source system changed its key structure. A currency conversion field that returns NULL for transactions in a currency added six months ago.

Automated data quality monitoring catches these before they reach the dashboard.

The Four Categories of Data Quality Checks

**Freshness checks** verify that data has been updated within expected time windows. A daily extract should have a MAX(load_timestamp) that is within the last 25 hours (allowing for scheduling variance). A near-real-time streaming pipeline should have records with event timestamps within the last 15 minutes. Freshness violations are often the first signal that a pipeline has silently failed — the data arrived, it was just old data re-loaded.

**Volume checks** verify that the row count or aggregate metric is within expected bounds. A transaction table that yesterday had 45,000 rows and today has 300 rows is not necessarily wrong — but it warrants investigation. Volume checks use rolling baselines (average of the last 7 or 30 days) and flag deviations beyond a configurable threshold (e.g., more than 20% below the rolling average). Volume checks catch truncation errors, filter bugs, and upstream pipeline failures that still produce some output.

**Distribution checks** verify that the statistical distribution of a column is consistent with historical patterns. A numeric column's mean, standard deviation, and null rate should not change dramatically overnight without a known cause. A categorical column's value distribution should be stable — if 'PENDING' orders historically represent 12% of order statuses and suddenly jump to 45%, either something significant happened in the business or there is a pipeline problem.

**Referential integrity checks** verify that foreign key relationships hold. Orders should reference customer IDs that exist in the customer table. Products referenced in line items should exist in the product table. Broken referential integrity often indicates a source system problem — a change in how source system IDs are generated, a data deletion that the pipeline did not propagate, or a pipeline that loaded child records before parent records.

Implementing Checks With dbt Tests

dbt's test framework is the most common implementation approach for teams using dbt for transformation. Built-in generic tests cover the most common checks:

models/marts/fct_orders.yml:

models:

- name: fct_orders

columns:

- name: order_id

tests:

- unique

- not_null

- name: customer_id

tests:

- not_null

- relationships:

to: ref('dim_customers')

field: customer_id

- name: status

tests:

- accepted_values:

values: ['PENDING', 'PROCESSING', 'SHIPPED', 'DELIVERED', 'CANCELLED']

These tests run as assertions after each dbt model is built. A failing test means the model produced results that violate the contract — stop and investigate.

For freshness checks, dbt source freshness tests verify that raw source tables have been updated recently:

sources:

- name: production_db

freshness:

warn_after: {count: 12, period: hour}

error_after: {count: 24, period: hour}

tables:

- name: orders

loaded_at_field: updated_at

Run 'dbt source freshness' before 'dbt run' in your pipeline. If sources are stale, fail the pipeline before building stale models.

For volume and distribution checks, dbt does not have built-in generic tests — use custom singular tests (SQL files in the tests/ directory that return rows when the assertion fails) or the dbt-expectations or dbt-utils packages, which provide additional generic test types including row count bounds, statistical distribution checks, and regex pattern matching.

The Great Expectations Framework

Great Expectations (GX) is a Python library for defining and running data quality expectations against any data source — Pandas DataFrames, SQL databases, Spark DataFrames. It is more flexible than dbt tests and appropriate for teams not using dbt or for quality checks outside the transformation layer.

The core concept is an Expectation Suite — a collection of expectations applied to a dataset. Examples:

expect_column_values_to_not_be_null(column='order_id')

expect_column_values_to_be_between(column='revenue', min_value=0, max_value=1000000)

expect_table_row_count_to_be_between(min_value=10000, max_value=200000)

expect_column_mean_to_be_between(column='order_value', min_value=85, max_value=115)

GX generates Data Docs — HTML reports showing which expectations passed and which failed, with sample failing records. These are useful for audits and for communicating data quality issues to stakeholders.

GX integrates with Airflow, Prefect, and other orchestration systems — run expectation suites as checkpoint steps in your pipeline DAG and fail the pipeline if critical expectations are violated.

Alerting and Escalation

Automated checks without alerting are valueless. When a check fails, someone must be notified immediately — not at the next morning's standup.

The alerting design depends on severity. Define check severities explicitly:

**Critical**: freshness violation on a data source that feeds executive dashboards or regulatory reporting. Alert immediately to the on-call data engineer and the relevant stakeholder. Block downstream processes from running.

**High**: volume anomaly on a core fact table beyond a 20% threshold. Alert the data engineering team within 15 minutes. Investigate before end of business.

**Medium**: distribution shift or referential integrity violation on non-critical tables. Alert in a shared Slack channel. Investigate within 24 hours.

**Low**: documentation-only quality issues — unexpected values in low-traffic columns, minor distribution shifts within tolerable bounds. Log for review. Weekly summary report.

Implement critical and high alerts via PagerDuty or equivalent for out-of-hours on-call coverage. Medium and low alerts go to Slack or email.

Quarantine Patterns

When a data quality check fails, the affected data should not silently flow into the production analytics environment. Two approaches:

**Pipeline halt**: fail the pipeline job when a quality check fails. Downstream models do not run. The production tables retain their last-known-good state. Alert fires. Analyst investigates before clearing the pipeline to rerun. This is the right approach for critical checks.

**Quarantine table**: route records that fail quality checks to a quarantine table rather than the production table. The production table gets only clean records. Quarantine records are reviewed and either corrected and reprocessed or discarded. This is appropriate for checks where partial failure is acceptable — some bad records in a large dataset, handled separately, while the good records continue to flow.

Building a Quality Dashboard

A quality dashboard gives the data team and stakeholders visibility into the health of the data environment over time. Key elements:

Check pass rate by table over the last 30 days — a heatmap-style view where colour indicates pass/fail. Persistent failures on specific tables that keep getting cleared indicate an upstream problem that needs a proper fix rather than recurring manual remediation.

Mean time to detection (MTTD) — how long between when a quality issue occurs and when the check catches it. If freshness checks run every hour but a pipeline fails silently for 4 hours before the check runs, MTTD is 4 hours. Reducing check frequency improves MTTD.

Mean time to resolution (MTTR) — from when the alert fires to when the data is correct in production. High MTTR indicates either inadequate runbook documentation (engineers spend time diagnosing rather than executing known remediation steps) or insufficient access to fix the root cause quickly.

Our data engineering practice designs and implements data quality monitoring systems — contact us to discuss automated quality monitoring for your data environment.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →