BlogData Engineering

Building a Data Quality Framework: Rules, Testing, and Monitoring at Scale

James Okafor
James Okafor
Data & Cloud Engineer
·June 23, 202611 min read

Bad data costs more than the effort to prevent it. Here is the end-to-end framework for defining, testing, and monitoring data quality across a production data platform.

The quick answer

A data quality framework is the combination of rules, tests, monitoring, and ownership structures that ensures data in your platform is fit for its intended use. Without a framework, data quality is reactive — analysts discover problems when dashboards break or numbers do not reconcile, and the root cause is weeks or months of bad data that nobody noticed. With a framework, quality is enforced at ingestion and transformation, monitored continuously, and owned by defined stakeholders. This guide covers the full implementation from rule definition through production monitoring.

The five dimensions of data quality

Most data quality frameworks measure quality across five dimensions:

**Completeness**: do all required fields have values? A customer record with no email address is incomplete. An order record with no product line items is incomplete. Completeness tests catch missing data before it propagates to analytics.

**Accuracy**: does the data reflect the real-world state it is supposed to represent? Accuracy is hardest to test automatically (it requires ground truth) but can be approximated through range checks, statistical baseline comparisons, and reconciliation against source system totals.

**Consistency**: is data consistent across systems and over time? If the customer count in the CRM is 12,450 but the billing system shows 11,982 active subscribers, you have a consistency problem. Consistency checks reconcile values across systems or detect unexpected deviations from expected distributions.

**Timeliness**: is data arriving on schedule? A sales dashboard built on yesterday's data while showing today's date is a timeliness failure. Source freshness checks (measuring when data was last loaded) catch pipeline delays before users encounter them.

**Uniqueness**: are records that should be unique actually unique? Duplicate customers in the CRM, duplicate order IDs in the transaction table, duplicate date rows in a daily aggregate — uniqueness failures cause double-counting and are among the most common analytical data errors.

Rule definition

Rules are the specific tests you run against your data. Good rules are:

**Unambiguous**: the rule defines exactly what constitutes a pass or fail. "customer_id must not be null" is unambiguous. "revenue should look reasonable" is not.

**Traceable**: each rule is linked to a dimension (completeness, uniqueness, etc.) and to the business requirement that motivates it.

**Prioritised**: not every failed rule is a pipeline-stopping error. Categorise rules by severity — critical (stop the pipeline, alert immediately), high (alert, do not stop), warning (log, review in next cycle).

Common rule types:

**Schema rules**: column not-null, data type validation, allowed value sets (status IN the set of active, inactive, pending), string pattern matching (email format, phone format).

**Statistical rules**: value ranges (revenue between 0 and 10,000,000), distribution checks (null rate below 5%), row count expectations (between 10,000 and 50,000 rows per daily load).

**Referential integrity**: foreign key checks — every order.customer_id must exist in customers.customer_id. These are the most commonly violated rules in analytics pipelines and often indicate upstream data bugs.

**Cross-system reconciliation**: total revenue in the warehouse matches total revenue in the source billing system (within a 0.1% tolerance). Requires connecting to source system APIs or exports.

**Freshness**: the most recent record timestamp is within the expected window — updated within the last 25 hours for a daily pipeline.

Implementation tooling

**dbt tests**: the lowest-friction starting point for teams already using dbt. Generic tests (not_null, unique, accepted_values, relationships) cover the most common rules with a few lines of YAML. Singular tests (SQL queries that return zero rows on pass) cover custom business logic. dbt-expectations extends the test library with statistical tests.

**Great Expectations**: a dedicated data quality framework with a Python API. Supports defining expectations across data sources (Pandas DataFrames, SQL warehouses, Spark), running validation suites, and generating HTML data docs showing validation results. More powerful than dbt tests for complex statistical checks; more setup overhead.

**Soda Core**: an open-source data quality tool (scan + monitor pattern). Define quality checks in YAML, run scans against SQL warehouses or Spark, ship results to Soda Cloud for monitoring and alerting. Strong Snowflake and BigQuery integration.

**Monte Carlo / Acceldata / Bigeye**: ML-powered data observability tools that learn normal data patterns and alert on anomalies without requiring explicit rule definition. Suitable for organisations with complex pipelines and too much data to manually define rules for every table. Higher cost; better coverage at scale.

Testing architecture

Quality tests should run at two points:

**At transformation**: dbt tests run at the end of each dbt model build. If tests fail, the pipeline fails (for critical tests) or warns (for non-critical). This prevents bad data from reaching BI-facing mart models.

**At ingestion**: source freshness checks and row-count checks run when new data lands from source systems, before transformation begins. This catches upstream data delivery failures before they propagate.

For both: tag tests with severity (warn vs error) and configure appropriate blocking behaviour. A warning should generate a Slack or email notification; an error should halt the downstream pipeline.

Monitoring in production

Testing catches known failure modes. Monitoring catches unknown anomalies — the gradual drift in data distributions, the sudden spike in null rate, the unexpected table that stopped updating.

**Row count monitoring**: track row counts per table per load cycle. Alert when counts fall outside expected ranges or deviate significantly from recent history.

**Null rate monitoring**: track null rates for key columns over time. A column that was 0.5% null and is now 8% null indicates an upstream change or bug.

**Distribution monitoring**: for numerical columns, track mean, median, and standard deviation. Unexpected distribution shifts indicate data changes worth investigating.

**Cross-table consistency**: track aggregate metrics that should be stable (total active customers, total monthly revenue) and alert when they deviate unexpectedly from prior periods.

Most mature teams implement this monitoring in one of two ways: custom SQL checks scheduled via Airflow, writing results to a monitoring table; or a dedicated observability tool (Monte Carlo, Bigeye) that automates the baselining and alerting.

Ownership model

Data quality without ownership is a framework that does not work in practice. Every data asset needs a defined owner who:

- Is notified when quality checks on that asset fail

- Is responsible for investigating and resolving failures

- Reviews and updates rules as the data evolves

- Documents known quality issues and their business impact

For data products built by domain teams (in a data mesh model), ownership is clear — the domain team owns the product and its quality. For a centralised data platform, assign ownership at the business domain level — the Finance analytics lead owns finance data quality, not the central data engineering team. The data engineering team builds and operates the framework; domain owners are accountable for the quality of their data.

Incident response

When a quality check fails in production, the response process should be documented before the first failure, not improvised during it:

1. Alert to the appropriate owner (Slack, PagerDuty, email — based on severity)

2. Initial triage: is this a data delivery failure (pipeline issue) or a source system issue (upstream bug)?

3. Impact assessment: which dashboards, reports, or downstream processes are affected?

4. Resolution path: pause dependent pipelines, fix or backfill the data, rerun tests

5. Post-resolution: document root cause, update rules if the failure revealed a gap, update monitoring thresholds if the alert was a false positive

Tracking quality incidents over time (failed tests, incident counts, resolution times) provides the metrics to justify ongoing investment in data quality infrastructure.

For the transformation layer where most tests run, see dbt best practices. For the observability and monitoring that complements data quality testing, see data observability. For the governance framework that data quality feeds into, see data governance framework.

Our data architecture consulting practice designs and implements data quality frameworks for production data platforms — from rule definition through monitoring architecture and ownership models. If your team is dealing with recurring data quality issues or building a quality framework from scratch, book a free 30-minute audit.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →