BlogData Engineering

What Is a Data Quality Test? Validating Your Data Pipeline with dbt

James Okafor
James Okafor
Senior Data Engineer
·August 7, 202810 min read

Data quality tests are automated checks that run in the data pipeline to validate that data meets defined expectations — uniqueness, referential integrity, non-nullness, and business-rule compliance. This guide explains the types of data quality tests, how dbt implements them, and how to build a testing strategy.

A data quality test is an automated check that runs during or after a data pipeline to validate that data meets defined expectations. Tests catch problems introduced in ingestion, transformation, or source systems before they propagate to dashboards and analytics that stakeholders depend on. Without tests, data quality failures are discovered by end users — the wrong number in a board-level report, a spike that turns out to be a pipeline bug rather than a business event.

Tests shift data quality failure discovery from downstream to upstream — from the stakeholder reading the dashboard to the pipeline engineer who can fix it before anyone else notices.

The Four Generic dbt Test Types

dbt provides four built-in generic tests that can be applied to any model column with a single line of configuration:

**unique:** Asserts that every value in the column is distinct. Appropriate for columns that should be unique — primary keys (customer_key, order_id), natural business keys, and any column where duplicate values indicate a data quality problem (duplicate rows from a broken deduplication step).

**not_null:** Asserts that no values in the column are NULL. Required fields that should never be empty — order date on an orders table, customer_id on a customer dimension. A not_null test failure on a non-nullable column indicates a broken ingestion step or source system bug.

**accepted_values:** Asserts that all values in the column belong to a defined list. For status columns (order_status should only ever be 'pending', 'processing', 'shipped', 'delivered', 'cancelled'), payment type columns, or any categorical field with a bounded valid set. Failures indicate new categories appearing in source data without being handled in transformation logic.

**relationships:** Asserts referential integrity — every value in the tested column has a matching row in a specified reference column of another model. Tests that every customer_key in the orders fact table exists in the customer dimension. Referential integrity failures indicate broken joins or incomplete dimension population.

Schema Tests vs Data Tests

**Schema tests** (the four generic tests above) validate structural properties of the data — uniqueness, nullability, referential integrity. They are fast to run and validate fundamental assumptions about the model.

**Data tests** (custom singular tests in dbt) validate business rules that require SQL logic beyond structural checks. Examples:

- Revenue should never be negative

- Order ship date should never be before order date

- Total payments applied to an invoice should not exceed invoice amount

- Every order should have at least one order line item

These tests are implemented as SQL SELECT queries that return failing rows. dbt treats any rows returned as test failures. This model is powerful: any business rule expressible as SQL can be a test.

Testing Strategy

Not every column needs every test. A practical testing strategy applies tests proportional to the importance of the model and the risk of the column:

**Tier 1 — Production models consumed by executives:** Full test coverage. Unique and not_null on all primary keys. Referential integrity on all dimension foreign keys. Custom tests on all critical business rules (non-negative revenue, ship-before-order-date validation).

**Tier 2 — Operational models used by data team:** Key structural tests. Primary keys unique and not_null. Referential integrity on join columns. Critical accepted_values tests.

**Tier 3 — Staging models (source layer):** Minimal testing. Primary key uniqueness to detect source deduplication issues. Not_null on required fields to catch ingestion failures.

Testing every column of every model is impractical and produces noise. Tests on columns unlikely to fail (a month_name column that has been stable for years) consume pipeline time without providing value. Apply tests where failures are most likely and most consequential.

dbt Test Configuration

Tests are defined in schema YAML files alongside model definitions. A test block specifies the model, the column, and the test type:

For the unique and not_null generic tests: one YAML entry per column per test. For accepted_values: include the values list. For relationships: specify the model and column to check against. For custom singular tests: a .sql file in the tests/ directory.

Tests run automatically as part of dbt test command, which runs after dbt run in production jobs. Failed tests can be configured to block downstream pipeline steps — if the customer dimension fails its referential integrity test, the fact table build that joins against it should not run.

Alerting on Test Failures

Tests that run without alerting on failure provide limited value — a failure that goes unnoticed until manual review defeats the purpose of automated testing. Production dbt jobs should emit alerts on test failures to Slack or email. dbt Cloud provides built-in alerting on job failures; self-hosted Airflow or Dagster orchestration should capture dbt test failure exits and trigger configured alert channels.

Our data engineering services practice implements dbt testing strategies — from initial generic test setup through custom business-rule test development and alerting integration. Contact us to discuss your data quality requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →