How to build a comprehensive data quality testing framework with dbt — schema tests, custom singular tests, statistical anomaly detection, cross-environment validation, and the governance model that makes testing sustainable at scale.
Data warehouse testing is the discipline that determines whether your analytical data can actually be trusted. Without it, dashboards lie, reports mislead, and business decisions rest on corrupted or incorrectly transformed data. The problem is that data warehouse testing looks very different from application testing — there are no unit tests for SQL transformations, no integration test frameworks that understand slowly changing dimensions, and most engineering teams import their testing habits from software development without examining whether they apply.
This guide covers the testing strategies that work for analytical data: what to test, how to test it, the tools available, and the governance framework that makes testing sustainable rather than a one-time effort.
Why Data Warehouse Testing Is Different
Application testing validates that code behaves as specified. Data warehouse testing faces an additional challenge: validating that data from source systems was correctly understood, correctly transformed, and correctly modelled. The source of truth for application testing is a specification. The source of truth for data warehouse testing is often ambiguous business logic, tacit knowledge held by a data analyst, or documentation that has not been updated in three years.
Data warehouse testing also operates at different scales. A unit test runs in milliseconds; a test that checks referential integrity across a 10-billion-row fact table takes time. Test design must account for cost and duration, not just correctness.
And unlike application code, bad data does not cause an exception — it silently produces wrong answers. A null foreign key in an application might crash the system. A null foreign key in a data warehouse produces incorrect counts, incorrect joins, and business reports that are wrong without any indication that something failed.
Schema and Structural Tests
The most basic tests validate that the data structure is what you expect:
**Not-null tests** ensure that columns that should always have values do. A fact table's foreign keys should never be null. A date column in a transaction record should never be null. These tests catch upstream data quality failures that would otherwise manifest as silent undercounting.
**Unique tests** verify that columns expected to be unique identifiers actually are. The primary key of a dimension table should be unique. Discovering that your customer dimension has duplicate customer IDs means your sales aggregations are wrong every time a duplicated customer made a purchase.
**Accepted values tests** verify that categorical columns contain only expected values. An order_status column should contain known statuses, not free-text garbage from a poorly validated source system.
**Relationship tests** check referential integrity — that every foreign key in a fact table exists in the referenced dimension. Orphaned fact rows (fact table rows whose dimension keys do not exist in the dimension) produce incorrect joins and incorrect aggregations.
In dbt, these four test types are built in and require only a few lines of YAML configuration per column. The test suite runs with dbt test and fails the build if any test fails, preventing bad data from being promoted to production. See our dbt project structure guide for how to organise tests in dbt projects.
Data Quality Tests
Beyond structural integrity, data quality tests validate that the data values are plausible and consistent:
**Row count tests** check that tables contain the expected number of rows. A daily snapshot table should have approximately the same number of rows today as yesterday. A sudden drop or spike indicates a pipeline failure or an upstream data problem.
**Distribution tests** check that aggregate metrics are within expected ranges. Revenue this week should be within N standard deviations of the trailing 12-week average. Average order value should be between $10 and $10,000. Transactions per hour should not drop to zero during business hours.
**Freshness tests** verify that data has been refreshed recently enough to be trusted. A dashboard showing "yesterday's" data should fail its freshness test if the data is more than 26 hours old. Stale data is a data quality problem even if the data itself is correct.
**Completeness tests** check that expected records exist. Every active customer should have at least one row in the activity table. Every product in the product catalogue should appear in the sales fact at least once in the last year (if the product has been available). Missing records often indicate upstream system changes that broke incremental extraction.
**Cross-table consistency tests** check that aggregates are consistent across related tables. The total revenue in the order line item fact should match the total revenue in the order-level fact. The customer count in the customer dimension should be consistent with the distinct customer count in the transaction fact.
Transformation Logic Tests
Testing the correctness of transformation logic is the hardest part of data warehouse testing. The challenge is that the expected output is often defined by business rules that are not precisely specified.
**Side-by-side comparison tests** compare the output of a transformation against a known-good reference. If you have a legacy SQL query that has been producing the correct answer for years, you can validate a refactored version by running both and comparing the output. The dbt package audit_helper provides exactly this — it compares two queries and reports on rows that exist in one but not the other.
**Boundary condition tests** check that transformations handle edge cases correctly. What happens when the order quantity is zero? When the discount exceeds the price? When a customer has null in the acquisition channel field? Edge cases in business rules are where transformation logic most commonly fails silently.
**Historical reconstruction tests** verify that slowly changing dimension logic is correct. If you apply your SCD Type 2 logic to historical data, does it produce the correct dimension history? Does a query joining the fact table to the dimension at a historical date produce results that match what was true at that date?
Statistical Anomaly Detection
Row count and distribution tests with hard-coded thresholds work until the data changes. A business growing 30% month-over-month will fail a row count test that was calibrated six months ago. Statistical anomaly detection — using the historical distribution of a metric to detect anomalies dynamically — is more robust for production environments.
The approach: for each key metric, maintain a rolling window of historical values. Compute the mean and standard deviation. Flag values that fall outside a defined number of standard deviations as anomalies. The threshold adjusts automatically as the baseline shifts.
The dbt_expectations package provides statistical tests including those based on standard deviation ranges. For more sophisticated anomaly detection — seasonality adjustment, trend correction, holiday effects — you typically need custom Python or an external anomaly detection service.
Cross-Environment Validation
When promoting changes to production — schema changes, refactored transformations, new data sources — cross-environment validation compares the output between environments:
**Development vs production comparison**: run the changed transformation in the development environment and compare the output against production. Differences that cannot be explained by legitimate data differences indicate bugs in the new logic.
**Pre/post migration comparison**: when migrating from one platform to another (SQL Server to Snowflake, legacy warehouse to Lakehouse), compare aggregate outputs between the old and new systems before cutover. Unexplained differences must be investigated, not accepted.
**Rollback comparison**: if a production deployment is suspected of causing a data issue, compare the current output against a snapshot taken before the deployment to confirm the regression and identify the scope.
Testing Governance
Testing is only valuable if it is enforced. An optional test suite that engineers skip when they are in a hurry delivers no protection.
**Gate tests on promotion.** Tests must pass before data is promoted to production. In dbt, this is built into the workflow: dbt test runs before dbt run in production, and a test failure aborts the run. Any test failure should generate an alert to the data engineering team.
**Classify test severity.** Not all test failures require the same response. A not-null failure on a core key is a blocker; a row count variation of 0.01% on a secondary table may be acceptable. Classify tests as blocking (deployment halts) or warning (alert sent, deployment continues).
**Monitor test results over time.** Track which tests fail, how often, and what the root cause is. Tests that fail repeatedly with false positives should be calibrated; tests that never fail may have thresholds set too loosely.
**Document what each test validates.** Tests without documentation become tribal knowledge. The person who set up the test knows why it exists; everyone else doesn't. Document the business rule each test enforces and what a failure means operationally.
For data architecture that is built for testability from the start, our data architecture consulting team designs analytical data systems with testing integrated into the architecture — contact us to discuss your data quality requirements.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →