Bad data is not a data problem — it is an architecture and governance problem. Here is how to diagnose the root causes of data quality failures, implement the technical controls that prevent them, and build the organisational structures that make quality sustainable.
The quick answer
Data quality is not primarily a tooling problem. Most organisations that buy a data quality tool without fixing the architecture and governance that is producing bad data find that the tool surfaces more problems than it fixes. Data quality failures have root causes — in how data is ingested, in how schemas evolve without governance, in how transformation logic is maintained (or not), and in the absence of ownership that means nobody is accountable when quality degrades. Fixing quality sustainably requires addressing the root causes, not just monitoring the symptoms.
Why data quality fails
The most common causes of enterprise data quality failure are structural, not accidental:
**No ownership.** The most common root cause of sustained data quality failure is the absence of a named owner accountable for a dataset's quality. Without ownership, quality problems are everyone's problem — which means they are nobody's problem. A data source with no owner has no one to escalate to when a quality issue appears, no one to make the call about whether a value is valid or corrupt, and no one invested in preventing the problem from recurring. This is the problem that data governance directly addresses.
**Schema drift.** Source systems change — a field is renamed, a new nullable column is added, a code value changes meaning. Without schema monitoring at the ingestion layer, these changes propagate silently into the data platform. Downstream transformations that assumed a particular schema produce wrong results or fail — often without alerting the teams that depend on the output.
**Ingestion without validation.** Many data pipelines ingest data directly from source systems into the warehouse or lakehouse without any validation at the point of ingestion. Data that does not meet basic quality standards — null values in non-null fields, values outside expected ranges, referential integrity violations — enters the data platform and propagates through transformation layers before any quality check occurs. By the time an analyst notices a problem, the bad data may be in dozens of downstream tables.
**Business logic buried in BI tools.** When business logic — customer segmentation criteria, revenue attribution rules, product category hierarchies — lives in individual Tableau workbooks or Power BI reports rather than in the data platform, different reports calculate the same metric differently. This is not a data quality problem in the traditional sense (the underlying data may be accurate), but it produces what looks like bad data to the business: two reports with the same title showing different numbers.
**No monitoring between runs.** Batch pipelines run on a schedule. Between runs, no one is watching. A source system data quality issue that emerges at 9am may not be noticed until an analyst queries a report at 2pm. For five hours, decisions are being made on data that is wrong. Monitoring that detects quality anomalies within a pipeline run — not just pipeline failures — reduces the exposure window.
The data quality dimensions
Data quality is measured across multiple dimensions. Understanding which dimension is failing determines which fix to apply:
**Completeness** — are all expected records present? Are non-null fields populated? Completeness failures are among the most common: a source system export that missed a date range, a field that was nullable in the source but should be required in the data platform, or records that were filtered by a pipeline condition that should not have applied.
**Accuracy** — do values reflect the real-world state they represent? Accuracy problems are the hardest to detect technically — if a CRM user entered the wrong company size in a form, the data is complete and well-formed but inaccurate. Accuracy is governed through source system controls (input validation, required fields) and through cross-system reconciliation (comparing CRM revenue to ERP revenue for the same period).
**Consistency** — does the same concept mean the same thing across systems? An "active customer" in the CRM, the finance system, and the data warehouse may be defined differently. Consistency is a semantic layer problem — it is solved by canonical definitions, not by data quality tooling.
**Timeliness** — is data available when it is needed? A pipeline that runs at 3am but is not available until 7am due to compute queue delays is a timeliness problem. A source system that has a 48-hour export lag is a timeliness problem. Timeliness monitoring tracks the lag between when data should be available and when it actually is.
**Validity** — do values conform to the expected format, range, and reference list? Phone numbers in unexpected formats, dates with impossible values (February 30), amounts outside expected ranges, foreign keys pointing to non-existent parent records. These are the quality failures that automated validation catches most reliably.
**Uniqueness** — are records that should be unique actually unique? Duplicate customers, duplicate transactions, duplicate product records. Often introduced by merge and deduplication failures in ETL processes.
The technical controls: where to implement quality checks
Data quality checks should be implemented at multiple layers of the data stack, not just at the reporting layer where problems become visible.
**At ingestion (Bronze layer).** Schema validation: does the incoming data match the expected schema? Volume validation: did we receive the expected number of records? Null checks on fields that should be non-null. Out-of-range checks on numerical fields. These checks at ingestion are the earliest warning — they catch source system problems before they propagate downstream.
**At transformation (Silver layer).** dbt tests are the standard tool for transformation-layer quality: not-null tests, unique tests, accepted-values tests for reference data, and relationship tests for referential integrity. Custom tests for business-logic validity (order amounts must be positive, customer age must be between 0 and 150). dbt tests run as part of the transformation pipeline and fail the run if quality standards are not met — preventing bad data from reaching the Gold layer.
**At the Gold layer.** Cross-table consistency checks: does the total revenue in the orders fact table reconcile with the revenue in the finance summary table? Volume checks: is the row count within expected bounds for the time period? These checks validate that the transformation logic produced correct outputs.
**In the BI layer.** Threshold alerts in BI tools (Tableau or Power BI) that flag when a metric moves outside expected bounds. Not a substitute for upstream quality checks, but a last-line detection mechanism for quality issues that reached the reporting layer.
Data quality tooling
Several tools exist specifically for data quality management. They are valuable when the right architecture and governance are in place — they are not a substitute for fixing root causes.
**dbt tests**: The default for transformation-layer quality in dbt-based architectures. Schema tests (not-null, unique, accepted-values, relationships) are built in. Custom tests are straightforward to write. For most organisations, dbt tests cover the majority of quality validation requirements without additional tooling.
**Great Expectations**: An open-source Python library for defining data quality "expectations" (assertions about data properties) and running them against datasets. Produces validation reports and integrates with data pipeline tools. More flexible than dbt tests for complex, programmatic quality checks. Useful when quality logic is too complex for dbt's SQL-based test model.
**Monte Carlo, Bigeye, Metaplane**: ML-based anomaly detection tools that learn normal data patterns and alert on anomalies — unexpected volume changes, statistical distribution shifts, unusual null rates. These tools catch quality issues that are difficult to specify as explicit rules (because you do not know in advance what normal looks like) and are well-suited to monitoring the quality of existing datasets with no known baseline.
**Microsoft Purview, Atlan, Alation**: Data catalogue and governance platforms that include quality monitoring, lineage tracking, and data stewardship workflows. Relevant for organisations that need quality management integrated with catalogue, lineage, and governance in a single platform.
The right tooling choice depends on your data stack. For dbt-based architectures, dbt tests plus a catalogue tool (Purview or Atlan) covers most quality requirements. For environments with large existing datasets and no established quality baseline, an anomaly detection tool (Monte Carlo, Bigeye) adds value by surfacing problems that explicit tests would miss.
Building the governance that makes quality sustainable
Technical controls prevent quality problems at the point of ingestion and transformation. Governance makes quality sustainable by creating the organisational structures that maintain quality standards over time.
**Assign ownership.** Every critical dataset needs a named owner — a person or team accountable for quality. Ownership is not a monitoring role; it is accountability for the data meeting the quality standards that downstream consumers depend on. When a quality issue is detected, the owner is responsible for triage, root cause investigation, and resolution.
**Define quality SLAs.** For each critical dataset, define what "acceptable quality" means: freshness SLA (data must be available by 7am), completeness rate (less than 0.1% null on key fields), validity rate (all records must pass validation rules). These SLAs set the bar that quality monitoring checks against.
**Establish an escalation path.** When quality checks fail, who is notified? In what order? Within what response time? Without a defined escalation path, quality alerts go to a general inbox that nobody monitors. The escalation path should be specified by dataset criticality: critical datasets escalate to the data owner and their manager; non-critical datasets escalate to the data engineering queue.
**Conduct regular data audits.** Periodic cross-system reconciliation — comparing the data platform's representation of a metric against the source system — catches quality issues that automated checks miss. For financial data, monthly reconciliation against the source ERP is standard practice. For customer data, quarterly audits against CRM source data are common.
Data quality and AI
The relationship between data quality and AI is unforgiving: AI systems amplify data quality problems rather than compensating for them. An analyst who encounters inconsistent data pauses and investigates. An AI system uses the inconsistent data to produce outputs that appear authoritative. When those outputs propagate through automated decision systems at machine speed, the impact of a data quality problem is orders of magnitude larger than in a human-analyst workflow.
McKinsey research from 2025 identifies data quality as the primary reason that 60% of enterprise AI deployments fail to reach production. The proof of concept ran on clean, curated data. Production connected to the actual enterprise data landscape — which had quality issues that were tolerated because human analysts could work around them. AI systems cannot work around them; they amplify them.
For organisations building AI-ready data infrastructure, quality gates at the ingestion and transformation layers are not optional. The data that AI systems consume must meet the same quality standards as the data that executives sign off on. See why your data architecture cannot support agentic AI for the full treatment of what AI-ready infrastructure requires.
Where to start
If your organisation has persistent data quality problems, the highest-leverage starting point is almost always ownership, not tooling:
1. **Identify your 10 most-used, most-disputed datasets.** These are the quality problems that are costing you the most in analyst time, reconciliation work, and business decisions made on wrong data.
2. **Assign an owner to each.** A named person or team accountable for quality — not "the data team" generically.
3. **Define the quality SLA for each.** Freshness, completeness, validity thresholds. Be specific.
4. **Implement dbt tests at the Silver layer for these 10 datasets.** Not-null, unique, accepted-values, relationships. These catch the most common failures.
5. **Set up alerting so the owner is notified when tests fail.** The alert is only valuable if it reaches the accountable owner within the response time defined in the SLA.
This sequence — ownership first, then technical controls, then monitoring — produces durable quality improvement. Tooling without ownership produces dashboards full of quality alerts that nobody acts on.
FAQs
How do we measure our current data quality?
Start with a data profiling exercise: for each critical dataset, calculate completeness rates (% non-null for each field), uniqueness rates (% unique records for fields that should be unique), validity rates (% records passing defined rules), and freshness (time since last update vs SLA). Most query engines and data catalogue tools support basic profiling. The output tells you where the worst problems are and gives you a baseline to measure improvement against.
What is the difference between data quality and data observability?
Data quality is about whether the data meets defined standards — is it complete, accurate, consistent, timely, valid, and unique? Data observability is about visibility into the health of the data system — do you know when something changes, fails, or behaves unexpectedly? Observability tools (Monte Carlo, Bigeye) monitor for unexpected changes in data patterns; quality tools (dbt tests) verify that data meets specified criteria. Both are part of a complete data quality programme — observability catches the unknown unknowns, quality tests verify the known requirements.
Can AI help with data quality?
ML-based anomaly detection (the approach Monte Carlo and similar tools use) is genuinely useful for identifying quality anomalies that would be impossible to specify as explicit rules. For large datasets with no established quality baseline, these tools surface patterns that manual inspection would miss. Where AI does not help is in fixing the root causes of quality failures — ownership structures, schema governance, ingestion validation. AI can detect the symptom; governance fixes the cause.
We have a data catalogue. Does that mean we have data quality management?
A data catalogue documents what data exists and where it comes from. Data quality management ensures the data meets defined standards. They are complementary: a catalogue with quality scores (freshness, completeness, profile statistics) gives consumers visibility into quality. But the catalogue does not enforce quality — it reports on it. Quality enforcement requires validation checks at ingestion and transformation, and ownership structures that respond when checks fail.
Our data architecture consulting practice designs data quality frameworks as part of broader data platform engagements, including ownership structures, dbt testing standards, and catalogue integration. If your organisation has data quality problems that are affecting business decisions or blocking AI initiatives, book a free 30-minute audit and we will tell you directly what is causing them and what it takes to fix them.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →