Data Quality Dimensions: A Framework for Measuring and Improving Data Quality

Data quality is not a single property — it is a collection of dimensions, each measuring a different aspect of how fit data is for its intended use. This guide covers the six core data quality dimensions — completeness, accuracy, consistency, timeliness, uniqueness, and validity — with SQL patterns for measuring each dimension and a framework for prioritising quality improvement.

What Data Quality Dimensions Are

Data quality is not a single property — it is a collection of dimensions, each measuring a different aspect of how fit data is for its intended use. A dataset can score well on one dimension and poorly on another: highly complete (no nulls) but deeply inaccurate (wrong values), or perfectly accurate but stale (last updated six months ago).

The six core data quality dimensions provide a structured framework for assessing and improving data quality:

1. Completeness

2. Accuracy

3. Consistency

4. Timeliness

5. Uniqueness

6. Validity

Dimension 1: Completeness

**Definition**: The extent to which required data is present and not null.

**Measuring completeness**:

SELECT

COUNT(*) AS total_rows,

COUNT(email) AS rows_with_email,

COUNT(phone) AS rows_with_phone,

ROUND(100.0 * COUNT(email) / COUNT(*), 1) AS email_completeness_pct,

ROUND(100.0 * COUNT(phone) / COUNT(*), 1) AS phone_completeness_pct

FROM customers

**Completeness thresholds**: Define acceptable completeness levels per field. Email at 99% completeness may be acceptable if some customers legitimately have no email. Invoice amount at 100% completeness is a strict requirement — a null invoice amount is a data entry failure.

**Populated but meaningless**: Completeness measures presence, not meaning. A required field populated with "N/A," "UNKNOWN," or placeholder values scores as complete but contains no useful data. Combine completeness checks with validity checks (see below) to catch this pattern.

Dimension 2: Accuracy

**Definition**: The extent to which data values correctly represent the real-world entity they describe.

Accuracy is the hardest dimension to measure in SQL because it requires an external reference to compare against — you cannot determine whether a customer's address is accurate solely from the database.

**Accuracy measurement approaches**:

- **Cross-system validation**: Compare field values between two authoritative sources. Customer names matching in CRM and billing system within 5% suggests accuracy; 30% mismatch indicates a data problem.

- **Business rule validation**: Certain relationships must hold. Order delivered date must be after ship date; ship date must be after order date. Violations indicate accuracy problems.

- **Sample-based ground truth comparison**: Manually verify a random sample of records against the source document or external record. Statistical accuracy estimate from the sample.

**SQL for business rule accuracy check**:

SELECT

COUNT(*) AS total_orders,

COUNT(CASE WHEN shipped_date < order_date THEN 1 END) AS ship_before_order_violations,

COUNT(CASE WHEN delivered_date < shipped_date THEN 1 END) AS deliver_before_ship_violations

FROM orders

Dimension 3: Consistency

**Definition**: The extent to which the same data is represented the same way across different systems, tables, or time periods.

**Cross-system consistency**: Customer status in the CRM should match customer status in the billing system. Revenue in the finance system should reconcile to revenue in the data warehouse within an acceptable tolerance.

**Cross-table consistency**:

-- Orders referencing customers that don't exist in the customers table

SELECT COUNT(DISTINCT o.customer_id) AS orphaned_customer_ids

FROM orders o

LEFT JOIN customers c ON o.customer_id = c.customer_id

WHERE c.customer_id IS NULL

-- Revenue by month: check for implausible jumps suggesting a pipeline issue

SELECT

month,

revenue,

LAG(revenue) OVER (ORDER BY month) AS prior_month,

ROUND(100.0 * (revenue - LAG(revenue) OVER (ORDER BY month)) /

NULLIF(LAG(revenue) OVER (ORDER BY month), 0), 1) AS mom_pct_change

FROM monthly_revenue

ORDER BY month

A 400% month-over-month change warrants investigation — either the business had an unusual event or data was loaded incorrectly.

**Temporal consistency**: The same metric calculated the same way should give the same result regardless of when the query runs. If yesterday's revenue changes in today's query, data is being retroactively modified — which may or may not be expected.

Dimension 4: Timeliness

**Definition**: The extent to which data is available when needed and reflects the current state of the real-world entity.

Timeliness has two components: freshness (how recent is the most recent data?) and latency (how long after an event does the data appear?).

**Measuring freshness**:

SELECT

MAX(updated_at) AS latest_record,

DATEDIFF('hour', MAX(updated_at), CURRENT_TIMESTAMP) AS hours_since_last_update

FROM orders

**Freshness SLAs**: Define acceptable freshness per dataset. An operational order status table must be current within 1 hour. A weekly revenue aggregation may be acceptable if refreshed within 24 hours of the week closing.

**Pipeline latency**: The time between an event occurring in the source system and the event appearing in the analytical warehouse. Measure via the difference between the event timestamp in the source and the load timestamp in the warehouse.

SELECT

PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY DATEDIFF('minute', event_occurred_at, loaded_at)) AS median_latency_minutes,

PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY DATEDIFF('minute', event_occurred_at, loaded_at)) AS p95_latency_minutes

FROM events

Dimension 5: Uniqueness

**Definition**: The extent to which records are not duplicated.

Duplicates silently inflate all aggregate metrics — COUNT, SUM, AVG all produce incorrect results when duplicate rows exist.

**Measuring uniqueness**:

-- Duplicate order IDs

SELECT order_id, COUNT(*) AS occurrences

FROM orders

GROUP BY order_id

HAVING COUNT(*) > 1

-- Overall uniqueness rate

SELECT

COUNT(*) AS total_rows,

COUNT(DISTINCT order_id) AS distinct_orders,

COUNT(*) - COUNT(DISTINCT order_id) AS duplicate_rows,

ROUND(100.0 * COUNT(DISTINCT order_id) / COUNT(*), 2) AS uniqueness_pct

FROM orders

**Sources of duplicates**: ETL pipeline re-runs loading the same batch twice; CDC events processed more than once (at-least-once delivery); UNION without UNION DISTINCT combining overlapping datasets; joins on non-unique keys creating fanout.

**Deduplication**: QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) = 1 in Snowflake/BigQuery — keeps the most recent row per order_id and discards earlier duplicates.

Dimension 6: Validity

**Definition**: The extent to which data values conform to the expected format, range, or set of allowed values.

Validity checks: allowed value sets (status must be in a defined list), format patterns (email must match email regex, phone must match phone pattern), range checks (order amount must be positive, age must be between 0 and 150).

**SQL validity checks**:

SELECT

COUNT(*) AS total,

COUNT(CASE WHEN status NOT IN ('pending', 'processing', 'shipped', 'delivered', 'cancelled') THEN 1 END) AS invalid_status,

COUNT(CASE WHEN order_amount < 0 THEN 1 END) AS negative_amounts,

COUNT(CASE WHEN email NOT LIKE '%@%.%' THEN 1 END) AS invalid_email_format,

COUNT(CASE WHEN LENGTH(postal_code) NOT IN (5, 10) THEN 1 END) AS invalid_postal_code

FROM orders

Building a Data Quality Score

Combine dimensions into a composite data quality score for each dataset:

SELECT

'orders' AS dataset,

ROUND(100.0 * COUNT(order_id) / COUNT(*), 1) AS completeness_order_id,

ROUND(100.0 * COUNT(CASE WHEN order_amount >= 0 THEN 1 END) / COUNT(*), 1) AS validity_amount_non_negative,

ROUND(100.0 * COUNT(DISTINCT order_id) / COUNT(*), 1) AS uniqueness_order_id,

DATEDIFF('hour', MAX(updated_at), CURRENT_TIMESTAMP) AS freshness_hours_since_update

FROM orders

Track these metrics over time in a data quality monitoring table — trend analysis reveals whether data quality is improving or degrading. Alert when dimensions drop below defined thresholds.

Our data architecture practice designs data quality frameworks including dimension measurement, monitoring, and alerting — contact us to discuss your data quality requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →