BlogData Engineering

What Is Data Observability? Detecting Pipeline Failures Before Users Do

James Okafor
James Okafor
Lead Data Engineer
·March 15, 202811 min read

Data observability is the ability to understand, monitor, and troubleshoot the health of data in your system. This guide explains the five pillars of data observability, how it differs from data testing, the leading tools in the space, and what an observable data pipeline actually looks like in practice.

Data observability is the ability to understand, monitor, and troubleshoot the health of data in a system. Borrowed from software engineering's concept of system observability — the practice of inferring internal state from external signals — data observability extends the same principle to data pipelines and the data they produce.

The core problem: data pipelines fail silently. A pipeline may complete without errors while producing incorrect, stale, or incomplete data. A schema change in a source system breaks a downstream transformation without raising an exception. A filter condition changes, silently dropping 30% of records. The pipeline logs show success; the dashboard shows wrong numbers; users notice before the data team does.

The Five Pillars

The data observability framework articulated by Monte Carlo (which popularized the term) defines five dimensions of data health:

**Freshness** — is the data up to date? Freshness observability monitors when a table was last updated and alerts when it falls outside expected windows. A daily table that has not received new data in 36 hours is a signal worth investigating. Freshness monitoring detects pipeline failures, schedule drift, and source system outages before users see stale reports.

**Volume** — does the data have the expected row count? Volume observability monitors record counts and alerts on anomalies: a table receiving 10,000 rows per day that suddenly receives 100 (or 100,000) signals a problem. Volume anomalies catch extraction failures, double-loading, filter logic changes, and source system issues.

**Distribution** — are the values within expected ranges? Distribution observability monitors the statistical properties of columns: minimum, maximum, mean, null rate, unique value counts. A column that previously had a null rate of 2% jumping to 40% signals a source change or transformation bug. A revenue column with a maximum value that is 100x the historical maximum signals either a genuine business event or a data corruption issue.

**Schema** — has the structure changed? Schema observability monitors for column additions, removals, renames, and type changes. Schema changes are a primary cause of downstream pipeline failures; detecting them at the moment of change rather than when downstream queries break is the difference between proactive and reactive incident response.

**Lineage** — which assets are affected? When a health issue is detected in a table, lineage answers which downstream tables, models, and dashboards are affected. This transforms an alert from "table X has a problem" to "table X has a problem, and these 12 dashboards and 3 downstream models are affected — these stakeholders need to be notified."

Data Observability vs Data Testing

Data testing (as implemented in dbt tests) and data observability are complementary, not competing.

Data tests are assertions written before a pipeline runs: this column must not be null, these values must be unique, this column must be a valid foreign key. Tests are deterministic — you define the expected state, and the test passes or fails based on whether the data matches. Tests are excellent at catching known categories of failure: known-invalid values, referential integrity violations, known business rules.

Data observability is statistical and learned — it detects anomalies relative to historical behavior without requiring predefined thresholds. A sudden 40% drop in row count is an anomaly even if you did not write a test for "row count must not drop 40%." Observability catches the failures you did not anticipate; tests catch the failures you already knew to guard against.

Production analytics environments need both: tests for known constraints, observability for behavioral anomalies.

What an Observable Data Pipeline Looks Like

Observable pipelines emit signals at every stage that enable both immediate alerting and retrospective debugging.

At the extraction stage: the ingestion tool (Fivetran, Airbyte) logs row counts, schema changes, and sync durations. These are consumed by observability tooling or exposed via the tool's own monitoring APIs.

At the transformation stage: dbt model runs record row counts, test results, and execution duration per model. In dbt Cloud, these are accessible via the metadata API. Monte Carlo, Atlan, and similar tools ingest dbt run results to correlate test failures with downstream impact.

At the warehouse layer: the data warehouse's query logs record execution counts, latency, and error rates for tables. Snowflake's Account Usage views expose this data. BigQuery has INFORMATION_SCHEMA tables for query history. These signals enable detection of tables that have suddenly stopped being queried (suggesting a pipeline failure upstream) or tables experiencing unusual query volumes (suggesting unexpected consumer behavior).

At the BI layer: Tableau and Looker both expose usage APIs. Dashboards that have not been viewed in 90 days are candidates for deprecation; dashboards with high view counts but recent underlying table failures are high-priority incidents.

Leading Data Observability Tools

**Monte Carlo** — the company that coined "data observability." Automated anomaly detection across freshness, volume, schema, and distribution. Connects to Snowflake, BigQuery, Redshift, dbt, Fivetran, Airflow, Tableau, and Looker. Commercial product with enterprise pricing.

**Bigeye** — column-level monitoring with threshold and anomaly detection. Strong on distribution monitoring with automated threshold recommendation.

**Soda** — open-source Python framework for defining data quality checks as YAML configuration, with a cloud platform for scheduling, notifications, and centralized results. Positioned between dbt tests (code-based assertions) and full observability platforms (automated anomaly detection).

**Great Expectations** — open-source Python library for data quality testing. Allows building "expectation suites" from existing data profiles and integrates into data pipelines. More of a testing framework than a full observability platform.

**dbt + Recce** — for teams on dbt, Recce provides observability-adjacent capabilities: comparing data profiles between development and production runs to validate that a model change produced the expected statistical difference.

**Metaplane** — simpler observability platform focused on small and mid-size data teams. Lower operational overhead than Monte Carlo; narrower connector set.

Incident Response With Observability

The operational value of observability is reducing mean time to detection (MTTD) and mean time to resolution (MTTR) for data incidents.

Without observability: an analyst notices a metric looks wrong, files a ticket, a data engineer investigates without knowing where to start, manually checks pipeline logs, eventually finds the source of the problem — hours or days later.

With observability: the system detects a 35% drop in row count in the transactions table at 3am, identifies that the freshness SLA for a dependent model was missed, notifies the on-call data engineer with a lineage graph showing which dashboards are affected, and provides the anomaly context (the row count drop started at the last Fivetran sync, suggesting a source system issue). The engineer diagnoses and resolves within the hour; stakeholders receive a proactive notification before anyone notices in a dashboard.

Our data architecture practice designs observable data pipeline infrastructure with monitoring, alerting, and quality assurance frameworks — contact us to discuss data reliability requirements for your environment.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →