BlogData Engineering

Great Expectations: Data Quality Testing for Analytics Pipelines

James Okafor
James Okafor
Lead Data Engineer
·December 15, 202712 min read

Great Expectations is the open-source framework for defining, documenting, and enforcing data quality rules in analytics pipelines. This guide covers expectation suites, validation checkpoints, data docs, and how to integrate Great Expectations into dbt and Airflow workflows for automated quality gates.

What Great Expectations Solves

Data quality failures are silent by default. A pipeline runs without error, loads rows into the warehouse, and BI reports display numbers that look plausible — but the underlying data has shifted. A source column changed meaning. A join produced unexpected duplicates. An upstream system started sending nulls in a field that was always populated. Without explicit quality checks, these failures surface weeks later as incorrect business decisions rather than pipeline alerts.

Great Expectations (GX) addresses this by letting you define explicit assertions about your data — expectations — and run those assertions at pipeline execution time. Failed expectations become failures in your pipeline, not silent data drift.

Core Concepts

**Expectations** are assertions about data. Each expectation is a named, parameterised rule:

- expect_column_to_exist

- expect_column_values_to_not_be_null

- expect_column_values_to_be_between

- expect_column_values_to_match_regex

- expect_column_pair_values_A_to_be_greater_than_B

- expect_table_row_count_to_be_between

Great Expectations ships with over 300 built-in expectation types covering distribution, completeness, referential integrity, statistical properties, and custom SQL expressions.

**Expectation Suites** are named collections of expectations for a specific dataset. One suite per data asset — orders, customers, events. Suites are serialised as JSON, checked into version control, and updated as data contracts evolve.

**Validators** run expectations against actual data. You point a validator at a dataset (Pandas DataFrame, Spark DataFrame, or database table via SQLAlchemy) and run the suite against it.

**Checkpoints** are the operational unit — they pair a validator with an expectation suite and define what happens on failure: raise an exception, write validation results to a store, update Data Docs, send Slack notification. Checkpoints are what you embed in pipelines.

Setting Up Great Expectations

Install: pip install great_expectations

Initialise a GX project in your repository: great_expectations init

This creates a great_expectations/ directory with:

- great_expectations.yml: project configuration, data source definitions, stores

- expectations/: expectation suite JSON files

- checkpoints/: checkpoint configurations

- uncommitted/: validation results and data docs (gitignored)

Defining a Data Source

GX connects to data through data sources. For a PostgreSQL database:

In great_expectations.yml, define a datasource with a SQLAlchemy connection string. For Pandas DataFrames read from S3 or local files, define a PandasDatasource pointing at a file path.

Creating Expectation Suites

Two approaches for building suites:

**Profiler-based (automated baseline)**: GX can profile a dataset and automatically generate a suite based on observed distributions. Useful for bootstrapping a suite from existing production data — generates 20-50 expectations covering nullability, value ranges, and cardinality. Review and prune the generated suite; profiler expectations are a starting point, not a finished contract.

**Manual definition**: Write expectations explicitly against what the data contract requires. More deliberate, more maintainable:

Define expectations in Python: validator.expect_column_to_exist("customer_id"), validator.expect_column_values_to_not_be_null("order_date"), validator.expect_column_values_to_be_between("order_amount", min_value=0, max_value=1000000), validator.expect_column_values_to_be_in_set("status", ["pending", "completed", "cancelled", "refunded"])

Save the suite: validator.save_expectation_suite()

Data Docs

Data Docs are auto-generated HTML documentation built from expectation suites and validation results. For each expectation suite, Data Docs render: the expected behaviour, the observed statistics from the last validation run, pass/fail status per expectation, and historical validation run results.

Data Docs serve as living documentation for your data assets. Teams looking at a dataset can see both what is expected and whether current data meets those expectations. Host Data Docs on S3, GCS, or an internal web server for team access.

Checkpoints: Running Validation in Pipelines

A checkpoint configuration specifies: which expectation suite to run, against which data source/asset, with which action list on validation results.

Actions include:

- StoreValidationResultAction: persist results to the validation store

- UpdateDataDocsAction: rebuild Data Docs HTML

- SlackNotificationAction: send pass/fail notification to Slack

- PagerdutyAlertAction: trigger PagerDuty on failure

Run a checkpoint from Python: context.run_checkpoint(checkpoint_name="orders_checkpoint")

If any expectations fail and the checkpoint is configured to raise an error, the Python call raises an exception — and any orchestration framework (Airflow, Dagster, Prefect) treats the task as failed.

Integrating with dbt

Great Expectations and dbt address different layers of data quality:

**dbt tests** are best for: referential integrity (relationships test), uniqueness, not-null on specific columns, accepted values. They run in the warehouse and are native to the dbt workflow.

**Great Expectations** is better for: distribution assertions (value ranges, percentiles, row count bounds), cross-column rules, statistical properties, complex regex patterns, and validation of raw source data before dbt models run.

A common pattern: run GX checkpoint in Airflow before triggering dbt, validating source data meets expectations. If sources fail validation, don't run dbt — prevent bad data from propagating downstream.

The dbt-great-expectations package provides a dbt macro interface for running GX checks as dbt tests, letting you centralise execution while using both tools' strengths.

Integrating with Airflow

Add a Great Expectations Airflow operator to validate data at specific pipeline checkpoints:

Use the GreatExpectationsOperator from great_expectations_provider. Set conn_id to your data source connection, data_asset_name to the table or asset to validate, expectation_suite_name to the suite to run, and fail_task_on_validation_failure=True to make Airflow task failure on expectation failures.

Place validation operators between extraction and transformation tasks, between staging and production promotion, and before delivering data to downstream consumers.

Expectation Design Principles

**Test the data contract, not implementation details**. Expectations should reflect what downstream consumers require from the data, not arbitrary observations about current data distributions.

**Version control suites**. Expectation suite JSON files belong in your data repository. Changes to data contracts become pull requests with review.

**Separate source validation from transformation validation**. Validate raw source data against what the source system promises. Validate transformed data against what analytical consumers require. They are different contracts.

**Avoid over-tight expectations that break on legitimate data change**. An expectation that order amounts are always below 50,000 will start failing the day a large enterprise customer is onboarded. Use ranges with headroom, or use relative distribution assertions rather than absolute bounds.

Alternatives and When to Use Them

**dbt tests**: Best for uniqueness, referential integrity, not-null, accepted values. Already in your dbt workflow. Start here.

**Soda**: Similar to GX but YAML-first configuration and hosted platform option. Lower setup friction, less Python flexibility.

**Monte Carlo / Acceldata**: Commercial data observability platforms with ML-based anomaly detection — automatic anomaly flagging without writing explicit expectations. More powerful for unknown unknowns; GX is better for explicit contract enforcement.

**SQL-based checks in Airflow**: For simple row count checks and existence queries, plain SQL tasks are simpler than a full GX setup. Reserve GX for complex multi-expectation suites.

Our data architecture practice designs data quality frameworks integrated with your pipeline and transformation stack — contact us to discuss your data quality requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →