BlogData Engineering

Data Quality Tools Compared: Great Expectations, Soda, dbt Tests, and Monte Carlo

James Okafor
James Okafor
Data & Cloud Engineer
·August 15, 202610 min read

Every modern data team needs a data quality framework. This guide compares the leading tools — Great Expectations, Soda Core, dbt tests, and Monte Carlo — on coverage, implementation overhead, and the right use case for each.

Every data team needs a data quality framework. The question is which tool — or combination of tools — to use. The options range from dbt's built-in test framework to dedicated data observability platforms. This guide compares the leading tools: dbt tests, Great Expectations, Soda Core, and Monte Carlo.

dbt tests: the baseline

dbt's built-in test framework is the starting point for most modern data teams. Tests are defined in schema.yml files alongside model definitions and run as part of dbt build or dbt test. Four built-in generic tests cover the most common quality requirements:

- **not_null**: Fails if any value in the column is null

- **unique**: Fails if any value appears more than once

- **accepted_values**: Fails if any value is not in the specified list

- **relationships**: Fails if any value does not exist in the referenced column of another model

Custom generic tests and singular tests (SQL queries that return rows when quality rules are violated) extend this baseline for business-specific rules. dbt-expectations (a community package) adds over 30 additional test types inspired by Great Expectations: expect_column_values_to_be_between, expect_column_to_exist, expect_table_row_count_to_be_between, etc.

**Strengths**: Zero additional tooling cost, integrated with the dbt transformation workflow, tests run as part of CI/CD on every code change, results stored in the warehouse for querying.

**Weaknesses**: Tests run on a schedule — they catch quality issues after the data is already in the warehouse, not at ingestion time. Limited statistical anomaly detection — dbt tests are rule-based assertions, not statistical models. No alerting infrastructure beyond CI/CD failure notifications.

**Best for**: Any team using dbt. The baseline for data quality regardless of what additional tools are added.

Great Expectations

Great Expectations (GX) is a Python library for defining, running, and documenting data quality assertions ("expectations") against DataFrames or database tables. The core concept is an expectation suite — a named collection of expectations for a dataset.

GX's distinguishing features:

**Expectation types**: Over 50 built-in expectations covering column values (expect_column_values_to_be_between, expect_column_values_to_match_regex), statistical properties (expect_column_mean_to_be_between, expect_column_quantile_values_to_be_between), and table-level assertions (expect_table_row_count_to_be_between, expect_table_column_count_to_equal).

**Data Docs**: GX generates HTML documentation from expectation suites — browsable reports showing all defined expectations, their current pass/fail status, and historical trends. The Data Docs are useful for sharing quality status with stakeholders.

**Checkpoint system**: A Checkpoint is a runnable configuration that connects expectation suites to data sources and action handlers (email alerts, Slack notifications, webhook calls on failure).

**Strengths**: Extensive expectation library, data documentation generation, strong Python ecosystem integration, open-source.

**Weaknesses**: Steeper setup curve than dbt tests — requires configuring data sources, expectation suites, checkpoints, and a store for results. The latest v3 API (Great Expectations 0.18+) is significantly different from v2; legacy documentation can be confusing.

**Best for**: Teams that need rich statistical expectations beyond what dbt-expectations provides, or that want programmatic expectation management in Python.

Soda Core

Soda Core (the open-source library) and Soda Cloud (the managed platform) provide data quality checks defined in YAML (SodaCL — Soda Checks Language). SodaCL is designed to be readable by non-engineers — a business stakeholder can review a SodaCL check without Python knowledge.

SodaCL check types include: missing values, invalid values, duplicates, freshness, row counts, custom SQL assertions, schema change detection, and distribution checks. Soda Cloud adds: alerting, historical trend visualisation, anomaly detection, and team collaboration features.

**Strengths**: SodaCL's readable YAML syntax is accessible to data stewards and analysts who are not Python developers. Soda Cloud provides a purpose-built UI for quality monitoring. Native integration with dbt (run Soda checks after dbt transformations in the same pipeline). Strong schema drift detection.

**Weaknesses**: Soda Core is free; Soda Cloud is a commercial product with per-dataset pricing. The managed platform is necessary for the most valuable features (alerting, anomaly detection, history).

**Best for**: Teams that want non-engineers to be able to define and review quality rules, or that need a managed quality monitoring UI without building their own.

Monte Carlo

Monte Carlo is a data observability platform — ML-based automatic anomaly detection applied to your data without requiring manual rule definition. Rather than defining that "revenue must be > 0", Monte Carlo learns the normal distribution and range of your data and alerts when it deviates significantly from that baseline.

Monte Carlo's automatic monitoring covers: volume anomalies (row count changes), freshness anomalies (table not updated when expected), schema changes (columns added, removed, or renamed), distribution changes (value distributions shift), and query activity changes.

**Strengths**: Automatic detection without rule writing — works out of the box on any table. Catches the unknown unknowns that rule-based testing misses (data that changes in unexpected ways). Strong lineage integration — incidents surface which upstream sources or downstream consumers are affected.

**Weaknesses**: Enterprise pricing. Not a replacement for rule-based testing — catches statistical anomalies but does not validate business rules (e.g., "order status must be one of: pending, completed, cancelled"). Best used alongside, not instead of, dbt tests.

**Best for**: Mid-market and enterprise data teams that want to reduce monitoring blind spots without writing rules for every possible failure mode. The complement to dbt tests and GX for detecting changes that break downstream analytics without obvious rule violations.

Choosing the right combination

The practical recommendation for most teams:

**Start with dbt tests**: Not_null, unique, relationships, accepted_values on every critical model. Add dbt-expectations for statistical checks. This is zero additional cost and integrated into the existing workflow.

**Add Soda Core or Great Expectations when**: You need richer checks than dbt provides, you want non-engineers to own quality rules, or you need quality checks on source data before it reaches dbt models.

**Add Monte Carlo (or Anomalo, a similar tool) when**: You have significant analytical traffic against your data, the cost of undiscovered data quality issues is high, and you want automatic detection of anomalies across the full data platform.

The tiered approach — dbt tests as foundation, a rule-based tool for complex business rules, an observability platform for anomaly detection — provides defence in depth: known rules enforced explicitly, unexpected changes caught automatically.

For the DataOps context, see dataops guide and data quality framework. Our data architecture consulting practice designs and implements data quality frameworks — book a free review to assess your current quality coverage.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →