BlogData Engineering

Soda Data Quality: YAML-Driven Data Checks for Analytics Pipelines

James Okafor
James Okafor
Lead Data Engineer
·January 12, 202811 min read

Soda is a data quality testing framework with a YAML-first configuration model — define checks in SodaCL (Soda Checks Language) without Python, run them against your warehouse, and integrate the results into Airflow, dbt, and CI pipelines. This guide covers SodaCL syntax, scan configuration, alert routing, Soda Cloud for centralised monitoring, and how Soda compares to Great Expectations.

What Soda Is

Soda is a data quality framework that takes a YAML-first approach to writing data checks — instead of Python code, you write checks in SodaCL (Soda Checks Language), a domain-specific language designed for data quality assertions. SodaCL checks are more readable than Python function calls, making them accessible to analytics engineers and data analysts without deep Python experience.

Soda connects directly to your data warehouse (Snowflake, BigQuery, Redshift, PostgreSQL, Databricks, and others) and executes checks as optimised SQL queries. Check results can be sent to Soda Cloud for centralised monitoring, routed to Slack or PagerDuty, or used as pipeline gates in Airflow and dbt workflows.

SodaCL Syntax

SodaCL checks are defined in YAML files, one file per dataset or check suite:

checks for orders:

- row_count > 0

- missing_count(customer_id) = 0

- duplicate_count(order_id) = 0

- min(order_amount) >= 0

- max(order_amount) < 1000000

- avg(order_amount) between 50 and 500

- invalid_count(status) = 0:

valid values: [pending, processing, shipped, delivered, cancelled, refunded]

- freshness(created_at) < 2h

**Check types**:

- row_count: minimum row count for table to be considered populated

- missing_count / missing_percent: null check on specific column

- duplicate_count / duplicate_percent: uniqueness check on one or more columns

- min / max / avg / sum: statistical range checks on numeric columns

- invalid_count / invalid_percent: value set checks — valid values, valid regex, valid format

- freshness: time elapsed since most recent value in a timestamp column

**Threshold syntax**:

- row_count > 0: simple comparison

- avg(order_amount) between 50 and 500: range

- missing_percent(email) < 5%: percentage threshold

- duplicate_count(order_id) = 0: exact value

**Multi-column uniqueness**:

checks for orders:

- duplicate_count(order_id, line_item_id) = 0

**Cross-dataset reference checks**:

checks for orders:

- values in (customer_id) must exist in customers (id)

Referential integrity check — every customer_id in orders must exist in the customers table.

Scan Configuration

A Soda scan runs checks against a data source. Configure the data source connection in a configuration YAML file:

soda_cloud:

host: cloud.soda.io

api_key_id: your_soda_api_key_id

api_key_secret: your_soda_api_key_secret

data_sources:

snowflake_prod:

type: snowflake

account: your_snowflake_account

username: your_snowflake_user

password: your_snowflake_password

database: ANALYTICS

schema: MARTS

warehouse: BI_WH

Run a scan: soda scan -d snowflake_prod -c configuration.yml checks/orders.yml

Soda executes the checks as SQL queries against the warehouse, returns pass/fail per check, and sends results to Soda Cloud if configured.

Soda vs Great Expectations

**Configuration style**: Soda uses YAML SodaCL — readable, concise, no Python required for standard checks. Great Expectations uses Python API or JSON expectation suite files. For teams where not everyone is comfortable with Python, Soda is more accessible.

**Flexibility**: Great Expectations offers more flexibility for complex custom assertions — custom expectations in Python, complex conditional logic, Pandas-based checks on DataFrames. SodaCL covers the common cases; edge cases require Soda's Python API extension.

**Integration**: Both integrate with Airflow, dbt, and CI. Great Expectations has broader connector support (Spark DataFrames, in-memory Pandas). Soda targets warehouse-native SQL execution — better for warehouse-first architectures.

**Soda Cloud**: Centralised monitoring dashboard — all scan results across all data sources, failed check history, trends over time, alert routing. Great Expectations Data Docs provide similar documentation but require self-hosting. Soda Cloud is managed and included in Soda's paid tier.

**Cost**: Soda has a free self-hosted tier and a paid Soda Cloud tier for centralised monitoring. Great Expectations is fully open-source.

Integrating Soda with Airflow

Run a Soda scan as an Airflow task using the BashOperator or the Soda Airflow provider:

from soda.core.scan import Scan

def run_soda_scan():

scan = Scan()

scan.set_data_source_name("snowflake_prod")

scan.add_configuration_yaml_file("config/soda_config.yml")

scan.add_sodacl_yaml_file("checks/orders.yml")

scan.execute()

if scan.get_error_logs():

raise Exception("Soda scan found errors")

if scan.has_check_failures():

raise Exception("Soda scan found check failures")

soda_task = PythonOperator(

task_id="soda_orders_check",

python_callable=run_soda_scan,

dag=dag

)

extract_task >> soda_task >> transform_task

Soda scan runs after extraction and before transformation. If checks fail, the Airflow task fails and downstream transformation tasks do not execute.

Integrating Soda with dbt

Two integration patterns:

**Pattern 1 — Soda checks after dbt run**: Run the dbt models, then run Soda checks against the output tables. Validates transformation output quality.

**Pattern 2 — dbt-soda package**: The dbt-soda package integrates Soda checks as dbt tests using a custom dbt macro. Run soda checks as part of dbt test. Centralises test execution in the dbt workflow.

Soda and dbt tests are complementary: dbt tests handle uniqueness, referential integrity, and not-null (built-in dbt tests). Soda handles distribution assertions, statistical range checks, freshness, and cross-dataset referential integrity that dbt test types don't cover directly.

Alert Routing

When checks fail, Soda can route alerts to:

**Soda Cloud notifications**: Email or Slack notification per failed check, with full scan results in the Soda Cloud UI.

**Custom webhooks**: HTTP POST with check results payload — route to PagerDuty, Opsgenie, or any webhook-receiving alerting system.

**Airflow pipeline gate**: Task failure propagates upstream failure in DAG.

Configure alert severity: error (blocks pipeline, triggers alert), warn (logs result, does not block), pass (no action). For checks with known distribution variability, set warn thresholds below the error threshold — warn when a metric drifts, error when it breaches a hard limit.

Check Design Principles

**Check the contract, not the history**: Write checks based on what downstream consumers require from the data, not based on historical data distribution. A check that average order amount must stay between $50 and $500 will fail the day a large enterprise client places their first order.

**Separate source freshness from content quality**: freshness(updated_at) checks validate that the source pipeline is delivering recent data. Value range checks validate content quality. These answer different questions and have different remediation paths.

**Start with critical tables**: Instrument the 5-10 tables that, if corrupted or stale, would cause the most business impact. Expand coverage incrementally.

**Review and update checks after schema changes**: Checks against column names that change will error, not fail gracefully. Version control check files alongside transformation code and review on schema migrations.

Our data architecture practice designs data quality frameworks integrated with dbt and Airflow pipeline stacks — contact us to discuss your data quality requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →