Data Contracts: How to Implement Them and Why Your Pipeline Needs Them

Data contracts are explicit agreements between data producers and consumers about what data will be provided, in what format, with what quality guarantees. Without them, schema changes break downstream pipelines silently and data quality issues are discovered by the business before the data team. This guide covers practical implementation.

A data contract is an explicit, machine-readable agreement between a data producer and a data consumer. The producer commits to delivering data in a specified format, with specified quality characteristics, on a specified schedule. The consumer depends on those commitments. When the contract is violated, the violation is detected and handled before it causes silent data quality issues downstream.

Most data quality problems in organisations without data contracts trace to the same root cause: one team changed something they did not know another team depended on. A column was renamed. A data type changed from integer to string. A field that was never null started containing nulls because an upstream application change removed a required field validation. None of these were malicious. None were communicated. All of them broke downstream pipelines in ways that produced incorrect analytical results without errors.

What a Data Contract Contains

A complete data contract specifies:

**Schema**: Column names, data types, and nullability. The consumer can rely on these columns existing with these types. The producer will not rename columns, change types, or remove columns without notifying the consumer and coordinating a change.

**Semantics**: What each column means. A 'status' column with possible values 'active', 'inactive', 'suspended' means something specific. The contract documents the business meaning, the allowed values, and any historical change in meaning.

**Freshness**: When data will be available. If the producer runs a daily job that loads data by 6 AM, the consumer can depend on this. If the job is delayed, the consumer is notified before depending on stale data.

**Quality thresholds**: Acceptable ranges for key quality metrics. The 'order_id' column is always unique. The 'revenue' column is never negative. The 'customer_id' foreign key always has a matching record. The null rate for 'email_address' is below 5%.

**Change management process**: How schema changes are proposed, reviewed, and communicated. Typically: producer notifies consumer with a defined lead time (two-sprint notice minimum); consumer reviews impact and approves or requests modification; change is deployed in a coordinated release.

Lightweight Implementation: dbt Sources and Tests

For teams already using dbt, data contracts can be implemented using native dbt features without additional tooling:

**Source definitions** in dbt's schema.yml declare the expected schema of upstream data sources. If the source schema changes and a declared column no longer exists, dbt's source freshness and column documentation fail visibly.

**dbt tests** enforce quality constraints at each pipeline run:

- 'unique' test: column values are unique — enforces primary key constraints

- 'not_null' test: column contains no null values — enforces non-nullable contract

- 'accepted_values' test: column contains only specified values — enforces enumeration contracts

- 'relationships' test: foreign key values have matching records in the referenced table

These tests run as part of the dbt job and fail the run if a contract violation is detected. The failure is specific: "Model orders: test not_null for column customer_id FAILED. 247 records with null customer_id."

For custom quality thresholds (null rate below 5%, revenue within expected range), dbt generic tests or custom singular tests extend the framework.

Schema Registry for Event Streams

For event-driven architectures where data flows through Kafka or other message queues, schema registries enforce contracts at the serialisation level. Apache Avro with Confluent Schema Registry is the standard pattern.

The schema registry stores the Avro schema for each Kafka topic. Producers serialize events against the registered schema; the registry validates that the serialised event conforms. Consumers deserialise using the registered schema. When a producer attempts to publish an event with a schema that breaks compatibility with registered schema (removing a required field, changing a field type), the registry rejects the publication.

The registry supports configurable compatibility rules:

- **BACKWARD**: New schema can read data written with old schema (safe for consumers that have not yet upgraded)

- **FORWARD**: Old schema can read data written with new schema (safe for producers that are ahead of consumers)

- **FULL**: Both backward and forward compatible (safest, most restrictive)

Schema registry-enforced contracts catch breaking changes at the producer before they reach consumers — earlier in the pipeline than dbt tests, which run after data is loaded.

Contract-Driven Development Pattern

The most effective approach to data contracts is defining them before implementation rather than retroactively documenting existing schemas:

1. The consuming team defines what they need: specific columns, types, quality guarantees, freshness requirements

2. The producing team reviews, negotiates any constraints that are impractical to guarantee, and commits to what they can deliver

3. The contract is documented and committed to version control alongside the pipeline code

4. Tests enforcing the contract are written as part of pipeline development

5. Contract violations fail the build or the pipeline run — they do not reach production silently

This is contract-driven development applied to data: the interface is defined before implementation rather than inferred from it.

Handling Contract Violations

When a contract violation is detected:

**Fail loudly**: The pipeline run fails with a specific error message identifying the violated contract. Downstream dependencies are not run against data that fails contract validation.

**Alert immediately**: The producing team is notified that their data violated a contract downstream. This is information they need even if they were not the direct cause.

**Do not skip or suppress**: Tempting under deadline pressure, but suppressing contract violation alerts produces the exact problem contracts are designed to prevent. A skipped alert is an undetected data quality issue.

For violations where the consumer can continue with degraded data (a freshness delay where stale data is acceptable), the contract should specify the degraded mode explicitly. Do not handle degraded modes silently.

Organisational Prerequisites

Data contracts require a minimum of organisational infrastructure to function:

**Ownership is clear**: Every data set has an identified producer team responsible for the contract. Without ownership, there is no one to notify when a contract is violated and no one accountable for maintaining it.

**Change management process exists**: Both the producing and consuming teams have a process for communicating and coordinating schema changes. The contract is only as good as the process for updating it.

**Tests run in CI/CD**: Contract validation tests must run automatically, not manually. If tests run only when someone remembers to run them, they do not function as contracts.

Our data architecture practice designs data contract frameworks and implements them in dbt, schema registry, and custom validation layers — contact us to discuss how to bring data contract discipline to your pipeline environment.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →