Data Contracts: What They Are and How to Implement Them

Data contracts define the agreement between data producers and consumers — schema, SLAs, quality guarantees, and ownership. Here is the practical guide to implementing them in a production data platform.

The quick answer

A data contract is a formal agreement between a data producer (the team or system that generates and publishes a data asset) and data consumers (the teams or systems that use it). The contract specifies what the producer commits to: the schema (fields, types, constraints), the delivery SLA (freshness, availability), the data quality guarantees (null rate thresholds, value ranges), the ownership (who to contact when something breaks), and the versioning policy (how changes will be communicated). Data contracts are the mechanism that makes a data mesh or a data product model operational — without them, distributed ownership of data produces unreliable pipelines.

Why data contracts matter

In most data organisations, the implicit contract is: the upstream team will keep providing the data approximately like it is today, and the downstream team will notice when it changes. This works when there are two tables and three people. It breaks down at scale — when there are 200 source tables, 50 consumer pipelines, and upstream teams who do not know or care about downstream impacts.

The cost of contract violations without formal contracts: a source team renames a column, the downstream dbt model breaks, the dashboard shows nulls, the VP of Sales notices the daily report is wrong three days later. This cycle — silent upstream change, downstream failure, incident response — is the dominant failure mode in data pipelines at scale.

Data contracts prevent this by making commitments explicit, automating contract validation, and establishing clear accountability.

What a data contract includes

**Schema definition**: the column names, data types, nullable constraints, and descriptions for every field the producer commits to providing. Schema contracts prevent the most common breaking change — column renames and type changes. The schema definition is machine-readable so it can be validated automatically.

**Freshness SLA**: how current will the data be? "Updated within 1 hour of the operational system" is a specific, testable commitment. "Refreshed nightly" is a vague commitment that produces confusion when the nightly job runs at 4 AM vs 11 PM. Define the expected delivery window and what constitutes a violation.

**Quality guarantees**: specific, measurable commitments about data quality. Null rate below 0.5% on customer_id. Row count between 10,000 and 100,000 daily. Referential integrity — every order.customer_id must exist in customers.customer_id. These are the assertions that automated monitoring tests continuously.

**Ownership**: who is responsible for this data asset. An email address or Slack handle, not just a team name. Who gets paged when the contract is violated? Who approves schema changes? Ownership without a named person is not ownership.

**Versioning policy**: how will changes to the contract be communicated? For breaking changes (removing a column, changing a type), what advance notice will consumers receive? What is the deprecation policy? Semantic versioning (major.minor.patch) applied to data contracts enables consumers to understand the nature of changes.

**Access and classification**: the data classification (public, internal, confidential, restricted), the applicable governance policies (PII columns identified, GDPR/HIPAA status), and the access grant model.

Implementation approaches

**Schema-first contracts (Protobuf, Avro, JSON Schema)**: for streaming data (Kafka topics), the message schema is the contract. Define the schema in Protobuf or Avro, register it in Schema Registry (Confluent or AWS Glue), and enforce it at publish time — invalid messages are rejected. This is the most automated form of contract enforcement.

**dbt-based contracts**: dbt version 1.5+ added dbt Contracts — a way to define column-level constraints in schema.yml that are enforced at model materialisation time. If a model's SQL does not produce the declared columns and types, the build fails. This enforces producer-side contract compliance. dbt tests (unique, not_null, accepted_values, relationships) enforce quality guarantees at the consumer's validation layer.

**OpenAPI/YAML data contracts**: for REST API-exposed data products, OpenAPI specifications document the contract. For file-based or warehouse-table-based data products, custom YAML schemas (following the Data Contract Specification open standard, or tools like Soda Data Contracts) document the contract and enable automated validation.

**Data Contract Specification (DCS)**: an emerging open standard (datacontract.com) for defining data contracts in a machine-readable YAML format. Supported by tools including Soda Core, OpenMetadata, and Atlan. The specification covers schema, quality expectations, service levels, and ownership. As the standard matures, more tooling will read DCS files for automated contract validation.

Contract lifecycle

**Creation**: when a new data product or table is published, the producer documents the contract before consumers build on it. For existing data assets migrating to contract-based governance, the contract is reverse-engineered from the current state and then formalised.

**Validation**: automated tests run on each data delivery to verify that the contract's quality commitments are met. Schema validation (column existence, type check) runs at ingest or transformation time. Quality assertions (null rates, row counts, referential integrity) run as part of the dbt test suite or a data quality tool.

**Monitoring**: dashboards and alerting surfaces contract violations in real time. Consumer teams are notified when a producer's data violates the contract. Producers are notified before SLA violations occur, enabling proactive response.

**Change management**: when a producer needs to change the contract (add a column, rename a column, change a type), the change process is: propose the change and communicate to consumers, negotiate a deprecation timeline for breaking changes, update the contract definition, and execute the change. Version the contract (major.minor.patch) so consumers can track what changed.

**Deprecation**: when a data asset is deprecated, the contract specifies the deprecation timeline — how long consumers have to migrate before the asset is removed. This makes data asset lifecycle management explicit rather than the silent deletion that currently destroys downstream pipelines without warning.

Tooling landscape

**Soda Data Contracts**: Soda Core's contract feature allows defining quality expectations as contracts (YAML format) and running automated scans to verify compliance. Well-suited for warehouse-based data quality contracts.

**Atlan Data Contracts**: Atlan's catalog integrates contract definitions with catalog metadata — contracts are associated with catalog assets, consumers can see the contract before consuming data.

**Spectral / Vacuum / custom YAML validation**: for teams that want lightweight contract enforcement without a dedicated tool, custom YAML schemas validated by CI/CD scripts are a viable approach. The schema is stored in git, validated against data assets on each pipeline run, and failures block deployment.

Starting point for most teams

For organisations without formal data contracts, the practical starting point is not a comprehensive contract for every table — it is a contract for the highest-value, highest-dependency data assets:

1. Identify the 10 data assets that, if they broke, would cause the most downstream pain

2. Document the current state: schema, expected freshness, known quality characteristics, current owner

3. Add automated quality tests (dbt tests) for the core quality assertions on those assets

4. Set up alerting so the owner is notified when quality tests fail

5. Establish a change communication process — even a Slack message to a channel before deploying upstream schema changes counts as a start

For the data quality testing framework that enforces contract quality assertions, see data quality framework. For the data mesh context where contracts are the primary coordination mechanism, see data mesh architecture. For the dbt implementation of contracts, see dbt best practices.

Our data architecture consulting practice designs data governance frameworks — including data contract implementation, quality monitoring, and ownership models. If your pipelines break silently due to upstream changes, book a free 30-minute audit to discuss your governance approach.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →