Data contracts make the implicit agreements between teams explicit — the schema, freshness SLA, quality guarantees, and ownership commitments that producers and consumers of a data asset need to operate reliably. This guide covers what data contracts contain, how to implement them, and how they reduce the pipeline fragility that plagues most data organisations.
Most data pipeline fragility is caused by implicit agreements that nobody documented. The upstream team changes a column name. The downstream team's pipeline breaks. Both teams are surprised — the upstream team did not know anyone depended on that column name; the downstream team did not know the column was not guaranteed to be stable. A data contract would have made the agreement explicit, triggered a review before the change, and prevented the breakage.
Data contracts formalise what was implicit. They are the mechanism for turning fragile, undocumented data dependencies into managed, explicit commitments.
What a Data Contract Contains
A data contract is a documented agreement between a data producer (the team that creates and maintains a data asset) and data consumers (the teams that use it). The contract specifies:
**Schema:** The columns, types, and semantics of the data asset. Column names, data types, nullable/non-null status, descriptions, and the business meaning of each field. This is the most basic commitment: these are the fields you can depend on, with these types, with these semantics.
**Freshness SLA:** When will new data be available? An hourly fact table has a different freshness SLA than a daily dimension table. The freshness SLA tells consumers what lag to expect and allows them to detect when data is overdue.
**Quality guarantees:** What data quality can consumers rely on? Examples: primary key uniqueness is guaranteed; the status field will only contain values from the defined enumeration; referential integrity to the customers dimension is maintained. These guarantees are what consumers can build on; quality issues outside the guaranteed dimensions are the producer's problem to fix.
**Owner and contact:** Who is responsible for this data asset? Who do consumers contact when there is a quality issue or when they need to understand the data? Named ownership is a prerequisite for accountability.
**Versioning policy:** How will breaking changes be handled? Will consumers receive advance notice? How much lead time? What constitutes a breaking change (column removal, type change, semantic change) versus a non-breaking change (adding a new column)?
**Deprecation policy:** How will consumers be notified when an asset is deprecated? What is the expected lifecycle?
Data Contracts vs Documentation
Data contracts are often confused with data documentation. The distinction is consequential:
**Documentation** is a one-way communication: the producer describes what they built. It may be read by consumers or not. There is no obligation to maintain it, no enforcement mechanism if it becomes inaccurate, and no process for consumers to signal that changes to the documented asset will break them.
**Data contracts** are bilateral agreements: the producer commits to specific guarantees, and consumers acknowledge the terms of the contract. The contract creates obligations on both sides — producers must honour the schema, freshness, and quality commitments; consumers must register their dependencies and participate in change review processes.
The difference matters when changes happen. Without a contract, a producer changes a column and discovers broken downstream pipelines after the fact. With a contract, a change to a contracted column triggers a review process: affected consumers are identified (because they registered their dependencies), a migration plan is agreed, and the change is coordinated.
Implementing Data Contracts
### In dbt: Schema Tests as Contract Enforcement
dbt schema.yml is the closest thing most teams have to a data contract by default. Model descriptions, column descriptions, and tests encode the contract:
models:
- name: fact_orders
description: "Order-level fact table. Grain: one row per order."
config:
contract:
enforced: true
columns:
- name: order_id
description: "Unique identifier for each order."
data_type: varchar
constraints:
- type: not_null
- type: unique
- name: customer_id
description: "Foreign key to dim_customers."
data_type: varchar
constraints:
- type: not_null
- type: foreign_key
to: ref('dim_customers')
to_columns: [customer_id]
dbt's model contracts feature (available from dbt 1.5) enforces that the model's output schema matches the declared schema. If a column is removed from the model SQL or its type changes, the contract enforcement fails the build before production deployment.
This is the minimum viable data contract for dbt-managed models: a declared schema with enforced types and constraints that fails the CI pipeline on contract-breaking changes.
### Data Contract Registries
Beyond dbt's built-in contract enforcement, some teams implement explicit contract registries — a centralised store of all contracts with tooling for:
- Registering new data assets and their contracts
- Registering consumer dependencies on each contract
- Automated notification to registered consumers when a contract owner proposes changes
- Version history for contracts
Tools in this space: Soda's data contracts framework, Monte Carlo's contract features, and custom implementations using a metadata store (a YAML registry in git or a metadata service like DataHub).
The implementation complexity of a full contract registry is significant. For most mid-market teams, dbt contract enforcement plus a documented review process for schema changes covers most of the value.
### Change Management for Contracted Assets
The contract creates the obligation to manage changes. The process:
1. **Identify the change:** The producer team proposes a change to a contracted asset — removing a column, changing a type, restructuring the schema.
2. **Impact analysis:** Identify all registered consumers of the contract. In dbt, Metadata API lineage analysis identifies which downstream models reference the changed model. In a contract registry, consumers are explicitly registered.
3. **Consumer review:** Notify consumers of the proposed change and the proposed timeline. Allow a review period for consumers to assess impact and raise objections.
4. **Migration plan:** For breaking changes, agree on a migration timeline — the producer maintains the old schema alongside the new for a defined period, giving consumers time to migrate.
5. **Deployment:** Deploy the change after the review period, on the agreed timeline, with the agreed deprecation window.
This process prevents the "surprise breakage" pattern. It adds time to schema changes — which is the point. Schema changes to shared data assets affect multiple teams and should not happen unilaterally.
What Data Contracts Solve (and What They Do Not)
Data contracts address the upstream-to-downstream communication problem: producers do not know who consumes their data, so they make changes without understanding the impact; consumers do not know what guarantees they can rely on, so they build brittle pipelines on undocumented assumptions.
They do not address:
**Data quality issues within the contract:** If the contract guarantees not-null on a column and the upstream system starts producing nulls, the contract specifies that this is a quality violation — but it does not automatically fix the upstream issue. The contract identifies the problem and the accountability; resolution still requires human action.
**Semantic drift:** Column names can stay the same while the underlying business meaning changes. "Revenue" might change from gross to net without a schema change. Contract enforcement catches syntax changes; semantic contracts require stronger documentation practices.
**Data quality outside the contracted dimensions:** A consumer who depends on properties of the data not covered by the contract (specific value distributions, relationships not explicitly contracted) is still taking an undocumented dependency.
Data contracts are most valuable in organisations where multiple teams produce and consume shared data assets — the data mesh operating model being the archetype. In organisations with a single centralised data team, the coordination overhead of formal contracts is often higher than the benefit.
Our data architecture consulting practice designs data contract frameworks and governance processes — contact us to discuss data contracts for your organisation.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →