What data contracts are, why they solve the root cause of most data quality failures, how to define and enforce them technically, and the organisational change required to make data contracts work in a real engineering team.
Data contracts are the formalisation of the implicit agreements that exist between every team that produces data and every team that consumes it. When a data engineer builds a pipeline that reads the users table from the application database, there is an implicit contract: the schema will not change without warning, the user_id column will always be populated, the email field will contain valid email addresses. These assumptions are never written down. They are discovered when they break — usually at 2am when a downstream dashboard stops working because a developer renamed a column.
Data contracts make these implicit agreements explicit, versioned, and technically enforceable. They are the operational mechanism that addresses the root cause of most data quality failures: the gap between what data producers think consumers need and what they actually deliver.
Why Data Quality Problems Are Really Contract Problems
The standard response to data quality problems is more monitoring — alerting when a column goes null, when row counts drop, when values fall outside expected ranges. This is valuable but it treats symptoms. Monitoring tells you when the contract has been broken; it does not prevent the breakage.
The root cause of most data quality failures is not malicious intent or incompetence. It is that the application team that owns the users table does not know that three downstream pipelines depend on the email column being non-null. They normalised the schema to support optional email addresses for a new user registration flow, the change was completely reasonable from their system's perspective, and three BI reports broke.
A data contract makes the dependency explicit. The application team knows, before making the change, that breaking the email column's non-null constraint will violate a contract that downstream systems depend on. They can either negotiate a change to the contract or implement the change in a backwards-compatible way.
The Components of a Data Contract
A data contract typically specifies:
**Schema**: the precise structure of the data — column names, data types, nullable constraints, and optionally column-level business definitions.
**Semantics**: what the data means. "active_users" means users who have logged in within the last 30 days, not users who have not been deleted. "revenue" means net revenue after returns, not gross revenue. Semantic definitions prevent situations where schema is technically correct but business meaning is violated.
**SLA and freshness**: when will this data be available? How often is it updated? What latency is acceptable? A batch pipeline that delivers data by 8am business days is a different contract than a streaming pipeline with 60-second latency.
**Quality expectations**: the data quality guarantees the producer commits to. user_id is never null. email matches a valid email pattern. order_date is never in the future. These are testable assertions that the producer is responsible for meeting.
**Versioning and change policy**: how changes to the contract are managed. New columns are backwards-compatible additions. Renaming or removing columns is a breaking change that requires a version bump and a migration period.
**Owner and contact**: who is responsible for this contract, how to reach them, and what the escalation path is for violations.
Technical Implementation
Data contracts exist at multiple levels of the stack:
**Schema registries** for event-driven architectures. Confluent Schema Registry, AWS Glue Schema Registry, and similar tools enforce schema compatibility for Kafka topics and event streams. Producers must register their schema; a compatibility mode (BACKWARD, FORWARD, FULL) determines what changes are allowed. A consumer reading a topic can rely on schema guarantees enforced by the registry.
**dbt tests and contracts** for the data warehouse layer. dbt's model contracts feature allows you to define the expected schema of a dbt model in the model's YAML config. If a model's output does not match the declared contract (wrong column types, missing columns), the dbt build fails. This enforces contracts at the transformation layer.
A dbt contract definition:
models:
- name: orders
config:
contract:
enforced: true
columns:
- name: order_id
data_type: varchar
constraints:
- type: not_null
- type: unique
- name: customer_id
data_type: varchar
constraints:
- type: not_null
**Data contract frameworks** like DataContract CLI, Soda, and custom implementations provide tooling for defining contracts in YAML or JSON, generating tests from contract definitions, validating contract compliance at pipeline runtime, and publishing contracts to a data catalogue.
**Column-level lineage** tools (dbt, OpenLineage, Marquez) make contract dependencies discoverable — showing which downstream models and pipelines depend on each source column. This turns the abstract concept of "downstream consumers" into specific, nameable things that can be notified when a contract change is proposed.
The Producer-Consumer Workflow
Implementing data contracts changes how changes to data assets are managed:
**For new data assets:** the producer documents the contract (schema, semantics, SLA, quality expectations) before the consumer depends on it. The contract is registered in the data catalogue. Consumers reference the contract, not just the table.
**For proposed changes to existing contracts:** backwards-compatible changes (adding columns, widening types) can be deployed with notification but not formal approval. Breaking changes (removing columns, renaming, changing semantics) require: identifying all downstream consumers from lineage, negotiating the change with consumer owners, defining a migration timeline with a deprecation window, and versioning the contract.
**For contract violations:** when a producer violates a contract — a column goes null that should not, a table is delayed past its SLA — consumers are notified via the monitoring layer. The producer is responsible for the violation and for communicating resolution timeline. The contract creates accountability; without it, the violation is just a data quality issue with no clear owner.
Organisational Requirements
Data contracts are not a purely technical problem. They require organisational structures that support contract management:
**Data ownership** must be clear. Every data asset needs an owner who is accountable for maintaining the contract. See our data governance implementation guide for how to establish data ownership structures.
**Change management process.** Breaking changes to contracts need a review process — who approves, how much notice is required, what migration support is offered. Without process, contracts are aspirational documents, not enforceable agreements.
**Engineering culture.** Application engineers must treat breaking changes to data assets with the same care as breaking changes to external APIs. In many organisations this requires a culture shift: "it is just an internal database table" is replaced with "downstream data systems depend on this like a customer depends on an API."
**Tooling investment.** Manual contract management at scale does not work. Organisations with hundreds of data assets need tooling that makes contracts discoverable (data catalogue), enforceable (schema registry, dbt contracts), and monitorable (data quality platform). The investment pays back quickly in reduced incident response time.
Data contracts are foundational to the data mesh architectural pattern — see our data mesh implementation guide for how contracts fit in the broader data mesh context. For data architecture design that includes contract frameworks as a first-class concern, our data architecture consulting team can help — contact us to discuss your requirements.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →