BlogData Engineering

Data Contracts: How to Make Upstream Data Producers Accountable for What They Send

Austin Duncan
Austin Duncan
Managing Director & Principal Data Architect
·May 2, 202711 min read

Data contracts are agreements between data producers and data consumers that define the schema, semantics, and quality guarantees of a data asset. They address the root cause of most data reliability problems — upstream changes that break downstream pipelines without notice. This guide covers what data contracts are, how to implement them, and how to make them stick organisationally.

The most common source of data reliability failures in mature data platforms is not bugs in pipelines — it is upstream changes that violate implicit agreements. A source system application team renames a column. An operational database adds a new status code that no one told the data team about. An API changes its response schema between versions. Each of these is a change to a data asset that breaks downstream consumers, sometimes immediately and sometimes subtly.

Data contracts make implicit agreements explicit. A contract defines what a data producer commits to providing: the schema (column names, types, nullability), the semantics (what each field means, what each value represents), the freshness (how often data arrives and what SLA it carries), and the quality expectations (uniqueness constraints, referential integrity, valid value ranges). The consumer depends on those commitments. The producer is accountable for maintaining them.

What a Data Contract Contains

A minimal data contract specifies:

**Schema definition**: field names, data types, and nullability for every column. Schema changes that are backward-compatible (adding a nullable column) are allowed without notice. Changes that break consumers (renaming or dropping a column, changing a type from string to integer) require advance notice and a migration period.

**Semantic definitions**: what does each field mean? What are the valid values for a status code field? When a date field is 'updated_at', does it reflect the time the record was last updated in the source system, or the time it was ingested? These are not derivable from schema alone — they require documentation.

**Freshness SLA**: how often does the producer commit to providing data? "Daily by 6 AM" or "within 10 minutes of a transaction" are commitments. Without a stated SLA, consumers cannot distinguish between a normal operational state and a pipeline failure.

**Quality expectations**: the producer commits to certain quality characteristics. The order_id column is unique. The customer_id column references a valid customer. The revenue column is non-negative. These can be encoded as dbt tests, Great Expectations checks, or custom assertions in the pipeline.

**Change notification process**: what is the process when the producer needs to make a breaking change? How much advance notice does the consumer receive? Is there a parallel run period where both old and new schemas are available simultaneously?

Why Data Contracts Fail Without Organisational Change

A data contract is a commitment. Commitments require incentives to maintain. If the application team that owns a source system has no accountability for breaking downstream data consumers, they will not prioritise communicating changes. They are shipping product features; data pipeline compatibility is not their problem.

Making data contracts stick requires:

**Contract violations become visible to the producer.** When a schema change breaks the downstream pipeline, the producer should know about it — not just the data team. Automated alerting that fires to the producing team (not just the consuming team) when a contract violation is detected creates the awareness that motivates care.

**Breaking changes require a review process.** Source system changes that modify data outputs should go through a review process that includes the data team. This does not mean the data team can veto application changes — it means they are included early enough to plan for migration rather than discovering breakage in production.

**Producer teams have data contract compliance as an explicit responsibility.** If no one is accountable, nothing changes. "The application team is responsible for maintaining their data contract" must be stated explicitly in team charters, not assumed.

Without these organisational elements, a data contract is documentation that no one reads and no one enforces.

Implementation Approaches

**Schema registry**: a schema registry (Confluent Schema Registry, AWS Glue Schema Registry) enforces schema evolution rules at the serialisation layer for event streaming use cases. Producers must register their schema before publishing; consumers validate incoming data against the registered schema. Incompatible schema changes are rejected at the point of publication, not discovered downstream. This is the most robust enforcement mechanism for streaming data.

**OpenAPI / AsyncAPI specifications**: for REST APIs that serve data to pipelines, a formal API specification (OpenAPI for synchronous, AsyncAPI for asynchronous) documents the schema and provides a contract that consumers can validate against. API versioning (v1, v2) allows breaking changes to be introduced alongside the existing contract, giving consumers time to migrate.

**dbt contracts**: dbt supports contract enforcement at the model level. When a model has a contract defined (schema with enforced types), dbt validates that the model's output matches the contract at build time. If the upstream source changes in a way that would break the contract, the dbt build fails — catching the violation in the CI pipeline rather than in production.

**Data contract YAML specifications**: tools like Soda and custom implementations use YAML-defined contracts that specify schema, quality rules, and freshness SLAs. The contract file lives in version control alongside the pipeline code. A contract validation step in the pipeline checks the incoming data against the contract and fails the pipeline if violations are detected.

Versioning and Migration

Breaking changes in data contracts follow a deprecation and migration process:

1. The producer announces the breaking change with a migration timeline (minimum 30 days for most contexts; 90 days or more for large organisations with many consumers).

2. Both the old and new schema are available simultaneously during the migration period. For databases, this typically means maintaining the old columns alongside new ones, or publishing to both old and new table names.

3. Consumers migrate to the new schema during the migration period, updating their pipeline code and tests.

4. The old schema is deprecated at the end of the migration period.

The migration period length should be calibrated to the number of consumers and their migration complexity. A table with one consumer can migrate in days. A table consumed by 20 pipelines and 15 dashboards may need a 90-day migration period.

The Operational Benefit

Data contracts reduce the operational cost of data incidents. When a pipeline breaks because an upstream schema changed, the investigation traditionally involves: discovering the breakage from a user complaint, identifying which pipeline failed, identifying which source table changed, reaching out to the application team, finding the relevant engineer, and coordinating a fix.

With data contracts, the investigation is: the monitoring system detects the schema violation and fires to both the data team and the producing team, the contract specifies who owns the source, the data team reviews the contract to understand the expected schema, and the fix is scoped immediately.

The reduction in mean time to resolution (MTTR) is measurable. So is the reduction in the frequency of undetected schema drift — changes that produce wrong results rather than pipeline failures, which are the harder class of problem to catch without explicit contract enforcement.

Our data engineering practice implements data contract programmes — contact us to discuss data reliability engineering for your data platform.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →