Data Vault is a modeling methodology designed for enterprise data warehouses that need to handle large numbers of source systems, frequent schema changes, and full auditability. This guide covers the core constructs, when Data Vault is justified, and when Kimball is the better choice.
Data Vault is a data modeling methodology developed by Dan Linstedt, designed for enterprise data warehouses that need to integrate many source systems, handle frequent schema changes, and maintain a complete audit trail. It is an alternative to the Kimball dimensional modeling approach. This guide covers what Data Vault is, the core constructs, when it is justified, and when Kimball is the better choice.
The Data Vault premise
Kimball dimensional modeling (star schemas, fact tables, dimension tables) is optimised for query performance and user experience — analytics queries are fast and intuitive. Its weakness is rigidity: when source systems change their schemas, when new sources are added, or when business definitions shift, Kimball models require refactoring that can be disruptive to downstream consumers.
Data Vault was designed to solve this. It is optimised for:
- **Auditability**: Every row includes insert timestamps, record sources, and no deletes. The full history of every record, from every source, is always retained.
- **Flexibility**: New sources can be added without touching existing tables. Schema changes in source systems are handled with minimal disruption to the core model.
- **Parallelism**: The modular structure allows teams to work on different parts of the model simultaneously without coordination overhead.
The trade-off is query complexity: Data Vault models are not designed to be queried directly. They are designed to be transformed into dimensional models (or equivalent query-optimised structures) for reporting.
The three core constructs
**Hubs**: A hub stores the unique business keys for a business entity — customer, order, product, employee. One hub per core business entity. Columns: hash key (surrogate key derived from the business key), business key value(s), load datetime, record source. No foreign keys to other hubs. No descriptive attributes.
The hub is the anchor of the model. It answers: "What are all the unique instances of this entity, across all sources, and when were they first seen?"
**Links**: A link captures the relationship between two or more hubs — an order placed by a customer, a product sold on an order, an employee assigned to a department. Links are many-to-many by design. Columns: hash key (derived from the combination of linked hub keys), foreign keys to each linked hub, load datetime, record source.
Links never change — once a relationship is recorded, it is permanent. If an order-product relationship needs to change (the wrong product was associated), a new link row is inserted rather than the old row updated.
**Satellites**: A satellite stores the descriptive context for a hub or link — the customer's name, address, segment; the order's total amount, status, timestamp. One satellite per source system or per topic area within a source. Columns: parent hash key (FK to hub or link), load datetime, load end datetime, hash difference (for detecting changes), record source, all attribute columns.
Satellites implement Type 2 slowly changing dimension logic natively — a new row is inserted when any attribute changes, with a load end datetime set on the previous row. Full history is retained automatically.
Why the three-layer structure
The separation of hub (identity), link (relationship), and satellite (context) enables:
**Adding a new source**: A new satellite added to an existing hub captures the new source's attributes without touching the hub or existing satellites. No schema changes to existing tables.
**Schema changes in source**: If a source adds a new column, a new satellite (or a new satellite version) captures it. The hub, existing satellites, and all linked hubs are unchanged.
**Parallel development**: Hub tables, link tables, and satellite tables can be built by different team members independently. The modular structure scales development teams.
Business Vault
The Business Vault layer in Data Vault 2.0 sits above the Raw Vault (hubs, links, satellites from source systems) and contains calculated fields, business rules, and derived entities. For example: a satellite that contains the calculated order margin (unit_price minus cost), where the raw satellites contain only the source values. Business rules applied in the Business Vault are separated from the raw source data, which is preserved unchanged in the Raw Vault.
Information Marts
The Information Mart layer transforms the Data Vault model into query-optimised structures for specific business consumers — typically Kimball-style dimensional models (fact and dimension tables). The Raw Vault + Business Vault contain the full history and all source data; the Information Mart contains the user-friendly, query-optimised view of the data relevant to a specific business domain.
This three-layer structure (Raw Vault → Business Vault → Information Mart) is what makes Data Vault 2.0 operationally viable for analytics. Users do not query the raw vault; they query Information Marts that look like standard dimensional models.
Data Vault vs Kimball: when to use which
**Use Data Vault when**:
- You are integrating 10+ source systems that change frequently
- Regulatory or audit requirements demand complete data lineage from source to report, with every source record preserved
- Development teams are large and need to work in parallel without coordination overhead
- The enterprise data warehouse is expected to have a multi-decade lifespan with significant evolution
**Use Kimball when**:
- You have 1–5 source systems with stable schemas
- The analytics consumers are sophisticated and want intuitive models they can query directly
- Development team is small (2–10 engineers) — Data Vault's modular structure is an advantage at scale, overhead at small scale
- Query performance for end users is the primary requirement
**The hybrid approach**: Many mature warehouses use both. Data Vault as the integration and historical preservation layer; Kimball dimensional models as the query-optimised marts built on top of the vault. This gives the auditability and flexibility of Data Vault with the query performance and user experience of Kimball.
Tooling for Data Vault
dbt can implement Data Vault models, though the hub-link-satellite pattern requires discipline in how models are organised. The dbt_vault package (formerly dbtvault) provides macros for generating Data Vault SQL from YAML definitions. Wherescape, Vaultspeed, and Datavault4dbt are commercial and open-source tools specifically for automating Data Vault generation.
The hashing function for surrogate keys (MD5, SHA-256, or a custom hash) must be applied consistently across all hubs and links. Using dbt_utils.generate_surrogate_key or an equivalent hash function provides consistency.
For the dimensional modeling alternative, see kimball vs inmon and data warehouse design. Our data architecture consulting practice designs enterprise data warehouse architectures including Data Vault implementations — book a free design review.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →