BlogData Architecture

What Is a Surrogate Key? Integer Keys for Data Warehouse Joins

Austin Duncan
Austin Duncan
Project Manager
·July 24, 20288 min read

A surrogate key is a system-generated integer used as the primary key in a data warehouse dimension table, replacing the natural business identifier from the source system. This guide explains why surrogate keys are used, how they enable slowly changing dimension history, and best practices for generating them.

A surrogate key is a system-generated integer used as the primary key of a dimension table in a data warehouse, in place of the natural business identifier from the source system. Where a CRM assigns customer "ACCT-00891" as its account ID, the data warehouse dimension assigns that customer a surrogate key of 12345 — a simple integer that the warehouse controls.

Surrogate keys are one of the most misunderstood conventions in dimensional modeling. Analysts accustomed to transactional databases ask: why not just use the business key directly as the dimension primary key? The answer is rooted in the specific challenges of data warehouse management: slowly changing dimensions, multiple source systems, and join performance.

Why Not Use Natural Business Keys?

**Natural keys change.** Source systems are not designed with data warehouse stability in mind. A company changes its ERP system and customer IDs are reassigned. A product is discontinued, its SKU retired, and a new product uses the same SKU with different attributes. If the natural key is the dimension primary key, every fact table row pointing to that key now points to the wrong entity.

Surrogate keys decouple the dimension record from the source identifier. The warehouse controls the surrogate key; it never changes regardless of what happens in the source system. Old facts still join to the dimension record that was current at the time they were created.

**Natural keys from different systems do not align.** An enterprise data warehouse integrates data from multiple source systems — Salesforce CRM, NetSuite ERP, a billing platform, a support system. The same customer exists in all four with different identifiers. The conformed customer dimension must resolve these to a single customer record with a single surrogate key, enabling any fact table to join to the customer dimension regardless of which source system generated the fact row.

**Natural keys are often not integers.** Join performance on large fact tables benefits significantly from integer keys. Joining a 500-million-row fact table to a customer dimension on an integer surrogate key is faster than joining on a VARCHAR account ID — integer comparisons are simpler and integer keys consume less storage in the fact table index.

Surrogate Keys and Slowly Changing Dimensions

The most important role of surrogate keys in dimensional modeling is enabling Type 2 slowly changing dimension (SCD) history. When a customer changes their segment from "Mid-Market" to "Enterprise," a Type 2 SCD response adds a new dimension row for the customer with the new segment value. The new row gets a new surrogate key; the old row's surrogate key is retained.

Fact table rows created before the customer changed segment point to the old surrogate key — the record showing "Mid-Market." Fact table rows created after the change point to the new surrogate key — the record showing "Enterprise." Queries that join fact to dimension retrieve the correct segment as of the time of the fact.

Without surrogate keys, this history preservation is impossible. If the natural business key is the primary key, there can only be one row per customer — you cannot add a second row with the same customer ID. Surrogate keys allow multiple rows per entity, one per historical state, each identified by its own unique integer.

Generating Surrogate Keys

In traditional on-premises data warehouses, surrogate keys were generated using database sequences or identity columns — auto-incrementing integers maintained by the database. Each new dimension row received the next available integer.

In modern dbt-based transformation pipelines, surrogate keys are typically generated using a hash of the natural business key plus the effective date for SCD Type 2 records. The dbt_utils.generate_surrogate_key macro produces a consistent hash from specified columns — the same combination of inputs always produces the same hash. This approach is:

**Deterministic:** The surrogate key for a given natural key and effective date is always the same, regardless of the order records are loaded. This enables idempotent loads — rerunning the dbt model produces the same surrogate keys.

**Not dependent on sequence management:** No database sequence state to maintain, no risk of sequence gap or collision when running pipeline backfills.

The tradeoff is that hash-based surrogate keys are VARCHAR (the hash) rather than INTEGER. For performance-sensitive joins on very large fact tables, integer sequences are still preferable; for most modern cloud warehouse workloads, the hash approach is acceptable and simpler to implement.

Surrogate Keys in the Fact Table

Fact tables contain only the surrogate key from each dimension — not the natural business key. The surrogate key is the join column. Analysts who need the natural key for reconciliation with source systems JOIN to the dimension table and retrieve the business key from there.

This is a common point of confusion for analysts new to dimensional modeling. The fact table does not contain customer_name or customer_segment; it contains customer_key. To get customer segment, you join the fact to the customer dimension on customer_key and retrieve the segment column from the dimension.

Our data architecture services practice designs dimensional models with correct surrogate key implementation — including SCD Type 2 history, multi-source entity resolution, and hash key generation via dbt. Contact us to discuss your data warehouse design.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →