The star schema is the most common dimensional model design — a central fact table surrounded by dimension tables, connected by foreign key relationships. This guide covers the practical decisions in star schema design: grain definition, fact table types, dimension table design patterns, slowly changing dimension handling, and the trade-offs between normalised and denormalised dimension approaches.
The star schema is the most widely implemented dimensional model pattern — a central fact table connected to surrounding dimension tables by foreign key relationships, forming a shape resembling a star when diagrammed. Understanding the practical decisions in star schema design — grain declaration, dimension table patterns, slowly changing dimension handling, and normalisation choices — is essential for data warehouse architects and analytics engineers who design the tables that BI tools query.
Fact Table Design
**Grain declaration** — the single most important design decision. The grain specifies what one row in the fact table represents. Common grains:
- One row per order line item
- One row per web session
- One row per daily snapshot of account balance
- One row per customer support ticket event
Declare the grain before adding any columns. Every column in the fact table must be consistent with the declared grain. If you find yourself adding a column that represents something other than a single event at the declared grain, the grain may be wrong, or you may need two separate fact tables.
Fact table columns:
- **Foreign keys** — one FK column per dimension table (customer_key, product_key, date_key, geography_key, etc.)
- **Degenerate dimensions** — dimension values with no corresponding dimension table (order number, invoice number, session ID). These are stored directly in the fact table as descriptive context.
- **Facts (measures)** — the numeric measurements (revenue, quantity, cost, etc.)
- **Metadata** — audit columns (loaded_at, source_system) if needed
**Surrogate keys vs natural keys** — dimension tables should use surrogate keys (system-generated integer identifiers) as primary keys, not natural keys (order_id, customer_number) from source systems. Surrogate keys:
- Enable Type 2 SCD — the same natural customer can have multiple dimension rows with different surrogate keys representing different time periods
- Protect against source system key changes (a customer_number change in the source doesn't cascade through the data warehouse)
- Provide consistent integer join performance regardless of source key format
Dimension Table Design
**Wide, denormalised dimensions** — dimension tables in a star schema are typically denormalised: all attributes of a dimension in a single wide table, rather than normalised into multiple related tables. A product dimension includes product name, SKU, brand, sub-category, category, department, and pack size in one table rather than linking to separate brand, category, and department tables.
The denormalised approach (called a "flat" dimension) simplifies queries (fewer joins), improves query performance, and makes the dimension more understandable to BI users and developers. The redundancy (brand name repeated once per product) is acceptable in analytical databases where updates are infrequent and storage is cheap.
The alternative — normalised dimension tables (called "snowflake schema") — reduces redundancy but requires more joins, increases query complexity, and provides no meaningful query performance benefit in modern column-store databases.
**Hierarchies** — most dimensions contain one or more hierarchies: product has SKU -> sub-category -> category -> department; date has day -> week -> month -> quarter -> year. Store all levels of every hierarchy as separate columns in the dimension table. "What is the category for this product?" is a single column lookup; no join required.
**Surrogate key generation** — add a numeric surrogate key as the primary key for every dimension table. In dbt, use dbt_utils.generate_surrogate_key() to create deterministic hash-based surrogate keys from the natural key. For Type 2 SCD dimensions, the natural key + valid_from combination generates a unique surrogate key for each row version.
Fact Table Types
**Transaction fact tables** — one row per discrete event at a specific point in time (one row per order line, one row per click, one row per transaction). The most common fact table type. Measurements are additive; grain is atomic.
**Periodic snapshot fact tables** — one row per entity per time period, capturing the state of a measurement at regular intervals (account balance each month-end, inventory level each day, customer status each week). Measurements may be semi-additive (account balance can be summed across accounts but not across time periods).
**Accumulating snapshot fact tables** — one row per entity lifecycle (one row per order, updated at each pipeline stage; one row per claim, updated at each claims processing step). The row is updated (not appended) as the entity progresses through its lifecycle. Measurements include milestone dates and elapsed time calculations.
Most enterprise data warehouses use predominantly transaction fact tables, supplemented by periodic snapshots for balance-type measurements and accumulating snapshots for pipeline/process tracking.
Bridge Tables
Some dimensional relationships are many-to-many and cannot be modelled with a single FK column in the fact table. Common examples:
- An order belongs to multiple promotions simultaneously
- A customer belongs to multiple market segments
- A transaction has multiple cost centres for allocation
Bridge tables handle these relationships: an intermediate table with one row per fact-dimension combination. An order-promotion bridge table has one row per (order_key, promotion_key) combination, allowing the fact table to join through the bridge to reach the promotion dimension.
Bridge tables complicate aggregate calculations (double-counting risk when rows fan out through the bridge). Use a weighting factor column in the bridge (promotion_share column that sums to 1.0 per order) to enable correctly weighted aggregations.
Handling Late-Arriving Data
In streaming and near-real-time pipelines, dimension rows for new entities may not exist in the dimension table when the first fact rows for those entities arrive. Without a dimension row, the FK in the fact table has no match and the row cannot be correctly attributed.
Strategies:
**Default/unknown dimension row** — maintain a dimension row with surrogate key -1 (or 0) representing "unknown." Fact rows for entities whose dimension row hasn't arrived yet use the unknown FK. A subsequent process updates fact rows to the correct FK when the dimension row is loaded. This requires tracking which fact rows reference the unknown dimension.
**Inferred dimension rows** — create a minimal dimension row with the natural key and null/default attribute values when the first fact row arrives. Update the dimension row with full attributes when the source provides them. This is the dbt snapshots approach for Type 2 SCD dimensions.
**Hold fact rows in a staging table** — delay loading fact rows until matching dimension rows exist. Suitable for lower-latency requirements; adds pipeline complexity.
Schema Documentation
Document every fact and dimension table with:
- Business description of what the table represents
- Grain statement for fact tables
- Column-level descriptions for every column
- Known limitations or quality issues
- Owner and review date
In dbt, this documentation goes in schema.yml files and is rendered in the dbt docs site. In other tools, maintain documentation in a data catalogue. Undocumented schemas create knowledge dependencies on individual engineers.
Our data architecture practice designs star schemas and dimensional models for enterprise data warehouse programmes — contact us to discuss your data warehouse design.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →