Data Warehouse Design: Schemas, Grain, and the Decisions That Last

Data warehouse design decisions made at the beginning — schema type, fact grain, dimension strategy — are very difficult to change later. This guide covers the core design choices and the principles behind them.

A data warehouse is not a database you query. It is a database you design — once, carefully, for a specific set of questions. The schema type you choose, the grain you define for each fact table, and the dimension strategy you adopt at the start will shape every report, every query, and every downstream system for years. Getting these decisions right matters. Getting them wrong creates compounding technical debt that eventually requires a full rebuild.

This guide covers the fundamental design decisions and the principles behind them.

Schema design: star vs snowflake vs flat

A **star schema** places fact tables at the center, surrounded by denormalised dimension tables. The fact table contains foreign keys to each dimension and one or more measures. A single join from the fact table to each dimension retrieves all attributes needed for analysis. Star schemas are query-optimised — fewer joins, simpler SQL, better performance on columnar analytics engines.

A **snowflake schema** normalises dimension tables into sub-dimensions. A Customer dimension may join to a Geography dimension, which joins to a Country dimension. Snowflake schemas reduce data redundancy and storage, but each analysis requires more joins and the query complexity increases significantly. For modern columnar data warehouses (Snowflake, BigQuery, Redshift, Synapse), storage is cheap and normalisation provides no meaningful benefit while adding query complexity. Snowflake schemas made sense in the era of row-oriented OLTP-style warehouses. They do not make sense for modern analytical warehouses.

A **flat (one big table / OBT) schema** denormalises everything into a single wide table. No joins required. Maximum query simplicity. Appropriate for specific high-throughput, narrow use cases — a single business domain where all analysis requires the same combination of dimensions and measures. Not appropriate as a general warehouse design pattern because it creates massive data redundancy, makes updates to dimension attributes expensive, and produces tables that are difficult to maintain.

**The recommendation**: Star schema for most enterprise data warehouses. One fact table per business process at the appropriate grain. Dimension tables denormalised to the level that avoids duplication of fact-table-relevant attributes. Snowflake schemas as an intermediate layer (in dbt, the staging models may normalise source data) but the final reporting layer is star.

Fact table grain

The grain of a fact table is the definition of one row in that table. Every other design decision follows from the grain definition.

A fact table for "sales" could have:

- One row per transaction line item (order_id, line_item_id, product_id, customer_id, quantity, unit_price)

- One row per order (order_id, customer_id, order_total, item_count)

- One row per day per customer (customer_id, date, daily_revenue, order_count)

These are different tables answering different questions. The line-item grain supports product-level analysis, margin analysis, and promotion analysis. The order grain supports order-level analysis but cannot answer product-level questions. The daily-customer grain is a pre-aggregated summary suitable for customer analytics but cannot drill down to individual orders.

**Grain determines addivity**. A measure is fully additive if it can be summed across all dimensions. Revenue is fully additive — summing line-item revenue across all dimensions gives a meaningful total. A ratio (margin %) is not additive — you cannot sum margin percentages. A semi-additive measure (inventory balance) is additive across some dimensions (sum across locations) but not others (do not sum across time — balance at start of day is not additive with balance at end of day). Design fact tables to store additive measures at the lowest grain; compute ratios in the reporting layer.

**Choose the lowest practical grain**. A fact table at order grain cannot answer product-level questions. A fact table at line-item grain can answer both product-level and order-level questions (with aggregation). Design at the lowest grain that the business requires — you can always aggregate up, you cannot disaggregate down.

**Document the grain explicitly**. The grain definition should be written in the table's description field in your data catalog or dbt YAML. "One row per order line item" — two words that prevent months of confusion when someone queries the table and gets duplicate results because they did not know the grain.

Dimension design

Dimensions provide the context for fact table analysis. A well-designed dimension table has:

- A surrogate key (a generated integer or hash, not the source system natural key) as the primary key

- The natural key from the source system, for traceability

- Descriptive attributes that analysts use for filtering and grouping

- Proper handling of slowly changing attributes

**Surrogate keys** are necessary because natural keys from source systems are not stable. A customer number may be reused after an account is closed. A product code may change during a rebranding. A surrogate key generated by your warehouse decouples the dimensional model from source system key instability.

**Slowly changing dimensions (SCDs)** are the hardest part of dimension design. When a customer moves from New York to Los Angeles, what happens to historical sales records? Three options:

Type 1 (overwrite): Update the current row with the new attribute value. History is lost — all historical sales for this customer now show Los Angeles. Simple to implement; appropriate when historical accuracy does not matter (correcting a data entry error) or when the attribute is not used in historical analysis.

Type 2 (add row): Insert a new row with the new attribute value, mark the old row as expired with an end_date. Now every sale can be accurately attributed to the customer's location at the time of the sale. More complex to query (you must always join to the dimension at the appropriate effective date) but historically accurate. Appropriate for attributes that are used in historical analysis and where "what was true at the time" matters.

Type 3 (add column): Add a "previous value" column. Simple but limited — you can only retain one previous value per attribute, and you cannot track more than one change.

Most dimensions use Type 2 for attributes where historical accuracy matters (customer location, product category, sales territory) and Type 1 for corrections and attributes where history is irrelevant (formatting improvements, code fixes).

Conformed dimensions

A conformed dimension is a dimension table that has the same meaning, keys, and attributes across multiple fact tables. A Date dimension is the canonical example — every fact table joins to the same Date dimension on date_key, and a query joining the Sales fact table and the Inventory fact table can filter both by the same Date dimension attributes.

Conformed dimensions enable the cross-process analysis that makes a warehouse valuable. If the Customer dimension is conformed across Order, Support Ticket, and Churn fact tables, you can answer "do high-value customers have more support tickets before churning?" — a question that requires joining three fact tables on the same customer key.

Non-conformed dimensions make cross-process analysis impossible or require expensive reconciliation. Invest in conforming dimensions early; the maintenance overhead is justified by the analytical flexibility it enables.

Junk dimensions

Low-cardinality flag and indicator columns — order_is_rush, order_is_gift_wrapped, order_is_first_purchase — do not belong in the fact table as individual columns. Adding 10 flag columns to a billion-row fact table adds storage and query overhead for attributes that are rarely used in combination with each other.

A junk dimension bundles these low-cardinality flags into a single dimension table. The fact table has one foreign key to the junk dimension. The junk dimension has one row per combination of flag values that actually occurs in the data. For 10 boolean flags, the junk dimension has at most 1,024 rows (2^10 combinations). The fact table key replaces 10 columns with one.

Role-playing dimensions

A single dimension can be used multiple times in a fact table in different roles. An Order fact table may join to the Date dimension three times: for order_date, ship_date, and delivery_date. Each is a different foreign key to the same Date dimension.

In dbt, role-playing dimensions are implemented as aliases — three views on the same date dimension table with different names (dim_order_date, dim_ship_date, dim_delivery_date). The underlying data is the same; the alias is for query clarity.

For the dimensional modeling philosophy behind these patterns, see kimball vs inmon. For the dbt implementation, see dbt best practices. Our data architecture consulting practice designs data warehouse schemas from scratch and audits existing models for structural issues — book a free design review.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →