Data Warehouse Schema Design: Star, Snowflake, and When Each Is Right

The schema design decision — star vs snowflake vs wide table — is one of the most consequential in data warehouse architecture. Each trades off query performance, storage efficiency, model maintainability, and analytical flexibility differently. Here is how to think through the decision for your specific workload and query patterns.

Schema design is one of the most consequential early decisions in a data warehouse. It determines how fast queries run, how much storage the warehouse consumes, how difficult it is to maintain the data model, and what kinds of analytical questions the warehouse can answer efficiently. Get it right and the warehouse runs fast and stays maintainable for years. Get it wrong and you inherit a performance problem that is expensive to fix once it has data in it.

The three main schema patterns — star, snowflake, and wide tables — each make different trade-offs. The right choice depends on query patterns, team sophistication, tooling, and the size and complexity of the dimensions involved.

The Star Schema

The star schema is the classic dimensional model: one fact table in the centre, dimension tables surrounding it, connected by foreign keys. The fact table holds events — transactions, order lines, sessions, log entries — at the finest grain you want to analyse. The dimension tables describe the attributes of those events: customers, products, time, geography.

A simple e-commerce star schema:

fact_orders: order_id, customer_id, product_id, date_id, store_id, quantity, revenue, cost

dim_customer: customer_id, name, email, segment, acquisition_channel, country

dim_product: product_id, name, category, subcategory, brand, cost_price, list_price

dim_date: date_id, date, year, quarter, month, week, day_of_week, is_weekend, is_holiday

dim_store: store_id, name, city, region, country, store_type, open_date

A query that joins fact_orders to dim_customer, dim_product, and dim_date can answer virtually any standard analytical question: revenue by segment by month, category performance by region, customer cohort retention.

The star schema optimises for query performance. Most analytical queries touch the fact table and one or two dimension tables. The joins are single-level — fact to dimension — with no chains. Modern columnar warehouses (Snowflake, BigQuery, Redshift) handle star schema joins efficiently because the fact table's foreign key columns are narrow and compress well.

The downside of star schema is redundancy in dimension tables. If a customer's segment changes, the current value in dim_customer is updated, and historical orders no longer reflect the segment at the time of order. Slowly Changing Dimensions (SCD) handle this — but add complexity. Star schema dimensions are also denormalised: city, region, and country attributes all live in the same dimension row rather than being normalised into separate tables. This redundancy is intentional — it eliminates joins at query time.

The Snowflake Schema

Snowflake schema normalises the dimension tables. Instead of a flat dim_product with category and subcategory columns, snowflake schema extracts them into separate dimension tables:

dim_product: product_id, name, subcategory_id, brand, cost_price, list_price

dim_subcategory: subcategory_id, subcategory_name, category_id

dim_category: category_id, category_name, department_id

dim_department: department_id, department_name

Now the product dimension is a hierarchy of normalised tables rather than a flat table with redundant attribute values.

The benefits: reduced storage (category names stored once rather than on every product row), easier updates (renaming a category updates one row in dim_category, not all product rows), and cleaner modelling of complex hierarchies.

The costs: more joins at query time. A query needing product category must join fact_orders to dim_product, then dim_product to dim_subcategory, then dim_subcategory to dim_category. For simple queries, this is negligible. For complex queries across multiple snowflaked dimensions, it adds latency and planning complexity.

In practice, snowflake schema is often overkill. Modern columnar warehouses have abundant storage, and the query join cost often exceeds the storage savings. The main legitimate use for snowflaking is when dimension hierarchies are genuinely complex and change frequently — geographic hierarchies with irregular regional structures, organisational hierarchies with many levels, or product taxonomies with dozens of levels.

Most mature data teams default to star schema with selective snowflaking only where the hierarchy complexity justifies it.

Wide Tables (One Big Table)

Wide tables — sometimes called the "one big table" (OBT) pattern — denormalise everything into a single flat table with every attribute that analysis might need. The fact_orders example becomes:

orders_wide: order_id, order_date, year, quarter, month, week, customer_id, customer_name, customer_segment, customer_country, product_id, product_name, category, subcategory, brand, store_id, store_city, store_region, quantity, revenue, cost

No joins required. Any query scans a single table. This is fast for ad-hoc analytical queries — especially in query engines designed for single-table columnar scans.

The trade-offs are significant. Wide tables are large (every attribute duplicated on every fact row). They are expensive to maintain — adding a new attribute requires rebuilding or updating every row. Slowly changing dimension logic becomes tangled — which version of the customer segment does this row reflect? And they violate normal form, creating update anomalies if the same fact can appear in multiple rows.

Wide tables work well for specific use cases: flat event logs with limited dimensional complexity, temporary analytical tables for specific projects, BI semantic layers built on top of a normalised warehouse. They are common in dbt projects where a final 'mart' table denormalises a star schema into a single wide table for BI tool consumption — the star schema maintains the source of truth, the wide table serves the BI layer.

Choosing Between Them

The decision framework:

**Default to star schema** for most production data warehouses. It is analytically flexible, well-understood by BI tools, and optimises for the query patterns that matter (multi-dimensional slicing). It is the pattern Kimball documented in the data warehouse literature and the pattern most BI tools are designed around.

**Consider snowflake schema** when dimensions have deep hierarchies (5+ levels), when dimension attribute changes are frequent and need to be tracked cleanly, or when storage cost is genuinely a concern (it rarely is in modern cloud warehouses).

**Use wide tables** for the consumption layer — the mart tables that BI tools query directly. Build the star schema as the warehouse layer, then materialise wide tables (via dbt or equivalent) for BI tool performance. Wide tables as the source of truth for a warehouse are harder to maintain; wide tables as a presentation layer on top of a warehouse are practical.

Grain Declaration

Before designing any fact table, declare the grain explicitly. The grain is the precise description of what one row in the fact table represents. "One row per order" and "one row per order line" are different grains with different implications.

Get the grain wrong and the schema produces incorrect aggregations. A common mistake: fact table contains order-level attributes (ship date, order status) and order-line-level attributes (quantity, line revenue) mixed together. When you aggregate revenue, do you get correct totals? Only if the grain is declared correctly and all columns respect it.

Grain declaration also determines which dimensions can be joined. An order-line-level fact table can join to a product dimension (products belong to lines). An order-level fact table cannot — or rather, joining to a product dimension produces a fan-out that inflates revenue. Know your grain. Put columns at the right grain. Separate grains into separate fact tables.

Slowly Changing Dimensions

The Kimball SCD framework addresses dimension attributes that change over time: a customer changes segment, a product changes category, a salesperson changes region. Three main types:

**Type 1 (overwrite):** Just update the current value. No history. Use when the historical value is not analytically meaningful.

**Type 2 (add row):** Add a new dimension row for the changed attribute, with effective start/end dates and a current flag. Historical fact rows point to the old dimension row; new fact rows point to the new one. This is the full history approach — expensive to implement and query, but necessary when you need to know "what segment was this customer in when they placed this order?"

**Type 3 (add column):** Add a "prior value" column alongside the current value column. Only tracks one prior state, but simpler to implement and query than Type 2. Use when one level of history is sufficient.

Most warehouse implementations need a mix of SCD types. Customer segment usually needs Type 2 (reporting by historical segment is meaningful). Product price usually needs Type 2 or at least Type 3. Product name usually needs Type 1 (a name correction should apply retroactively). The right type depends on whether the historical value is analytically meaningful.

Handling Multiple Fact Tables

Complex domains require multiple fact tables at different grains: fact_orders at order line grain, fact_sessions at session grain, fact_support_tickets at ticket grain. The shared dimensions (customers, products, dates) connect these fact tables — a design called a conformed dimension architecture.

Conformed dimensions allow cross-process analysis. Because fact_orders and fact_sessions share dim_customer with the same customer keys, you can join them — relating purchasing behaviour to browsing behaviour through the shared customer dimension. This is one of the core analytical capabilities of a well-designed star schema.

A design antipattern: dimension tables that are not conformed — a "customer" dimension in the orders domain with different keys and attributes than a "customer" dimension in the CRM domain. Cross-domain analysis becomes impossible without a manual key reconciliation step.

Our data architecture practice designs dimensional models for complex analytical environments — contact us to discuss your warehouse schema requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →