Data Modeling Best Practices: Principles That Survive Contact with Production

Data models that work in the prototype almost never survive production unchanged. Here are the principles that produce data models analysts can trust, engineers can maintain, and the business can evolve.

The quick answer

Good data models are not just technically correct — they are designed for the people and systems that will use them. A model that produces accurate numbers but is incomprehensible to analysts, requires full rebuilds for every minor change, or cannot be maintained by anyone except its original author is not a good model. The best practices in this guide address the full lifecycle: designing for analyst usability, building for maintainability, and structuring for evolvability. These apply whether you are designing a dimensional model for a data warehouse or a dbt project.

Principle 1: Name things for their consumers, not their origins

The most common data modeling mistake is naming columns and tables after their source system — customer_crm_external_id, order_salesforce_amount, product_erp_sku. This makes sense to the engineer who built the model; it is actively unhelpful to the analyst who needs to understand what the data means.

Name tables and columns for what they represent in the business domain, not where they came from:

**Tables**: customers (not salesforce_accounts), orders (not shopify_orders), revenue (not billing_mrr_snowflake_finance). The source system is an implementation detail.

**Columns**: customer_id (not sf_account_id or crm_customer_id), order_total (not order_subtotal_before_tax_usd), created_date (not created_at_utc_epoch).

**Boolean columns**: is_active, has_subscription, is_churned — not active_flag or churn_status with string values 'Y'/'N' or 'CHURNED'/'ACTIVE'.

The test: can an analyst who has never seen the model understand what every column means from its name alone? If not, rename.

Principle 2: Define one source of truth for each concept

A data model with two tables that both claim to be "the customer list" but produce different customer counts depending on which you query is a trust-destroying model. One concept, one table, one authoritative definition.

**Fact tables define metrics**: the fct_orders table is the authority for order count, order value, and order timing. Analysts building order-level analysis use this table. Any other table that claims to provide order totals is either derived from this one (and is clearly marked as such) or is wrong.

**Dimension tables define entities**: the dim_customers table is the authority for customer attributes — name, segment, acquisition date, status. Every table that needs customer attributes joins to this one; it does not embed customer attributes redundantly.

**Avoid redundant aggregates**: do not create three separate summary tables (monthly_revenue_by_region, monthly_revenue_by_product, monthly_revenue_by_sales_rep) that are each aggregations of the same fct_orders source. Create one fact table; let BI tools aggregate. If BI performance requires pre-aggregated tables, create one fct_orders_monthly with all possible dimensions and derive all summaries from it.

Principle 3: Model at the grain of the business question, not the grain of the source

The grain of a fact table is the level of detail it contains — one row per order, one row per order line item, one row per daily customer-product combination. Choosing the wrong grain is one of the most expensive mistakes in dimensional modeling.

**Too granular**: a fact table at transaction-line-item grain with 50 columns of dimensional attributes causes query problems — every question requires aggregation across many rows, BI tools generate complex SQL, and analysts constantly wonder whether they are double-counting.

**Too aggregated**: a fact table at monthly-customer grain answers monthly customer questions efficiently but cannot answer daily or transaction-level questions without a separate model.

**Choose the grain based on the most granular question you need to answer regularly**: for an e-commerce analytics use case, order-level grain (one row per order) handles most questions. Add an order_line table at line-item grain for product-level analysis. Do not combine the two in one table.

**Document the grain explicitly**: in dbt, the primary key test (unique + not_null on the PK column) both tests and documents the grain. The PK name itself documents the grain — if the PK is order_id, the grain is one row per order.

Principle 4: Use surrogate keys for dimensions

Natural keys (source system IDs) from operational systems are fragile as dimension keys:

- They can change (a CRM migrates customer IDs to a new system)

- They can collide across sources (customer_id 1234 in Salesforce is different from customer_id 1234 in NetSuite when you merge the two into one dim_customers)

- They may contain PII (email addresses used as natural keys in some systems)

Surrogate keys (hash-based or integer-based keys generated by the data model) decouple the dimension table from source system changes. When the source customer ID changes, the dimension's surrogate key stays stable; only the source key field updates.

In dbt, generate_surrogate_key (from the dbt_utils package) creates a consistent hash-based surrogate key from one or more natural key columns. The function produces a deterministic key — the same input always produces the same key, enabling stable joins across separate pipelines.

Principle 5: Handle slowly changing dimensions deliberately

Dimension attributes change over time — customers change their region, products change their category, sales reps change their territory. The question is: when history is queried, should it show the attribute as it was at the time of the event, or as it is now?

**Type 1 (overwrite)**: update the attribute in place. History queries show the current value, not the value at event time. Appropriate for attributes where historical accuracy is not needed (correcting a data error, updating a typo).

**Type 2 (add row)**: create a new row for each change, with effective date and end date columns. History queries can join to the dimension at the effective date to get the attribute value at the time of the event. Required for attributes where historical accuracy matters — sales rep territory at the time of the sale, customer segment at the time of a campaign.

**Type 3 (add column)**: add a "previous value" column alongside the current value column. Simple to implement; only tracks one level of history. Rarely the right choice for complex slowly changing dimension scenarios.

Most production dimensional models have a mix of Type 1 and Type 2 attributes in the same dimension. In dbt, the dbt-snapshot feature implements Type 2 SCD automatically — tracking changes to source data and maintaining a history of versions.

Principle 6: Separate mart layer from staging layer

The mart layer (fct_, dim_ tables in dbt) is what analysts query. The staging layer (stg_ tables) is the transformation buffer between raw source data and the mart layer. Keep them strictly separated:

**Staging layer**: one-to-one with source tables. Renames columns to business-friendly names. Casts types. No business logic. No joins.

**Mart layer**: joins staging models together. Applies business logic. Implements slowly changing dimensions. Aggregates to the right grain. This is where the model's complexity lives.

When source systems change, only the relevant staging model needs updating. The mart layer, which joins staging models, is insulated from source changes. This is the primary maintainability benefit of the layer pattern.

Principle 7: Test every primary key, every foreign key

Data models without tests erode trust the first time they produce incorrect results. Every primary key must be tested for uniqueness and non-null — a primary key with duplicates means every count of that entity double-counts. Every foreign key must be tested for referential integrity — an order with a customer_id that does not exist in dim_customers produces null joins and silent under-counting.

In dbt, these are two-line test definitions in schema.yml. The cost is minimal; the protection is significant. Make them non-negotiable.

For the dimensional modeling foundations these practices build on, see kimball vs inmon. For the dbt implementation of these principles, see dbt best practices. For the testing framework, see data warehouse testing.

Our data architecture consulting practice designs and audits data models — from grain and dimension design through dbt project structure and testing framework. If your data model is producing inconsistent numbers or is difficult to maintain, book a free 30-minute audit.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →