BlogData Architecture

Data Modelling Anti-Patterns: The Mistakes That Make Warehouses Slow and Unmaintainable

Obed Tsimi
Obed Tsimi
Founder & Senior Tableau Architect
·January 31, 202712 min read

The data modelling decisions that seem reasonable at first and cause serious problems at scale — over-normalised warehouses that require too many joins, wide tables that make everything slow, fanout joins that silently inflate metrics, EAV patterns that make analysis nearly impossible, and the governance failures that allow these patterns to accumulate.

Good data modelling produces warehouses that are fast to query, easy to understand, and maintainable as the business grows. Bad data modelling produces warehouses that are slow, confusing, and expensive to change. The anti-patterns in this guide are common enough that most data teams will recognise at least a few in their current environment.

Anti-Pattern 1: Over-Normalisation in the Warehouse

Relational database normalisation principles (3NF, Boyce-Codd NF) are correct for OLTP systems — applications that do many concurrent small reads and writes. They are wrong for analytical warehouses, for a simple reason: normalisation trades storage efficiency for query complexity. Analytical queries join many tables to assemble the information they need; each join is an opportunity for incorrect cardinality assumptions and a cost to query performance.

**The symptom:** Analysts need to write 8-table joins to get a simple report. Query performance is slow because of join overhead. dbt mart models consist entirely of joining things together.

**The correct approach:** Dimensional modelling — wide, pre-joined fact tables and dimension tables that reflect the business structure, not the application database structure. A fact_orders table should have customer_name, customer_region, product_name, and product_category already joined in — not require the analyst to join to dim_customers and dim_products to get them.

The warehouse's role is to pre-join and pre-aggregate. Storage is cheap; analyst time and query latency are expensive.

Anti-Pattern 2: The Excessively Wide Table

The opposite of over-normalisation: a single table that contains all fields from all related entities, producing a table with 200+ columns, most of which are null for most rows, and with unclear grain.

**The symptom:** There is a "main" table that contains customer attributes, order attributes, product attributes, payment attributes, and marketing attributes all in one place. Row count and column count are both enormous. Query cost is high even when selecting a few columns. Column names are duplicated with different suffixes (_customer, _product, _order).

**The correct approach:** Identify the grain — what does one row represent? A table can legitimately be wide if every column is relevant for that grain. But if the table tries to be multiple tables (one row represents an order AND a customer AND a product), the grain is wrong.

Split into fact and dimension tables at the appropriate grain. The fact_orders grain is one row per order line item. dim_customers grain is one row per customer. The analyst joins them when they need both.

Anti-Pattern 3: The Fanout Join

Fanout occurs when you join a table to another table with a one-to-many relationship and then aggregate — producing inflated counts and sums because the same fact row is duplicated for each matching row in the joined table.

**The example:** A customer has 3 email campaigns. An order has 1 customer. Joining orders to campaigns produces 3 rows per order. SUM(revenue) then triples the revenue. This is a silent bug — the query runs without error but produces numbers that are 3x too large.

**The symptom:** Revenue numbers look correct for some queries and wrong for others. Analysts disagree on metric values. The data "looks right" when queried alone but "breaks" when combined with other dimensions.

The correct approach:

1. Aggregate before joining: resolve the many-to-many by aggregating the right table to the join key first, then join the aggregated result.

2. Use COUNT DISTINCT instead of COUNT: when counting entities that may appear multiple times due to a fanout join.

3. Design the data model to avoid the join altogether: pre-join at the dbt model level where the fanout can be handled correctly.

The most dangerous property of fanout bugs is that they produce confident-looking numbers that are simply wrong. They are frequently discovered months or years into a data product's life, when someone does a sanity check and finds that total revenue in the warehouse is 2.7x what the finance system shows.

Anti-Pattern 4: Entity-Attribute-Value (EAV) Tables

EAV tables store flexible key-value pairs instead of fixed columns:

entity_id | attribute_name | attribute_value

123 | country | US

123 | plan_tier | enterprise

123 | signup_date | 2024-01-15

This pattern appears in systems that need to handle arbitrary user-defined attributes or highly variable schemas. In an analytical warehouse, it is almost always a mistake.

The problems:

- Pivoting EAV into a usable form requires CASE/PIVOT expressions that are slow, tedious to maintain, and break when new attribute names are added

- Column-level security and access control cannot be applied to individual attributes — all attributes are in one column

- Data types are lost — all attribute_values are strings; numeric attributes require casting

- Query performance is poor — filtering on an attribute requires a string comparison against all rows

**The correct approach:** Model attributes as columns. If the schema genuinely needs to be flexible (user-defined custom fields), use a JSON column for the flexible part and materialise the known-important attributes as explicit columns. Most EAV implementations reflect an uncertainty about which fields will be used analytically — resolve that uncertainty and build explicit columns.

Anti-Pattern 5: Implicit Grain

A table without a documented grain is a table where every downstream user makes their own assumption about what one row represents. Those assumptions are rarely identical. Aggregations on the table produce different results depending on what grain the aggregating query assumes.

**The symptom:** Two queries on the same table produce different totals for the same metric. Investigation reveals that the queries are aggregating at different levels because neither query author knew the correct grain.

**The correct approach:** Every fact table and mart table must have a documented grain. One row in fact_orders represents one order line item. One row in fact_subscriptions represents one subscription-month. The grain is documented in the dbt model description and enforced by a unique test on the grain columns.

When a table does not have a grain test, it is easy to accidentally add duplicate rows during transformations or loads. The duplicates are invisible until they cause downstream metric inflation.

Anti-Pattern 6: Accumulating Snapshot Tables Without Grain Discipline

An accumulating snapshot table tracks the lifecycle of a business process — one row per order, updated as the order progresses through stages (placed, confirmed, shipped, delivered). This is a useful pattern. The anti-pattern is implementing it without proper grain discipline or stage completion tracking.

**The failure mode:** The order row is updated by overwriting the previous state. Historical states are lost. Questions like "how long did orders sit in the confirmed stage in Q1 2024?" cannot be answered because the confirmed date was overwritten when the order was shipped.

**The correct approach:** For accumulating snapshots, use separate date columns for each milestone (order_placed_date, order_confirmed_date, order_shipped_date, order_delivered_date) that are filled in as the milestone is reached and never overwritten. If you also need the current status, add a current_status column. Do not overwrite milestone dates.

Anti-Pattern 7: Calculation Duplication Across Models

The same business logic — the definition of "active customer," the calculation of LTV, the classification of revenue into categories — implemented in multiple dbt models independently. When the definition needs to change, it requires updating in multiple places. When different stakeholders use models with different implementations of the same concept, their metrics diverge.

**The symptom:** Two reports both show "Active Customers" but disagree. Investigation reveals that one mart model counts active as "logged in within 30 days" and another counts it as "has any usage in 30 days." Both were written by different engineers without checking existing implementations.

**The correct approach:** Implement each canonical business logic definition once, in a shared intermediate model. All downstream models reference the shared definition. When the definition needs to change, it changes in one place. This is the analytics engineering principle: business logic in the transformation layer, not distributed across the BI layer.

Anti-Pattern 8: Staging Models That Do Too Much

Staging models are supposed to be thin: rename columns, standardise types, apply minimal cleaning. They are the raw-to-standard layer. The anti-pattern is staging models that implement business logic — joining to other tables, applying complex transformations, performing aggregations.

**The problem:** When staging models do too much, the lineage becomes opaque. A downstream model references a staging model expecting clean raw data, but the staging model has already applied business logic that affects the downstream model's outputs. Debugging is harder; changes to the staging model affect more downstream models than expected.

**The correct approach:** Keep staging models to standardisation only. Business logic lives in intermediate and mart models where it is explicit, named, and traceable. The staging layer is the boundary between raw and clean; the intermediate layer is where entities are assembled; the mart layer is where analytical models are constructed.

Our data architecture consulting practice audits and redesigns data warehouse models — contact us to discuss data modelling for your environment.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →