BlogData Architecture

Data Architecture Anti-Patterns: The Design Decisions That Come Back to Haunt You

Austin Duncan
Austin Duncan
Managing Director & Principal Data Architect
·May 23, 202713 min read

Most data architecture failures are not caused by choosing the wrong technology. They are caused by design decisions that seemed reasonable at the time but that scale poorly, accrete technical debt, or make the system progressively harder to maintain. This guide covers the most common anti-patterns and why they are so attractive before they become problems.

Anti-patterns in data architecture are design choices that have short-term appeal but create long-term problems. They are adopted because they solve an immediate problem, because the team does not yet have enough experience to see the downstream consequences, or because the organisational incentives reward speed over sustainability.

Understanding them is useful not because you will never make these choices — some of them are the right choice in specific circumstances — but because being able to recognise them allows you to make them deliberately and with awareness of the debt you are taking on.

The Warehouse as Application Database

**The pattern**: operational applications write directly to the data warehouse. A CRM application inserts records to Snowflake. An event tracking system writes session data to BigQuery. The application and the warehouse share a database.

**Why it is attractive**: it eliminates a data movement step. Operational data is immediately in the analytical system. No ETL, no sync lag.

**Why it fails**: data warehouses are not optimised for high-frequency, low-latency transactional writes. Write patterns from operational applications (many small inserts, frequent updates, point deletes) are antithetical to columnar warehouses optimised for analytical reads. Concurrent application writes and analytical reads compete for warehouse resources. Schema changes required for application evolution break analytical queries that depend on the schema.

The separation between operational (OLTP) systems and analytical (OLAP) systems exists for a reason. Use an operational database for operational writes; use a warehouse for analytical reads. The data movement step is the right cost.

The Monolithic Transform

**The pattern**: one giant SQL query — or one giant dbt model — that joins 15 tables, applies 40 calculated fields, and produces a wide denormalised table that serves all analytical use cases.

**Why it is attractive**: it is simple in concept. One place to go for all the data. No need to join tables in the BI layer.

**Why it fails**: monolithic transforms are impossible to test meaningfully, painful to debug when they produce wrong results, and fragile to any change in any upstream source. When the query fails, the failure is not attributable to a specific transformation step — the entire output is unavailable until the issue is identified and fixed. When a business logic change is needed, it requires modifying a query that affects all consumers simultaneously.

The modular alternative (dbt with staging, intermediate, and mart layers) takes longer to build initially but produces a system where: each transformation step is testable independently, failures are localised to specific models, and business logic changes are limited in scope.

The Schema-on-Query Anti-Pattern

**The pattern**: raw, semi-structured data (JSON, XML, API responses) is stored in the warehouse without schema definition. Queries are written to parse the structure at query time — SELECT json_field:customer_id, json_field:order_amount FROM raw_events.

**Why it is attractive**: it is fast to implement. Data arrives in the warehouse immediately without a schema definition step. No upfront schema design required.

**Why it fails**: every query must reparse the same semi-structured field, making queries slower than typed columns. Schema changes in the source silently change query results — if customer_id is renamed to cust_id in the JSON payload, queries using the old key return NULL without error. The schema is implicitly defined in dozens of separate queries rather than in one documented place.

Materialise typed columns from semi-structured sources in a staging layer. Apply the schema definition once, test that the fields are extracted correctly, and let downstream models use typed columns rather than parsing JSON on every query.

The Premature Aggregation

**The pattern**: data is aggregated at ingestion time before being stored in the analytical database. Transaction-level data is summarised to daily totals before loading; individual event records are aggregated to session-level records.

**Why it is attractive**: smaller tables are faster to query. Aggregation at ingestion time seems like an obvious optimisation.

**Why it fails**: once data is aggregated, the information it contained at the finer grain is gone. A future analytical requirement that needs transaction-level detail — individual order analysis, customer-level attribution, anomaly detection at the event level — cannot be served from pre-aggregated data. The question you do not think to ask today is often the question that becomes most important in six months.

Load raw data at the finest grain available. Aggregate in the transformation layer for consumption-layer tables. Store the aggregations, but retain the source. The raw grain is the irreplaceable asset.

The One-to-Many Join Blindspot

**The pattern**: a fact table is joined to a dimension table without verifying the cardinality of the join relationship. The dimension table turns out to have multiple rows per join key for some records, causing the fact table rows to be duplicated.

**Why it is attractive**: joins look correct in the data model diagram. The query runs without error.

**Why it fails**: the duplicated fact table rows produce inflated aggregations that are not obviously wrong — revenue appears to be 130% of actual, which is plausible enough that the error is not immediately detected. The inflated numbers flow into dashboards and decisions before anyone notices.

Always verify join cardinality before writing production queries. Count the distinct join keys in the dimension table and compare to the total rows — if they differ, the dimension has duplicates and the join will produce fan-out. Add a uniqueness test on join keys in dbt before joining.

The Undefined Grain Fact Table

**The pattern**: a fact table is built without an explicit declaration of what one row represents. Columns from multiple levels of granularity are mixed into the same table.

**Why it is attractive**: it is convenient to have all related data in one table.

**Why it fails**: columns at different grains produce nonsensical aggregations when the user does not understand the implicit structure. An order-line-level fact table that includes a ship_date column (which is an order-level attribute, not a line-level attribute) produces misleading results when order-level attributes are used as dimensions without the user understanding that they are duplicated across all lines of each order.

Declare the grain explicitly and enforce it. Every column in the fact table should be at the declared grain or be a measure that aggregates across that grain. Order-level attributes belong in an order dimension; line-level attributes belong in the fact.

Snowflaking Unnecessarily

**The pattern**: dimensions are normalised into multiple joined tables in the name of avoiding redundancy. A product dimension becomes product → subcategory → category → department, each as a separate table.

**Why it is attractive**: normalisation is good practice in OLTP systems. It eliminates update anomalies and reduces redundancy.

**Why it fails in OLAP**: analytical query engines are optimised for joining to denormalised dimension tables, not for navigating multi-level dimension hierarchies. Query complexity grows with every dimension level. BI tools generate complex SQL for queries that cross dimension hierarchy levels. The storage savings from normalisation are negligible in modern cloud storage; the query complexity costs are real.

Denormalise dimension tables unless the hierarchy has 5+ levels or changes so frequently that normalisation is warranted for update manageability.

The Vendor Lock-In Trap

**The pattern**: the data architecture is built around proprietary features of a specific vendor's platform — Snowflake-specific UDFs, BigQuery-specific ML functions, Redshift-specific data sharing features — with no consideration for portability.

**Why it is attractive**: vendor-specific features often provide capability that SQL standards do not. Using them accelerates development.

**Why it fails**: vendor pricing changes, vendor strategy shifts, and vendor capabilities evolve. An architecture that is deeply coupled to one vendor's proprietary features is expensive to migrate when any of these change. The flexibility premium of maintaining portability has a real cost, but so does the inflexibility cost when a migration becomes necessary.

Use standard SQL where it covers the requirements. Use vendor-specific features selectively for capabilities that genuinely cannot be served by standard SQL. Document vendor-specific dependencies explicitly so their cost is understood.

Our data architecture practice audits existing data architectures and designs forward-looking platforms — contact us to discuss your data architecture assessment.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →