BlogData Architecture

What Is Data Warehouse Architecture? Layers, Zones, and Design Patterns

Austin Duncan
Austin Duncan
Project Manager & Data Strategist
·June 2, 202811 min read

Data warehouse architecture defines how data flows from source systems through transformation layers to analytical consumption. This guide explains the canonical layers of a modern warehouse, common design patterns, and the decisions that shape warehouse architecture in practice.

Data warehouse architecture defines how data flows from source systems through transformation layers to analytical consumption. The architecture determines what the warehouse can do reliably, how quickly data becomes available for analysis, and how much engineering effort it takes to maintain.

Most organizations that struggle with their data warehouse do not have a data quality problem or a tooling problem. They have an architecture problem: layers are missing, responsibilities are unclear, and data flows in ways that make the warehouse brittle, slow, or untrustworthy. Understanding the canonical architecture clarifies what needs to exist and why.

The Three Core Layers

A well-designed data warehouse has three logical layers, each with a distinct purpose:

### Layer 1: Ingestion (Raw / Landing)

The ingestion layer receives data from source systems exactly as it arrives — without transformation. Its job is to get data from operational systems into the warehouse quickly and completely, not to make it correct or useful.

Common characteristics of the raw layer:

- Schema mirrors the source system schema (sometimes with additions for pipeline metadata like load timestamp and source name)

- No business logic applied — calculated fields, joins, and filters belong in downstream layers

- Append-only or snapshot-based — preserves history even when source systems overwrite or delete records

- High volume, low transformation cost — the engineering investment is in reliability and completeness, not modeling

Why the raw layer matters: transformation logic breaks. When a transformation assumption turns out to be wrong, the raw layer provides the immutable source of truth to re-process from. Without it, fixing a transformation error requires re-extracting from the source — which may not be possible for systems that do not preserve history.

### Layer 2: Transformation (Staging / Intermediate / Marts)

The transformation layer is where raw data is cleaned, joined, aggregated, and modeled into structures useful for analysis. In the modern data stack, this layer is typically implemented in dbt.

The transformation layer typically has internal sub-layers:

**Staging models** — one-to-one with source tables, applying basic cleaning: standardizing column names, casting data types, filtering obvious garbage records, renaming fields to business-friendly names. Staging models do not join tables.

**Intermediate models** — joining staging models to build unified entities. A customer model that joins CRM data with product usage data with payment history. An order model that joins orders with order lines with products with customers.

**Mart models** — business-domain-specific aggregated models optimized for specific analytical use cases. A finance mart with period-over-period revenue calculations. A customer success mart with health score and renewal probability. A marketing mart with attribution and campaign performance.

Each layer builds on the previous, and each layer can be tested independently. When a mart model produces wrong numbers, you can trace the error upstream through intermediate and staging models to identify where the logic breaks.

### Layer 3: Consumption (Semantic Layer / BI)

The consumption layer is where data is served to analytical consumers — BI tools, data science notebooks, operational systems via reverse ETL, APIs.

For BI consumption, the consumption layer often includes a semantic layer: a metadata layer that defines business metrics in a way that is tool-agnostic and enforced consistently. Revenue is defined once, in one place, with one formula, one currency conversion, one fiscal calendar mapping. Every BI tool, every dashboard, every query that asks "what was revenue in Q3?" gets the same answer.

Semantic layer tools include dbt Metrics, Looker LookML, Cube.dev, AtScale, and Tableau Prep calculated fields (more limited). Without a semantic layer, metric definitions scatter across individual BI workbooks, creating the inconsistency problem where different dashboards show different numbers for the same metric.

Common Architecture Patterns

### Star Schema

The star schema is the foundational dimensional modeling pattern: a central fact table containing quantitative measures and foreign keys, surrounded by dimension tables containing descriptive attributes.

A sales fact table might contain: order_id, customer_id, product_id, date_id, salesperson_id, quantity, revenue, discount. Each foreign key joins to a dimension table (customer dimension with name, segment, industry, region; product dimension with name, category, price; date dimension with fiscal periods, quarters, year-over-year comparators).

Star schemas optimize for BI query performance: simple joins, pre-aggregated fact tables, dimension tables small enough to hold in memory. They denormalize aggressively — the same customer attributes appear on every order row — trading storage efficiency for query simplicity.

### Snowflake Schema

A snowflake schema normalizes dimension tables, splitting them into hierarchies. Instead of a single customer dimension, there is a customer table, a city table, a region table, and a country table, all joined in a hierarchy.

Snowflake schemas save storage and eliminate redundancy. They also make queries more complex. For most BI workloads, the query complexity cost exceeds the storage savings benefit. Star schemas are preferred for BI-optimized warehouses; snowflake schemas are more common in normalized operational environments.

### Data Vault

Data Vault is a modeling methodology designed for auditability and historical tracking. Its core elements are hubs (business keys), links (relationships between hubs), and satellites (attributes and their history).

Data Vault is well-suited for regulatory environments where audit trails, data lineage, and historical accuracy are critical requirements. It is significantly more complex to implement and query than dimensional models and is generally not appropriate for organizations whose primary requirement is analytical agility rather than regulatory compliance.

### OBT (One Big Table)

The OBT pattern denormalizes everything into a single wide table with hundreds of columns. All joins have been pre-computed; every analytical query is a simple scan with filters and aggregations on a single table.

OBT trades storage for query simplicity and performance. At petabyte scale in columnar storage, the storage cost is often acceptable. OBT is common in product analytics use cases where the primary grain is a user event and all relevant attributes can be joined at load time.

Medallion Architecture (Bronze / Silver / Gold)

The medallion architecture — popularized by Databricks and the lakehouse pattern — maps onto the three layers described above:

- **Bronze** corresponds to raw ingestion: unmodified data as it arrived from sources

- **Silver** corresponds to staging and intermediate transformation: cleaned, deduplicated, joined data

- **Gold** corresponds to mart and consumption models: aggregated, business-domain-specific, optimized for analytical consumption

The naming convention is different but the logical structure is the same. Organizations using warehouse technologies (Snowflake, BigQuery) and those using lakehouse technologies (Delta Lake, Iceberg) are often implementing the same architectural pattern with different tooling.

Architecture Decisions That Shape Everything

**Full refresh vs. incremental models.** Full-refresh models reprocess the entire source table on every run. Incremental models process only new or changed records. For small tables, full refresh is simpler. For large tables, incremental processing is necessary for acceptable run times — but requires identifying changed records accurately (usually via updated_at timestamps or CDC).

**Batch vs. streaming ingestion.** Batch ingestion (daily, hourly) is simpler to implement and sufficient for most analytical use cases. Streaming ingestion (Kafka, Kinesis, Pub/Sub) is necessary when analytical latency requirements are measured in seconds or minutes rather than hours. Most organizations overestimate their real-time requirements and underestimate the operational complexity of streaming pipelines.

**Single warehouse vs. multiple zones.** Some organizations separate development, staging, and production environments within the same warehouse. Others use entirely separate warehouse instances. The governance choice affects who can read and write to which schemas, how code is promoted from development to production, and how costs are attributed.

Our data architecture practice designs and implements data warehouse architectures from initial assessment through schema design through production deployment. Contact us to discuss your data warehouse architecture requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →