BlogData Architecture

What Is a Data Warehouse? Architecture, Modern Options, and When to Build One

Austin Duncan
Austin Duncan
Managing Director & Principal Data Architect
·May 20, 202610 min read

A data warehouse is a central repository for structured, integrated data built for analytical querying. Here is how modern cloud data warehouses work, how they differ from data lakes and lakehouses, and the decision framework for which architecture fits your organisation.

The quick answer

A data warehouse is a central repository for structured, integrated data built specifically for analytical queries — not for the transactional workloads that operational databases handle. Modern cloud data warehouses (Snowflake, BigQuery, Azure Synapse, Redshift) have dramatically expanded what "warehouse" means: they handle semi-structured data, scale compute elastically, and cost a fraction of their on-premise predecessors. For organisations that primarily need governed, structured analytics — financial reporting, operational dashboards, business intelligence — a modern cloud warehouse remains the fastest path to production analytics value.

How a data warehouse works

A data warehouse aggregates data from multiple source systems — CRM, ERP, operational databases, SaaS tools, event streams — transforms it into a consistent, integrated structure, and stores it in a format optimised for analytical queries.

The key design characteristics of a warehouse that distinguish it from an operational database:

**Read-optimised, not write-optimised.** Operational databases (OLTP) are designed for high-frequency, low-latency read/write operations: an e-commerce transaction, a CRM record update, a user authentication event. A data warehouse is designed for analytical read performance: queries that scan millions of rows, join across many tables, and aggregate to summary metrics. The storage format, indexing strategy, and query optimiser are all tuned for this read pattern.

**Integrated across source systems.** Operational databases contain data from one system. A warehouse integrates data across many systems into a unified model — so that a query can join customer data from the CRM with transaction data from the ERP and product data from the inventory system without the analyst needing to know where each dataset originally lived.

**Historical.** Operational systems typically maintain current state (what the record is now). A warehouse maintains history — the full time series of changes — so that questions like "what was our customer count six months ago?" or "what was the daily revenue trend over the last two years?" can be answered from a single system.

**Subject-oriented.** Warehouse data is organised around business subjects — Customer, Product, Order, Financial Account — rather than around operational processes. This subject-orientation is what allows cross-functional analysis.

Modern cloud data warehouse architecture

The on-premise data warehouse era — Oracle Exadata, Teradata, IBM Netezza — required significant hardware investment, fixed capacity, and dedicated DBA teams. Modern cloud warehouses have changed this fundamentally.

**Snowflake** separates storage and compute entirely. Data is stored in cloud object storage (S3, Azure Blob, GCS); Virtual Warehouses are independent compute clusters that query it. You pay for storage per TB and compute per second of active use. Warehouses auto-suspend when idle, eliminating the "running 24/7 at full capacity" cost model of on-premise.

**Google BigQuery** uses a serverless model — there are no warehouses to configure, no clusters to manage. You run a query; BigQuery allocates compute automatically. Pricing is per TB of data scanned (on-demand) or per slot reservation (fixed capacity). BigQuery's separation of storage from compute is even more extreme than Snowflake's — there is no infrastructure concept at all for the analyst.

**Azure Synapse Analytics** is Microsoft's integrated analytics service, combining a SQL warehouse (formerly SQL Data Warehouse) with Spark capabilities and deep integration with Azure Data Factory, Power BI, and Azure Data Lake Storage.

**Amazon Redshift** is AWS's cloud warehouse, with both a provisioned cluster model and a serverless option. Native integration with the AWS ecosystem (S3, Glue, SageMaker).

All four platforms support standard SQL, connect natively to major BI tools (Tableau, Power BI, Looker), and handle datasets that would have required dedicated hardware clusters in the on-premise era.

Data warehouse vs data lake vs data lakehouse

These three architectures are often confused. They solve related but distinct problems.

**Data warehouse**: Structured, integrated data optimised for SQL analytics. High query performance, strong governance, schema-on-write (data is structured before it enters the warehouse). Limited flexibility for unstructured data, ML workloads, and raw data preservation.

**Data lake**: Raw data storage at scale on cloud object storage (S3, ADLS, GCS). Any data format, any structure, no schema enforcement at ingestion. Low cost, high flexibility, poor query performance without additional compute, limited governance.

**Data lakehouse**: Combines data lake storage economics with warehouse query performance and governance. Uses open table formats (Delta Lake, Apache Iceberg) to add ACID transactions, schema enforcement, and query optimisation on top of object storage. Serves both SQL analytics and ML workloads from the same storage layer.

For a detailed comparison, what is a data lakehouse covers the architecture and the decision framework for when a lakehouse is the right choice versus a managed warehouse.

When a data warehouse is the right choice

**Your analytics workloads are primarily SQL-based.** If your team's primary work is structured analytics — financial reporting, operational dashboards, ad-hoc SQL analysis — a cloud warehouse provides the fastest path to production and the best SQL query performance.

**Governance and compliance are primary requirements.** A cloud warehouse's schema-on-write model enforces structure at ingestion. Data governance controls are mature and well-supported. For organisations in financial services, healthcare, or other regulated industries where data quality and access control are non-negotiable, a warehouse's structured approach reduces compliance risk.

**Your BI tools need fast, reliable query performance.** Tableau, Power BI, and Looker are designed to query structured data in warehouses. The query patterns BI tools generate — aggregations, joins, filters on specific dimensions — are exactly the workloads that warehouse query optimisers are tuned for. A warehouse purpose-built for SQL analytics consistently outperforms a general-purpose platform for BI workloads.

**You do not have significant ML/AI engineering requirements.** If your primary analytics use case is business intelligence reporting rather than ML model training, the additional capability of a lakehouse platform (Spark, MLflow, Delta Lake) is complexity you do not need. A managed cloud warehouse is simpler to operate and sufficiently capable.

When to consider a lakehouse instead

A data lakehouse becomes the stronger choice when:

- You have significant ML and AI engineering requirements alongside BI analytics

- You need to store and process semi-structured or unstructured data at scale

- You want to avoid proprietary storage formats and maintain portability

- Your transformation workloads include complex Python or Spark logic beyond SQL

For most mid-market organisations, a cloud warehouse is sufficient. For large enterprises with both significant analytics and ML workloads, a lakehouse architecture (often with a dedicated warehouse like Snowflake on top) better serves the full workload range.

The modern data warehouse stack

A production-ready cloud data warehouse environment typically includes:

**Ingestion layer** — Fivetran, Airbyte, or Azure Data Factory moving data from source systems to the warehouse staging area on a scheduled or event-driven cadence.

**Transformation layer** — dbt running inside the warehouse, transforming staged raw data into governed data models through Bronze/Silver/Gold (or staging/marts) layers. Tests validate data quality. Documentation is auto-generated. Lineage is tracked.

**Warehouse** — Snowflake, BigQuery, Synapse, or Redshift hosting the governed, analytics-ready data.

**Semantic layer** — dbt metrics, Power BI datasets, or a dedicated semantic layer tool providing canonical business metric definitions above the warehouse tables.

**BI layer** — Tableau, Power BI, or Looker connecting to the semantic layer for dashboards, reports, and ad-hoc analysis.

**Orchestration** — Apache Airflow, Prefect, or Azure Data Factory coordinating pipeline schedules and dependencies.

This stack is well-understood, well-documented, and supported by a large ecosystem of consultants and tooling vendors. Deviations from it require deliberate justification.

Data warehouse design: key decisions

**Dimensional modelling vs entity-centric modelling.** Traditional data warehouse design follows dimensional modelling (Ralph Kimball's methodology): fact tables containing measurable events, surrounded by dimension tables containing context (who, what, where, when). Modern approaches using dbt favour entity-centric models: one table per business entity (Customer, Order, Product), modelled as Type 2 slowly changing dimensions when history matters. Both approaches work; dimensional modelling is more familiar to analytics engineers trained in the Kimball tradition, entity-centric is more natural for dbt-based development.

**Incremental vs full refresh.** Most warehouse tables should load incrementally — only appending or updating rows that have changed since the last load, rather than replacing the full table on every run. Incremental loading reduces compute cost, reduces pipeline runtime, and is required for tables with billions of rows that cannot be fully rebuilt in a reasonable time window. dbt's incremental model pattern handles this cleanly. Full refresh is acceptable for small dimension tables but should not be the default for fact tables.

**Partitioning and clustering.** Warehouse query performance on large tables depends heavily on how data is partitioned and clustered. Tables that are commonly filtered by date should be partitioned by date — so queries that filter to a single month scan only that month's partition rather than the full table. Tables commonly filtered by a business dimension (region, product category) should be clustered on that dimension. These are performance decisions that should be made at design time, not retrofitted after the warehouse is built and slow.

FAQs

Do we still need a data warehouse if we have a data lake?

A data lake without a warehouse layer means analysts are querying raw, unstructured data with no schema enforcement, no governance, and poor performance. Most organisations that built data lakes without a warehouse layer added one later — either a traditional warehouse on top, or moved to a lakehouse pattern. The data lake's value is storage flexibility and cost; the warehouse's value is query performance and governance. Combining them (through a lakehouse or a warehouse-on-lake architecture) gives you both.

How much does a cloud data warehouse cost?

Cloud warehouse costs depend on data volume and query frequency. A mid-market organisation with 500GB–5TB of warehouse data and a team of 20–50 analysts typically spends $2,000–$8,000 per month on Snowflake or BigQuery, depending on query optimisation and warehouse configuration. Costs grow quickly without proper governance: warehouses that run continuously rather than auto-suspending, queries that scan full tables rather than partitions, and storage of data that should be archived are the most common cost drivers. Properly configured, cloud warehouses are materially cheaper than the on-premise infrastructure they replace.

Should we use Snowflake, BigQuery, Synapse, or Redshift?

For Azure-first organisations with a Microsoft stack, Synapse or Snowflake (which runs natively on Azure) are the natural choices. For GCP-first organisations, BigQuery is the obvious default. For AWS-first organisations, Redshift or Snowflake. For organisations without strong cloud platform commitments, Snowflake's portability (runs on Azure, AWS, and GCP) and its proven SQL performance make it the most flexible choice. See our detailed comparison in snowflake vs databricks for the Snowflake/Databricks decision specifically.

When should we migrate from an on-premise warehouse to the cloud?

The economic case for cloud migration is typically clear — cloud warehouses are cheaper than on-premise hardware at equivalent scale, especially when including hardware refresh cycles, DBA staffing, and data centre costs. The right time to migrate is when your current on-premise contract is coming up for renewal, when you are hitting capacity constraints, or when the analytics capabilities you need (elastic scaling, modern SQL features, BI tool integration) are not available on your current platform. See azure data architecture best practices for the migration approach.

Our data architecture consulting practice has designed and built data warehouses across Snowflake, BigQuery, and Synapse for mid-market and enterprise organisations. If you are evaluating a warehouse platform, planning a migration, or trying to understand why your current warehouse is costing more than expected, book a free 30-minute audit and we will give you a direct assessment.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →