Data Warehouse vs Data Lake: The Right Architecture for Your Organisation

A clear-eyed comparison of data warehouses and data lakes — what each is actually for, where the lakehouse fits, the workloads that belong in each, and the architecture decision framework that prevents organisations from buying the wrong platform for the wrong problem.

The data warehouse vs data lake debate has been running for a decade and generated more confusion than clarity. The confusion comes from two sources: marketing language that overstates each technology's capabilities, and genuine architectural evolution that has changed what each system is good for. This guide cuts through both.

What a Data Warehouse Actually Is

A data warehouse is a structured, governed repository for analytical data. It stores data in a schema-on-write model — the structure is defined when data is loaded, not when it is queried. Data in a warehouse has been cleaned, transformed, and organised according to a data model (typically dimensional: fact tables and dimension tables).

The defining characteristics:

**Structured data only.** Warehouses handle tabular data with defined columns and types. Semi-structured data (JSON, XML) can be stored but requires special handling.

**SQL query interface.** Business users and analysts interact with warehouses using SQL or BI tools that generate SQL. No specialised query interface is required.

**Performance-optimised for analytical queries.** Columnar storage, materialised views, zone maps, and parallel query execution make warehouses fast for aggregations, joins, and analytical scans across large datasets.

**Governed.** Access control, data quality rules, and certified metric definitions are maintained by the data team. The warehouse is the source of truth for business reporting.

Cloud warehouses — Snowflake, BigQuery, Redshift, Databricks SQL — are the modern implementations. They share the core characteristics of structured data, SQL interface, and analytical optimisation, while adding cloud-native scalability and separation of storage from compute.

What a Data Lake Actually Is

A data lake is an object storage system (S3, GCS, Azure Data Lake Storage) that holds data in raw or minimally processed form. The defining characteristics:

**All data types.** Data lakes store structured data (CSV, Parquet), semi-structured data (JSON, Avro), unstructured data (images, audio, documents), and binary data. No schema is required to store data.

**Schema-on-read.** Structure is defined when data is queried, not when it is stored. The same file can be queried with different schemas by different consumers.

**Cheap storage at scale.** Object storage costs $0.02–0.03/GB/month, compared to $0.02–0.025/GB for warehouse managed storage. For petabyte-scale data, this difference is substantial. Data that is rarely accessed but must be retained is cheap to store in a lake.

**Requires engineering to query.** Raw data in a lake requires schema definition, partition management, and query engine configuration (Athena, Spark, Trino) to be analytically useful. Business users cannot query a data lake directly without significant abstraction.

The promise of data lakes — store everything, query anything — frequently became "store everything, query nothing" in practice. Petabytes of raw data with no schema, no governance, and no analytical model is not an asset; it is a liability.

The Lakehouse: Convergence

The lakehouse architecture — pioneered by Databricks (Delta Lake) and later adopted by Apache Iceberg and Apache Hudi — attempts to combine the best properties of both:

**Open table formats (Delta Lake, Iceberg, Hudi)** add transactional guarantees, schema enforcement, and time travel to object storage. Delta Lake, for example, stores data as Parquet files in S3 but maintains a transaction log that provides ACID semantics, schema evolution, and time travel queries.

**Multiple engine access.** The same Delta Lake or Iceberg table can be queried by Spark (for ML and data engineering), Trino (for federated SQL), Snowflake (via external table), BigQuery (via BigLake), and Athena (via Iceberg table). One storage layer, multiple compute engines.

**BI-grade query performance.** With proper partitioning, Z-ordering (ClickHouse-style data clustering for Delta Lake), and statistics collection, lakehouse tables can approach warehouse-level query performance for analytical workloads.

The lakehouse does not eliminate the distinction between a data lake and a data warehouse — it narrows it. A well-managed lakehouse with governed tables, clean schemas, and consistent metric definitions is functionally equivalent to a data warehouse for analytical purposes. A poorly managed one is a data lake under a different name.

Which Workloads Belong Where

Data warehouse is the right choice for:

- Business reporting and dashboards (revenue, KPIs, operational metrics)

- Self-serve analytics for business users

- Governed, certified analytical content used for decisions

- Workloads where SQL is the primary interface and all data is structured

- Organisations without data engineering expertise to manage a lake

Data lake is the right choice for:

- Long-term storage of raw data at petabyte scale (cheaper than warehouse storage)

- Unstructured and semi-structured data that cannot go into a warehouse (logs, images, audio, documents)

- ML training data, feature stores, and model artefacts

- Archive storage for data that must be retained but is rarely queried

- Staging area before transformation and loading to a warehouse

Lakehouse is the right choice for:

- Organisations that need both ML/data science and business analytics on the same data

- Teams that want Spark for heavy transformations and SQL for analytics on the same tables

- Databricks-native environments where Delta Lake is the default table format

- Organisations with petabyte-scale data that is too large or expensive to copy to a warehouse

The Architecture Decision Framework

Rather than "warehouse or lake?", the right question is: "what does each type of data need to enable, and what is the cost and complexity of achieving it in each environment?"

**Raw operational data (logs, events, application records):** Start in a lake (cheap storage, no schema required). Process and model into a warehouse for analytics. Or use a lakehouse table format to skip the warehouse copy step.

**Structured business data (transactions, CRM records, financial data):** Warehouse natively. These are SQL-shaped, governance-required, business-critical datasets that belong in a structured, governed environment.

**Large-scale historical data with infrequent query needs:** Lake or lakehouse. Storing 5 years of event history in a data warehouse is expensive. Cold storage in S3 with occasional Athena or Trino queries is the right trade-off.

**ML training data and feature pipelines:** Lake or lakehouse. ML workflows require Python, Spark, and custom compute environments that are not warehouse-native. Feature stores read from lake storage.

**Most analytical BI workloads:** Warehouse. Period. The complexity of operating a lake for workloads that do not need it is rarely justified by the cost saving.

The majority of enterprise data teams need a cloud data warehouse, not a data lake. A lakehouse is the right architecture for organisations with both ML/engineering-heavy workloads and business analytics requirements. A pure data lake — without warehouse-quality governance — is the right answer for almost no one building a production analytics function.

Our data architecture consulting practice helps organisations choose the right platform architecture for their specific workload profile — contact us to discuss your data infrastructure requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →