BlogData Architecture

Data Lakehouse vs Data Warehouse: Which Architecture Should You Use?

Austin Duncan
Austin Duncan
Managing Director & Principal Data Architect
·June 23, 202611 min read

The data lakehouse pattern is reshaping how organisations think about analytics infrastructure. Here is when a lakehouse is the right choice versus a traditional cloud data warehouse.

The quick answer

A data warehouse stores structured, processed data optimised for SQL analytics and BI. A data lakehouse stores structured and unstructured data in open file formats on cloud object storage, providing warehouse-like SQL analytics on top — combining the scale and flexibility of a data lake with the performance and governance of a data warehouse. The lakehouse pattern is increasingly dominant for new platform builds at organisations with diverse data types (structured tables, semi-structured JSON, unstructured files, ML feature sets). For organisations running standard SQL analytics on structured relational data, a managed cloud warehouse (Snowflake, BigQuery, Redshift) remains simpler and often less expensive.

What a data warehouse is

A cloud data warehouse (Snowflake, BigQuery, Redshift, Azure Synapse) is an optimised analytics database — columnar storage, MPP query execution, automatic query optimisation, a full SQL interface, and tight integration with BI tools. Data is ingested, cleaned, and stored in warehouse-native formats. The user interface is SQL. The governance model is native database access control. The scale is large (hundreds of terabytes to petabytes) but the data types are structured.

**Strengths**: mature SQL engine, excellent BI tool integration, simple governance model, low operational overhead (fully managed), strong ACID transaction support, predictable performance.

**Limitations**: storage cost is higher than object storage (2–10x depending on platform and compression); limited support for unstructured data; proprietary formats create vendor lock-in; ML/AI workloads that need Python access to raw data require exporting from the warehouse; managing large volumes of semi-structured data (JSON, Parquet) is less efficient than in a lakehouse.

What a data lakehouse is

A data lakehouse separates storage from compute and uses open table formats (Delta Lake, Apache Iceberg, Apache Hudi) on top of cloud object storage (S3, Azure Data Lake Storage, GCS). The table format provides the warehouse-like features — ACID transactions, schema enforcement, time travel, partitioning, statistics — on top of files that any compute engine can read.

**Delta Lake** (Databricks-origin, now open source via the Linux Foundation): the dominant open table format, especially in Databricks environments. Delta Lake adds ACID transactions, time travel (query as of a specific version or timestamp), schema enforcement, and optimised read performance (data skipping, compaction) to Parquet files on object storage.

**Apache Iceberg** (Netflix-origin, now Apache Foundation): a competing open table format with strong support from Snowflake, AWS, Google, and the broader ecosystem. Iceberg has different design trade-offs from Delta (better partition evolution, more vendor-neutral governance) and is gaining adoption as the multi-engine standard.

**Apache Hudi** (Uber-origin): optimised for incremental upsert workloads (streaming CDC ingestion). Less widely adopted than Delta or Iceberg for greenfield projects.

**Compute engines**: the lakehouse model separates storage from compute — any engine that can read the table format can query the data. Databricks SQL, Apache Spark, Presto/Trino, Snowflake External Tables, BigQuery Omni, DuckDB, and Apache Flink can all read Delta or Iceberg tables. This multi-engine flexibility is a core lakehouse advantage.

Architecture comparison

**Storage**: warehouse stores data in proprietary formats (Snowflake's micro-partitions, BigQuery's Capacitor). Lakehouse stores data as open Parquet files on S3/GCS/ADLS — accessible by any engine without export.

**Cost**: object storage (S3 Standard) costs ~$23/TB-month vs Snowflake storage at ~$23/TB-month (same for compressed data) or BigQuery at $20/TB-month. For active query data with frequent compute, total costs are comparable. For cold archival data that is queried occasionally, object storage is dramatically cheaper — the lakehouse model retains it all in one place.

**Data types**: warehouse SQL engines are optimised for structured tabular data. Lakehouses handle structured tables, semi-structured (JSON, XML), and can store unstructured files (ML models, raw text, images) in the same storage system. Data science and ML teams access the same storage without a separate data lake.

**Transactions**: both support ACID transactions. Warehouse ACID is native. Lakehouse ACID is provided by the table format (Delta, Iceberg) — you must use a format that supports it; raw object storage has no ACID guarantees.

**Governance**: warehouse governance uses native access control (role-based access, column-level security, row-level security). Lakehouse governance requires a catalogue layer — Apache Atlas, AWS Glue Data Catalog, Databricks Unity Catalog, or Snowflake Horizon Catalog — to enforce access control across files. Warehouse governance is simpler to implement; lakehouse governance is more complex but more portable.

The medallion architecture

Most lakehouse implementations use the medallion (or multi-hop) architecture to organise data quality tiers:

**Bronze layer**: raw ingest. Data lands in the bronze layer as-is from source systems — no transformation, no quality enforcement. This is the system of record for raw data.

**Silver layer**: cleaned, validated, conformed data. Duplicates removed, schemas enforced, business keys standardised, data quality rules applied. Silver is the trusted source for downstream consumption.

**Gold layer**: business-domain-specific aggregations and data models — equivalent to data warehouse mart models. Optimised for BI queries and ML feature engineering.

This is equivalent to the dbt staging/intermediate/mart layer pattern, applied at the storage tier. Many organisations use dbt on top of a lakehouse to implement the silver and gold layers.

When to choose a data warehouse

Choose a cloud data warehouse (Snowflake, BigQuery, Redshift) when: your data is primarily structured relational data; your primary consumers are BI analysts using SQL and dashboards; you want the simplest possible governance model; your team does not have data engineering capacity to manage lakehouse infrastructure; and your data volumes are not so large that warehouse storage costs are prohibitive.

When to choose a data lakehouse

Choose a lakehouse architecture when: you have significant semi-structured or unstructured data alongside structured tables; your ML and data science teams need direct access to raw data without warehouse export; you want vendor portability (open formats that are not locked into a specific compute engine); your data volumes are very large and you want to minimise storage costs for cold data; or you need to run multiple compute engines (SQL analytics, Spark for ML, streaming) on the same data.

The converging middle

The distinction between warehouse and lakehouse is narrowing. Snowflake and BigQuery both support external tables on Delta and Iceberg, allowing warehouse SQL engines to query lakehouse-format data. Databricks' lakehouse SQL engine increasingly matches traditional warehouse performance for BI workloads. Many organisations run a hybrid: a Databricks or Apache Iceberg lakehouse as the storage and engineering layer, with a Snowflake or BigQuery warehouse for BI consumption — using technologies like Snowflake Iceberg Tables or BigQuery Bigtable to bridge the two without data duplication.

For the specific table format decision, see delta lake guide. For the broader architectural patterns, see data architecture patterns. For the cloud data warehouse comparison, see snowflake vs redshift and bigquery vs snowflake.

Our data architecture consulting practice designs cloud data platforms — from warehouse-only architectures through full lakehouse implementations. If you are deciding between a warehouse and lakehouse approach, book a free 30-minute audit.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →