The lakehouse combines the storage flexibility of a data lake with the ACID guarantees and query performance of a data warehouse. This guide covers the architecture layers, key design decisions, and how to implement a lakehouse on Databricks, AWS, or Azure.
The data lakehouse is an architecture that stores data in open file formats on object storage (S3, ADLS, GCS) and adds a structured table layer on top — using a table format like Delta Lake, Apache Iceberg, or Apache Hudi to provide ACID transactions, schema enforcement, and query performance comparable to a traditional data warehouse. This guide covers the architecture layers, key design decisions, and how to implement a lakehouse on the major platforms.
The case for lakehouse over separate lake and warehouse
Traditional organisations ran two separate systems: a data lake (cheap object storage, raw files, supports any workload) and a data warehouse (structured, ACID, fast SQL queries, expensive). Data was ETL'd from the lake into the warehouse. The duplication created:
- Two copies of data to manage and keep in sync
- ETL jobs that could fail, creating data inconsistency between systems
- Data scientists working with raw lake data using different tools than analysts using the warehouse
- Two cost structures (lake storage + warehouse compute)
The lakehouse eliminates this duplication. One copy of data in object storage, structured with a table format, queryable by any engine that supports the format.
The lakehouse architecture layers
**Ingestion layer**: Data arrives from source systems via CDC (Debezium, Fivetran, Airbyte), event streams (Kafka, Kinesis), file drops (SFTP, S3 upload), or API pulls. The ingestion layer lands data in the bronze/raw layer of the storage tier with minimal transformation.
**Storage layer**: Object storage (S3, ADLS Gen2, GCS) with a table format providing ACID semantics. Delta Lake (Databricks, open-source), Apache Iceberg (cloud-agnostic, multi-engine), or Apache Hudi (CDC-optimised). The table format maintains metadata — schema definitions, snapshot history, partition maps, column statistics — alongside the Parquet data files.
The storage layer is typically organised into bronze (raw), silver (cleaned), and gold (business-ready) tiers following the medallion architecture. Each tier is a separate schema or directory within the same storage account.
**Transformation layer**: Spark (PySpark notebooks, Spark SQL, dbt-spark), dbt with a warehouse connector, or SQL-on-lakehouse engines (Trino, Presto, Spark SQL). Transformations move data from bronze through silver to gold following documented business logic.
**Serving layer**: Multiple query engines can read the same table format files simultaneously. Spark for large-scale processing. Trino or Athena for SQL analytics. A cloud data warehouse (Snowflake external tables on Iceberg, BigQuery Omni) for BI tool compatibility. The serving layer is polyglot — different consumers use different engines against the same data.
**BI and ML layer**: BI tools (Tableau, Power BI) connect to the serving layer. ML platforms (Databricks MLflow, SageMaker) train models against the gold-layer feature data.
Key design decisions
**Table format choice**: Delta Lake if you are primarily on Databricks or Spark. Iceberg if multi-engine is a requirement (Spark + Trino + Snowflake + Athena against the same data). Hudi if high-frequency CDC upserts are the primary workload. For new projects, Delta Lake and Iceberg are the dominant choices; Hudi is niche.
**Catalog**: The catalog is the metadata store that maps table names to storage locations and schema definitions. Hive Metastore (traditional, widely supported), AWS Glue Catalog (serverless, Athena-native), Databricks Unity Catalog (governance layer with column-level security and lineage), or Apache Polaris/Nessie for Iceberg (open REST catalog). The catalog choice determines which engines can access the data and what governance features are available.
**Compute separation**: One of the lakehouse's primary benefits is that storage and compute are decoupled. A Databricks cluster processes ETL; Athena handles ad-hoc SQL; Tableau connects via a Spark or Trino endpoint; each uses separate compute provisioned independently. Do not try to route all workloads through a single compute tier — use the right tool for each workload type.
**Partition strategy**: Partition tables on the columns used in the most frequent query filters — typically a date column (year, month, day). Over-partitioning (partitioning on too many columns or too fine a granularity) creates millions of small files that degrade query performance. Under-partitioning means full table scans for filtered queries. A table with 10M rows/day partitioned by DAY is appropriate; the same table partitioned by HOUR creates 24x the partitions with no benefit if queries span multiple hours.
**Small file problem**: Streaming ingest creates many small files. Small files hurt performance because each file requires a separate metadata lookup and read. Implement compaction — periodically rewriting small files into larger files. Delta Lake's OPTIMIZE command, Iceberg's REWRITE DATA FILES procedure, or scheduled Spark jobs compact small files. Compact on a frequency proportional to your streaming ingest rate.
Implementation on Databricks
Databricks is the canonical lakehouse implementation environment. Databricks provides:
- Delta Lake as the native table format
- Unity Catalog for governance (column-level security, lineage, data sharing)
- Databricks Workflows for orchestration (DAGs for Spark and Delta Live Tables)
- MLflow for model tracking and serving
- SQL Warehouse for BI tool connections
- Auto Loader for incremental file ingestion from cloud storage
The Databricks lakehouse architecture is: Auto Loader or Kafka → Delta bronze → Databricks Workflows → Delta silver/gold → SQL Warehouse → Tableau or Power BI.
Implementation on AWS
AWS lakehouse: Kinesis or MSK → S3 (Iceberg or Delta via Glue ETL) → Glue Catalog → Athena for SQL analytics + EMR for Spark + Redshift Spectrum for BI SQL access.
AWS Lake Formation provides row and column-level access control on Glue Catalog tables, applying permissions across Athena and EMR. For organisations without Databricks, the AWS-native lakehouse (Iceberg on S3 + Glue Catalog + Athena + Lake Formation) is a viable architecture at lower cost.
Implementation on Azure
Azure lakehouse: Event Hub → ADLS Gen2 (Delta Lake format) → Azure Data Factory or Databricks on Azure → Delta silver/gold → SQL Analytics endpoint (via Azure Synapse or Fabric) → Power BI.
Microsoft Fabric is Microsoft's managed lakehouse product — OneLake (managed ADLS), Lakehouse (Delta Lake), Spark Notebooks, and Power BI in a single governed workspace.
When not to build a lakehouse
The lakehouse architecture is not universally correct. It is appropriate when:
- Data volume exceeds what a traditional warehouse handles economically
- Multiple workload types (SQL analytics, Spark ML, streaming) must access the same data
- Open format and multi-engine are requirements
It is excessive when:
- Your entire analytics needs are served by a cloud data warehouse (Snowflake, BigQuery, Redshift) — the warehouse is effectively a lakehouse with a managed storage layer
- Your team does not have Spark expertise — the transformation layer requires Spark knowledge
For organisations where Snowflake or BigQuery handles the workload comfortably, those tools are the lakehouse — they just use managed storage rather than self-managed object storage. The term "lakehouse" is sometimes applied specifically to the self-managed architecture, but the conceptual separation from "warehouse" is smaller than the marketing suggests.
For the related architectural concepts, see what is a data lakehouse, medallion architecture, and apache iceberg guide. Our data architecture consulting practice designs and implements lakehouse architectures across AWS, Azure, and GCP — book a free architecture review.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →