BlogData Architecture

Data Lakehouse Architecture: Design Patterns and Implementation Guide

Austin Duncan
Austin Duncan
Senior Data Architect
·July 31, 202611 min read

The lakehouse combines the storage flexibility of a data lake with the ACID guarantees and query performance of a data warehouse. This guide covers the architecture layers, key design decisions, and how to implement a lakehouse on Databricks, AWS, or Azure.

The data lakehouse is an architecture that stores data in open file formats on object storage (S3, ADLS, GCS) and adds a structured table layer on top — using a table format like Delta Lake, Apache Iceberg, or Apache Hudi to provide ACID transactions, schema enforcement, and query performance comparable to a traditional data warehouse. This guide covers the architecture layers, key design decisions, and how to implement a lakehouse on the major platforms.

The case for lakehouse over separate lake and warehouse

Traditional organisations ran two separate systems: a data lake (cheap object storage, raw files, supports any workload) and a data warehouse (structured, ACID, fast SQL queries, expensive). Data was ETL'd from the lake into the warehouse. The duplication created:

- Two copies of data to manage and keep in sync

- ETL jobs that could fail, creating data inconsistency between systems

- Data scientists working with raw lake data using different tools than analysts using the warehouse

- Two cost structures (lake storage + warehouse compute)

The lakehouse eliminates this duplication. One copy of data in object storage, structured with a table format, queryable by any engine that supports the format.

The lakehouse architecture layers

**Ingestion layer**: Data arrives from source systems via CDC (Debezium, Fivetran, Airbyte), event streams (Kafka, Kinesis), file drops (SFTP, S3 upload), or API pulls. The ingestion layer lands data in the bronze/raw layer of the storage tier with minimal transformation.

**Storage layer**: Object storage (S3, ADLS Gen2, GCS) with a table format providing ACID semantics. Delta Lake (Databricks, open-source), Apache Iceberg (cloud-agnostic, multi-engine), or Apache Hudi (CDC-optimised). The table format maintains metadata — schema definitions, snapshot history, partition maps, column statistics — alongside the Parquet data files.

The storage layer is typically organised into bronze (raw), silver (cleaned), and gold (business-ready) tiers following the medallion architecture. Each tier is a separate schema or directory within the same storage account.

**Transformation layer**: Spark (PySpark notebooks, Spark SQL, dbt-spark), dbt with a warehouse connector, or SQL-on-lakehouse engines (Trino, Presto, Spark SQL). Transformations move data from bronze through silver to gold following documented business logic.

**Serving layer**: Multiple query engines can read the same table format files simultaneously. Spark for large-scale processing. Trino or Athena for SQL analytics. A cloud data warehouse (Snowflake external tables on Iceberg, BigQuery Omni) for BI tool compatibility. The serving layer is polyglot — different consumers use different engines against the same data.

**BI and ML layer**: BI tools (Tableau, Power BI) connect to the serving layer. ML platforms (Databricks MLflow, SageMaker) train models against the gold-layer feature data.

Key design decisions

**Table format choice**: Delta Lake if you are primarily on Databricks or Spark. Iceberg if multi-engine is a requirement (Spark + Trino + Snowflake + Athena against the same data). Hudi if high-frequency CDC upserts are the primary workload. For new projects, Delta Lake and Iceberg are the dominant choices; Hudi is niche.

**Catalog**: The catalog is the metadata store that maps table names to storage locations and schema definitions. Hive Metastore (traditional, widely supported), AWS Glue Catalog (serverless, Athena-native), Databricks Unity Catalog (governance layer with column-level security and lineage), or Apache Polaris/Nessie for Iceberg (open REST catalog). The catalog choice determines which engines can access the data and what governance features are available.

**Compute separation**: One of the lakehouse's primary benefits is that storage and compute are decoupled. A Databricks cluster processes ETL; Athena handles ad-hoc SQL; Tableau connects via a Spark or Trino endpoint; each uses separate compute provisioned independently. Do not try to route all workloads through a single compute tier — use the right tool for each workload type.

**Partition strategy**: Partition tables on the columns used in the most frequent query filters — typically a date column (year, month, day). Over-partitioning (partitioning on too many columns or too fine a granularity) creates millions of small files that degrade query performance. Under-partitioning means full table scans for filtered queries. A table with 10M rows/day partitioned by DAY is appropriate; the same table partitioned by HOUR creates 24x the partitions with no benefit if queries span multiple hours.

**Small file problem**: Streaming ingest creates many small files. Small files hurt performance because each file requires a separate metadata lookup and read. Implement compaction — periodically rewriting small files into larger files. Delta Lake's OPTIMIZE command, Iceberg's REWRITE DATA FILES procedure, or scheduled Spark jobs compact small files. Compact on a frequency proportional to your streaming ingest rate.

Implementation on Databricks

Databricks is the canonical lakehouse implementation environment. Databricks provides:

- Delta Lake as the native table format

- Unity Catalog for governance (column-level security, lineage, data sharing)

- Databricks Workflows for orchestration (DAGs for Spark and Delta Live Tables)

- MLflow for model tracking and serving

- SQL Warehouse for BI tool connections

- Auto Loader for incremental file ingestion from cloud storage

The Databricks lakehouse architecture is: Auto Loader or Kafka → Delta bronze → Databricks Workflows → Delta silver/gold → SQL Warehouse → Tableau or Power BI.

Implementation on AWS

AWS lakehouse: Kinesis or MSK → S3 (Iceberg or Delta via Glue ETL) → Glue Catalog → Athena for SQL analytics + EMR for Spark + Redshift Spectrum for BI SQL access.

AWS Lake Formation provides row and column-level access control on Glue Catalog tables, applying permissions across Athena and EMR. For organisations without Databricks, the AWS-native lakehouse (Iceberg on S3 + Glue Catalog + Athena + Lake Formation) is a viable architecture at lower cost.

Implementation on Azure

Azure lakehouse: Event Hub → ADLS Gen2 (Delta Lake format) → Azure Data Factory or Databricks on Azure → Delta silver/gold → SQL Analytics endpoint (via Azure Synapse or Fabric) → Power BI.

Microsoft Fabric is Microsoft's managed lakehouse product — OneLake (managed ADLS), Lakehouse (Delta Lake), Spark Notebooks, and Power BI in a single governed workspace.

When not to build a lakehouse

The lakehouse architecture is not universally correct. It is appropriate when:

- Data volume exceeds what a traditional warehouse handles economically

- Multiple workload types (SQL analytics, Spark ML, streaming) must access the same data

- Open format and multi-engine are requirements

It is excessive when:

- Your entire analytics needs are served by a cloud data warehouse (Snowflake, BigQuery, Redshift) — the warehouse is effectively a lakehouse with a managed storage layer

- Your team does not have Spark expertise — the transformation layer requires Spark knowledge

For organisations where Snowflake or BigQuery handles the workload comfortably, those tools are the lakehouse — they just use managed storage rather than self-managed object storage. The term "lakehouse" is sometimes applied specifically to the self-managed architecture, but the conceptual separation from "warehouse" is smaller than the marketing suggests.

For the related architectural concepts, see what is a data lakehouse, medallion architecture, and apache iceberg guide. Our data architecture consulting practice designs and implements lakehouse architectures across AWS, Azure, and GCP — book a free architecture review.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →