What Is a Data Lakehouse? Architecture, Benefits, and When to Build One

The data lakehouse pattern combines the storage economics of a data lake with the query performance and governance of a data warehouse. Here is what the architecture actually looks like, when it is the right choice, and what it takes to build one at enterprise scale.

The quick answer

A data lakehouse is a data platform architecture that combines the low-cost, flexible storage of a data lake with the query performance, ACID transaction support, and governance capabilities of a data warehouse. Instead of maintaining separate systems — a lake for raw data and a warehouse for curated data — a lakehouse uses an open table format (Delta Lake, Apache Iceberg, or Apache Hudi) directly on cloud object storage to provide warehouse-grade query performance and reliability without the cost and rigidity of a traditional warehouse. For most enterprise organisations building a modern data platform today, the lakehouse pattern is the right architecture.

Why the lakehouse pattern emerged

To understand why the lakehouse matters, you need to understand the problem with the two architectures it replaces.

**The data warehouse problem.** Traditional data warehouses (Teradata, SQL Server DW, Redshift) are optimised for SQL query performance. They achieve that performance through rigid schema enforcement, proprietary storage formats, and expensive compute. The problems: they are expensive at scale, they do not handle unstructured or semi-structured data well, they require all data to be modelled before it can be stored (schema-on-write), and they create vendor lock-in through proprietary formats.

**The data lake problem.** Data lakes (typically HDFS or S3 with Hive-style metastore) solved the cost and flexibility problems of warehouses. You could store anything in raw format, ingest data fast, and process it with distributed compute frameworks like Spark. The problems: without schema enforcement, data quality degraded quickly. Without transaction support, concurrent reads and writes created consistency problems. Without good indexing and statistics, query performance was poor. The "data swamp" pattern — a lake full of low-quality, ungoverned data that nobody could query reliably — became the dominant failure mode.

The lakehouse emerged as the solution: keep the economics and flexibility of a lake, add the reliability and performance guarantees of a warehouse, using an open format that avoids vendor lock-in.

The key technologies

**Delta Lake.** Delta Lake is an open-source storage layer developed by Databricks that adds ACID transactions, schema enforcement, and data versioning to Parquet files on cloud object storage. It is the most widely deployed open table format in production enterprise environments. Delta Lake enables time travel (querying the state of a table as of any past point in time), schema evolution (adding columns without breaking downstream queries), and concurrent read/write operations without consistency issues. Delta Lake is the table format at the heart of the Databricks Lakehouse Platform.

**Apache Iceberg.** Apache Iceberg is an alternative open table format developed initially at Netflix. It has stronger multi-table transaction support than Delta Lake and better handling of very large tables (hundreds of millions of files). Snowflake, AWS (with its Iceberg Tables in S3), and Apple use Iceberg as their preferred table format. For organisations on Snowflake or AWS, Iceberg is often the more natural choice.

**Apache Hudi.** Apache Hudi (Hadoop Upserts Deletes and Incrementals) was developed at Uber and is particularly strong for streaming data ingestion and upsert-heavy workloads. It is widely used in streaming analytics applications where high-frequency record updates are common. Less widely adopted than Delta Lake or Iceberg for general analytics use cases.

**Unified cloud storage.** All three formats run on cloud object storage — Azure Data Lake Storage Gen2, Amazon S3, or Google Cloud Storage. The economics are fundamental: object storage costs 2–5 cents per GB per month, versus 8–40 cents per GB per month for warehouse storage. At petabyte scale, this difference is hundreds of thousands of dollars per year.

The medallion architecture

The most widely adopted pattern for organising data in a lakehouse is the medallion architecture — three logical layers with distinct quality levels:

**Bronze (Raw).** The Bronze layer contains raw data as ingested from source systems — no transformation, no schema enforcement, just the data in whatever format it arrived in. Bronze is append-only. It is the complete historical record of everything that has ever entered the platform. Bronze data is not directly queryable for analytics — it exists to provide a reliable source for reprocessing and for audit trails.

**Silver (Cleaned and Conformed).** The Silver layer contains cleaned, validated, and conformed data. Source schemas are applied, data types are enforced, duplicates are removed, and basic quality checks are run. Silver data represents the agreed-upon structure of each data domain. It is the foundation for most analytical workloads. Silver tables are still relatively granular — they represent facts and entities, not pre-aggregated metrics.

**Gold (Business-Ready).** The Gold layer contains aggregated, business-ready data products — the specific analytical views, aggregated tables, and metric definitions that power dashboards, reports, and analytical queries. Gold is where business logic lives. It is the layer that BI tools connect to. Gold tables are smaller, pre-aggregated, and optimised for query performance.

The medallion architecture creates a clear separation between data ingestion (Bronze), data quality (Silver), and business logic (Gold). It makes the platform maintainable — when business definitions change, you update the Gold layer, not the ingestion pipeline. And it provides an audit trail — if a Gold metric is questioned, you can trace it back through Silver to Bronze to the original source record.

When a lakehouse is the right architecture

**You have large data volumes at reasonable storage cost.** Organisations with terabytes or petabytes of data benefit from lake economics. If your data fits in a 100GB warehouse, a lakehouse is architectural overkill.

**You have diverse data types.** Lakehouses handle structured, semi-structured (JSON, XML), and unstructured (text, images) data on the same platform. If your analytics requirements include unstructured data for NLP, image analysis, or document processing, a pure warehouse cannot serve those workloads.

**You are building for AI and machine learning.** Feature stores, training datasets, and model artefacts need to live somewhere. A lakehouse provides the foundation for ML workloads — your features and training data are governed alongside your analytics data, not in a separate, disconnected ML infrastructure.

**You want to avoid warehouse vendor lock-in.** Open table formats (Delta Lake, Iceberg) mean your data is not locked into a proprietary format. If you decide to change your compute engine, your data moves with you. This is not theoretical — organisations that migrated from Teradata to Snowflake understand the cost of proprietary format lock-in.

**You need both streaming and batch analytics.** Modern lakehouses handle both workloads. Streaming ingestion via Apache Kafka or Azure Event Hubs writes directly to Delta Lake tables, and batch analytics queries the same tables. You do not need separate infrastructure for real-time and batch data.

When a lakehouse is not the right choice

**Small, simple analytics requirements.** If you have a small data team, limited data volumes, and straightforward reporting requirements, a managed cloud warehouse (Snowflake, BigQuery, Azure Synapse) is simpler to operate and does not require the engineering expertise that a lakehouse platform demands.

**No engineering capability.** A lakehouse requires competent data engineers to design and maintain it. The medallion architecture, pipeline design, Unity Catalog governance, and Spark optimisation are not simple. Organisations without senior data engineering capability will struggle to operate a lakehouse and should consider a fully managed warehouse instead.

**Pure SQL analytics teams.** If your entire analytics function operates in SQL and has no need for Python, Spark, or machine learning, a warehouse is simpler. Lakehouses are most powerful when combined with data science and ML workloads — without those workloads, the added complexity may not be worth it.

Lakehouse vs data warehouse vs data lake: the practical comparison

|---|---|---|---|

| Storage cost | High | Low | Low |

| ACID transactions | Yes | No | Yes |

| Unstructured data | No | Yes | Yes |

| ML/AI workloads | Limited | Yes | Yes |

| Vendor lock-in | High | Low | Low |

Building a lakehouse: what it actually takes

A production lakehouse deployment for a mid-market enterprise involves several workstreams:

**Storage design.** Choosing the table format (Delta Lake vs Iceberg), designing the folder hierarchy in cloud storage, setting up the Bronze/Silver/Gold layer structure, and configuring retention policies.

**Ingestion pipelines.** Building the pipelines that move data from source systems into Bronze — Azure Data Factory, Spark Structured Streaming, Fivetran, or Airbyte depending on the source types and latency requirements.

**Transformation layer.** Implementing the Silver and Gold transformations, typically with dbt (data build tool) for SQL-based transformations and PySpark for more complex processing. This is where business logic is codified and tested.

**Governance.** Setting up Unity Catalog (for Databricks) or the equivalent on your chosen platform — table access controls, column masking, data lineage, and auditing.

**Compute optimisation.** Tuning Spark cluster configurations, enabling Delta Lake's OPTIMIZE and ZORDER commands for query performance, and managing autoscaling to control costs.

**BI connection.** Connecting Tableau, Power BI, or other BI tools to the Gold layer — typically via JDBC/ODBC or native connectors.

Most enterprise lakehouse builds take 12–20 weeks for the initial production deployment of the first three to five data domains. Full enterprise coverage is a 12–18 month programme.

Frequently Asked Questions

Is Snowflake a lakehouse?

Snowflake is increasingly positioning itself as a lakehouse platform, and with its Iceberg Table support and the Snowflake Data Cloud, it does provide many of the capabilities of a lakehouse. However, Snowflake's primary storage and compute model is still warehouse-based, and its pricing model reflects that. Databricks and the Delta Lakehouse Platform are more commonly described as true lakehouse architectures. In practice, many organisations use Snowflake as the primary data warehouse and Databricks for ML and Spark workloads.

Do we need both Databricks and Snowflake?

Not necessarily, but the combination is common. Databricks is stronger for Spark-based processing, ML workflows, and streaming ingestion. Snowflake is stronger for pure SQL analytics, time-series functions, and the Snowflake ecosystem of data sharing and marketplace integrations. Organisations that need both ML capabilities and strong SQL analytics often run the two platforms with data shared between them via Delta Sharing or by writing Gold-layer Databricks tables that Snowflake queries via external tables.

What is the difference between a lakehouse and a data mesh?

They are not the same kind of thing. A lakehouse is a technical architecture — a specific way of building and organising a data platform. A data mesh is an organisational architecture — a model where data ownership is distributed to domain teams rather than centralised in a data engineering function. A data mesh can be implemented on top of a lakehouse. Many organisations implement a federated lakehouse architecture where individual domain teams own their Bronze and Silver pipelines, and the Gold layer is governed centrally.

What does it cost to build a lakehouse?

Infrastructure costs depend heavily on data volumes and compute intensity, but a representative mid-market production lakehouse runs $3,000–$15,000 per month in cloud infrastructure costs (Databricks or Synapse compute + Azure Data Lake Storage). Build costs for the initial delivery — designing the architecture, building the ingestion and transformation pipelines, setting up governance — typically run $80,000–$250,000 depending on scope and the number of data domains covered. Our data architecture consulting practice has delivered lakehouse builds across financial services, healthcare, and technology — if you want to understand what a build would look like for your organisation, book a free 30-minute audit.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →