The data lakehouse combines the storage economics and flexibility of a data lake with the query performance and governance of a data warehouse. For many organisations, the question is no longer which architecture to use but which use cases are better served by managed warehouse services (Snowflake, BigQuery, Redshift) and which by open lakehouse formats (Iceberg, Delta Lake, Hudi) on object storage.
The data lakehouse architecture has emerged as the practical middle ground between pure data lakes (cheap, flexible, hard to govern) and pure data warehouses (performant, governed, expensive). Understanding what a lakehouse is — and is not — helps organisations make clear decisions about which workloads belong on managed warehouse services and which belong on open lakehouse infrastructure.
What a Lakehouse Is
A data lakehouse combines:
**Data lake storage economics** — data stored in open formats (Parquet, ORC) on cloud object storage (S3, GCS, ADLS), with storage costs an order of magnitude lower than managed warehouse storage.
**Data warehouse semantics** — ACID transactions, schema enforcement, query optimisation, time travel, and governance — delivered by an open table format layer (Iceberg, Delta Lake, or Hudi) on top of the object storage.
**Multiple query engine support** — because data is stored in open formats with standardised table metadata, any compatible query engine can read the same tables: Spark for heavy transformation, Trino/Athena for ad-hoc SQL, Flink for streaming, BI tools via Iceberg/Delta connectors.
The term was coined by Databricks and is now used broadly. Operationally, "lakehouse" typically means one of:
- Databricks Delta Lake on object storage
- Apache Iceberg on object storage with a compatible query engine
- Snowflake with Iceberg external tables
Managed Warehouse vs Lakehouse: The Decision Framework
The choice is not binary — most mature data platforms use both, with the decision made per workload:
Choose managed warehouse (Snowflake, BigQuery, Redshift) for:
**SQL analytics workloads** — managed warehouses have highly optimised query engines tuned specifically for SQL analytics. Snowflake's virtual warehouses, BigQuery's serverless compute, and Redshift's MPP architecture deliver query performance on structured analytical data that open table formats on general compute cannot match without significant tuning.
**Small to medium data teams** — managed warehouses abstract infrastructure management. No Spark cluster sizing, no compaction scheduling, no cluster maintenance. Teams without dedicated data platform engineers benefit significantly from this abstraction.
**Complex SQL transformations** — dbt-on-Snowflake or dbt-on-BigQuery is a more mature, better-documented pattern than dbt-on-Spark-on-Iceberg. If the transformation layer is SQL-first, managed warehouses have better tooling.
**Tight SLA requirements** — managed warehouse query performance is more predictable. Cloud warehouses autoscale, provide dedicated compute per query, and have extensive performance tuning primitives. Lakehouse query performance requires careful tuning of table layout, partitioning, and compute cluster sizing.
Choose lakehouse (open format on object storage) for:
**ML and Python workloads** — machine learning feature engineering, model training, and Python-based processing require Spark or Python compute that managed SQL warehouses do not natively support at scale. A lakehouse enables the same data to be queried with SQL and processed with Spark or Python without data movement.
**Very large data volumes** — at petabyte scale, managed warehouse storage costs (Snowflake's per-TB compressed storage, BigQuery's storage pricing) can exceed open object storage (S3 Standard) by 10x. For organisations with petabytes of data, the storage economics strongly favour open formats.
**Multi-engine requirements** — if the same data needs to be processed by Spark (for transformation), Flink (for streaming), Trino (for SQL analytics), and fed to ML pipelines, open formats enable all engines to access the same physical data without copies.
**Long-term data retention** — object storage is cheap and durable. Historical data that is queried infrequently but must be retained for compliance (7-10 years) is significantly cheaper on S3 Glacier than in Snowflake storage.
The Practical Hybrid Architecture
Most enterprise data platforms land at a hybrid:
**Raw and intermediate layers on lakehouse** — raw data from CDC and streaming ingestion lands in S3/GCS in Iceberg or Hudi tables. Initial cleaning and staging transformations run in Spark. Storage is cheap; the volume is high.
**Analytical mart layer in managed warehouse** — business-layer tables that BI tools query directly live in Snowflake or BigQuery. Spark jobs write clean data from the lakehouse to the warehouse; dbt transforms it within the warehouse. Query performance for BI workloads is critical; the volume at the mart level is lower than raw.
**ML feature store on lakehouse** — ML features are computed in Spark from the lakehouse and stored back in the lakehouse. ML model training reads from the lakehouse directly. No managed warehouse involvement for ML workloads.
This architecture optimises each layer for its access pattern: cheap storage for high-volume raw data, fast SQL for BI-facing analytical data, and flexible compute for ML workloads.
Lakehouse Governance Challenges
The lakehouse's openness creates governance challenges that managed warehouses handle automatically:
**Access control** — managed warehouses have integrated access control via SQL GRANT/REVOKE. Lakehouse data on S3 requires either bucket-level IAM policies (coarse) or a governance layer like Apache Ranger, AWS Lake Formation, or Databricks Unity Catalog for table and column-level access control.
**Data catalogue** — managed warehouses have integrated data catalogues (Snowflake Information Schema, BigQuery Data Catalog). Lakehouse requires a separate catalogue (AWS Glue, Apache Atlas, or commercial alternatives) to maintain table metadata and support data discovery.
**Schema enforcement** — managed warehouses enforce schemas at write time. Lakehouse formats enforce schemas through the table format layer (Iceberg schema enforcement, Delta Lake schema enforcement), but this requires consistently using the table format API — direct Parquet writes bypass enforcement.
**Auditing** — managed warehouses log all query and access activity natively. Lakehouse auditing requires collecting logs from each compute engine separately, which creates gaps if any engine does not have audit logging configured.
Our data architecture practice designs hybrid lakehouse and warehouse architectures for enterprise data teams — contact us to discuss the right storage and compute architecture for your analytical workloads.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →