BlogData Architecture

Apache Iceberg vs Delta Lake vs Apache Hudi: Open Table Format Comparison

James Okafor
James Okafor
Data & Cloud Engineer
·November 3, 202610 min read

A practical comparison of the three major open table formats for data lakes — Apache Iceberg, Delta Lake, and Apache Hudi — covering architecture, use cases, cloud platform support, and how to choose for your specific workload.

The three major open table formats — Apache Iceberg, Delta Lake, and Apache Hudi — have each solved the same fundamental problem: adding ACID transactions, schema evolution, and time travel to files on object storage (S3, ADLS, GCS). But they solve it differently, with different strengths, different ecosystem integrations, and different operational characteristics.

Choosing between them is increasingly a real architectural decision as organisations build data lakehouses. The right choice depends on your cloud platform, your query engines, your write patterns, and whether you are already embedded in one ecosystem.

What All Three Provide

Before the differences, the common ground. All three open table formats provide:

**ACID transactions on object storage.** Concurrent writes without data corruption. Readers see consistent snapshots even when writers are active.

**Time travel.** Query historical versions of a table by timestamp or version number.

**Schema evolution.** Add columns, rename columns, change column types without rewriting all data.

**Partitioning.** Physical organisation of data by partition values for scan efficiency.

**Metadata management.** A manifest or transaction log that describes the current and historical state of the table, enabling efficient query planning.

The substantive differences are in implementation details that affect performance, ecosystem integration, and operational complexity.

Apache Iceberg

Iceberg was developed at Netflix and donated to the Apache Software Foundation. It is the most open and engine-agnostic of the three formats — designed from the ground up to be read by any query engine without dependency on a specific processing framework.

**Architecture.** Iceberg stores metadata in a hierarchy: a catalog (pointing to current metadata file), metadata files (describing table schema, partitioning, and snapshots), manifest list files (listing manifest files for a snapshot), and manifest files (listing data files with column-level statistics). This multi-level metadata hierarchy enables efficient query planning for very large tables.

Key strengths:

- **Engine agnosticism.** Iceberg is supported by the broadest range of query engines: Spark, Flink, Trino, Presto, Hive, Dremio, Athena, BigQuery Omni, and Snowflake (external tables). No engine dependency — any engine that implements the Iceberg spec can read and write Iceberg tables.

- **Hidden partitioning.** Iceberg supports partition transforms that compute partition values from column values (year/month/day from a timestamp, bucket by user_id) without requiring users to manually manage partition columns. This eliminates a common source of query errors in Hive-style partitioned tables.

- **Column-level statistics.** Iceberg stores min/max statistics per column in manifest files, enabling highly efficient data skipping without materialising metadata.

- **Multi-engine writes.** Multiple engines can write to the same Iceberg table concurrently using optimistic concurrency control.

**Ecosystem:** strongest on AWS (Athena, Glue, EMR all support Iceberg natively), strong on Snowflake (via external tables), good Databricks support. The dominant choice for multi-engine, multi-cloud environments.

Delta Lake

Delta Lake was developed at Databricks and is the native table format for the Databricks Lakehouse Platform. It is the incumbent choice for organisations heavily invested in Databricks.

**Architecture.** Delta Lake stores metadata in a _delta_log transaction log — a folder of JSON and Parquet checkpoint files that record every transaction. The transaction log is the single source of truth for table state. Readers build the current table state by reading the log. Periodic checkpoints compactify the log for faster read access.

Key strengths:

- **Databricks integration.** Delta Lake is deeply integrated with Databricks — Auto Optimize, Auto Compaction, Predictive I/O, Liquid Clustering, and Delta Live Tables are all Delta-specific features. If you are on Databricks, Delta Lake is the obvious choice.

- **MERGE performance.** Databricks has invested heavily in Delta MERGE performance, producing state-of-the-art upsert speeds for CDC use cases.

- **Ecosystem breadth on Databricks.** Delta Lake plus Unity Catalog provides a complete governance layer with lineage, access control, and data sharing.

- **Delta Sharing.** The Delta Sharing protocol allows live data sharing across organisations without data copying — similar to Snowflake Data Sharing, built on Delta Lake.

**Ecosystem:** dominant on Databricks, good Spark support, improving support in other engines (Trino, Presto, Flink have Delta readers). Not supported natively by Snowflake (requires external processes). Snowflake can read Delta via external tables but with limitations.

Apache Hudi

Hudi (Hadoop Upserts Deletes and Incrementals) was developed at Uber for their data lake, focusing specifically on the use case that motivated it: efficient upserts and incremental processing on large data lakes.

**Architecture.** Hudi organises table data differently from Iceberg and Delta Lake. It stores data in storage types (Copy-on-Write or Merge-on-Read) and maintains a timeline (an ordered log of commits, compactions, and cleanings) rather than a transaction log. The timeline approach is optimised for the write-heavy, upsert-heavy use cases Hudi was designed for.

Key strengths:

- **Upsert performance.** Hudi's primary design goal is efficient upserts. For CDC workloads where the write pattern is continuous upserts of changed records, Hudi's Merge-on-Read storage type provides very low write latency.

- **Merge-on-Read.** MOR storage allows writing changed data to log files (fast writes) and merging with base data at read time (read-time merge). This enables near-real-time data availability in the lake without the cost of rewriting full Parquet files on each update.

- **Incremental processing support.** Hudi's incremental query mode returns only the rows that changed since a given timestamp — enabling efficient incremental ETL without scanning the full table.

**Ecosystem:** primarily Spark; Flink support is improving. Less broad engine support than Iceberg. Strongest choice for AWS EMR or GCP Dataproc environments with Spark-heavy CDC workloads.

How to Choose

**If you are on Databricks:** Delta Lake is the clear choice. The native integration, performance optimisations, and Unity Catalog governance make it the default.

**If you need multi-engine compatibility (Athena + Snowflake + Spark + Trino):** Apache Iceberg. Its engine-agnostic design and broad support across cloud platforms makes it the best choice for polyglot environments.

**If your primary use case is CDC and upsert-heavy streaming ingestion on Spark:** Hudi is worth evaluating, though Iceberg and Delta Lake have closed much of the gap in recent versions.

**If you are starting fresh on AWS:** Iceberg is the most future-proof choice, with native Athena, Glue, and EMR support, and growing Snowflake support via external tables.

**If you are starting fresh on Azure:** Delta Lake has the strongest Microsoft support — Microsoft Fabric uses OneLake with Delta Lake as its table format.

For data lakehouse architecture design including open table format selection, our data architecture consulting practice can advise on the right choice for your specific platform and workload — contact us to discuss your requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →