BlogData Architecture

Apache Iceberg: The Open Table Format for Analytics at Scale

Eric Chen
Eric Chen
BI Solutions Architect
·November 24, 202713 min read

Apache Iceberg is an open table format for storing large analytical datasets in cloud object storage. It adds ACID transactions, schema evolution, time travel, and partition evolution to Parquet and ORC files — giving data lakes the reliability and governance capabilities previously associated only with managed data warehouses. Iceberg has become the foundation of modern lakehouse architectures.

Apache Iceberg is an open table format for large analytical datasets stored in cloud object storage (S3, GCS, Azure Data Lake Storage). It specifies how data files (Parquet, ORC, Avro) should be organised, tracked, and queried — adding the reliability and governance features of a managed data warehouse (ACID transactions, schema evolution, time travel, partition evolution) to data stored in object storage that any compatible query engine can read.

Iceberg has become the foundation of modern lakehouse architectures, replacing ad-hoc Parquet/Hive structures that lacked transactional semantics and were difficult to govern at scale.

Why Iceberg Over Plain Parquet/Hive

Raw Parquet files on S3 with a Hive metastore have several fundamental limitations:

**No ACID transactions** — concurrent writes to the same Parquet partition produce inconsistent or corrupted data. Readers during a write see partial results. There is no rollback.

**Expensive schema evolution** — changing a column name or type requires rewriting all Parquet files in the table. Column additions are possible but tracked inconsistently across different query engines.

**Partition-level operations only** — the smallest unit of operation is a partition (a directory). Updating a single row requires rewriting the entire partition file. Deletes are impossible without rewriting.

**Full partition scan for query planning** — Hive lists all partitions at query time to determine which to include. On tables with thousands of partitions, this listing operation is slow.

**Hidden partitioning (reader burden)** — consumers querying a Hive table must know the partition scheme and write partition predicates explicitly to avoid full scans. Changing the partition scheme requires rewriting the table.

Iceberg addresses all of these.

Iceberg Architecture

Iceberg maintains metadata separately from data files:

**Data files** — the actual data in Parquet, ORC, or Avro format, stored in cloud object storage.

**Manifest files** — lists of data files with statistics: min/max values per column, null counts, record counts. Used for query planning (pruning files that cannot contain relevant data).

**Manifest list (snapshot file)** — a list of manifest files representing a complete snapshot of the table at a point in time.

**Metadata file** — the table's schema, partition spec, current snapshot, and snapshot history. Atomic pointer update to a new metadata file is the mechanism for ACID commits.

Every write produces a new snapshot (new manifest list, new metadata file). The current table state is defined by the current metadata file. Older snapshots are retained for time travel; expired snapshots can be garbage-collected.

ACID Transactions

Iceberg's transaction model uses optimistic concurrency:

1. Writer reads the current snapshot.

2. Writer computes changes (new data files, updated manifests).

3. Writer attempts to commit by atomically updating the metadata file pointer from the old metadata to the new metadata.

4. If another writer committed between step 1 and 3, the commit fails (optimistic locking conflict). The writer retries.

This model provides serialisable isolation for most operations. Concurrent appends (two writers adding new data simultaneously) are handled without conflict. Concurrent row-level updates to the same rows may conflict.

Schema Evolution

Iceberg supports schema changes that do not require rewriting data files:

**Column additions** — add columns with default values. Existing Parquet files without the new column return the default value for that column.

**Column drops** — mark a column as dropped. Queries no longer return the dropped column. The underlying Parquet files still contain the data (it is not deleted), but it is not visible to queries.

**Column renames** — rename a column. Iceberg tracks the mapping between column names and column IDs; queries using the new name access the same underlying data files.

**Type widening** — certain type promotions (integer to long, float to double) are supported without rewriting files.

These schema changes are instant — no data file rewriting required.

Partition Evolution

One of Iceberg's most practical advantages over Hive is partition evolution: you can change the partition scheme of a table without rewriting it.

A table initially partitioned by month can be repartitioned by day when data volume grows, without migrating existing data. Old files remain in their original monthly partitions; new files are written in daily partitions. Queries plan against both partitions schemes simultaneously, reading old monthly and new daily partitions as appropriate.

This eliminates the "big bang" partition migration that Hive tables require when partitioning needs to change.

Time Travel

Iceberg retains all historical snapshots until explicitly expired. Time travel queries use a snapshot ID or timestamp:

SELECT * FROM orders FOR SYSTEM_TIME AS OF '2024-01-01';

SELECT * FROM orders FOR VERSION AS OF 12345;

Time travel is valuable for:

- Debugging: query the table state before a bad write

- Audit: prove what data looked like at a specific time

- Reproducibility: ML model training against a point-in-time snapshot

The retention period for snapshots is configurable; expired snapshots and their data files can be garbage-collected with the expireSnapshots procedure.

Iceberg Query Engine Support

Iceberg is query-engine agnostic. Major query engines that support Iceberg natively:

**Apache Spark** — primary compute engine for Iceberg at scale; full read/write support including MERGE INTO.

**Trino / Presto** — SQL query engine used in many lakehouse deployments; full Iceberg support.

**AWS Athena** — serverless SQL on S3; native Iceberg support.

**Snowflake** — can query Iceberg tables in S3 directly via Snowflake Iceberg Table feature.

**Databricks** — supports Iceberg via the Iceberg REST catalog; also supports Delta Lake natively.

**Apache Flink** — streaming and batch processing with Iceberg sink for streaming writes.

Iceberg Catalog Options

Iceberg tables are tracked by a catalog, which stores the mapping from table names to metadata file paths:

**AWS Glue** — managed catalog in AWS. Native integration with Athena, Spark on EMR, and Glue ETL.

**Apache Hive Metastore** — the traditional catalog; Iceberg tables are tracked as Hive tables with Iceberg-specific properties.

**Nessie** — open-source catalog with Git-like branching for Iceberg tables. Enables branch-and-merge workflows for data.

**Iceberg REST Catalog** — the standard REST API for catalog operations; supported by most engines and hosted catalog services (Tabular, Snowflake Polaris Catalog, Databricks Unity Catalog).

When to Use Iceberg vs Delta Lake

Both Iceberg and Delta Lake are open table formats with similar capabilities. The choice is primarily ecosystem-driven:

**Use Iceberg when** the primary compute engine is Spark, Trino, or Athena; multi-engine environments where no single vendor controls the stack; or when Nessie branching capabilities are valuable.

**Use Delta Lake when** Databricks is the primary platform (Delta is Databricks-native); when the lakehouse is primarily managed by Databricks with other engines secondary.

Both formats are converging. Databricks supports Iceberg via UniForm (enabling Iceberg reads on Delta tables); Snowflake's Iceberg integration reads Delta tables via Iceberg REST. In practice, the formats are increasingly interoperable.

Our data architecture practice designs lakehouse architectures using Iceberg for enterprise data teams — contact us to discuss open table format strategy for your data platform.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →