Delta Lake: What It Is, How It Works, and When to Use It

Delta Lake is an open-source storage layer that adds ACID transactions, schema enforcement, and time travel to data lakes built on object storage. It is the foundation of the Databricks Lakehouse and is now supported natively by Snowflake, BigQuery, and other platforms.

The quick answer

Delta Lake is an open-source storage layer that adds ACID transactions, schema enforcement, time travel, and efficient upserts to data lakes built on object storage (S3, ADLS Gen2, GCS). It converts a raw Parquet-based data lake — which lacks the transaction guarantees and reliability of a database — into a data lakehouse that provides database-like reliability while retaining the open format and low storage cost of a data lake.

Delta Lake is the foundation of Databricks's Lakehouse Platform and is supported natively by Apache Spark, Databricks, Flink, and increasingly by Snowflake (through Iceberg compatibility) and BigQuery (through Delta Sharing). Understanding Delta Lake is essential for any data engineer or architect working on Databricks or designing a lakehouse architecture.

The problem Delta Lake solves

A traditional data lake stores data as Parquet (or CSV/ORC) files in object storage. Object storage is cheap and scalable but lacks the concurrency control, consistency guarantees, and reliability features that analytical workloads require:

**No ACID transactions**: if a write job fails partway through, you are left with a partially written dataset. Readers may see incomplete data. There is no rollback mechanism.

**No schema enforcement**: any file can be written to any folder, regardless of schema. A column rename in a source system can silently break downstream queries that expect the original column name.

**No efficient upserts**: update and delete operations (CDC/upsert patterns) require reading and rewriting entire partitions rather than modifying specific rows. For large tables with CDC requirements, this is slow and expensive.

**No time travel**: once data is overwritten, the previous version is gone. Point-in-time queries (as of two days ago) are not possible without separate archival copies.

**Small file proliferation**: incremental streaming writes create many small files. Small files degrade query performance — each file requires a separate read operation, and many small files overwhelm the metadata management of query engines.

Delta Lake addresses all of these: ACID transactions via a transaction log, schema enforcement via schema validation on write, efficient upserts via merge, time travel via transaction log history, and automatic file compaction (OPTIMIZE) to address small files.

How Delta Lake works: the transaction log

Delta Lake's core mechanism is the **Delta Log** — a transaction log stored alongside the data files in object storage. Every transaction (write, delete, schema change) creates a new JSON log entry in the transaction log. The transaction log is the authoritative record of what the Delta table contains and what operations have been performed.

When a reader queries a Delta table, it first reads the transaction log to reconstruct the current table state — which files are part of the table, what schema applies, what version is current. The transaction log enables:

**Serialisable isolation**: all transactions are serialised through the log. Concurrent writers are handled via optimistic concurrency control — a writer checks whether the files it intends to modify have been changed by another writer since it started. If a conflict is detected, one writer fails and must retry.

**Snapshot isolation**: readers always see a consistent snapshot of the table as of a specific transaction, even if concurrent writers are modifying the table simultaneously.

**Time travel**: the transaction log retains history. You can query a Delta table as it existed at any past transaction or timestamp, as long as the underlying data files have not been deleted (VACUUM removes old files beyond the retention period — default 7 days).

Core Delta Lake features

### ACID transactions

Delta Lake guarantees atomicity (a write either fully completes or has no effect), consistency (the table is always in a valid state), isolation (concurrent reads and writes are correctly isolated), and durability (committed transactions persist even if the system fails).

This makes Delta Lake tables reliable for ETL workflows: if a dbt run or Spark job fails partway through a table write, the partial write does not corrupt the table. Only complete, committed transactions are visible to readers.

### Schema enforcement and evolution

**Schema enforcement**: by default, Delta Lake rejects writes that do not match the table's schema. A Spark job that attempts to write a DataFrame with a column that does not exist in the Delta table, or with an incompatible data type, will fail at write time rather than corrupting the table with unexpected schema.

**Schema evolution**: when changes to the schema are intentional, Delta Lake supports schema evolution with explicit opt-in. Setting mergeSchema = true on a write allows new columns to be added to the table schema automatically. This provides the flexibility of a data lake (schema-on-read) with the safety of schema enforcement by default.

### Time travel

Time travel queries a Delta table as it existed at a previous point in time or version:

- SELECT * FROM my_table VERSION AS OF 42 — query the table at version 42 of the transaction log

- SELECT * FROM my_table TIMESTAMP AS OF '2026-06-01' — query the table as it existed on June 1, 2026

Time travel is useful for: debugging (what data existed before the problematic pipeline run?), regulatory compliance (point-in-time data state for audit requirements), model training (retrieving feature values as they existed at a specific training date), and data recovery (restoring accidentally deleted data by reading from a previous version).

Time travel is available as long as the old data files exist. The VACUUM command removes files older than the retention period (default 7 days). Set the retention period based on your requirements before running VACUUM.

### Efficient upserts with MERGE

Delta Lake's MERGE INTO statement provides SQL-style upsert (insert-or-update) operations:

MERGE INTO target USING source ON target.id = source.id WHEN MATCHED THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT ...

MERGE is efficient because Delta Lake uses the transaction log to identify only the affected data files (the files that contain records with matching keys) and rewrites only those files, rather than rewriting the entire table. For CDC workloads where a small percentage of records change per batch, MERGE is significantly more efficient than full table rewrites.

For deletes in CDC workloads (records flagged as deleted in the source), MERGE INTO target USING source ON ... WHEN MATCHED AND source.operation = 'DELETE' THEN DELETE handles the deletion case.

### OPTIMIZE and ZORDER

**OPTIMIZE** compacts small files into larger, optimally-sized files. Streaming writes, frequent MERGE operations, and small batch inserts create many small files that degrade query performance. Running OPTIMIZE periodically (or automatically via Databricks's Auto Optimize feature) consolidates small files and applies sorting optimisations.

**ZORDER** is a multi-dimensional data clustering technique applied during OPTIMIZE. OPTIMIZE ZORDER BY (column1, column2) co-locates data with similar values in the specified columns within the same files, improving data-skipping effectiveness for queries that filter on those columns. ZORDER is Databricks-specific (Delta Lake open-source supports basic OPTIMIZE without ZORDER).

Delta Lake vs Apache Iceberg vs Apache Hudi

Delta Lake, Apache Iceberg, and Apache Hudi are all open table formats for data lakehouses. They solve the same core problems (ACID transactions, schema evolution, time travel) with different design choices.

**Delta Lake**: strongest Databricks integration (native), mature Spark support, Databricks-led development. The standard choice for Databricks environments. Iceberg compatibility (read Iceberg tables from Delta Lake environments) added recently.

**Apache Iceberg**: strongest cross-platform support (Snowflake, BigQuery, Athena, Flink, Spark). Governed by the Apache Foundation rather than a single vendor. Superior hidden partitioning (partitions are not exposed in query syntax) and better cross-engine compatibility. The choice for multi-engine environments where Databricks is not the primary compute.

**Apache Hudi**: strongest streaming ingest and CDC capabilities. Hudi was designed specifically for high-frequency incremental writes from CDC sources. Weaker SQL analytics support than Delta or Iceberg. Commonly used in Kafka-to-data-lake streaming pipelines.

For Databricks-first architectures: Delta Lake. For multi-cloud or multi-engine architectures where vendor neutrality is important: Iceberg. For very high-frequency streaming ingest: Hudi.

Delta Lake in the lakehouse architecture

Delta Lake is the foundation of the **Databricks Lakehouse** — the pattern that combines data lake economics (cheap object storage, open format) with data warehouse reliability (ACID transactions, schema enforcement, SQL performance).

In the medallion architecture:

- **Bronze**: raw data loaded as Delta tables from ingestion pipelines. Schema enforcement prevents corrupted writes. Time travel enables point-in-time recovery if a bad load corrupts Bronze data.

- **Silver**: cleaned and standardised Delta tables. MERGE handles CDC from Bronze to Silver for tables with updates and deletes. Schema evolution handles source system schema changes.

- **Gold**: dimensional Delta tables consumed by BI tools via Databricks SQL, Databricks SQL warehouses, or via JDBC/ODBC to Tableau and Power BI.

For the medallion architecture context, see what is a data lakehouse. For how Delta Lake fits in Apache Spark workloads, see what is Apache Spark. For the Databricks vs Snowflake decision that determines whether Delta Lake or Snowflake's native format is the right choice, see Snowflake vs Databricks.

Our cloud engineering and data architecture consulting practices design Delta Lake-based lakehouse architectures on Databricks for mid-market and enterprise organisations. If you are designing a lakehouse or evaluating Delta Lake for your environment, book a free 30-minute audit.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →