dbt Incremental Models: How They Work and When to Use Them

dbt incremental models process only new or changed rows on each run, rather than rebuilding the entire table. This guide covers the incremental strategies, merge key configuration, and the trade-offs between incremental and full-refresh models.

Full-refresh models in dbt drop and recreate the entire table on every run. For small tables, this is fine. For large tables — hundreds of millions of rows, multi-hour build times — rebuilding from scratch on every run is prohibitively slow and expensive. Incremental models solve this by processing only new or changed rows, appending or merging them into the existing table rather than replacing it.

This guide covers how incremental models work, the available strategies, the configuration required, and the situations where a full-refresh model is actually the right answer.

How incremental models work

An incremental dbt model includes a Jinja conditional that changes the SQL depending on whether the model is running for the first time or incrementally:

On the first run (or after a full-refresh), the is_incremental() block evaluates to false and the query runs without the incremental filter — processing the full source data and building the table from scratch.

On subsequent runs, is_incremental() evaluates to true and the WHERE clause is added — filtering source data to only rows newer than the most recent record already in the target table. Only those new rows are processed and written to the table.

The is_incremental() pattern is simple but requires a reliable way to identify new rows. The most common approach is filtering on an updated_at timestamp column, or on an event date column for append-only event data.

Incremental strategies

dbt supports several strategies for how new rows are written to the incremental table. The available strategies depend on the warehouse adapter.

**append** (all warehouses): Insert new rows into the table. No duplicate checking. Appropriate for append-only event streams where duplicates are impossible (e.g., clickstream events with unique event IDs and no updates). If the source can produce duplicates, append will produce duplicate rows in the target.

**merge** (Snowflake, BigQuery, Databricks, Spark): A SQL MERGE (UPSERT) — matched rows are updated, unmatched rows are inserted. Requires a unique_key configuration specifying which column(s) identify a unique record. Merge handles source data that includes both new records and updates to existing records. The unique_key prevents duplicates.

**delete+insert** (Snowflake, BigQuery, Databricks): Deletes all rows in the target table that match on the unique_key, then inserts the new batch. Functionally equivalent to merge for correctness, but some warehouses execute this more efficiently than a MERGE statement for large batches.

**insert_overwrite** (BigQuery, Spark, Databricks): Replaces entire partitions rather than individual rows. Appropriate for partition-based data where you want to replace everything in a date partition with the latest version of that day's data. More efficient than row-level merge for daily data with many rows per day.

The right strategy depends on your source data semantics. Append-only event data → append. Updatable records with a primary key → merge. Date-partitioned data with potential intra-day corrections → insert_overwrite on date partition.

Configuring unique_key

For merge and delete+insert strategies, the unique_key identifies which rows in the target table correspond to which rows in the incoming batch. For example: unique_key = 'order_id' means "when the incoming batch contains order_id 12345, update the existing row with order_id 12345 rather than inserting a duplicate."

unique_key can be a single column or a list of columns (composite key). When the natural key for a record is composite — for example, customer_id + date for a daily customer metrics table — pass the list.

For merge strategies, be explicit about which columns to update when a match is found. By default, all columns are updated. If you want to preserve certain column values on match (for example, created_at should not be overwritten on upsert), use the merge_update_columns configuration to specify which columns the MERGE should update.

The incremental filter pattern

The standard incremental filter uses a subquery against the target table to find the maximum value of the timestamp column, then filters the source to only rows newer than that maximum. A lookback window (subtracting some hours or days) is critical for handling late-arriving data — if events from yesterday can still arrive today, a filter of max(updated_at) without lookback would miss them.

The lookback window size depends on your data's late-arrival characteristics. For streaming data with guaranteed delivery within minutes, a 1-hour lookback is sufficient. For batch pipelines with day-level SLAs, a 3-day lookback ensures late arrivals are captured.

The trade-off: a larger lookback window processes more data per run. For most incremental models, processing the last 3–7 days on each run is an acceptable cost for correctness.

When to use full-refresh instead

Incremental models are not universally better than full-refresh models. Use full-refresh when:

**The table is small enough that full-refresh is fast**: A 10-million-row table in Snowflake with a well-designed query rebuilds in under 2 minutes. The complexity of managing incremental logic is not worth saving 2 minutes.

**The upstream data can have historical corrections**: If source systems can update records arbitrarily far back in time, an incremental model with a fixed lookback window will miss corrections outside that window. If your source says an order placed 6 months ago was subsequently cancelled, and your lookback is 7 days, the cancellation never reaches the incremental model. Full-refresh handles arbitrary historical corrections correctly; incremental does not.

**The transformation logic changed significantly**: When you refactor an incremental model's SQL, the existing rows in the table were built with old logic and the new logic only applies to new rows going forward. The table is now inconsistent — different rows were computed with different logic. Run a full-refresh after any significant logic change to ensure the entire table is consistent.

For the last case, dbt's --full-refresh flag triggers a full-refresh of incremental models on demand. Run it explicitly after major model changes.

Practical tips for production incremental models

**Always test with a full-refresh before relying on incremental**: Build the table from scratch, verify the row count and key metrics, then run incrementally and verify that subsequent runs produce correct increments.

**Monitor incremental run row counts**: An incremental run that processes 0 rows may indicate the incremental filter is too restrictive (the lookback window is too short, or the updated_at column is not being updated by the source). An incremental run that processes unexpectedly many rows may indicate the filter is too broad.

**Use dbt's on_schema_change configuration**: When the source schema changes (new column added), specify on_schema_change = 'append_new_columns' to add new columns to the target table automatically rather than failing the run.

**Document the unique_key and filter column in the model's YAML**: Future maintainers need to understand why the model is incremental, what identifies a unique record, and what the lookback window is set to.

For the broader dbt context, see dbt best practices and data modeling best practices. Our data architecture consulting practice designs and audits dbt project architectures — book a free review if your dbt models are running too slowly or producing inconsistent results.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →