Incremental loading is the practice of processing only new or changed data in each pipeline run, rather than reprocessing the full dataset. This guide explains incremental loading patterns — append, upsert, CDC-based — how they compare to full refresh, and when each approach is appropriate.
Incremental loading is the practice of processing only the data that has changed since the last pipeline run, rather than reprocessing the complete dataset from scratch on every execution. A full refresh of a 500-million-row transaction table takes hours and consumes significant compute resources; an incremental load that processes only the past hour's transactions takes minutes.
The choice between full refresh and incremental loading is one of the most consequential decisions in data pipeline design. It affects pipeline latency, compute cost, storage requirements, and the complexity of handling edge cases like deletes and schema changes.
Full Refresh vs Incremental Loading
**Full refresh** drops and recreates the target table from scratch on every pipeline run. It is simple, correct by construction (the target always matches the source query exactly), and easy to debug — if data is wrong, rerun the pipeline. Its cost is runtime and compute. For small or medium tables, full refresh is often the right answer: the simplicity benefit outweighs the runtime cost.
**Incremental loading** processes only new or changed rows, appending or updating the target. Runtime scales with change volume rather than total data volume. As datasets grow, incremental loading becomes necessary — a table that takes 5 minutes to full-refresh at 100 million rows takes 50 minutes at 1 billion rows, making daily full refresh prohibitive.
The decision rule: use full refresh until runtime or cost make it impractical. Introduce incremental loading when you need it, not preemptively.
Incremental Patterns
**Append-only (insert new rows):** The simplest incremental pattern. Filter the source to rows created after the last successful run timestamp; insert them into the target. This works when source data is append-only — new rows are added but existing rows are never modified or deleted. Event tables, log tables, and immutable transaction records are append-only by design.
Implementation: maintain a last_loaded_at timestamp (stored in a pipeline metadata table or derived from MAX(created_at) in the target). Each run queries WHERE created_at > last_loaded_at.
**Upsert (insert new rows, update changed rows):** Handles tables where rows can be created and modified but not deleted. The pipeline identifies new and changed rows, inserts new ones, and updates existing ones. Requires a reliable modification timestamp in the source — an updated_at column populated by source application triggers or CDC.
Implementation in cloud warehouses typically uses a MERGE statement: match source rows to target rows on the primary key; when matched and source is newer, update; when not matched, insert.
**CDC-based loading:** Change data capture reads the source database's transaction log (binary log in MySQL/PostgreSQL, redo log in Oracle) to capture every insert, update, and delete event. The CDC stream contains the full row before and after the change, enabling exact replication of source table state including deletes.
CDC is the most accurate approach for full source replication but also the most complex. Tools like Debezium (open-source), Fivetran, and Stitch implement CDC and deliver change events to Kafka topics or directly to a destination table. CDC enables true real-time replication with sub-minute latency; it also requires managing CDC offset tracking and handling schema change events in the source.
**Snapshot loading:** Instead of tracking changes, snapshot loading captures the full state of the source table at each run and stores it as a new partition. No change detection is required — every run is a full extraction, but previous snapshots are retained rather than overwritten. Enables full historical analysis of slowly changing state — "what were all customer plan assignments on March 15 2023?" — at the cost of storage growth proportional to table size times snapshot frequency.
Incremental Loading in dbt
dbt's incremental materialization implements the upsert pattern. A dbt incremental model includes a unique_key configuration (the column or columns that uniquely identify a row) and an is_incremental() macro-gated filter that applies only during incremental runs:
When running for the first time (full refresh), the model executes without the incremental filter and populates the full table. On subsequent runs, the incremental filter applies — only rows modified after the current max timestamp in the target are included in the incremental result set. dbt then merges this result set into the target table using the unique key.
The key consideration with dbt incremental models: the incremental filter must be correct. If the modified timestamp in the source is not reliably populated (application bugs that do not update the timestamp on every record change, ETL processes that load data without preserving the source timestamp), the incremental filter will miss changed rows.
Handling Late-Arriving Data
Late-arriving data — records with past event timestamps that arrive in the pipeline after the window they belong to has been processed — is the primary edge case in incremental loading.
Strategies:
**Lookback window:** Instead of processing only since the last run, include a buffer period — "process all records with event timestamps in the last 48 hours." Late-arriving records within the lookback window are captured; records arriving later than the window are missed. Appropriate when late arrival is predictable and bounded.
**Full reprocess on schedule:** Combine incremental loading for recent data with periodic (weekly, monthly) full refresh to clean up any records that were missed. Correct but requires managing the full refresh schedule and its higher runtime cost.
**Event time vs load time:** Track both the event timestamp (when the event occurred) and the load timestamp (when the record arrived in the source system). Use the load timestamp for incremental filtering — this ensures every record is captured regardless of how far in the past its event timestamp is, as long as it has been loaded into the source within the lookback window.
Our data engineering services practice designs and implements incremental pipeline patterns for cloud data warehouses — from simple append-only Fivetran connections to CDC-based replication with full historical fidelity. Contact us to discuss your data pipeline requirements.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →