BlogData Engineering

What Is Change Data Capture? Real-Time Data Sync From Source Systems

James Okafor
James Okafor
Senior Data Engineer
·June 3, 20289 min read

Change data capture (CDC) is a technique for identifying and capturing data changes in source systems and delivering them to downstream targets in near real-time. This guide explains how CDC works, the implementation approaches, and when it is the right choice over batch extraction.

Change data capture (CDC) is a technique for identifying data changes in a source system — inserts, updates, and deletes — and delivering those changes to downstream targets in near real-time. Rather than extracting the full dataset on every run, CDC extracts only what has changed since the last extraction, enabling continuous synchronization between a source system and its targets.

CDC is the foundation of real-time data pipelines: it is what allows a data warehouse to reflect changes made in a production application within seconds or minutes, rather than waiting for a nightly batch load.

Why Batch Extraction Has Limits

The traditional alternative to CDC is batch extraction: running a full-table or incremental SQL query against the source database on a schedule, then loading the results into the warehouse. For many use cases, batch extraction is perfectly adequate.

Batch extraction breaks down when:

**Source systems cannot support query load.** Running heavy analytical queries against a production OLTP database degrades performance for application users. For high-traffic systems, a nightly batch extract may be impractical.

**Deletes cannot be detected.** A query that extracts rows where updated_at is greater than the last run time will capture inserts and updates. It will not capture deletes — rows that have been removed from the source table leave no trace. If the warehouse needs to reflect deletions, batch extraction requires a full-table comparison.

**Latency requirements are tight.** If stakeholders need data that is current to within minutes, hourly or daily batch extraction does not meet the requirement. CDC can deliver changes with latency measured in seconds.

**Change volume is high.** Full-table extraction of very large tables is expensive in time and compute. CDC extracts only the changed records, which may be a tiny fraction of the full dataset.

How CDC Works: Three Approaches

### Log-Based CDC

Log-based CDC is the most reliable and widely deployed approach. It reads the database transaction log — a record of every write operation that occurs — rather than querying the source tables directly.

Every major relational database maintains a transaction log for recovery purposes. PostgreSQL's Write-Ahead Log (WAL), MySQL's binary log (binlog), Oracle's LogMiner, and SQL Server's Transaction Log all expose this change stream.

A log-based CDC tool (Debezium, Fivetran, Airbyte, Striim) taps into this log as a consumer, parses change events, and publishes them to a message stream (Kafka, Kinesis) or writes them directly to a target.

**Advantages:** Zero query load on the source database. Captures every change including deletes. Sub-second latency is achievable. Does not require application changes.

**Disadvantages:** Requires database-level access (often DBA involvement). Some database configurations require adjustments to enable log retention. Changes to source schema can break log parsing if not handled carefully.

### Query-Based CDC

Query-based CDC extracts changes by periodically querying the source table for rows where an updated_at or modified_at timestamp is greater than the last extraction time.

**Advantages:** No special database access required. Works with any SQL-queryable source. Easier to implement.

**Disadvantages:** Does not capture deletes (rows with no updated_at change when deleted). Requires every source table to have a reliable updated_at column maintained by the application — not always the case. Adds query load to the source database. Latency is bounded by the polling frequency.

### Trigger-Based CDC

Trigger-based CDC uses database triggers to write change events to a staging table whenever a row is inserted, updated, or deleted. The CDC system reads from the staging table rather than the source tables.

**Advantages:** Captures deletes reliably. Works without transaction log access.

**Disadvantages:** Triggers add write overhead to every source table operation. Maintaining triggers is a DBA responsibility. Can cause performance degradation at high write volumes.

CDC in the Modern Data Stack

CDC is typically paired with a message streaming platform:

The source database sends change events to **Apache Kafka** (or a managed equivalent — Confluent Cloud, Amazon MSK, Redpanda). Kafka durably buffers the events and decouples the producer (source database) from the consumers (warehouse ingestion, search index, application event stream).

A stream processing layer (Apache Flink, Kafka Streams, Spark Structured Streaming) may transform, filter, or join events before they reach the target.

The target — a data warehouse (Snowflake, BigQuery, Databricks), a search index (Elasticsearch), a feature store, or a cache — receives the change events and applies them to its representation of the data.

CDC Tools

**Debezium** — the leading open-source CDC connector. Supports PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, and others. Publishes change events to Kafka. Deployed as Kafka Connect connectors. Requires more operational investment than managed alternatives.

**Fivetran** — managed data movement platform with CDC support for major databases. Handles log-based CDC without requiring configuration of the database-level log settings. Fully managed, easy to set up, higher cost.

**Airbyte** — open-source data integration platform. Supports CDC via log-based and query-based approaches depending on the connector. Can be self-hosted or used as a managed service.

**AWS DMS (Database Migration Service)** — AWS-managed CDC service. Particularly well-integrated with the AWS ecosystem. Used for both one-time migrations and ongoing replication.

**Stitch** — managed platform (now part of Talend) with CDC connectors. Similar positioning to Fivetran.

When CDC Is and Is Not the Right Choice

Use CDC when:

- Analytical data freshness requirements are measured in minutes

- Source tables are too large for efficient full-table extraction

- Deletes in source systems need to be reflected in the warehouse

- Source database query load is a constraint

Use batch extraction when:

- Daily or hourly data freshness is sufficient

- Source tables are small and extract queries are fast

- The source system does not support log access

- Simplicity is more important than latency

Our data architecture and cloud engineering practices design and implement real-time data pipelines using CDC where the requirements justify it. Contact us to discuss your data ingestion architecture.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →