Change Data Capture (CDC) reads the transaction log of a source database to capture every insert, update, and delete as it happens — delivering a stream of changes to downstream systems without polling or timestamp-based queries. For analytics teams that need near-real-time data from operational databases without placing query load on production systems, CDC is the most efficient ingestion architecture.
Change Data Capture (CDC) is an ingestion pattern that reads the transaction log of a source database to capture database changes — inserts, updates, and deletes — as they occur, and delivers them to downstream systems as a stream of change events. Unlike polling-based incremental extraction (querying the source for rows updated since the last run), CDC reads the database's internal log, which records every committed transaction regardless of whether a timestamp column exists.
Why CDC Over Polling
Polling-based incremental extraction has three limitations that CDC eliminates:
**Hard-delete invisibility** — polling queries for rows with updated_at > last_run. When a row is deleted from the source, it disappears from future queries. The downstream analytics system never learns about the deletion; the deleted row persists in the warehouse indefinitely. CDC captures DELETE events explicitly, enabling downstream systems to reflect deletions correctly.
**Source database load** — polling fires a query against the source database on every extraction run. For high-frequency extraction (every 5–10 minutes), this generates continuous read load on a production operational database. Transaction log reading is a different access pattern — it reads the log, not the tables, and generates minimal load on the production database.
**Timestamp dependency** — polling requires a reliable updated_at column that updates on every row modification. Many operational tables lack this column, or it updates unreliably (not on all code paths, or with timezone inconsistencies). CDC does not require a timestamp column; it captures changes from the log regardless of whether the changed row has a timestamp field.
How CDC Works
Every major relational database maintains a transaction log — a sequential record of committed database operations. CDC reads this log:
**PostgreSQL** — CDC reads the Write-Ahead Log (WAL) using logical replication slots. A logical decoding plugin (pgoutput, wal2json, decoderbufs) decodes WAL records into structured change events.
**MySQL/MariaDB** — CDC reads the binary log (binlog). MySQL must be configured with binlog_format=ROW to record row-level changes (rather than the SQL statements), enabling CDC to capture the before and after state of each changed row.
**SQL Server** — CDC uses SQL Server's Change Data Capture feature (a built-in database feature) or the transaction log via SQL Server's log reader agent. CDC must be explicitly enabled on each table.
**Oracle** — CDC reads the Oracle LogMiner, a built-in log reading facility.
CDC Change Event Structure
A CDC change event contains:
- **Operation type** — insert, update, delete (some connectors also emit a "create" event for snapshot rows).
- **Before state** — the row's column values before the change (for updates and deletes).
- **After state** — the row's column values after the change (for inserts and updates).
- **Metadata** — transaction timestamp, database name, table name, log position (offset).
For updates, the before/after pair enables the analytics system to apply the update precisely — knowing not just what a row currently looks like, but what changed between states.
CDC in Analytics Architectures
CDC events are typically delivered to a downstream system one of two ways:
**Kafka as the streaming layer** — CDC events are published to Kafka topics (one topic per source table). Downstream consumers — warehouse loaders, stream processors, other applications — consume from Kafka at their own pace. This is the standard architecture for enterprise CDC because Kafka provides durable storage, replay, and fan-out.
**Direct-to-warehouse** — CDC tools like Debezium with Kafka Connect, or managed CDC services (Fivetran's database connectors in CDC mode, AWS DMS) write directly to a warehouse landing table. This is simpler but less flexible — there is no durable intermediate storage for replay and no fan-out capability.
Applying CDC Events in the Warehouse
CDC change events require a different warehouse ingestion approach than batch loading:
**Streaming merge** — incoming CDC events are merged into the destination table using the event's operation type. INSERT events add rows; UPDATE events update existing rows by primary key; DELETE events remove rows. Most warehouses support MERGE or UPSERT operations that handle this.
**Staging table approach** — CDC events land in a staging table with operation type as a column. A scheduled merge job applies the staged events to the target table using MERGE INTO syntax. This decouples event landing (continuous) from event application (scheduled) and simplifies the warehouse ingestion layer.
**Event sourcing / append-only** — instead of merging events, append all events to an immutable history table. The current state is derived by querying the latest event per entity. This approach retains full history but requires a materialised view or mart model to reconstruct the current state for BI query performance.
Operational Considerations
**Replication slot management (PostgreSQL)** — PostgreSQL replication slots retain WAL records until the consumer acknowledges them. If the consumer stops reading (connector outage, maintenance), WAL accumulates and can exhaust disk space. Monitor replication slot lag and have automated alerting for replication slot growth.
**Initial snapshot** — before CDC streaming begins, an initial snapshot captures the current state of all rows. The snapshot is typically a full table scan that runs before the streaming replication starts. For large tables, this snapshot can take hours. During the snapshot, changes are buffered in the log and applied after the snapshot completes.
**Schema changes** — CDC events are typed according to the source table's schema. When the source schema changes (a column added, a column type changed), the CDC connector must handle the schema evolution. Most CDC tools handle column additions gracefully; column type changes and column removals require more careful handling.
**Exactly-once delivery** — CDC systems typically guarantee at-least-once delivery (events may be delivered more than once but never dropped). Downstream consumers must be idempotent — applying the same event multiple times should produce the same result. Using the event's log offset as a deduplication key enables idempotent processing.
Our data architecture practice designs CDC architectures for enterprise data teams — contact us to discuss database change capture for your analytics stack.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →