BlogData Engineering

Kafka Connect and Debezium: Change Data Capture in Production

James Okafor
James Okafor
Data & Cloud Engineer
·October 13, 202610 min read

How to implement change data capture with Debezium and Kafka Connect — capturing every insert, update, and delete from operational databases and streaming them to analytics systems with low latency and high reliability.

Debezium is the open-source change data capture (CDC) platform that has become the standard approach for streaming database changes to Kafka. Where traditional ETL copies data periodically, Debezium reads the database transaction log — the same log that provides replication — and streams every insert, update, and delete as an event in near-real time. The result is a continuous feed of database changes that downstream systems can consume with low latency.

Kafka Connect is the integration framework that manages connectors — including Debezium — as a distributed, scalable, fault-tolerant service. Together, Debezium connectors running in Kafka Connect provide the reliable CDC infrastructure that powers real-time data pipelines.

What Change Data Capture Solves

Traditional batch ETL for analytics has a fundamental limitation: it can only detect new rows (or updated rows with a last_modified timestamp), not deletes. If a customer record is soft-deleted in the operational database, the change does not propagate to the analytical warehouse unless the ETL is specifically designed to handle it. For businesses where accurate deletion propagation matters — GDPR compliance, fraud detection, inventory systems — this is a significant gap.

CDC solves this by reading from the database's transaction log rather than querying the data. The transaction log records every operation: not just which rows exist now, but what changed, when, and in what order. Debezium converts this log stream into Kafka events that downstream systems can consume.

Beyond deletes, CDC provides:

- **Lower latency**: changes appear in Kafka seconds after they are committed, not hours after the next batch run

- **Accurate ordering**: events in the log are ordered by commit time, ensuring downstream systems see changes in the correct sequence

- **No source query load**: reading the transaction log does not query the source database, unlike extract queries that can impact production performance

Debezium Architecture

Debezium runs as a Kafka Connect connector. Each Debezium connector monitors one database server, reads the transaction log for specified tables, and publishes events to Kafka topics. By default, each monitored table gets its own Kafka topic — topic naming follows the pattern serverName.databaseName.tableName.

**Supported sources.** Debezium has production-ready connectors for: MySQL/MariaDB (reads the binary log), PostgreSQL (reads the logical replication WAL using pgoutput or wal2json decoder), SQL Server (reads the SQL Server CDC mechanism), Oracle (reads LogMiner), MongoDB (reads the oplog), Db2, and Vitess. Each database has different prerequisites for enabling CDC at the database level.

**Event format.** Each Debezium event is a JSON (or Avro, or Protobuf) message containing:

- **before**: the row's values before the change (null for inserts)

- **after**: the row's values after the change (null for deletes)

- **op**: the operation type — c (create/insert), u (update), d (delete), r (read — initial snapshot)

- **ts_ms**: the timestamp of the change in the source database

- **source**: metadata about the source (server, database, table, binlog position or LSN)

This structure enables downstream systems to reconstruct the full history of any row, apply changes in order, and implement SCD Type 2 logic.

Initial Snapshot

When a Debezium connector starts for the first time, it cannot start from the beginning of the transaction log (logs are typically retained for only a few days). It performs an initial snapshot: reads the current state of each monitored table and publishes read events (op: r) for every existing row. Once the snapshot is complete, it switches to streaming changes from the current log position.

The initial snapshot is a full table scan and can be slow for large tables. During the snapshot, Debezium holds a read lock on the table (briefly, for consistency), which may impact operational queries on high-traffic tables. Plan the initial snapshot carefully — run it during low-traffic periods for large, heavily-used tables.

For tables that are too large to snapshot in a reasonable time, Debezium supports incremental snapshot (using a signalling mechanism to snapshot tables in chunks without a prolonged lock) — available in recent Debezium versions.

Kafka Connect Configuration and Operations

**Connector configuration.** A Debezium connector is configured via a JSON document submitted to Kafka Connect's REST API. Key configuration parameters:

{

"connector.class": "io.debezium.connector.postgresql.PostgresConnector",

"database.hostname": "production-db.internal",

"database.port": "5432",

"database.user": "debezium",

"database.password": "${file:/secrets/db-credentials.properties:db.password}",

"database.dbname": "application_db",

"topic.prefix": "prod_db",

"table.include.list": "public.orders,public.customers,public.products",

"plugin.name": "pgoutput",

"slot.name": "debezium_slot"

}

**Offset storage.** Debezium tracks its position in the transaction log using Kafka Connect's offset storage (stored in a Kafka topic by default). If the connector stops and restarts, it resumes from where it left off. Offset storage must be on a reliable Kafka cluster — losing offsets means the connector cannot resume cleanly and may require a new snapshot.

**Replication slots (PostgreSQL).** PostgreSQL CDC requires a replication slot — a server-side bookmark in the WAL that ensures the database retains WAL segments until Debezium has consumed them. An inactive replication slot causes WAL buildup on the PostgreSQL server, potentially filling disk. Monitor replication slot lag and alert if a connector falls behind.

**Dead letter queue.** For events that fail processing (malformed messages, processing errors), configure a dead letter queue topic to capture them rather than halting the connector. This requires downstream handling of the dead letter queue — not ignoring it.

Consuming Debezium Events Downstream

Debezium events land in Kafka topics. Common downstream consumers:

**Kafka Connect sink connectors** write events to target systems: JDBC sink connector for relational databases, Snowflake Kafka Connector for Snowflake, BigQuery Kafka Connector, S3 Sink for data lake landing. Sink connectors handle upsert logic — mapping Debezium's before/after events to INSERT, UPDATE, and DELETE operations in the target.

**Kafka Streams or ksqlDB** for stream processing: filtering, transforming, aggregating, and enriching events before they reach the target system. A Kafka Streams application can join the orders CDC stream with the customers CDC stream and publish a denormalised order-with-customer stream to a new topic.

**Flink** for complex event processing: stateful stream processing with exactly-once semantics. Flink SQL can consume Debezium topics directly using the Debezium source table format.

**Custom consumers** (Spark Structured Streaming, Python consumer applications) read from the Kafka topic and write to targets using bespoke logic.

Handling Schema Changes

One of the more operationally challenging aspects of CDC is handling source schema changes. If a developer adds a column to the monitored table, existing consumers of the Kafka topic may not handle the new column gracefully.

Debezium's schema registry integration (Confluent Schema Registry or AWS Glue Schema Registry) records the Avro schema of each event version and enforces compatibility checks. When a schema change occurs, the registry validates that it is backwards-compatible (for configured compatibility modes) before allowing the new schema to be published.

For teams that want to use CDC without managing full schema evolution, configuring schema change handling to stop the connector on schema changes (and triggering an alert for human review) is a safer operational posture than automatic schema evolution propagation.

For real-time data architecture design including CDC pipeline implementation on AWS, Azure, or GCP, our data architecture consulting and cloud engineering practices cover end-to-end streaming infrastructure — contact us to discuss your requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →