Debezium CDC: Open-Source Change Data Capture for Production Databases

Debezium is the most widely deployed open-source CDC platform — it connects to PostgreSQL, MySQL, SQL Server, Oracle, and MongoDB transaction logs and publishes change events to Kafka. This guide covers Debezium architecture, connector configuration for each supported database, and the operational practices for running Debezium reliably in production.

Debezium is an open-source CDC platform that reads the transaction logs of PostgreSQL, MySQL, SQL Server, Oracle, MongoDB, and other databases and publishes change events to Apache Kafka. It is the most widely deployed open-source CDC solution and the de facto standard for teams running self-managed CDC pipelines. Debezium runs as a set of Kafka Connect connectors — it is deployed as a Kafka Connect plugin, leveraging Kafka Connect's distributed task management, offset tracking, and connector lifecycle management.

Debezium Architecture

Debezium connectors run inside Kafka Connect workers. Kafka Connect is a distributed integration framework that manages the lifecycle of connector tasks, tracks offsets (which log positions have been processed), and provides a REST API for connector configuration and management.

The deployment architecture:

1. Kafka cluster — receives and stores change events

2. Kafka Connect cluster — runs Debezium connectors as tasks

3. Source databases — read by Debezium via transaction log protocols

4. Downstream consumers — read change events from Kafka topics

Each source database requires a separate Debezium connector instance. A single Kafka Connect cluster can run multiple Debezium connectors simultaneously.

PostgreSQL Connector Configuration

PostgreSQL requires WAL logical replication to be enabled before Debezium can connect. Configuration steps:

Database configuration:

wal_level = logical

max_replication_slots = 10 -- at least 1 per Debezium connector

max_wal_senders = 10

Create replication slot and publication:

CREATE PUBLICATION debezium_pub FOR ALL TABLES;

-- Debezium creates the replication slot on first connection, or pre-create:

SELECT pg_create_logical_replication_slot('debezium_slot', 'pgoutput');

Debezium PostgreSQL connector configuration:

{

"name": "postgresql-connector",

"config": {

"connector.class": "io.debezium.connector.postgresql.PostgresConnector",

"database.hostname": "your-db-host",

"database.port": "5432",

"database.user": "debezium_user",

"database.password": "password",

"database.dbname": "production",

"database.server.name": "prod_db",

"plugin.name": "pgoutput",

"publication.name": "debezium_pub",

"slot.name": "debezium_slot",

"table.include.list": "public.orders,public.customers",

"topic.prefix": "prod"

}

The database user needs REPLICATION privilege and SELECT access on the tables being captured.

MySQL Connector Configuration

MySQL requires binary logging with ROW format:

MySQL configuration (my.cnf):

server-id = 1

log_bin = mysql-bin

binlog_format = ROW

binlog_row_image = FULL

MySQL user permissions:

GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'debezium'@'%';

Debezium MySQL connector configuration:

{

"connector.class": "io.debezium.connector.mysql.MySqlConnector",

"database.hostname": "your-mysql-host",

"database.user": "debezium",

"database.server.id": "184054",

"database.include.list": "production_db",

"table.include.list": "production_db.orders",

"topic.prefix": "mysql"

}

Kafka Topic Structure

Debezium publishes change events to Kafka topics with a naming convention: {topic.prefix}.{database}.{table}. For the PostgreSQL configuration above, changes to the orders table appear in the topic prod.public.orders.

Each event in the topic is a JSON (or Avro) message with:

- **before** — row state before the change (null for inserts)

- **after** — row state after the change (null for deletes)

- **op** — operation: "c" (create/insert), "u" (update), "d" (delete), "r" (read, during initial snapshot)

- **source** — metadata: connector name, database, table, transaction timestamp, WAL offset

- **ts_ms** — event timestamp

Schema Registry Integration

For production deployments, use Avro serialisation with a Confluent Schema Registry (or compatible alternative). Avro provides:

- **Schema enforcement** — only events conforming to the registered schema are accepted

- **Schema evolution** — backward and forward compatible schema changes (column additions) are handled automatically

- **Compact encoding** — Avro is significantly more compact than JSON, reducing Kafka storage and network overhead

Configure Debezium to use Avro serialisation by setting the key and value converters to the Avro converter and providing the schema registry URL.

Handling Schema Changes

When a source table schema changes:

**Column additions** — Debezium detects the column addition from the WAL record and includes the new column in subsequent events. The Avro schema is updated in the Schema Registry. Downstream consumers using schema evolution-compatible configurations handle the change automatically.

**Column removals** — Debezium handles column removal by publishing events without the removed column. Downstream consumers must tolerate missing fields.

**Table additions** — new tables matching the table.include.list pattern are captured automatically. Tables not matching are ignored.

**Table structure changes** — for MySQL, DDL changes (ALTER TABLE) are captured in a separate schema history topic. Debezium replays this history on restart to reconstruct the schema state at any log position.

Operational Monitoring

Key Debezium metrics exposed via JMX:

**milliseconds behind master** — how many milliseconds behind the source database's current log position the connector is. Non-zero values indicate lag; growing lag indicates the connector is not keeping up with source write volume.

**number of events** — events processed per connector. Use for throughput monitoring.

**last event timestamp** — timestamp of the most recently processed event. Alert if this timestamp falls significantly behind wall clock time.

**Kafka Connect task status** — the Kafka Connect REST API exposes connector and task status. Automated monitoring should alert on task failures and restart failed tasks.

Debezium Cloud Alternatives

For teams that want CDC without managing Debezium infrastructure:

**Fivetran** — Fivetran's database connectors support CDC mode for PostgreSQL, MySQL, and SQL Server. Managed CDC without infrastructure overhead.

**AWS DMS (Database Migration Service)** — supports ongoing replication mode (CDC) from multiple source databases to AWS targets.

**Confluent Cloud CDC connectors** — managed Debezium connectors hosted by Confluent, with full Confluent Platform integration.

Self-hosted Debezium makes sense when: Kafka is already in the infrastructure, cost at volume favours self-hosting, or specific connector configurations are required that managed services don't support.

Our data architecture practice designs and implements CDC pipelines including Debezium deployments for enterprise data teams — contact us to discuss database change capture architecture.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →