Data replication is the process of copying data from one system to another and keeping those copies synchronized. This guide explains the key replication patterns, technologies, and trade-offs — including logical vs physical replication, CDC-based sync, and when replication is preferable to ETL.
Data replication is the process of copying data from one system to another and keeping those copies synchronized as the source changes. It is a foundational concept in data infrastructure — used to move operational data to analytics environments, synchronize data across geographically distributed systems, maintain disaster recovery copies, and enable read scaling by distributing query load across replicas.
Why Replication Matters
The simplest reason to replicate data: you need it in two places at once. An operational database serves application transactions; an analytics environment serves reporting and analysis. These two workloads have conflicting requirements — one needs low-latency writes, the other needs large-scale reads. Replication moves data from the source to the target without disrupting the source's operational function.
Less obvious reasons: disaster recovery (a replica in a second region restores service if the primary fails), read scaling (multiple read replicas distribute SELECT query load across nodes), and geographic distribution (putting data closer to users who read it).
Physical vs Logical Replication
**Physical replication** copies the binary representation of database files — the exact bytes on disk. The replica is a byte-for-byte copy of the primary. Physical replication is fast and low-overhead because it operates at the storage level without parsing SQL. The limitation: replicas must be the same database version and typically the same hardware architecture as the primary. Physical replicas are read-only and cannot serve as independent databases.
PostgreSQL streaming replication is physical: the primary writes WAL (Write-Ahead Log) records, replicas apply them continuously, and the replica stays within milliseconds of the primary. Standard high-availability PostgreSQL setups use physical streaming replication.
**Logical replication** decodes the WAL records into logical change events (INSERT, UPDATE, DELETE operations with column values) and applies them to a target database. The target can be a different database version, a different database engine entirely, or a system with a different schema. Logical replication is more flexible than physical but has higher overhead and more edge cases.
PostgreSQL logical replication uses publication/subscription: the primary publishes changes for specified tables, and subscribers receive those changes. This enables selective replication of specific tables, replication between different PostgreSQL versions, and replication to non-PostgreSQL targets via tools like Debezium.
Change Data Capture (CDC)
CDC is the practice of capturing row-level changes (inserts, updates, deletes) as they occur in a database and propagating them to downstream systems. CDC-based replication is how most modern data integration tools — Fivetran, Airbyte, Debezium — achieve low-latency, low-overhead incremental data movement.
**Log-based CDC** reads the database's transaction log:
- PostgreSQL: WAL (Write-Ahead Log)
- MySQL: binlog (binary log)
- SQL Server: CDC tables or transaction log
- Oracle: redo logs (requires LogMiner or GoldenGate)
Log-based CDC captures every change as it occurs without adding load to the source database through query-based extraction. It detects hard deletes (which timestamp-based incremental polling cannot detect). And it can operate with near-zero latency — changes propagate within seconds of being committed.
**Query-based CDC** polls source tables using a "high watermark" — a timestamp or sequence ID column. On each poll, it selects rows where the updated timestamp is greater than the last successful run. Simpler to implement than log-based CDC, but: requires a reliable updated_at column, cannot detect hard deletes, adds query load to the source, and operates at batch intervals rather than continuously.
**Trigger-based CDC** uses database triggers to capture changes into a side table that the replication tool reads. Higher overhead than log-based CDC; not recommended for high-throughput sources.
Replication Tools
**Debezium** — the dominant open-source CDC platform. Connectors for PostgreSQL, MySQL, SQL Server, Oracle, MongoDB, and others. Publishes change events to Kafka topics. Used as the CDC layer in many streaming architectures. Operates as Kafka Connect connectors.
**Fivetran** — managed data integration platform. Most connectors use some form of CDC or incremental sync. For PostgreSQL and MySQL, Fivetran can use log-based CDC (requiring WAL or binlog access). Fully managed: no infrastructure to operate, automatic schema change detection.
**Airbyte** — open-source alternative to Fivetran. Supports CDC for major databases. Self-hosted or cloud (Airbyte Cloud). More flexibility and customization; more operational overhead than Fivetran.
**AWS Database Migration Service (DMS)** — managed migration and ongoing replication service for databases. Supports one-time migration and continuous CDC replication to RDS, Redshift, S3, and other targets. Commonly used for migrating on-premise databases to AWS.
**Google Datastream** — fully managed CDC service on GCP. Sources: Oracle, MySQL, PostgreSQL, SQL Server. Destinations: BigQuery, GCS, Spanner. Designed for real-time analytics on operational data with minimal source impact.
**Striim** — enterprise streaming integration platform with CDC sources and warehouse/streaming sinks. Common in large enterprise environments with Oracle, SAP, and mainframe sources.
Replication Lag and Consistency
Replication is never perfectly synchronous (except in some synchronous replication configurations that accept write latency). There is always some lag between when a change is committed on the primary and when it is visible on the replica.
For most analytics use cases, a few seconds to a few minutes of lag is acceptable. For operational use cases — like a read replica serving application queries that need to see the latest writes — replication lag must be monitored and kept within bounds. PostgreSQL exposes replication lag metrics via pg_stat_replication; monitoring systems should alert when lag exceeds thresholds.
**Read-your-writes consistency** is a specific consistency requirement: if a user writes a record, their subsequent read should see that write. With a replication lag, a user who writes to the primary and immediately reads from a replica may not see their write. Applications that use replicas for reads must either route reads that require immediate consistency to the primary, or accept the read-your-writes inconsistency.
When Replication vs ETL
Replication is appropriate when: you need a near-real-time copy of source data for analytics, you need a read replica for operational query scaling, you are migrating a database with minimal downtime requirements, or you need a disaster recovery copy.
ETL/ELT is appropriate when: data needs substantial transformation before it is analytically useful, multiple source systems need to be integrated and reconciled, historical snapshots are needed (replication maintains current state, not history), or business logic needs to be applied before data reaches the analytics layer.
In practice, CDC-based replication (via Fivetran or Airbyte) and ELT (via dbt) are complementary: the replication layer moves raw data to the warehouse with minimal latency, and the ELT layer transforms it into analytical models. The two are designed to work together.
Our data architecture practice designs data replication and integration architectures for operational and analytical systems — contact us to discuss your data movement and synchronization requirements.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →