BlogData Engineering

Apache Kafka for Data Engineers: Core Concepts and Real-World Patterns

James Okafor
James Okafor
Data & Cloud Engineer
·June 23, 202612 min read

Apache Kafka is the dominant event streaming platform in enterprise data. Here is what data engineers need to know: architecture, partitioning, consumer groups, exactly-once semantics, and integration with the modern data stack.

The quick answer

Apache Kafka is a distributed event streaming platform designed for high-throughput, fault-tolerant, persistent message passing between systems. For data engineers, Kafka serves two primary purposes: as the backbone of real-time data pipelines (moving data from source systems into data warehouses, lakes, or other consumers in near-real-time) and as the event bus for event-driven architectures (microservices publishing and consuming events). Understanding Kafka's architecture — topics, partitions, consumer groups, offsets, and retention — is increasingly required for senior data engineering roles at organisations with real-time data requirements.

Core architecture

**Topics**: the fundamental unit of organisation in Kafka. A topic is a named, ordered, immutable log of messages. Producers write messages to topics; consumers read from them. A topic called user_events might receive every user interaction from a web application; a topic called orders might receive every order placed in a commerce system.

**Partitions**: topics are divided into partitions — each partition is an ordered, immutable sequence of messages stored on disk. Partitions enable parallelism: a topic with 12 partitions can be consumed by up to 12 consumers simultaneously (within the same consumer group). Partitions are distributed across Kafka brokers in the cluster. The partition count is set at topic creation and affects both throughput and consumer parallelism.

**Partition keys**: when a producer sends a message, it can specify a partition key. Kafka hashes the key and routes the message to a consistent partition. This guarantees ordering for all messages with the same key — if every order for customer_id 12345 goes to the same partition, those orders are always processed in sequence. Messages without a key are distributed round-robin across partitions.

**Brokers**: Kafka runs as a cluster of broker nodes. Each broker stores a subset of partitions. Kafka uses replication — each partition has a leader (one broker handles all reads and writes) and a configurable number of followers (replicas on other brokers). If the leader fails, a follower is elected leader. Replication factor of 3 means the cluster can tolerate 2 broker failures without data loss.

**Offsets**: every message in a partition has an offset — a sequential integer that identifies its position. Consumer groups track their current offset per partition, enabling them to resume from where they left off after failure or restart. This is how Kafka achieves at-least-once and exactly-once delivery semantics.

**Retention**: Kafka retains messages for a configurable duration (time-based) or size (size-based). The default is 7 days. Unlike traditional message queues, messages are not deleted after consumption — they persist until the retention policy expires. This means multiple consumer groups can read the same messages independently, and you can replay messages from the beginning of the retention window.

Producers and consumers

**Producer**: any application that writes messages to a Kafka topic. Producers are responsible for partitioning (choosing which partition a message goes to), batching (buffering messages and sending in batches for throughput), and acknowledging delivery. Producer acknowledgement modes: acks=0 (fire-and-forget, fastest, some data loss risk), acks=1 (leader acknowledged, no follower sync), acks=all (all in-sync replicas acknowledged, strongest durability guarantee).

**Consumer**: any application that reads messages from Kafka topics. Consumers subscribe to one or more topics and process messages. Each consumer tracks its offset — where it last read — and commits the offset after processing. If a consumer fails and restarts, it resumes from the last committed offset.

**Consumer groups**: consumers are organised into consumer groups identified by a group ID. Kafka distributes topic partitions among consumers in the same group — each partition is consumed by exactly one consumer in the group at a time. This is how Kafka provides parallel consumption: a topic with 12 partitions consumed by a group of 4 consumers gives each consumer 3 partitions. If a consumer fails, its partitions are rebalanced to other consumers in the group.

**Lag**: consumer lag is the difference between the latest offset (newest message) in a partition and the consumer group's current offset. Lag measures how far behind a consumer is. High lag means the consumer is not keeping up with production rate — a common alerting metric.

Delivery semantics

**At-least-once** (default): messages are delivered to consumers at least once. If a consumer processes a message but fails before committing the offset, it will re-process the message after restart. The consumer application must be idempotent — processing the same message twice must produce the same result.

**At-most-once**: messages are consumed at most once. The offset is committed before processing, so if the consumer fails during processing, the message is skipped. Suitable only when some data loss is acceptable.

**Exactly-once**: messages are processed exactly once end-to-end. Achieved using Kafka transactions (producer transactions + transactional consumer reads) and idempotent producers. Exactly-once is harder to implement and has throughput implications, but is required for financial transactions and other use cases where duplicates are unacceptable. Kafka Streams and ksqlDB provide built-in exactly-once semantics.

Kafka Connect

Kafka Connect is a framework for streaming data between Kafka and external systems — databases, cloud storage, data warehouses, APIs — without writing custom producer/consumer code. A connector handles the integration logic; you configure it via JSON.

**Source connectors**: move data from external systems into Kafka. Debezium (CDC from PostgreSQL, MySQL, MongoDB, Oracle), JDBC Source (polling relational databases), S3 Source, HTTP Source. These are the connectors that feed data into the Kafka event backbone from source systems.

**Sink connectors**: move data from Kafka into external systems. S3 Sink (for data lake landing), Snowflake Sink, BigQuery Sink, Elasticsearch Sink. These connectors take events from Kafka topics and land them in downstream stores.

**Schema Registry**: when using Kafka Connect at scale, pair it with Confluent Schema Registry (or AWS Glue Schema Registry, or Apicurio). Schema Registry enforces message schemas (Avro, JSON Schema, Protobuf) and prevents schema changes that would break consumers. Schema evolution policies (backward, forward, full compatibility) allow controlled schema changes without breaking existing consumers.

Kafka Streams and ksqlDB

**Kafka Streams**: a Java library for building stateful stream processing applications that consume from Kafka topics, perform transformations (filter, map, join, aggregate), and produce results back to Kafka topics. No separate cluster required — the processing runs inside your application. Suitable for custom stream processing logic in Java/Kotlin.

**ksqlDB**: a SQL engine for stream processing on Kafka. You write SQL queries against Kafka topics — SELECT, JOIN, GROUP BY — and ksqlDB continuously processes the stream, materialising results into new Kafka topics or queryable tables. Lower engineering overhead than Kafka Streams for teams comfortable with SQL.

Kafka in the modern data stack

**CDC (Change Data Capture)**: the most common data engineering use case. Debezium reads the database transaction log (WAL in PostgreSQL, binlog in MySQL), converts row-level changes (INSERT, UPDATE, DELETE) into Kafka events, and publishes them to Kafka topics. Downstream consumers can build materialised views of source tables, replicate data to a warehouse in near-real-time, or trigger workflows on data changes.

**Event ingestion to data warehouse**: Kafka Sink connectors (Snowflake Connector for Kafka, BigQuery Connector) continuously consume from Kafka topics and write to warehouse tables. This enables near-real-time data availability in the warehouse without batch pipeline latency.

**Decoupling producers and consumers**: Kafka decouples source systems from analytics consumers. A source system publishes events to Kafka once; multiple downstream consumers (a data warehouse loader, a real-time dashboard, an ML feature store, an alert system) all read independently from the same topic without coupling to the source.

Managed Kafka options

Self-hosting Kafka requires managing ZooKeeper (or KRaft in Kafka 3.x), broker infrastructure, replication, and upgrades. For most data engineering teams, managed Kafka removes this overhead:

**Confluent Cloud**: the fully managed Kafka service from Confluent (the company founded by Kafka's original authors). Includes Schema Registry, Kafka Connect managed connectors, ksqlDB, and enterprise governance. Most feature-complete but most expensive.

**AWS MSK (Managed Streaming for Apache Kafka)**: AWS-managed Kafka with native AWS integration (IAM auth, VPC, CloudWatch). Serverless tier available. Good choice for AWS-native architectures.

**Redpanda**: a Kafka-compatible event streaming platform that replaces Kafka with a C++ implementation, eliminating ZooKeeper and claiming 10x lower latency. Drop-in compatible with Kafka clients and Kafka Connect. Growing adoption in latency-sensitive use cases.

Monitoring

Key metrics to monitor: consumer group lag (per partition and topic), broker disk utilisation, network throughput, under-replicated partitions (indicates replication health), request latency. Kafka exposes metrics via JMX; ship them to Prometheus + Grafana or Datadog for dashboards and alerting.

For the orchestration layer that coordinates Kafka consumers with batch pipelines, see apache airflow guide. For the real-time architecture context, see real-time data architecture. For the data ingestion tools that complement Kafka for batch sources, see fivetran vs airbyte.

Our data architecture consulting practice designs and implements event-driven data platforms — from Kafka cluster design and CDC implementation through real-time warehouse loading. If you are building or scaling a streaming data architecture, book a free 30-minute audit.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →