BlogData Engineering

What Is Apache Kafka? Event Streaming Explained

James Okafor
James Okafor
Lead Data Engineer
·February 23, 202811 min read

Apache Kafka is a distributed event streaming platform — a high-throughput, fault-tolerant, durable log for publishing and subscribing to streams of events in real time. This guide explains how Kafka works, the core concepts (topics, partitions, producers, consumers, consumer groups), why it has become the backbone of real-time data architectures, and the trade-offs organisations face when adopting it.

What Apache Kafka Is

Apache Kafka is an open-source distributed event streaming platform — a high-throughput, fault-tolerant, persistent log for publishing, subscribing to, and processing streams of events in real time. Originally built at LinkedIn in 2011 to handle internal activity stream data at scale, Kafka has become the backbone of real-time data architectures across industries.

The core concept: events (messages) are written to Kafka topics by producers and read from those topics by consumers. Kafka retains events for a configurable period, allowing consumers to read from any point in the event history — not just the current "live" stream.

Core Concepts

**Event**: The fundamental unit in Kafka. An event has a key, a value (the payload), a timestamp, and optional headers. An event is immutable once written. Examples: a user clicked a button (event_type: page_click, user_id: 12345, page: /checkout), an order was placed (order_id: 9876, amount: 149.99, customer_id: 4567).

**Topic**: A named, ordered, durable log of events. Topics are the channels to which producers write and from which consumers read. Create one topic per event type or per data domain — orders, page_events, user_signups.

**Partition**: Topics are divided into partitions — ordered, immutable sequences of events on a single Kafka broker. Partitions enable horizontal parallelism: a topic with 8 partitions can be consumed by 8 parallel consumer instances simultaneously. Events are assigned to partitions based on their key (same key → same partition, preserving order) or round-robin (if no key).

**Producer**: A client application that writes events to a Kafka topic. Producers specify which topic to write to and optionally specify a partition key.

**Consumer**: A client application that reads events from a Kafka topic. Consumers track their position (offset) within each partition independently, enabling replay from any point.

**Consumer group**: A named group of consumer instances that collectively read a topic. Kafka assigns each partition to one consumer in the group at a time — distributing load across the group. Multiple consumer groups can independently read the same topic, each maintaining its own offset. This fan-out architecture is a key Kafka advantage: one order event can be consumed simultaneously by an order processing service, a shipping notification service, and an analytics pipeline.

**Broker**: A Kafka server that stores partitions and handles producer/consumer requests. A Kafka cluster has multiple brokers for replication and fault tolerance.

**Replication**: Each partition is replicated across multiple brokers (replication factor, typically 3). If a broker fails, a replica on another broker becomes the new leader. Kafka is designed to be resilient to individual broker failures without data loss.

**Offset**: A sequential integer identifying each event's position within a partition. Consumers commit their offset after processing — the offset tracks how far the consumer has read. If a consumer crashes and restarts, it resumes from the last committed offset rather than reprocessing all events from the beginning.

**Retention**: Kafka retains events for a configurable time (7 days by default) or size. Unlike a message queue that deletes messages after consumption, Kafka retains events regardless of whether consumers have read them. Multiple consumer groups can independently read the same events at their own pace.

Why Kafka Is Different from a Message Queue

Traditional message queues (RabbitMQ, Amazon SQS) delete messages after a consumer acknowledges them. Kafka retains messages for the configured retention period regardless of consumption. This fundamental difference enables:

**Replay**: A new consumer group can read from the beginning of a topic and process historical events — catching up on hours or days of missed events, or processing historical data for a new analytics use case.

**Fan-out**: Multiple independent consumer groups can read the same topic simultaneously. An order event can be consumed by multiple services without each service needing a separate copy of the event.

**Decoupled production and consumption**: Producers write events without knowing which consumers will read them. Consumers can fall behind and catch up without affecting producers. The system tolerates consumer downtime without losing events.

Kafka in Analytics Architectures

**Event streaming to data warehouse**: Kafka Connect with a BigQuery/Snowflake/Redshift sink connector writes events from Kafka topics to analytical tables. New events appear in the warehouse within seconds of being produced. Replaces batch extraction pipelines for event data.

**CDC (Change Data Capture)**: Debezium Kafka Source Connector reads database transaction logs (PostgreSQL WAL, MySQL binlog) and produces change events to Kafka topics. Downstream consumers receive every insert, update, and delete as a Kafka event — enabling real-time data replication without polling.

**Real-time feature pipelines**: ML feature computation engines (Flink, Spark Structured Streaming) consume Kafka events and compute aggregated features (user activity in last 5 minutes, rolling 24-hour order count) that are written to a feature store for model serving.

**Event-driven microservices**: Services communicate by publishing and subscribing to Kafka topics rather than calling APIs directly. An order placed event triggers shipping, notification, inventory, and analytics consumers independently.

Managed Kafka Options

**Confluent Cloud**: The commercial managed Kafka platform from Kafka's creators. Adds Schema Registry, ksqlDB (streaming SQL over Kafka), fully managed connectors, and enterprise support. Higher cost than self-managed Kafka but eliminates cluster operations.

**Amazon MSK (Managed Streaming for Apache Kafka)**: AWS-managed Kafka. Handles cluster provisioning, patching, and replication. Integrates with IAM, VPC, CloudWatch. Less feature-rich than Confluent but tightly integrated with AWS ecosystem.

**Azure Event Hubs**: Azure's Kafka-compatible event streaming service. Kafka producers/consumers work natively with the Kafka protocol. Tightly integrated with Azure services.

When Kafka Is and Is Not the Right Choice

**Kafka is appropriate when**:

- Event throughput is high (millions of events per minute) and batch latency is unacceptable

- Multiple independent consumers need to read the same event stream

- Event replay is needed for new consumers or recovery

- Decoupling producers from consumers is architecturally important

**Kafka may be overkill when**:

- Data volumes are moderate and batch pipelines (hourly Fivetran syncs, daily dbt runs) meet freshness requirements

- Only one consumer reads each event stream — a simpler queue suffices

- The team lacks Kafka operational experience — cluster operations, replication tuning, consumer group management, Schema Registry are non-trivial

For many analytics teams, the right answer is managed ingestion (Fivetran, Airbyte) for batch pipelines plus Kinesis Firehose or Pub/Sub for streaming — reserving full Kafka infrastructure for use cases that genuinely require its capabilities.

Our data architecture practice designs real-time and streaming data architectures using Kafka and managed alternatives — contact us to discuss your streaming data requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →