Apache Kafka has become the default infrastructure for event streaming in data-intensive organisations. Its use as a data pipeline backbone — connecting source systems, streaming processors, and downstream consumers — requires architectural decisions that determine whether the system delivers its intended reliability and scalability.
Apache Kafka has become the default infrastructure for event streaming in data-intensive organisations. As a data pipeline backbone — connecting source systems, streaming processors, and downstream consumers — Kafka handles use cases from microservice communication to data lake ingestion to real-time analytics. Its architecture decisions: topic design, partition configuration, replication, consumer group design, and retention policy, determine whether the system delivers its intended reliability and scalability in production.
Kafka as Data Pipeline Infrastructure
The original use case for Kafka was inter-service communication at LinkedIn. Its adoption as data pipeline infrastructure came from a recognition that its properties — durable, ordered, high-throughput, subscriber-based — are valuable beyond service-to-service messaging.
As data pipeline infrastructure, Kafka serves as the backbone connecting:
**Source systems to the data lake or warehouse** — source systems produce events (database change events via CDC, application events from instrumentation) that are published to Kafka topics. A consumer reads from the topic and writes to the data warehouse or data lake, decoupled from the source. The decoupling is the key value: the source system writes once to Kafka; multiple consumers can read independently at their own pace.
**Microservices to analytics** — in microservice architectures where each service produces domain events, Kafka aggregates those events for analytics consumers without requiring each analytics pipeline to integrate directly with each service.
**Real-time processing to downstream systems** — stream processors (Flink, Spark Streaming, Kafka Streams) read from input topics, process events, and write results to output topics or directly to downstream systems. The output topic can then be consumed by multiple downstream systems independently.
Topic and Partition Design
Topics are the logical channels in Kafka. Partitions are the physical units of parallelism within a topic. Partition count determines maximum consumer parallelism: a topic with 12 partitions can be consumed by at most 12 consumers in a consumer group simultaneously.
Topic design decisions:
**Topic granularity** — one topic per entity type (orders, customers, events) or one topic per source system. Finer granularity provides better consumer targeting but increases topic count management overhead. The practical recommendation is one topic per logical event type, where a logical event type corresponds to a stable business domain concept.
**Partition count** — choose partition count based on the target throughput requirement and the desired consumer parallelism. Changing partition count after topic creation is disruptive (it changes the key-to-partition mapping and requires consumer rebalancing). Over-partition rather than under-partition initially if future throughput growth is expected.
**Keyed partitioning** — messages with the same key are always written to the same partition. For events that need to be processed in order per entity (all events for a specific customer ID in order), keyed partitioning ensures ordering. Without key-based partitioning, ordering is only guaranteed within a partition, not across partitions.
**Replication factor** — the number of broker replicas that hold a copy of each partition. A replication factor of 3 is standard for production — tolerates loss of up to 2 of the 3 replicas before data loss. Higher replication factors increase durability at the cost of storage and write latency.
Consumer Group Architecture
A consumer group is a set of consumers that share the work of reading from a topic. Each partition is assigned to exactly one consumer in the group; adding consumers to the group redistributes partitions.
Consumer group design decisions:
**One consumer group per independent downstream use case** — a topic consumed by both a data warehouse loader and a fraud detection system should have two separate consumer groups. Each group maintains its own offset (its position in the topic), so each consumer independently tracks how far it has processed. Sharing a consumer group between independent use cases would cause them to compete for partitions.
**Offset management** — Kafka tracks consumer offsets automatically in an internal topic. The at-least-once delivery guarantee means a consumer may process the same message more than once if it crashes after processing but before committing the offset. Downstream systems that consume from Kafka need to be idempotent — processing the same message twice should produce the same result as processing it once. The warehouse load process, for example, should upsert rather than insert, so a duplicate message does not create a duplicate row.
**Consumer lag monitoring** — consumer lag is the number of messages in the topic that have not yet been consumed by a consumer group. Rising lag indicates that the consumer is not keeping up with producer throughput. Consumer lag is the primary operational metric for Kafka data pipelines; sustained high lag means events are being processed with increasing delay.
Retention and Storage Architecture
Kafka retains events for a configurable period (time-based) or up to a configurable size (size-based). After retention expires, old events are deleted.
Retention decisions:
**Retention period** — the minimum period during which events are available for consumption. Retention should be long enough to allow consumers to recover from outages without losing events. For most data pipeline use cases, 7 days of retention is sufficient; for use cases that require event replay for historical reprocessing, longer retention or external archival to object storage (using Kafka Connect's S3 sink or Confluent Tiered Storage) is required.
**Log compaction** — an alternative to time-based retention for topics that represent the current state of entities. Log compaction retains the most recent event for each key and deletes earlier events. This is appropriate for CDC topics where each event is an update to an entity's state — the most recent event contains the current state, and prior events are superseded.
**Tiered storage** — Confluent Platform and Amazon MSK support tiered storage, which offloads older segments to object storage (S3, GCS) while keeping recent data on local broker disks. Tiered storage enables very long retention periods (months or years) at low cost, making Kafka viable as a long-term event log in addition to a real-time streaming backbone.
Schema Management
Without schema management, producers and consumers need to agree on message format through informal coordination. When a producer changes the message schema, consumers break if they cannot parse the new format. For any Kafka deployment where multiple teams produce and consume from shared topics, a schema registry is essential infrastructure.
The Confluent Schema Registry is the standard solution: producers register message schemas before publishing; consumers validate incoming messages against the registered schema. Schema evolution rules (backward compatibility, forward compatibility, full compatibility) can be enforced at the registry level — preventing producers from publishing schemas that would break existing consumers without coordination.
dbt and warehouse transformation teams integrating with Kafka should treat schema registry as a dependency: the upstream schema is the contract that the transformation logic is written against. Changes to the upstream schema should trigger a review of dependent transformations before the schema change is deployed.
Our data architecture and cloud engineering practice designs Kafka-based data pipeline infrastructure for organisations building event-driven data platforms — contact us to discuss your Kafka architecture.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →