Apache Kafka has become the default event streaming platform for real-time data pipelines. But Kafka is complex, operationally demanding, and often chosen for use cases that do not actually require streaming. This guide covers what Kafka does well, what it does not, and how to design a streaming analytics architecture that serves real-time needs without unnecessary complexity.
Apache Kafka has become ubiquitous in modern data infrastructure. It appears in architecture diagrams for everything from microservice communication to real-time analytics to data lake ingestion. This ubiquity has created a problem: Kafka is chosen for use cases that do not need it, because it is now the obvious answer to "we need real-time data movement," regardless of whether real-time is actually required.
This guide covers what Kafka does well, what it costs, what simpler alternatives handle adequately, and how to design a streaming architecture that is appropriate for the actual requirements.
What Kafka Is
Kafka is a distributed log — a durable, ordered, replayable record of events. Producers write events to Kafka topics. Consumers read from topics, maintaining their own offset (position in the log). Multiple consumers can read the same topic independently, each at their own pace, each maintaining their own offset. Events are retained in Kafka for a configurable period (default 7 days; production configurations often use longer retention for replayability).
This design has several important properties:
**Decoupling**: producers do not know about consumers. They write events to a topic. Any number of consumers can read those events without the producer being aware of or affected by them.
**Durability**: events are persisted to disk and replicated across brokers. A consumer failure does not lose events — the consumer restarts and picks up from its last committed offset.
**Replayability**: because events are retained, a consumer can reprocess historical events. A new consumer can read from the beginning of the topic and catch up to the current position. A buggy consumer can be fixed and the topic replayed from any point.
**High throughput**: Kafka is designed for high write throughput. Millions of events per second across a cluster is achievable. This is why it became the standard for high-volume event streaming.
When Kafka Is the Right Tool
**High-volume event ingestion**: if your system produces millions of events per hour that need to be reliably captured and made available to multiple consumers, Kafka is designed for this. Website clickstreams, IoT sensor data, financial transaction logs, and application event streams at scale are canonical Kafka use cases.
**Event-driven microservice architectures**: services that communicate through events rather than synchronous API calls use Kafka as the event bus. Service A produces an event to a Kafka topic; Service B, C, and D all consume it and react independently. This decouples the services and allows each to scale independently.
**Stream processing at scale**: when you need to process a continuous stream of events — aggregating, enriching, filtering, joining with reference data — and produce output in real time, Kafka combined with a stream processor (Apache Flink, Kafka Streams, or Spark Structured Streaming) is the standard architecture.
**Audit logging and change data capture (CDC)**: Kafka topics as durable audit logs of all changes to a system. Debezium, a CDC tool, reads database binary logs and writes every insert, update, and delete to Kafka topics. Downstream consumers can reconstruct the full state of a table from the CDC topic.
When Kafka Is Overkill
**Sub-100k events per hour into a single consumer**: at this scale, a database table, an S3 bucket, or even a Redis queue provides the same reliability properties with a fraction of the operational complexity.
**Internal microservice communication where latency is critical**: if Service A calls Service B and needs a response within 50ms, an asynchronous Kafka event is not the right communication pattern. Use a synchronous API. Kafka is asynchronous by design; synchronous request-response patterns do not fit its model.
**Analytics use cases where "real-time" means "within a few minutes"**: a batch pipeline that runs every 5 minutes achieves near-real-time latency for most analytical use cases. Building a Kafka streaming pipeline to achieve 5-minute latency is expensive and complex when a scheduled batch with a 5-minute cadence achieves the same result.
**When the team does not have streaming operations experience**: Kafka clusters require expertise to operate — partition management, consumer group lag monitoring, broker configuration, retention policy management, consumer offset management. A team that has not operated Kafka before will spend significant time on operational issues rather than data product delivery.
The Stream Processing Layer
Kafka is a message broker, not a compute engine. It stores and delivers events but does not process them. Processing requires a stream processor.
**Apache Flink**: the most capable open-source stream processor. It supports exactly-once processing semantics, complex stateful operations (windowed aggregations, stream joins, pattern detection), and both batch and stream processing in a single runtime. It is complex to configure and operate. Appropriate for complex stream processing requirements.
**Kafka Streams**: a Java library (not a separate cluster) for stream processing that runs as part of the consuming application. Lower operational overhead than Flink because there is no separate cluster to manage. Limited to Java/Kotlin applications and to simpler processing topologies.
**Spark Structured Streaming**: if your organisation already uses Apache Spark for batch processing, Structured Streaming extends it to stream processing with the same API. Integration with the existing Spark ecosystem is smooth. Latency is higher than Flink or Kafka Streams — micro-batch rather than true streaming — but often acceptable for near-real-time use cases.
**Managed services**: Confluent Cloud (managed Kafka), Amazon MSK (managed Kafka on AWS), and cloud-native alternatives like Amazon Kinesis and Google Pub/Sub eliminate the infrastructure management burden. For organisations that cannot justify dedicated streaming infrastructure engineering, managed services make Kafka operationally tractable.
The Serving Layer
Stream processing computes results from the event stream. Those results need to be served to the operational application or analytical dashboard that consumes them.
Common serving patterns:
**OLAP stores**: Apache Druid and ClickHouse are columnar databases optimised for high-concurrency analytical queries over time-series and event data. They ingest from Kafka directly and serve dashboard queries with sub-second latency at scale. Appropriate for real-time analytics dashboards where thousands of users query simultaneously.
**Key-value stores**: Redis and DynamoDB serve pre-computed signals at very low latency (milliseconds) for operational use cases. A stream processor computes a fraud score per device ID and writes it to Redis keyed by device ID; the transaction approval system reads the score at approval time.
**Analytical warehouses**: Snowflake, BigQuery, and Redshift all have streaming ingestion capabilities (Snowpipe for Snowflake, streaming inserts for BigQuery). Data arrives in the warehouse within minutes rather than hours. This is often sufficient for near-real-time analytics use cases that do not require true streaming latency.
Architecture Anti-Patterns
**Kafka as a database**: Kafka is a log, not a database. Using Kafka topics as the primary store for application state — where consumers query Kafka to read current values — is a misuse. Use a database for current state; use Kafka for the stream of events that led to that state.
**One topic per entity type**: a common over-generalisation. Kafka topics should reflect message categories that have meaningful consumer relationships. "All customer events" on one topic is fine if all consumers want all customer events. Separate topics for each event type add consumer coordination complexity without proportional benefit.
**Ignoring consumer lag**: consumer lag — the gap between the producer's latest offset and the consumer's current offset — is the primary health metric for a Kafka consumer. Unmonitored consumer lag grows silently until it manifests as severely delayed data or consumer group rebalancing that takes hours to recover from. Monitor lag continuously.
Our data architecture and data engineering practice designs streaming data architectures — contact us to discuss your real-time data requirements.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →