Streaming Data Architecture: Patterns for Real-Time Analytics

Streaming architecture processes events as they arrive rather than accumulating them in batches. This guide covers the core patterns — Lambda, Kappa, streaming lakehouse — the trade-offs between them, and when streaming is actually the right choice.

Most data architecture is batch-oriented: collect events, accumulate them in a staging area, process them periodically, load results into a warehouse. Batch is simpler, cheaper, and sufficient for most analytical use cases. Streaming architecture processes events as they arrive — latency measured in seconds or milliseconds rather than hours. This guide covers the core streaming patterns, when streaming is actually justified, and the common mistakes that lead organisations to over-engineer with streaming when batch would have been fine.

When streaming is justified

Before choosing a streaming architecture, answer this question: what breaks if the data is one hour old? If the answer is "nothing critical", batch is the correct architecture and streaming is premature complexity.

Streaming is justified when:

**Operational decisions require real-time data**: Fraud detection (a payment must be evaluated before it is approved), inventory reservation (stock must be decremented before the next order is placed), personalisation (recommendations must reflect a user's activity in the current session). These use cases have a business consequence if the data is stale — not just a slightly less accurate report, but a wrong decision with a material impact.

**Alerting and monitoring require sub-minute latency**: Infrastructure monitoring, application error rate tracking, financial market data. The value of the data degrades rapidly — knowing that your checkout was down 5 minutes ago is too late to prevent the loss.

**Downstream systems require events as they happen**: A customer data platform that must enrich user profiles for personalised email within seconds of a behavioural event. An inventory system that must update availability displays in near real-time.

If none of these apply and your analytics team needs "yesterday's data by 6am", batch is sufficient.

The Lambda architecture

Lambda architecture, proposed by Nathan Marz, splits processing into two paths running in parallel:

**Batch layer**: Processes the complete historical dataset periodically, producing accurate but latency-limited views.

**Speed layer**: Processes the stream in near-real-time, producing approximate views of recent data.

**Serving layer**: Merges results from both layers to serve queries — exact historical data from the batch layer, recent approximate data from the speed layer.

The problem with Lambda is maintaining two codebases doing essentially the same computation — one in batch (Spark, dbt) and one in streaming (Flink, Spark Structured Streaming). When the business logic changes, it must be updated in both paths. Discrepancies between the two paths are common and hard to debug. Lambda architecture was a pragmatic solution to a real constraint (streaming engines could not handle full historical reprocessing) that is now largely obsolete.

The Kappa architecture

Kappa architecture, proposed by Jay Kreps, eliminates the batch layer. All processing is done on the stream. Historical reprocessing is achieved by replaying the event log from the beginning — possible if your streaming platform (Apache Kafka, AWS Kinesis, Apache Pulsar) retains events for a sufficient retention window.

Kappa is simpler than Lambda because there is only one processing codebase and one execution environment. The trade-off is that reprocessing the full event history from the beginning is slow for very large histories, and long event retention in Kafka increases storage cost.

Kappa is appropriate when the business requires consistent real-time processing and historical reprocessing from the event log is feasible given the data volume.

The streaming lakehouse

The streaming lakehouse pattern combines streaming event processing with a lakehouse storage layer (Delta Lake, Iceberg, Hudi). Events are consumed from Kafka and written to Delta Lake tables in near-real-time using Spark Structured Streaming or Flink. The Delta tables are queryable by analytics engines (Spark, Trino, SQL analytics tools) immediately as new events are appended.

This pattern effectively collapses the batch and speed layers of Lambda architecture into a single path: streaming writes to Delta, batch analytics query the Delta table. The streaming write handles the latency requirement; the Delta ACID guarantees handle the consistency requirement; the standard analytics tooling handles the query requirement.

The streaming lakehouse is the dominant pattern for organisations using Databricks or Apache Spark with Delta Lake, and is increasingly applicable with Iceberg on AWS (Flink → Iceberg on S3, queried by Athena and Snowflake external tables).

Event streaming platforms

**Apache Kafka**: The dominant event streaming platform. Distributed, fault-tolerant, high-throughput (millions of events per second). Kafka stores events as ordered, immutable logs partitioned across brokers. Consumers can replay from any offset. Kafka is operationally complex — Kafka clusters require expertise to run, and ZooKeeper (now KRaft) adds configuration overhead. Managed Kafka via Confluent Cloud, AWS MSK, or Azure Event Hubs reduces operational burden.

**Apache Pulsar**: Multi-tenant, geo-replicated streaming platform. Similar to Kafka but with tiered storage (older data moves to object storage automatically), built-in multi-tenancy, and a more flexible messaging model (queuing and streaming in one system). Gaining adoption but smaller ecosystem than Kafka.

**AWS Kinesis**: Fully managed, serverless streaming on AWS. Lower operational overhead than Kafka. Throughput limits per shard require capacity planning. Integrates natively with AWS analytics services (Lambda, Glue, Redshift Streaming Ingestion).

**Google Pub/Sub**: Managed messaging and streaming on GCP. Integrates natively with Dataflow, BigQuery Streaming Inserts.

Stream processing engines

**Apache Flink**: The leading stream processing engine for stateful computation — window functions, joins across streams, complex event processing. Lower latency than Spark Structured Streaming. More operationally complex. Managed via Confluent Cloud for Flink, Amazon Managed Service for Apache Flink, or self-hosted on Kubernetes.

**Spark Structured Streaming**: Streaming execution on top of Spark's batch processing engine. Uses micro-batch execution (configurable trigger interval) rather than true event-at-a-time processing — minimum latency is bounded by the trigger interval (typically seconds). Simpler to adopt for teams already on Spark; slightly higher latency than Flink.

**ksqlDB**: SQL interface on top of Kafka Streams. Enables stream processing via SQL queries without writing Java or Python. Appropriate for filtering, aggregation, and enrichment use cases that do not require the full power of Flink.

Common streaming architecture mistakes

**Streaming everything when most use cases are batch**: The most expensive mistake. Streaming infrastructure (Kafka + Flink cluster) costs $3,000–$10,000/month on managed services. For a company where the only "real-time" requirement is a dashboard that refreshes hourly, this is waste.

**Ignoring late-arriving events**: Events arrive out of order. A mobile event recorded at 11:58pm may not reach your stream processor until 12:05am the following day. Streaming aggregations that close windows without accounting for late arrivals will under-count. Define your watermark tolerance (how long to wait for late events before closing a window) based on your observed late-arrival distribution.

**Not handling exactly-once semantics**: Most streaming pipelines offer at-least-once delivery by default — events may be reprocessed on failure, producing duplicate records in the sink. Exactly-once requires transactional writes at the sink (Kafka transactions, Delta Lake ACID, or idempotent writes). Design for idempotency from the start.

**Building streaming without an event schema registry**: Producers change their event schema; consumers break. A schema registry (Confluent Schema Registry, AWS Glue Schema Registry) enforces backward and forward compatibility for event schemas, preventing breaking changes from reaching consumers without notice.

For the batch processing foundations that streaming complements, see data pipeline best practices. For the lakehouse storage layer, see apache-iceberg-guide and delta lake guide. Our data architecture consulting practice designs streaming architectures and audits whether streaming is actually warranted for your use case — book a free architecture review.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →