Real-Time Data Architecture: Streaming vs Batch and When Each Fits

Most enterprise data architectures run on batch — hourly or daily pipeline refreshes. Real-time streaming solves a different problem and costs more to build and operate. Here is the honest assessment of when real-time is worth it and when batch is the right answer.

The quick answer

Most enterprise analytics questions do not require real-time data. Batch processing — running pipelines on hourly, 4-hourly, or daily schedules — delivers the analytics value that most organisations need at a fraction of the cost and complexity of real-time streaming. Real-time architecture is the right investment when the business decision it supports genuinely cannot wait for a batch refresh: fraud detection, operational alerting, real-time customer personalisation at scale, or industrial monitoring. Build for real-time only when the latency reduction produces measurable business value that justifies the engineering and operational overhead.

What "real-time" actually means in data architecture

"Real-time" is used loosely — most use cases that claim to need real-time data actually need near-real-time (seconds to minutes) rather than true real-time (milliseconds). The distinction matters for architecture, because the engineering requirements for these latency targets are very different.

**True real-time (sub-second latency)**: Data is processed and available for query within milliseconds of being generated. Required for high-frequency trading, real-time fraud scoring on individual transactions, and industrial process control. Requires event streaming infrastructure (Apache Kafka, Azure Event Hubs), stream processing engines (Apache Flink, Kafka Streams), and purpose-built low-latency data stores. This is the most expensive and complex tier.

**Near-real-time (seconds to minutes)**: Data is available within 1–5 minutes of generation. Sufficient for operational dashboards, customer service tooling, marketing automation, and most "real-time analytics" use cases that organisations actually mean when they say they need real-time. Achievable with micro-batch processing or continuous ingestion tools without a full streaming architecture.

**Hourly batch**: Data is refreshed every 1–4 hours. Sufficient for most operational reporting, sales dashboards, inventory monitoring, and daily business management. The default architecture for most BI environments.

**Daily batch**: Data is refreshed once per day. Sufficient for strategic reporting, financial reporting, and analytics that inform weekly or monthly business decisions. The simplest and cheapest architecture.

Before investing in real-time infrastructure, be precise about which latency tier your use case actually requires. Most organisations that believe they need sub-second real-time actually need hourly batch. Some need near-real-time. Relatively few need true real-time.

Batch processing: how it works and when it fits

Batch processing runs data pipelines on a schedule. At the scheduled time — hourly, 4-hourly, daily — the pipeline extracts changed data from source systems, transforms it, and loads it into the destination warehouse or lakehouse. Between runs, the destination data is static.

The strengths of batch processing:

**Simplicity.** Batch pipelines are significantly simpler to build, test, and maintain than streaming pipelines. The pipeline runs, completes, and either succeeds or fails clearly. Debugging is straightforward. Testing is reliable.

**Reliability.** Batch pipelines have well-understood failure modes and recovery patterns. If a pipeline fails, it can be rerun from the last successful checkpoint. There is no continuous state to manage, no stream offset to track.

**Cost.** Batch compute runs only when the pipeline executes — typically minutes per hour. Streaming infrastructure runs continuously, consuming compute and memory at all hours even when event volume is low.

**Data quality.** Batch pipelines can apply comprehensive data quality checks before loading — checking referential integrity, validating aggregates against source systems, running cross-table consistency checks. These checks are harder to implement in a streaming context where data arrives record by record.

For the vast majority of enterprise BI workloads — dashboards, reports, financial analytics, operational monitoring — hourly or 4-hourly batch is sufficient. The question to ask is not "do we have real-time data?" but "does the business decision we are making with this data change faster than our refresh interval?"

Streaming architecture: how it works

Streaming data architecture processes data continuously as it is generated, rather than accumulating it and processing it in batches. The canonical streaming architecture has three components:

**Event producer**: The source system publishes events to a messaging system (Apache Kafka, Azure Event Hubs, Amazon Kinesis) as they occur — a transaction processed, a sensor reading taken, a user action logged. Events are published to named topics that consumers can subscribe to.

**Stream processor**: A stream processing engine (Apache Flink, Databricks Structured Streaming, Kafka Streams) subscribes to event topics, applies transformations, aggregations, and business logic to the event stream, and writes results to the destination. The processor maintains state between events — for windowed aggregations, for joining streams, for detecting patterns across sequences of events.

**Stream store / serving layer**: Processed stream results are written to a destination optimised for the query pattern — a low-latency key-value store (Redis, Cassandra) for real-time lookups, a time-series database (InfluxDB, TimescaleDB) for sensor and IoT data, or a lakehouse table (Delta Lake, Iceberg) for analytical queries over streaming data.

The streaming architecture adds significant complexity at every layer: the messaging system requires capacity planning and retention configuration; the stream processor requires cluster management and exactly-once semantics handling; the state management for windowed aggregations requires careful design to avoid unbounded memory growth; and the end-to-end latency monitoring requires instrumentation that batch pipelines do not need.

Lambda architecture: combining batch and streaming

Many real-world data architectures need both streaming for low-latency views and batch for reliable historical analysis. The Lambda architecture addresses this by maintaining two parallel processing paths:

**Speed layer**: A streaming pipeline processes events as they arrive, producing low-latency views of recent data. The speed layer is fast but may produce slightly inconsistent results — late-arriving events may not be reflected, reprocessing for corrections is complex.

**Batch layer**: A batch pipeline processes the full event history on a schedule, producing accurate, complete views of historical data. The batch layer is authoritative but delayed.

**Serving layer**: Queries are served from a combination of the batch layer (for historical accuracy) and the speed layer (for the most recent window), merged at query time.

Lambda architecture adds significant complexity: you maintain two processing paths, two codebases, and a merging logic that must reconcile discrepancies between them. For many use cases, this complexity is justified. For others — particularly where near-real-time (minutes) rather than true real-time (milliseconds) is sufficient — a simplified approach works: run a streaming pipeline that writes to a lakehouse table using micro-batches, and serve queries directly from that table without a separate batch path.

Databricks Structured Streaming and Delta Lake make this unified streaming+batch approach practical: the same table serves both streaming writes (with ACID transactions preventing read inconsistency) and batch queries from BI tools.

Common real-time use cases that justify the investment

**Fraud and anomaly detection**: Scoring each transaction against a fraud model as it occurs, before it is processed. The business value of catching fraud before a payment clears is unambiguous, and the latency requirement (sub-second) is genuine. This is one of the clearest cases for real-time infrastructure.

**Operational alerting**: Triggering alerts when a business metric crosses a threshold — inventory below reorder level, error rate above 1%, a critical pipeline failure. Near-real-time (minutes) is usually sufficient. Can often be implemented with micro-batch rather than full streaming.

**Real-time customer personalisation**: Updating product recommendations, pricing, or content based on a user's current session activity. Requires session-level state management and sub-minute latency. Justified when personalisation demonstrably increases conversion.

**Industrial and IoT monitoring**: Processing sensor data from manufacturing equipment, energy infrastructure, or building management systems to detect anomalies, predict failures, or optimise operations. Often high-frequency (thousands of events per second per device) and requires aggregation and anomaly detection at the edge or near-edge before centralisation. We cover this specifically in the cloud engineering context in azure data architecture best practices.

**Operational dashboards for customer-facing teams**: Customer service agents who need to see order status, recent interactions, and account state in real time — not as of last night's batch. Near-real-time (minutes) is usually sufficient.

What most "real-time" projects get wrong

**Building streaming before validating the latency requirement.** The most common mistake is building streaming infrastructure because "real-time sounds better" rather than because the business use case genuinely requires it. Before committing to a streaming architecture, validate the actual latency requirement with the business stakeholders who will use the data. Most teams discover that hourly batch or near-real-time micro-batch is sufficient.

**Underestimating operational complexity.** Streaming infrastructure requires ongoing operational expertise: Kafka cluster management, stream processor cluster sizing, offset management, dead-letter queue monitoring, late-data handling, and exactly-once semantics verification. The engineering cost does not end at deployment — it scales with the operational complexity of the running system.

**Ignoring data quality in streaming contexts.** Batch pipelines can apply comprehensive data quality checks before loading. Streaming pipelines process records as they arrive, which means bad data propagates to the serving layer before quality checks can catch it. Designing data quality into a streaming pipeline — schema validation at ingestion, dead-letter queues for records that fail validation, downstream anomaly detection — requires deliberate engineering investment that is often skipped in initial builds.

**Not planning for late-arriving data.** In streaming architectures, events do not always arrive in order. A sensor reading from 10:00am may arrive at 10:03am, after the 10-minute window has already closed. Handling late-arriving data correctly — either by including it in a corrected window or explicitly marking it as late — requires deliberate design. Ignoring it produces incorrect aggregations that appear correct.

FAQs

How do we know if we need streaming or batch?

Ask: "If this data is 4 hours old when an analyst looks at it, does it change the decision they make?" If no — batch is sufficient. If yes — ask "Does it need to be 5 minutes old, or 30 seconds old?" The answer determines whether micro-batch or full streaming is required. For most enterprise BI workloads, 4-hour batch is sufficient. For operational alerting, 5-minute micro-batch is usually sufficient. For fraud scoring, sub-second streaming is necessary.

What is Apache Kafka and do we need it?

Apache Kafka is a distributed event streaming platform — a highly durable, high-throughput message broker that allows event producers to publish events and stream consumers to subscribe to them. For true real-time architectures processing high event volumes, Kafka (or Azure Event Hubs, which is Kafka-API-compatible) is the standard infrastructure choice. For near-real-time use cases where micro-batch is sufficient, Kafka is often unnecessary — simpler ingestion tools (Fivetran, ADF) can handle the workload. Do not deploy Kafka because it is a recognised technology in the streaming space; deploy it because your use case requires the throughput and durability guarantees it provides.

What is Apache Flink and how does it differ from Spark Streaming?

Both Apache Flink and Apache Spark Structured Streaming are distributed stream processing engines. Flink is a native stream processor — it processes events one at a time with true continuous processing semantics. Spark Structured Streaming is a micro-batch processor — it processes small batches of events on a configurable interval (as low as a few hundred milliseconds). For most near-real-time use cases, Spark Structured Streaming on Databricks is the practical choice because it integrates with Delta Lake, the existing Databricks engineering environment, and the established Spark ecosystem. Flink is preferred for use cases that require truly continuous processing semantics, complex event processing, or sub-hundred-millisecond latency.

How does real-time data architecture relate to AI and agentic AI?

Real-time infrastructure is a prerequisite for agentic AI that takes actions based on current state. An AI agent that monitors operations and triggers interventions needs current data — not data that is 24 hours old from last night's batch. The architectural gap between batch analytics infrastructure and agentic AI requirements is precisely this latency problem. We cover this in detail in why your data architecture cannot support agentic AI.

Our data architecture consulting and cloud engineering practices design real-time and near-real-time data architectures across Azure Event Hubs, Kafka, Databricks Structured Streaming, and Delta Lake. If your organisation is evaluating streaming infrastructure or trying to understand what "real-time" actually requires, book a free 30-minute audit and we will give you a direct assessment of what latency tier your use cases actually need.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →