A streaming data pipeline processes data continuously as events occur rather than in scheduled batches. This guide explains how streaming pipelines work, when real-time processing is actually necessary, the tools used to build them, and the operational complexity they introduce.
A streaming data pipeline processes data continuously as events occur, rather than accumulating records and processing them in scheduled batches. Where a batch pipeline runs at intervals (hourly, daily), a streaming pipeline operates on every event as it arrives — milliseconds to seconds after it is generated.
Streaming pipelines enable use cases that batch pipelines cannot: fraud detection at transaction time, real-time dashboards showing current inventory or live order status, personalization systems that adapt to user behavior in the current session, and operational alerting that detects anomalies as they happen rather than hours after the batch runs.
When Real-Time Processing Is Actually Necessary
Before investing in streaming infrastructure, the first question should be: does this use case actually require real-time, or would a 15-minute or hourly batch suffice?
Genuine streaming requirements:
- **Fraud and risk detection** — a fraudulent transaction must be flagged before it is approved. No batch latency is acceptable.
- **Real-time operational control** — a logistics system adjusting routing in response to live vehicle telemetry. A trading system responding to market prices.
- **User-facing real-time features** — a ride-sharing app showing the current location of a vehicle. An e-commerce site showing current inventory count.
- **Event-driven application logic** — microservices reacting to business events (order placed, payment received, shipment dispatched) that must trigger immediate downstream actions.
Use cases that feel real-time but are often adequately served by batch:
- **Analytics dashboards showing "current" data** — for most business intelligence dashboards, data that is 15–60 minutes old is indistinguishable from real-time from the user's perspective. Near-real-time batch (15-minute increments) is dramatically simpler than streaming and serves most analytics use cases adequately.
- **Alerting on data quality anomalies** — most data quality alerting tolerates a lag of minutes to tens of minutes. Batch-triggered monitoring is simpler.
- **Reporting for end-of-day or weekly reviews** — these are not streaming use cases regardless of how they are sometimes presented.
Core Streaming Pipeline Components
**Event source** — where events originate. Application databases emitting change events via CDC (Debezium reading PostgreSQL WAL), application services writing to a message queue, IoT devices publishing sensor readings, clickstream events from web/mobile applications.
**Message broker / event streaming platform** — the durable, ordered log that decouples producers from consumers. Apache Kafka is the dominant platform: events are written to topics, retained for a configurable period, and consumed by multiple independent consumer groups at their own pace. Managed alternatives: Amazon Kinesis Firehose (simpler, lower throughput), Google Pub/Sub (GCP native), Azure Event Hubs (Kafka-compatible). The broker provides backpressure handling, replay capability, and the decoupling that prevents producers from being blocked by slow consumers.
**Stream processing engine** — the computation layer that transforms, aggregates, filters, and enriches events as they flow through the pipeline. Options:
- **Apache Flink** — the current state-of-the-art for stateful stream processing. Supports exactly-once semantics, sophisticated windowing (tumbling windows, sliding windows, session windows), stateful operations (maintaining running aggregations across events), and event-time processing (handling out-of-order events correctly). More complex to operate than simpler alternatives.
- **Apache Spark Structured Streaming** — streaming extension of Spark. If a team already runs Spark for batch processing, Structured Streaming extends those skills to streaming workloads. Micro-batch execution (processes small batches continuously) rather than true event-at-a-time processing.
- **ksqlDB** — SQL interface for stream processing on Kafka. For teams comfortable with SQL, ksqlDB provides a declarative way to write streaming transformations and aggregations without Java or Python code.
- **Flink SQL** — SQL interface for Apache Flink. Combines Flink's processing capabilities with SQL familiarity.
- **Databricks Structured Streaming** — Spark Structured Streaming integrated with Delta Lake, enabling streaming writes to Delta tables with ACID guarantees.
**Sink** — where processed events land. A data warehouse (streaming inserts to Snowflake, BigQuery, or Redshift for near-real-time analytics), a database (operational results stored in PostgreSQL or Cassandra), another Kafka topic (for downstream consumers), or a serving layer (Redis for low-latency feature serving, Elasticsearch for real-time search).
Windowing: Aggregating Across Time
Most streaming analytics is not about individual events but about aggregations over time windows: how many orders were placed in the last 5 minutes, what is the average session duration in the current hour, which products have had a spike in views in the last 15 minutes.
**Tumbling windows** — fixed, non-overlapping time periods. A 5-minute tumbling window aggregates events from 10:00–10:05, then 10:05–10:10, etc. No overlap between windows.
**Sliding windows** — fixed size but overlapping. A 5-minute window sliding every 1 minute: 10:00–10:05, 10:01–10:06, 10:02–10:07. Higher resolution but more computation.
**Session windows** — grouping events by inactivity gaps. A session ends when there is no activity for 30 minutes. Session windows have variable duration and are useful for user behavior analysis.
**Event time vs processing time** — events do not always arrive in the order they occurred. A mobile device that was offline for 10 minutes delivers queued events when it reconnects — those events are timestamped 10 minutes in the past but arrive now. Stream processors that use event time correctly handle out-of-order events using watermarks — the processor waits a defined interval for late-arriving events before finalizing a window's aggregation.
Exactly-Once vs At-Least-Once Semantics
Distributed streaming systems must handle failures: network interruptions, consumer restarts, broker failures. The failure-handling semantics determine whether events are processed zero, one, or more than one time:
**At-least-once** — events are guaranteed to be processed, but may be processed more than once on failure. Duplicates are possible. The application must be idempotent (processing the same event twice produces the same result) or include deduplication logic.
**Exactly-once** — events are processed exactly once, even on failure. Achieved through transactions, two-phase commit between the broker and sink, or idempotent producers with consumer offset tracking. More complex to implement; Kafka's exactly-once semantics (introduced in Kafka 0.11) and Apache Flink's exactly-once guarantee are the most reliable implementations.
Most streaming pipelines for analytics use at-least-once with deduplication at the sink — simpler than full exactly-once while providing effectively-once behavior for most analytical aggregations.
The Operational Complexity Tax
Streaming pipelines are fundamentally harder to operate than batch pipelines. The failure modes are more varied (message lag buildup, consumer group rebalancing, partition imbalance), the debugging is harder (distributed event processing with no natural checkpoint to inspect), and the state management requirements add infrastructure complexity.
Teams considering streaming should explicitly evaluate:
- Whether their use case actually requires sub-minute freshness, or whether 15-minute batch adequately serves the need
- Whether the team has the operational experience to run Kafka and Flink or Spark Structured Streaming in production
- Whether the on-call burden of streaming pipeline failures is acceptable relative to the business value delivered
For most analytics teams, the right answer is a combination: batch pipelines (Fivetran + dbt) for the primary analytical workload, with streaming reserved for the specific use cases that genuinely require real-time processing.
Our data architecture practice designs streaming and real-time data pipeline architectures — contact us to discuss whether and how streaming fits your data platform requirements.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →