BlogData Engineering

Real-Time Analytics Architecture: Streaming Pipelines, OLAP Engines, and When You Actually Need It

Obed Tsimi
Obed Tsimi
Founder & Senior Tableau Architect
·November 26, 202612 min read

The architecture of real-time analytics — streaming ingestion with Kafka, stream processing with Flink, real-time OLAP databases (ClickHouse, Apache Druid, Apache Pinot), and the honest assessment of when real-time adds business value versus when near-real-time or batch analytics is the right answer.

Real-time analytics is one of the most over-specced requirements in enterprise data work. Every organisation wants it; few actually need it. The ones that do need it face genuinely hard engineering problems. This guide covers the architecture of real-time analytics systems — and the honest assessment of when they add value versus when near-real-time or batch analytics is the right answer.

The Real-Time Analytics Spectrum

"Real-time" is not binary. The practical spectrum is:

**Batch analytics (hourly to daily latency):** Data is processed on a schedule. Queries run against a snapshot of data from the last batch run. Appropriate for most business reporting — revenue dashboards, cohort analysis, product metrics.

**Near-real-time analytics (minutes latency):** Data is ingested continuously or in short intervals (5–15 minutes) into a cloud warehouse and available for query shortly after. Achievable with streaming ingestion (Kafka + Confluent Cloud or AWS Kinesis) into Snowflake or BigQuery. Most "real-time" requirements are actually near-real-time requirements in disguise.

**Real-time analytics (seconds to sub-second latency):** Data is available for query within seconds of being generated. Requires a specialised real-time OLAP engine. Appropriate for operational dashboards, fraud detection, live monitoring, and use cases where sub-minute latency changes a business decision.

The most expensive engineering mistake in analytics is building a real-time system for a near-real-time problem.

Streaming Ingestion: Apache Kafka

Apache Kafka is the dominant event streaming platform. It provides durable, ordered, partitioned event logs with configurable retention. Producers write events to topics; consumers read from topics at their own pace.

For analytics use cases, Kafka serves as the event bus between operational systems and analytics infrastructure:

**Producers:** Application services writing events (user actions, transactions, system events) to Kafka topics. Events are typically JSON or Avro-serialised. Schema Registry (Confluent or AWS Glue Schema Registry) enforces schema compatibility between producers and consumers.

**Topics and partitioning:** A topic is divided into partitions. Events with the same key (e.g., user_id) are written to the same partition, preserving per-key ordering. Partition count determines maximum consumer parallelism — more partitions allow more consumers to read in parallel.

**Consumers:** Analytics consumers reading from Kafka include stream processors (Flink, Kafka Streams), cloud warehouse connectors (Snowflake Kafka Connector, BigQuery Dataflow), and object storage sinks (S3, GCS via Kafka Connect). Consumer groups allow multiple independent consumers to read the same topic at different offsets.

**Retention:** Kafka retains events for a configurable period (commonly 7 days to indefinite). For analytics use cases, a replay-capable retention window is valuable — it allows recovering from downstream failures by replaying from a historical offset.

**Managed Kafka options:** Confluent Cloud, AWS MSK (Managed Streaming for Apache Kafka), and Google Cloud Pub/Sub (not Kafka, but Kafka-compatible API). These eliminate Kafka cluster management — which is genuinely complex — in exchange for higher per-event cost.

Stream Processing: Apache Flink

Apache Flink is the leading stream processing engine for complex analytics transformations on event streams. It processes events with low latency (milliseconds), handles out-of-order events via watermarks and event-time windows, and provides exactly-once processing semantics with checkpointing.

Key Flink concepts for analytics:

**Event time vs processing time:** Event time is when the event occurred (embedded in the event). Processing time is when Flink sees it. For analytics correctness, always use event time — it produces consistent results even when events arrive late or out of order.

**Watermarks:** Watermarks tell Flink how far behind the real-time clock it should wait for late-arriving events before closing a time window. A watermark of 30 seconds means Flink waits 30 seconds past the window close time before emitting results, allowing events up to 30 seconds late to be included.

**Windows:** Tumbling windows (non-overlapping, fixed-size intervals), sliding windows (overlapping), and session windows (gaps between events define session boundaries). Window aggregations (count, sum, avg, distinct count) are the core of real-time analytics with Flink.

**Flink SQL:** Apache Flink exposes a SQL interface that allows defining streaming transformations using SQL syntax. For analytics engineers familiar with SQL, Flink SQL significantly reduces the learning curve for stream processing:

CREATE TABLE page_views_per_minute AS

SELECT

TUMBLE_START(event_time, INTERVAL '1' MINUTE) as window_start,

page_url,

COUNT(*) as view_count

FROM page_view_events

GROUP BY TUMBLE(event_time, INTERVAL '1' MINUTE), page_url;

**Managed Flink options:** Amazon Managed Service for Apache Flink (formerly Kinesis Data Analytics), Confluent's Flink-as-a-service (in partnership with Ververica), Ververica Platform for self-hosted. Running Flink at scale requires operational expertise; managed options reduce this significantly.

Real-Time OLAP Engines

Cloud data warehouses (Snowflake, BigQuery) are optimised for complex analytical queries on large historical datasets. They are not optimised for sub-second query latency on continuously updated data. For real-time analytics with sub-second query performance, specialised real-time OLAP engines are used.

**ClickHouse** is an open-source column-oriented DBMS designed for fast analytical queries on large datasets. ClickHouse ingests data continuously (via Kafka engine tables or structured data inserts) and executes aggregation queries at extremely low latency — typically milliseconds for aggregations on billions of rows. It uses a merge-tree storage engine that sorts data by primary key for fast range scans and deduplication.

ClickHouse is the most widely deployed real-time OLAP engine. It is operationally more complex than managed cloud services but performant and cost-efficient. ClickHouse Cloud provides a managed version.

**Apache Druid** is an open-source real-time analytics database designed for interactive queries on event data at scale. Druid ingests from Kafka natively, pre-aggregates data during ingestion (configurable roll-up), and serves queries with sub-second latency. It separates historical data storage (deep storage on S3/HDFS) from real-time serving (Realtime nodes), making it architecturally complex but scalable.

Druid is used by organisations with extremely high event volumes (billions of events per day) and complex real-time segmentation requirements — ad tech, user behaviour analytics at scale, IoT telemetry.

**Apache Pinot** was developed at LinkedIn for real-time analytics on user-facing applications. It handles both offline (batch-loaded) and real-time (Kafka-ingested) data in a single table, provides upsert support (important for event correction), and achieves sub-second query latency via pre-computed indexes and star-tree indexing for aggregations.

Pinot is well-suited for user-facing analytics features embedded in product UIs — analytics dashboards visible to customers, real-time leaderboards, personalised metrics.

Architecture Patterns

**Lambda architecture:** Parallel batch and speed layers. Batch layer rebuilds complete, accurate data on a schedule; speed layer provides low-latency approximate results for recent data. Query layer merges both. Operationally complex; the two layers frequently diverge and require reconciliation.

**Kappa architecture:** Single stream processing layer handles both real-time and historical processing by replaying Kafka logs. Simpler operationally; requires Kafka to retain enough history for full reprocessing.

**Streaming into the warehouse:** For near-real-time (not true real-time) use cases, continuous ingestion into Snowflake or BigQuery via Kafka Connect or Dataflow is often sufficient. Latency is 2–10 minutes; warehouse queries return results in seconds to minutes. This approach avoids operating a specialised OLAP engine.

When You Actually Need Real-Time

The genuine cases for sub-minute latency analytics:

**Fraud detection and risk scoring:** A payment decision that must be made in under 200ms requires real-time feature computation. Batch cannot help here.

**Operational monitoring with alerting:** A dashboard showing "orders failing checkout in the last 60 seconds" needs real-time data. An hourly report does not catch the incident.

**Live auction, trading, or bidding systems:** Latency directly affects revenue. Real-time is required.

**User-facing analytics in product UIs:** If your customers see an analytics dashboard in your product that shows their activity in "real-time," sub-minute latency is a product requirement.

**Most other cases:** Revenue reporting, product analytics, cohort analysis, data science exploration — these are batch or near-real-time problems. 5-minute latency is fine. Hourly is often fine. Building a Kafka + Flink + ClickHouse stack for a BI dashboard that analytics teams use internally is almost always over-engineering.

If you are considering a real-time analytics build, start by asking what decision changes in the next 60 seconds that cannot wait for the next 15-minute batch. If the honest answer is none, start with near-real-time streaming into your existing data warehouse before adding specialised infrastructure.

For data teams evaluating streaming and real-time analytics architecture, our data architecture consulting practice designs systems matched to actual business latency requirements — contact us to discuss your analytics infrastructure requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →