BlogData Engineering

Google Cloud Dataflow: Unified Stream and Batch Processing

James Okafor
James Okafor
Lead Data Engineer
·December 22, 202712 min read

Google Cloud Dataflow is the fully managed Apache Beam runner on Google Cloud — a unified programming model for both stream and batch processing pipelines. This guide covers the Apache Beam model, Dataflow execution, windowing for stream processing, integration with Pub/Sub and BigQuery, and when Dataflow is the right choice for your data pipeline requirements.

What Google Cloud Dataflow Is

Google Cloud Dataflow is a fully managed service for executing Apache Beam pipelines on Google infrastructure. Beam is a unified programming model where the same pipeline code runs in both batch and streaming modes — the execution environment (runner) handles the operational differences.

For data engineers on Google Cloud, Dataflow provides: autoscaling compute that grows and shrinks with your pipeline's data volume, automatic horizontal scaling of workers, streaming-native execution with exactly-once semantics, and deep integration with Pub/Sub, BigQuery, Cloud Storage, and Bigtable.

The Apache Beam Model

Apache Beam's core abstraction is the PCollection — an immutable, distributed dataset. Pipelines are sequences of transforms applied to PCollections.

**Pipeline construction** (Python SDK):

import apache_beam as beam

with beam.Pipeline() as p:

result = (

p

| 'Read' >> beam.io.ReadFromText("gs://bucket/input.txt")

| 'Parse' >> beam.Map(parse_row)

| 'Filter' >> beam.Filter(lambda x: x['amount'] > 0)

| 'Write' >> beam.io.WriteToBigQuery(table_ref, schema=schema)

)

The pipeline definition is lazy — it builds a directed acyclic graph of transforms. Execution happens when the pipeline runs, either locally (for testing) or on a runner like Dataflow.

**Core transforms**:

- beam.Map: one-to-one element transformation

- beam.FlatMap: one-to-many (like explode)

- beam.Filter: discard elements not matching predicate

- beam.GroupByKey: group a key-value PCollection by key

- beam.CombinePerKey: aggregate within each key group

- beam.ParDo: general-purpose transform using a DoFn class for complex logic

**DoFn for stateful transforms**:

class EnrichFn(beam.DoFn):

def setup(self):

self.client = create_api_client()

def process(self, element):

enriched = self.client.lookup(element['id'])

yield {**element, 'metadata': enriched}

setup() runs once per worker to initialise connections. process() runs per element and yields output elements.

Streaming with Beam and Dataflow

Beam's streaming model handles three core challenges:

**Event time vs processing time**: Event time is when an event occurred; processing time is when the pipeline receives it. Network latency, batch uploads, and mobile app connectivity gaps create gaps between the two. Beam processes data using event time, not processing time, for accurate analytics.

**Watermarks**: Beam estimates progress in event time using watermarks — a heuristic for "we expect all events with timestamps before time T have been received." The watermark advances as new events arrive. Events arriving after the watermark has passed their timestamp are late data.

**Windows**: Group streaming events into finite time buckets for aggregation:

- Fixed windows: non-overlapping buckets (hourly, daily)

- Sliding windows: overlapping windows (last 30 minutes, updated every 5 minutes)

- Session windows: activity-based — group events with gaps smaller than a threshold into one session per user

Example: count events per 1-hour fixed window:

with_windows = (

events

| 'Window' >> beam.WindowInto(beam.window.FixedWindows(3600))

| 'Count' >> beam.combiners.Count.PerKey()

)

**Triggers and late data**: Triggers control when window results are emitted. The default trigger fires once when the watermark passes the window end. For dashboards requiring early results before the window closes, use early triggers that fire periodically. AccumulatingCombine accumulates results across trigger firings; DiscardingFiredPanes fires and discards.

Reading from and Writing to GCP Services

**Pub/Sub source** (streaming):

messages = p | beam.io.ReadFromPubSub(topic="projects/project/topics/events")

Each Pub/Sub message arrives as bytes — decode and parse JSON in a subsequent Map.

**BigQuery sink**:

records | beam.io.WriteToBigQuery(

'project:dataset.table',

schema={'fields': [{'name': 'id', 'type': 'STRING'}, {'name': 'ts', 'type': 'TIMESTAMP'}]},

write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,

create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED

)

For streaming inserts, Dataflow uses the BigQuery Streaming API. For batch, it uses load jobs. Streaming inserts are real-time but cost more per row; batch load jobs are free but async.

**Cloud Storage source** (batch):

lines = p | beam.io.ReadFromText("gs://bucket/prefix/*.csv")

Supports wildcards — Dataflow distributes file reads across workers.

**Bigtable for stateful lookups**: Use beam-bigtable connector for per-element lookups against a Bigtable table in a DoFn setup/process pattern — initialise the Bigtable client once in setup(), call per element in process().

Dataflow Runner Features

**Autoscaling**: Dataflow monitors pipeline throughput and adjusts worker count automatically — scales up when backlog grows, scales down when throughput decreases. Set maxNumWorkers to cap cost.

**Dataflow Flex Templates**: Package pipeline code and dependencies into a container image. Flex Templates enable launching parameterised pipelines from Cloud Scheduler, Cloud Run, or API calls without managing runner dependencies.

**Dataflow SQL**: Write streaming SQL pipelines that Dataflow executes as Beam under the hood — accessible from BigQuery UI for teams more comfortable with SQL than Python.

**Streaming Engine**: Offloads state and timer processing to a managed service outside of worker VMs — reduces memory pressure on workers and enables faster scaling. Enable with --enable_streaming_engine.

Dataflow vs Alternatives

**Dataflow vs Spark (Dataproc)**: Beam/Dataflow is unified batch and stream with autoscaling and no cluster management. Spark on Dataproc gives more control, broader ecosystem, and PySpark familiarity. Choose Dataflow for streaming-first and unified batch/stream pipelines. Choose Dataproc for large-scale batch jobs with complex PySpark transformations or existing Spark expertise.

**Dataflow vs Cloud Functions/Run**: Functions and Run suit event-driven processing of individual records. Dataflow suits high-throughput pipelines processing millions of records with windowed aggregations and stateful operations.

**Dataflow vs BigQuery scheduled queries**: For simple periodic aggregations that run entirely within BigQuery, scheduled queries are simpler and free. Dataflow is warranted when you need Python logic, external API calls, joins across sources, or stateful processing not expressible in SQL.

**Dataflow vs Pub/Sub + Dataproc Streaming**: Dataflow integrates both natively. For multi-stage stream processing with BigQuery sinks, Dataflow is the natural choice on GCP.

Beam Portability and Multi-Cloud

Beam pipelines are not tied to Dataflow. The same pipeline code runs on:

- Direct Runner: local execution for development and testing

- Dataflow Runner: managed GCP execution

- Spark Runner: Spark cluster execution

- Flink Runner: Flink cluster execution

This portability means Beam pipelines are not vendor-locked. Teams with multi-cloud requirements can run the same pipeline on Dataflow in GCP and Flink in a self-managed environment.

Our data architecture practice designs and implements Google Cloud data pipelines using Dataflow, Pub/Sub, and BigQuery — contact us to discuss your Google Cloud data requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →