BlogData Architecture

GCP Data Architecture: Building a Modern Data Platform on Google Cloud

Austin Duncan
Austin Duncan
Managing Director & Principal Data Architect
·July 2, 202612 min read

Google Cloud has the strongest integrated data platform of any cloud provider. Here is the reference architecture for building on GCP — BigQuery, Dataflow, Pub/Sub, and the native tooling that makes GCP compelling.

The quick answer

Google Cloud Platform has the strongest integrated data platform of any cloud provider — BigQuery is the most capable cloud data warehouse, Pub/Sub and Dataflow provide best-in-class streaming, and the native integration between GCP services makes BigQuery-centric architectures particularly clean. The reference GCP data stack: Cloud Storage (object storage), Pub/Sub or Datastream (ingestion), BigQuery (warehouse + analytics), Dataflow or Spark on Dataproc (heavy processing), dbt or Dataform (transformation), Cloud Composer (Airflow for orchestration), and Looker or Looker Studio for BI. GCP is the strongest choice for BigQuery-first architectures and for organisations already using Google Workspace.

The GCP data service landscape

**Storage**: Google Cloud Storage (GCS) — equivalent to S3, the primary object storage layer for raw and intermediate data. Standard, Nearline, Coldline, and Archive tiers for different access frequency profiles.

**Ingestion**: Cloud Pub/Sub (fully managed event streaming, Kafka-compatible API available, strong Google ecosystem integration), Datastream (CDC from Oracle, MySQL, PostgreSQL, AlloyDB to BigQuery or GCS in real-time), Cloud Data Transfer Service (batch transfer from AWS S3 or other cloud storage), Fivetran and Airbyte (third-party SaaS ingestion tools that work with BigQuery as a destination).

**Storage / Warehouse**: BigQuery is the unified data warehouse and data lake on GCP. BigQuery stores data in native columnar format (Capacitor), but also provides BigLake (open table format support for Delta, Iceberg, Hudi on GCS), and Bigtable (wide-column NoSQL for operational use cases). For most analytical workloads, BigQuery is the primary storage and compute layer.

**Processing**: Cloud Dataflow (managed Apache Beam for both batch and streaming processing — the native GCP stream processing service), Dataproc (managed Spark, Flink, Hadoop on GCP), BigQuery native processing (SQL, ML, and Python via BigQuery Remote Functions).

**Transformation**: Dataform (Google's native dbt alternative, now part of BigQuery Studio — defines SQL transformation workflows with dependency management, testing, and version control), dbt (third-party, works with BigQuery via dbt-bigquery adapter, more feature-complete than Dataform for complex projects), BigQuery scheduled queries (simple scheduled SQL for lighter transformation workloads).

**Orchestration**: Cloud Composer (managed Apache Airflow on GCP — the equivalent of MWAA on AWS), Workflows (serverless workflow orchestration for GCP service-centric pipelines), Cloud Scheduler (simple cron-style job scheduling for lightweight orchestration needs).

**BI**: Looker (Google Cloud-native enterprise BI, LookML semantic layer, covered in looker vs power bi), Looker Studio (free, browser-based BI tool formerly known as Google Data Studio, good for simple dashboards), third-party BI tools (Tableau, Power BI, all connect to BigQuery natively).

**ML and AI**: Vertex AI (managed ML platform for training, deployment, and serving), BigQuery ML (train and run ML models directly in SQL inside BigQuery — logistic regression, XGBoost, deep neural networks), Gemini integration in BigQuery and Vertex.

The GCP-native data stack

**Ingestion**: Pub/Sub for streaming event data (application events, clickstream, IoT). Datastream for CDC from operational databases. Fivetran for SaaS source connectors (Salesforce, HubSpot, NetSuite) with BigQuery as destination.

**Storage**: GCS as the landing zone for raw files. BigQuery as the primary analytical store.

**Transformation**: Dataform for lighter SQL transformation needs (it is deeply integrated into BigQuery and requires no external hosting). dbt with dbt-bigquery adapter for complex transformation projects that need dbt's full feature set (testing library, documentation, CI/CD integration).

**Orchestration**: Cloud Composer (managed Airflow) for multi-step pipelines with complex dependencies. BigQuery scheduled queries for simple time-based SQL execution.

**BI**: Looker for governed, semantic-layer-based enterprise BI. Looker Studio for lightweight, exploratory dashboards.

BigQuery: the centre of gravity

BigQuery's architecture is fundamentally different from other cloud warehouses. Key characteristics:

**Serverless**: BigQuery has no cluster to provision or size. You submit a query; Google allocates compute automatically. On-demand queries are charged per TB scanned; capacity reservations (slot-based) are charged by committed slot-hours.

**Separation of storage and compute**: BigQuery storage is billed separately from compute. Storage is cheap (~$20/TB/month active, $4/TB archived). Compute is billed per query (on-demand) or by reserved slot capacity.

**Native BigLake / open format support**: BigQuery can query Delta Lake and Iceberg files on GCS directly using BigLake External Tables, enabling a unified query surface across BigQuery-native tables and open format lake data without copying data.

**BigQuery ML**: execute ML model training and inference using SQL. BQML reduces the need to export data for ML training — models trained in BQML run in BigQuery compute without an external ML platform.

**Streaming inserts**: BigQuery Streaming API (and Storage Write API for higher throughput) inserts rows in near-real-time. Combined with Pub/Sub and Dataflow, BigQuery can serve as a near-real-time analytics target.

For BigQuery performance optimisation, see bigquery performance optimization. For the BigQuery vs Snowflake comparison, see bigquery vs snowflake.

Dataflow and streaming architecture

Dataflow (managed Apache Beam) is GCP's primary stream processing service. Beam unifies batch and streaming processing in a single SDK — the same pipeline code runs in batch mode (processing a bounded dataset) or streaming mode (processing an unbounded event stream). Dataflow manages the underlying compute cluster automatically.

**Pub/Sub → Dataflow → BigQuery** is the canonical GCP streaming pattern: application events publish to Pub/Sub topics, Dataflow consumes them, applies transformations (windowing, aggregation, enrichment), and writes results to BigQuery. The streaming pipeline achieves sub-minute latency from event publication to query availability in BigQuery.

For less complex streaming use cases, BigQuery's Pub/Sub direct ingestion (Pub/Sub subscriptions that push directly to BigQuery tables) is simpler than Dataflow and appropriate when no transformation is needed.

Dataform vs dbt

Dataform is Google's SQL-based transformation framework, deeply integrated into BigQuery Studio. The choice between Dataform and dbt for GCP deployments:

**Dataform**: tighter BigQuery integration (runs in BigQuery console, no external hosting needed), simpler setup, good for smaller transformation projects. Limited by a smaller feature set than dbt — less mature testing framework, fewer community packages, no CLI equivalent to dbt's for CI/CD.

**dbt with dbt-bigquery**: more feature-complete (dbt docs, richer testing, established community packages, mature CI/CD tooling). Requires hosting dbt Core (on Cloud Run, GKE, or another compute service) or dbt Cloud subscription. Better choice for larger, more complex transformation projects.

For most GCP deployments, dbt is the stronger choice for production transformation — the feature set advantage over Dataform is material for complex projects. Dataform is appropriate for lighter transformation needs or teams without existing dbt expertise.

Cost considerations for GCP data

**BigQuery on-demand vs reservations**: on-demand pricing ($6.25/TB scanned) is excellent for low-to-moderate query volumes. At roughly 50TB/day of scan volume, BigQuery capacity reservations (flat-rate slot pricing) typically become cheaper. Monitor bytes scanned per query via INFORMATION_SCHEMA.JOBS to assess reservation break-even.

**GCS storage tiers**: use Standard storage for frequently accessed data, Nearline ($0.01/GB/month vs Standard $0.02/GB) for data accessed less than once a month, Coldline for archival. Lifecycle policies can automatically move objects between tiers based on age.

**Dataflow compute**: Dataflow charges per worker per hour for batch jobs, or per worker-hour and per GB processed for streaming. Horizontal autoscaling adjusts worker counts based on throughput — monitor job autoscaling efficiency to avoid over-provisioned streaming jobs.

For the transformation layer specifics, see what is dbt and dbt best practices. For the Looker BI layer on GCP, see looker vs power bi and looker vs tableau.

Our data architecture consulting and cloud engineering practices design and implement GCP data platforms — from BigQuery architecture and Dataflow pipeline design through dbt implementation and cost optimisation. Book a free 30-minute audit to discuss your GCP data architecture.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →