BlogData Architecture

BigQuery Cost Optimisation: Controlling Cloud Data Warehouse Spend

James Okafor
James Okafor
Lead Data Engineer
·December 8, 202712 min read

BigQuery's on-demand billing model charges per byte scanned. Without careful query and table design, costs can grow unexpectedly — a single unpartitioned query on a multi-terabyte table generates significant spend. This guide covers the BigQuery cost optimisation patterns that teams use to control spend: partitioning, clustering, materialised views, query governance, and the evaluation framework for reserved slots versus on-demand billing.

BigQuery's on-demand pricing charges $5 per TB scanned (at list price). A query that scans 200TB costs $1,000 — and it is easy to write such a query accidentally on a large, unpartitioned table. Cost control in BigQuery requires design decisions at the table level (partitioning, clustering) and query governance at the team level. This guide covers the patterns that produce material cost reduction in production BigQuery environments.

Understanding Bytes Scanned

BigQuery calculates billing based on bytes scanned during query execution, not bytes returned. A query that reads 10TB and returns 1000 rows is billed for 10TB. This means the primary levers for cost reduction are reducing the bytes a query needs to scan — not reducing the query result size.

BigQuery provides a query validator in the UI and --dry_run in the CLI that estimates bytes scanned before running a query. Use this consistently during development: run --dry_run before running an expensive query for the first time.

Partitioning: The Most Impactful Single Optimisation

Partition pruning is the most significant cost reduction available. A query that filters on the partition column scans only the matching partitions, not the full table.

For a 10TB events table partitioned daily:

- A query for yesterday's events scans 1/365 of the table — roughly 27GB instead of 10TB

- Cost reduction: ~99.7% for point-day queries

Partition types:

- **Ingestion time** — rows automatically partitioned by load timestamp. Simple but prevents backfilling into past partitions.

- **Column-based** — partitioned by a DATE/TIMESTAMP/DATETIME column value. Preferred for tables with a natural event time dimension.

- **Integer range** — for tables without a time dimension but with a numeric column suitable for range partitioning.

**Partition expiration** — configure partition expiration to automatically delete partitions older than a specified number of days. For transactional or operational tables where historical data beyond a rolling window is not needed, partition expiration eliminates the storage accumulation problem without manual cleanup.

**Verify partition pruning** — a partition predicate that BigQuery cannot evaluate at partition selection time (involving subqueries, non-deterministic functions, or JOINs to other tables) will scan all partitions. Use INFORMATION_SCHEMA.JOBS to verify that queries on partitioned tables show partition pruning in the jobs metadata.

Clustering: Secondary Cost Reduction

Clustering co-locates related rows within partitions. For queries that filter on clustered columns, BigQuery reads only the blocks containing relevant data — reducing bytes scanned within the partition.

For a partitioned events table, clustering by user_id and event_type means queries filtering on specific users or event types scan far fewer blocks within each date partition.

Clustering is free to apply and provides cost reduction proportional to the selectivity of queries on the clustered columns. Apply clustering to the 2-4 columns most commonly used in WHERE clause predicates beyond the partition column.

**Automatic reclustering** — BigQuery automatically reclusters tables in the background as data is inserted. No manual maintenance is required.

Materialised Views

BigQuery materialised views precompute and cache query results. When a query matches the materialised view's definition, BigQuery answers it from the cached result rather than scanning the base tables.

For common aggregate queries (daily summary by region and product, for example) that run hundreds of times per day, a materialised view caches the aggregate result and eliminates the per-execution cost:

CREATE MATERIALIZED VIEW analytics.daily_revenue_mv AS

SELECT

DATE(order_timestamp) as order_date,

region,

SUM(revenue) as total_revenue

FROM analytics.orders

GROUP BY 1, 2;

If a BI tool runs "daily revenue by region" 500 times per day against a 1TB orders table, the cost difference is:

- Without materialised view: 500 queries x 1TB scan = 500TB scanned per day

- With materialised view: incremental update cost only

Materialised view refresh is automatic and incremental; only new data since the last refresh is processed.

Column Selection and Table Design

BigQuery is columnar — it scans only the columns referenced in a query. Unnecessary column reads add cost:

- **Never SELECT **** in production queries — explicitly select needed columns. SELECT * on a 100-column table with 500 text columns scans far more data than SELECT id, event_type, created_at.

- **Nested/repeated fields** — BigQuery natively supports STRUCT and ARRAY types. Storing related data as nested structures within a row (rather than in a separate joined table) reduces join overhead and can reduce bytes scanned for queries accessing only the nested field.

- **STRUCT columns** — if only some fields of a STRUCT are needed, querying the specific field (SELECT event.user_id) scans only that nested column, not the full STRUCT.

Query Governance and Cost Attribution

Organisational cost control requires visibility into which queries, users, and projects are generating cost:

**INFORMATION_SCHEMA.JOBS_BY_PROJECT** — the primary query log. Contains bytes processed, billing tier, user, and query text for every completed query:

SELECT

user_email,

SUM(total_bytes_processed) / POW(1024,4) as tb_processed,

COUNT(*) as query_count

FROM region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT

WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)

AND statement_type = 'SELECT'

GROUP BY 1

ORDER BY tb_processed DESC

LIMIT 20;

Review this weekly to identify high-spend users and queries.

**Maximum bytes billed** — set maximum_bytes_billed on queries or at the project level to prevent runaway queries from generating unexpected large bills. A query that exceeds the maximum bytes billed limit is cancelled and returns an error rather than completing and generating the cost.

**Cost-based query approval** — for queries above a bytes threshold (100GB+), require dry-run approval before execution. Implement this as a pre-execution hook in your query tooling.

Reserved Slots vs On-Demand

BigQuery offers two billing models:

**On-demand** — pay per byte scanned. Suitable when query volume is irregular or unpredictable. No commitment.

**Capacity reservations (slots)** — purchase a fixed number of slots (unit of compute capacity). Queries run against a fixed compute pool at no per-query cost. Suitable when query volume is high and predictable; economically attractive when monthly on-demand spend exceeds the slot reservation cost.

The evaluation: calculate your monthly on-demand spend. If it consistently exceeds the cost of equivalent slot reservations (BigQuery pricing calculator provides this comparison), slots provide better economics. Slot reservations are available as 1-year or 3-year commitments; pay-as-you-go slots are available at a premium over commitment pricing.

Our data architecture and cloud engineering practice optimises BigQuery cost and performance for enterprise analytics teams — contact us to discuss BigQuery cost management for your environment.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →