Migrating to BigQuery: Moving Your Data Warehouse to Google Cloud

Migrating a data warehouse to BigQuery involves more than copying tables — it requires SQL dialect conversion, ETL pipeline rewiring, IAM and access control design, BI reconnection, and cost model adjustment. This guide covers the full BigQuery migration lifecycle, covering migrations from Redshift, Snowflake, and on-premises warehouses, with practical guidance on common blockers and validation patterns.

Why Organisations Move to BigQuery

BigQuery's serverless architecture requires no cluster management, scales automatically, and charges per byte scanned rather than per active cluster. For organisations already on Google Cloud, BigQuery eliminates warehouse infrastructure entirely. For organisations migrating from AWS or on-premises, BigQuery's analytics capabilities, native ML integration via BigQuery ML, and Looker/Looker Studio ecosystem make it an attractive target.

Migration drivers: eliminate managed warehouse infrastructure; leverage Google Cloud data ecosystem (Dataflow, Pub/Sub, Vertex AI, Looker); reduce operational overhead; and in some cases, cost optimisation — particularly when migrating from Redshift where reserved instance pricing has become expensive.

Assessment: What You Are Migrating

Before moving data, document the full scope:

**Schema and table inventory**: Query INFORMATION_SCHEMA in the source warehouse to enumerate tables, row counts, sizes, and last-modified dates. Identify tables with no activity in the last 12 months — candidates for archival rather than active migration.

**SQL dialect gap analysis — BigQuery specifics**:

BigQuery SQL has several differences from standard ANSI SQL and other dialects:

- **Backtick-quoted identifiers**: BigQuery uses project.dataset.table notation with optional backtick quoting. Legacy SQL used square brackets; Standard SQL uses backticks.

- **EXCEPT and REPLACE in SELECT**: SELECT * EXCEPT (column_name) excludes a column; SELECT * REPLACE (expression AS column_name) replaces a column's value without listing all other columns explicitly.

- **Date functions**: DATE_ADD, DATE_SUB, DATE_DIFF, DATE_TRUNC — BigQuery uses INTERVAL keyword: DATE_ADD(date, INTERVAL 7 DAY). DATEDIFF becomes DATE_DIFF(date1, date2, DAY).

- **Approximate aggregation**: APPROX_COUNT_DISTINCT, APPROX_QUANTILES, APPROX_TOP_COUNT — highly efficient for large-scale analytics.

- **ARRAY and STRUCT types**: BigQuery natively supports nested repeated records. Migrating flat denormalised tables is straightforward; migrating schemas that use JSON strings to represent nested data may benefit from conversion to native ARRAY/STRUCT.

- **Partitioning syntax**: PARTITION BY DATE(_PARTITIONTIME) or PARTITION BY DATE(event_date) — different from Redshift DISTSTYLE and SORTKEY or Snowflake CLUSTER BY syntax.

**Stored procedure and UDF inventory**: BigQuery supports SQL procedures and JavaScript UDFs. Teradata BTEQ, Oracle PL/SQL, and Redshift Python UDFs all require rewrite. Audit procedure complexity before estimating migration effort.

BigQuery Project and Dataset Structure

Before migrating data, design the target structure:

**Projects**: BigQuery is organised by Google Cloud project. Billing is per project. Most enterprises use separate projects for environments (dev, staging, production) and for billing isolation between teams. Cross-project queries are supported.

**Datasets**: Equivalent to schemas — the container for tables. Design datasets by: data layer (raw, staging, marts) and by domain (finance, sales, operations). Data location (US, EU, specific region) is set at the dataset level and cannot be changed after creation.

**IAM roles**: BigQuery access is controlled via Google Cloud IAM. Key roles:

- bigquery.dataViewer: read table data

- bigquery.dataEditor: read and write data

- bigquery.jobUser: run jobs (required for queries)

- bigquery.user: create datasets and jobs in a project

Avoid project-level roles — grant dataset-level access to specific service accounts and user groups.

Data Migration to BigQuery

**From Redshift**: Export to S3 using Redshift UNLOAD command as Parquet. Transfer S3 files to GCS using Storage Transfer Service. Load from GCS to BigQuery using LOAD DATA or bq load.

**From Snowflake**: Export using COPY INTO to GCS directly (Snowflake supports GCS stage) as Parquet. Load from GCS to BigQuery.

**From on-premises**: Stream via Datastream (Google's CDC service) for operational databases — Datastream reads PostgreSQL/MySQL/Oracle WAL/redo logs and delivers changes to BigQuery via direct integration. For bulk export, export to GCS then load to BigQuery.

**BigQuery Data Transfer Service**: Managed transfers from Google Ads, Google Analytics 4, YouTube, Salesforce, and other SaaS sources directly to BigQuery — no Fivetran or Airbyte needed for Google ecosystem sources.

**Load options**:

- bq load from command line: bq load --source_format=PARQUET dataset.table gs://bucket/path/*.parquet

- LOAD DATA SQL: LOAD DATA INTO dataset.table FROM FILES (format='PARQUET', uris=['gs://bucket/path/*'])

- Storage Write API: streaming inserts via API for real-time pipeline ingestion

Schema Design for BigQuery

BigQuery is columnar with serverless autoscaling — different design optimisations from row-oriented or managed-MPP warehouses:

**Partitioning**: Partition large tables by date, timestamp, or integer range. Queries with partition filters in WHERE clause scan only matching partitions. Always partition tables expected to grow beyond 1-2GB. Use require_partition_filter=true to enforce partition pruning in queries.

**Clustering**: CLUSTER BY 1-4 columns within partitions. BigQuery physically co-locates rows with similar cluster column values, enabling block skipping. Cluster on columns frequently used in WHERE and GROUP BY after the partition column.

**Nested and repeated fields**: BigQuery ARRAY and STRUCT types store related data denormalised in one row. A user row with an ARRAY of purchase events avoids separate fact-dimension joins. Effective for event-level data; requires UNNEST in queries. Avoid if analysts are not comfortable with UNNEST syntax.

**Avoid small tables with high query frequency**: BigQuery minimum billing is 10MB per query per table. For tiny lookup tables queried millions of times daily, denormalise into the querying table.

ETL and Pipeline Rewiring

**Fivetran and Airbyte**: Reconfigure destination to BigQuery. Specify project, dataset, and service account JSON credentials.

**dbt**: Update profiles.yml to BigQuery connection. BigQuery dbt profile uses service account JSON or OAuth. Validate model SQL for BigQuery dialect — particularly date functions and identifier syntax.

**Airflow and Dataflow**: Update connection targets from source warehouse to BigQuery. Use BigQueryInsertJobOperator in Airflow for BigQuery job execution. Use beam.io.WriteToBigQuery in Dataflow pipelines.

**Pub/Sub to BigQuery**: For streaming pipelines, Pub/Sub BigQuery subscriptions write messages directly to a BigQuery table — no Dataflow required for simple stream-to-table landing.

BI Tool Reconnection

**Looker**: BigQuery is Looker's native platform — connection via service account with bigquery.jobUser and dataset-level viewer roles. LookML SQL dialect: StandardSQL. This is the lowest-friction BI reconnection for BigQuery migrations.

**Tableau**: Use the Tableau BigQuery native connector. Requires Google Cloud service account JSON key or OAuth via Google account. Test calculated fields for dialect differences.

**Power BI**: BigQuery connector in Power Query. On-premises gateway requires Google BigQuery ODBC driver. Service account authentication recommended for scheduled refresh.

**Looker Studio**: Native BigQuery connector — free dashboards with direct BigQuery execution. Appropriate for operational reporting and sharing across the organisation without per-user licence cost.

Cost Considerations After Migration

**On-demand queries**: Charged per byte scanned — partition and cluster all large tables; select specific columns rather than SELECT *; use INFORMATION_SCHEMA.JOBS to track per-query costs.

**Storage**: BigQuery storage is significantly cheaper than most managed warehouses. Active storage: $0.02/GB. Long-term storage (tables not modified in 90 days): $0.01/GB automatically.

**Slot reservations**: For consistently high query loads, flat-rate reserved slots may be cheaper than on-demand. Evaluate with 30 days of INFORMATION_SCHEMA.JOBS usage data before committing.

**Streaming inserts**: BigQuery Streaming API charges per row — expensive for high-volume streaming. Prefer Storage Write API (significantly cheaper) for new streaming implementations.

Our data architecture practice manages BigQuery migrations from Redshift, Snowflake, and legacy warehouses — contact us to discuss your Google Cloud data requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →