Cloud data infrastructure costs grow faster than most organisations expect. Compute waste, unmanaged storage, over-provisioned warehouses, and unoptimised queries are the primary culprits. Here is the diagnostic framework and the specific controls that reduce costs.
The quick answer
Cloud data infrastructure costs grow faster than most organisations expect. The primary culprits: compute that runs when no one is using it, storage that accumulates without governance, queries that scan more data than necessary, and environments (dev, staging, test) that are sized like production. The good news: most cloud data cost problems are fixable with configuration changes, not architectural rewrites. A systematic cost review typically identifies 20–40% cost reduction in environments that have not been actively governed.
This guide covers the diagnostic framework for identifying waste and the specific controls that make cloud data costs predictable.
The four buckets of cloud data cost
Cloud data infrastructure cost breaks down into four categories:
**Compute** — the cost of running compute clusters: data warehouse virtual warehouses (Snowflake), query slots (BigQuery), Spark clusters (Databricks), ETL pipeline compute (ADF integration runtime, Fivetran consumption). Compute is typically 50–70% of total cloud data cost and has the highest optimisation potential because it is directly tied to usage patterns.
**Storage** — the cost of data at rest: data warehouse tables, data lake object storage (S3, ADLS Gen2, GCS), database backups, snapshot retention, Time Travel data (Snowflake). Storage is typically 15–25% of total cost and grows predictably with data volume. Optimisation is through compression, retention policy enforcement, and removing orphaned data.
**Data transfer** — the cost of moving data between regions, across availability zones, or from cloud to on-premise. Often overlooked in initial architecture design. Cross-region analytics architectures and on-premise-to-cloud pipelines generate significant transfer cost at scale. Typically 5–15% of total cost.
**Tooling and licensing** — the cost of managed services and SaaS tooling that sit on top of the cloud infrastructure: Fivetran connector licensing, dbt Cloud seats, Tableau Cloud licences, Databricks DBU charges above the infrastructure cost. Often not attributed to the data platform cost centre despite being direct costs.
Compute optimisation
### Data warehouse compute (Snowflake, BigQuery, Databricks SQL)
**Auto-suspend settings** are the single highest-impact Snowflake cost control. A virtual warehouse that does not auto-suspend consumes credits continuously. Set auto-suspend to 1–5 minutes for analytics workloads (where cold start latency is acceptable), 10–15 minutes for development environments where frequent queries make frequent cold starts costly. Check current auto-suspend settings with the Snowflake SHOW WAREHOUSES command or via the Snowflake account usage views.
**Warehouse sizing** should match workload. Most organisations over-provision warehouse size because "bigger is faster." Bigger warehouses consume more credits per hour — they complete individual queries faster, but for light or concurrent-friendly workloads, the extra credits do not produce proportional value. For interactive analytics queries by small groups of analysts, X-Small or Small warehouses are often adequate. Resize down and validate that query performance remains acceptable before committing.
**Separate workload warehouses** allow different cost control policies per workload. A warehouse for scheduled dbt transformation jobs can auto-suspend immediately after job completion (suspend timeout 1 minute); an analytics warehouse used by interactive users can stay warm for 5 minutes. Without separation, you cannot apply different policies.
**BigQuery slot reservations vs on-demand**: for high-volume BigQuery environments, slot reservations (pre-purchased query compute) are significantly more economical than on-demand ($5/TB queried). If your BigQuery spend exceeds $3,000–$5,000/month on on-demand, a slot reservation is worth evaluating. Conversely, for low-volume or highly variable usage, on-demand is more economical than purchasing unused reservation capacity.
**Databricks cluster configuration**: Databricks all-purpose clusters (used for notebooks and development) are more expensive per DBU than jobs clusters (used for automated workflows). Use jobs clusters for production pipeline runs. Enable auto-termination on all clusters. Use instance pool warm instances for interactive workloads with frequent starts — pool instances start faster and at lower cost than cold-start cluster provisioning.
### Pipeline compute
Fivetran pricing is consumption-based (monthly active rows). Large historical syncs or full refresh configurations for high-volume connectors drive cost. Review Fivetran connector configurations: switch full-refresh connectors to incremental where the source supports CDC. Archive or pause connectors for data sources that are no longer used.
Azure Data Factory charges per pipeline activity run, integration runtime usage, and data movement. Pipeline runs that execute more frequently than data changes produce unnecessary cost. Review trigger frequency against actual data change cadence.
Storage optimisation
### Data warehouse storage
Snowflake storage is billed on compressed data size including Time Travel and Fail-safe. For large tables, Time Travel retention is the biggest controllable cost driver. Review Time Travel settings: Enterprise edition default is 1 day for most tables; tables with long Time Travel set (14, 30, or 90 days) accumulate significant historical data storage. Reduce Time Travel retention on large tables that are not subject to frequent accidental modification.
Identify orphaned tables — tables that exist but are not referenced by any pipeline, transformation, or BI tool. Use the SNOWFLAKE.ACCOUNT_USAGE.TABLE_STORAGE_METRICS view to identify large tables and compare against your data catalogue (or run a query to identify tables with zero recent query activity from QUERY_HISTORY). Orphaned tables in development or testing schemas accumulate without governance.
Zero-copy clones are storage-free when created but accumulate storage as they diverge from the source. Review clone inventory and delete clones that are no longer needed.
### Data lake object storage
S3, ADLS Gen2, and GCS storage costs grow with data volume and retention policies. The primary optimisation levers:
**Storage tiering**: infrequently accessed data should be in cheaper storage tiers (S3 Infrequent Access, Azure Cool Blob Storage, GCS Nearline) rather than standard tier. Enable lifecycle policies that automatically transition data to cheaper tiers after a configurable inactivity period (30, 60, or 90 days). For archive data (data that must be retained for compliance but is rarely accessed), move to the lowest-cost archival tier (Glacier Deep Archive, Azure Archive Storage).
**Compression**: Parquet files with Snappy or ZSTD compression are typically 5–10× smaller than uncompressed CSV for the same data. If your data lake stores raw CSV files, recompressing them as Parquet significantly reduces storage cost. Many medallion architectures perform this compression as part of the Bronze-to-Silver transformation.
**Partition pruning**: data partitioned by date (year/month/day folder structure) allows queries to skip irrelevant partitions. Without date partitioning, queries scan the full data lake even when filtering on a date range — producing both higher query cost and higher data transfer. Partition by the most common filter dimension (usually date for time-series analytics data).
**Retention policy enforcement**: raw data (Bronze layer) that is retained indefinitely accumulates cost. For most organisations, raw data beyond a certain age (2–5 years, depending on regulatory requirements) has marginal analytical value relative to the storage cost. Define and enforce retention policies for Bronze-layer raw data. Regulatory and compliance requirements govern minimum retention periods; beyond those minima, data should be tiered or deleted.
Query cost optimisation
Poorly optimised queries are a significant compute cost driver — they run longer, scan more data, and consume more warehouse credits. Key patterns:
**Avoid SELECT ***: on columnar storage (Snowflake, BigQuery, Parquet in Delta Lake), reading only the required columns is fundamentally faster and cheaper than SELECT *. Require column-specificity in production queries and dbt models.
**Partition filter push-down**: queries that include filter predicates on the partition column (date, region, etc.) prune partitions and reduce data scanned. Queries without such filters scan the full table. For large tables, the cost difference is significant.
**Materialise intermediate results**: if multiple queries repeatedly perform the same expensive join or aggregation, materialise the result once (as a dbt model or a Snowflake dynamic table) and query the materialisation. The storage cost of the materialised table is typically much lower than the repeated compute cost of re-executing the query.
**Query profiling**: Snowflake's Query Profile (in the Snowflake web interface or via QUERY_HISTORY) shows the execution plan and time breakdown for each query. BigQuery's INFORMATION_SCHEMA.JOBS_BY_PROJECT provides slot consumption by query. Use these to identify high-cost queries and target optimisation effort.
Cost visibility and governance
Cost optimisation without cost visibility is guesswork. The minimum governance infrastructure for cloud data cost:
**Cost attribution by team/workload**: create separate Snowflake warehouses or BigQuery projects per team or workload type. Attribute cost by resource (warehouse or project) to the team that owns it. Without attribution, there is no accountability for cost growth.
**Budget alerts**: configure budget alerts at the cloud provider level (AWS Cost Anomaly Detection, Azure Cost Alerts, Google Cloud Budgets) and at the warehouse level (Snowflake Resource Monitors, BigQuery budget alerts). Alerts should fire before costs exceed budget, not after month-end billing.
**Monthly cost review**: a monthly review of cloud infrastructure costs — by service, by team, and compared to previous months — keeps cost growth visible and creates an accountability mechanism. Without a regular review cadence, cost growth is invisible until the quarterly budget review when it is too late to course-correct.
**Tagging**: apply consistent resource tags to all cloud infrastructure (environment: prod/dev/staging, team: data-engineering/analytics/ml, project: name). Tags enable cost breakdown by tag in cloud billing reports and make attribution feasible without maintaining a separate mapping.
Environment sizing
Development and staging environments are often sized to match production unnecessarily. Production requires high availability, performance SLAs, and capacity for peak load. Development and staging need sufficient capacity to test, not to serve production workloads.
Practical actions: reduce Snowflake virtual warehouse sizes in dev/staging environments (X-Small is usually sufficient for development queries), lower auto-suspend timeouts in non-production environments, use serverless or on-demand compute for development workloads rather than always-on clusters, schedule dev environment suspension during off-hours (use Snowflake Resource Monitors or a scheduled Lambda/Azure Function to suspend warehouses outside working hours).
For the cloud infrastructure design context that prevents cost problems from being built into the architecture, see azure data architecture best practices and how to build a modern data stack. For platform-specific cost management, see Snowflake pricing guide.
Our cloud engineering practice conducts cloud data cost reviews as a standalone engagement — identifying the specific waste items and quantifying the savings available in your environment. If your cloud data infrastructure costs are growing faster than expected or you want an external benchmark, book a free 30-minute audit.
A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.
Book a Call →