BlogData Architecture

What Is Snowflake? The Cloud Data Platform Explained

Austin Duncan
Austin Duncan
Project Manager & Data Strategist
·February 23, 202811 min read

Snowflake is a cloud-native data platform built on a multi-cluster shared data architecture — separating compute from storage, enabling elastic scaling, and supporting multiple independent compute clusters querying the same data simultaneously. This guide explains how Snowflake works, what makes it different from traditional data warehouses, and the use cases where its architecture is a strong fit.

What Snowflake Is

Snowflake is a cloud-native data platform built on a multi-cluster shared data architecture. Unlike traditional data warehouses where compute and storage are tightly coupled on the same physical hardware, Snowflake separates them completely — data lives in cloud object storage (S3, GCS, or Azure Blob), and compute clusters (virtual warehouses) are provisioned and released independently.

This separation enables what traditional data warehouse architectures could not: elastic compute scaling in seconds, multiple independent compute clusters querying the same data without contention, and storage costs that grow linearly with data volume rather than with hardware capacity.

Architecture: How Snowflake Works

**Storage layer**: All data in Snowflake is stored in compressed columnar format in cloud object storage managed by Snowflake. Data is immutable and stored in micro-partitions — small compressed units (50-500 MB uncompressed) that contain the data for a range of rows. Metadata about each micro-partition (min/max values per column, row count, size) enables efficient query pruning.

**Virtual warehouses (compute)**: A virtual warehouse is a cluster of compute resources — CPU, memory, local SSD cache — that executes queries. Warehouses are sized from X-Small to 6X-Large (1 to 512 nodes). They start in seconds and suspend automatically when idle. You pay only for the time a warehouse is active.

Multiple virtual warehouses can run simultaneously against the same data — a data loading warehouse, a BI reporting warehouse, and a data science warehouse can all operate independently without competing for the same resources.

**Cloud services layer**: The coordination layer — query compilation and optimisation, metadata management, authentication, transaction management, result caching. This layer runs continuously without extra charge.

**Multi-cloud**: Snowflake runs on AWS, Azure, and GCP. The underlying cloud is selected when the account is created. Snowflake manages the infrastructure; customers see only the SQL interface and billing.

Snowflake-Specific Capabilities

**Automatic clustering**: Snowflake's query optimiser uses micro-partition metadata to skip partitions that do not match query predicates. For large tables, defining a CLUSTER BY key causes Snowflake to physically re-arrange micro-partitions by the cluster key, maximising partition pruning. Automatic clustering runs in the background using Snowflake-managed credits.

**Zero-copy cloning**: CREATE TABLE ... CLONE creates an instant copy of a table, schema, or database without duplicating the underlying data. The clone initially shares the same micro-partitions as the source; only new or modified data is separately stored. Clones are useful for test environments, development snapshots, and data science feature engineering without incurring storage costs for the full dataset copy.

**Time travel**: Query historical data as of any point within the retention period (up to 90 days on Enterprise). SELECT * FROM table AT(TIMESTAMP => '2024-01-15 14:00:00'). Useful for data recovery, auditing, and debugging pipelines that modified data incorrectly.

**Data sharing**: Share live data between Snowflake accounts without copying or moving data. The data provider creates a share pointing at tables; the consumer queries those tables via their own compute. Data providers can share with external accounts (customers, partners) or internally between business units with separate accounts.

**Snowpark**: Python, Java, and Scala execution within Snowflake — write DataFrames that execute in Snowflake's compute rather than in an external Spark cluster. Enables Python ML libraries and custom transformations without moving data out of Snowflake.

**Semi-structured data**: VARIANT type stores JSON, XML, and Avro natively. Query with dot-notation: SELECT json_column:field_name FROM table. Automatic schema inference from VARIANT columns for flattening nested structures.

Virtual Warehouse Sizing and Cost

Snowflake pricing has two components: compute (credits per second of warehouse runtime) and storage (per GB per month).

Credit costs vary by cloud region but are approximately $2-4 per credit on on-demand pricing. Compute credit consumption scales with warehouse size: X-Small consumes 1 credit/hour, Small consumes 2, Medium 4, Large 8, X-Large 16, and so on doubling at each size.

**Sizing strategy**: Size the warehouse for the workload, not the data volume. A larger warehouse runs queries faster but does not change the query result. For interactive BI queries with many concurrent users, scale out (add multi-cluster warehouses to handle concurrency) rather than scaling up (larger warehouse). For long ETL jobs or complex data science queries, scale up.

**Auto-suspend and auto-resume**: Set warehouses to suspend after 1-10 minutes of inactivity and resume automatically on query. A BI warehouse that is idle 20 hours per day and active 4 hours costs 80% less than one running continuously.

**Result cache**: Snowflake caches query results for 24 hours. Identical queries (same SQL, same warehouse role, same underlying data) return cached results without consuming compute credits. Design BI tools to use consistent query text to maximise cache utilisation.

When Snowflake Is a Strong Fit

**Multiple workloads competing for resources**: Separate virtual warehouses for ETL, BI, and data science eliminate resource contention. An overnight ETL batch does not slow down morning BI queries.

**Bursty analytical demand**: Warehouses scale up in seconds. A quarterly planning process that requires heavy analytical computation for three days can run at XL for those days and drop back to S otherwise.

**Collaborative analytics with data sharing**: Organisations that need to share data between business units, with external partners, or with customers without copying data.

**Semi-structured data in analytics**: JSON event logs, API response payloads, Salesforce object exports — Snowflake's VARIANT type and FLATTEN function handle these without a separate pre-processing step.

**Cross-cloud data consumers**: Snowflake's availability on AWS, GCP, and Azure means a single Snowflake account can serve teams using different cloud environments.

Our data architecture practice designs and implements Snowflake architectures for enterprise analytics teams — contact us to discuss your Snowflake data platform requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →