BlogData Architecture

What Is Data Partitioning? Splitting Tables for Faster Queries and Easier Management

Austin Duncan
Austin Duncan
Project Manager
·September 11, 20289 min read

Data partitioning divides a large table into smaller, physically or logically separated segments based on a column value — typically a date. This guide explains how partitioning works, the difference between horizontal and vertical partitioning, and how cloud data warehouses implement it.

Data partitioning is the division of a large table into smaller, physically or logically separated segments based on the value of one or more columns. Queries that filter on the partition column can skip entire partitions of data that do not match the filter — avoiding the cost of scanning rows that would be filtered out anyway.

The most common partitioning pattern is time-based: a sales transaction table partitioned by order_date stores each day's transactions in a separate partition. A query filtering to last quarter's transactions reads only the partitions for those three months, rather than scanning the entire multi-year table.

Partitioning is one of the highest-leverage performance optimizations available for large tables. A well-partitioned 5-billion-row table can answer a query for last-week's data in seconds; the same query without partitioning might take minutes.

Horizontal vs Vertical Partitioning

**Horizontal partitioning (sharding)** divides rows across partitions — each partition contains a subset of the full row set. A date-partitioned table divides rows by date; a geography-partitioned table divides rows by region. Each partition has the same column structure; rows are distributed based on their partition key value. This is the standard meaning of "partitioning" in data warehousing.

**Vertical partitioning** divides columns across tables — commonly seen in column-oriented (columnar) storage, where each column is stored physically separately from others. Columnar storage is the architectural foundation of cloud data warehouses (BigQuery, Snowflake, Redshift): analytical queries that aggregate only a few columns of a wide table read only those columns' data, not the full row width. Vertical partitioning is built into columnar storage engines; it is not typically configured manually.

Partition Types

**Range partitioning:** Rows assigned to partitions based on a value range. DATE(created_at) BETWEEN '2024-01-01' AND '2024-01-31' defines the January partition. Range partitioning on dates is the canonical pattern for time-series data — every time-bounded query benefits automatically.

**List partitioning:** Rows assigned based on membership in a defined list of values. status IN ('active', 'trial') goes to one partition; status IN ('churned', 'cancelled') goes to another. Useful when queries frequently filter on a low-cardinality categorical column.

**Hash partitioning:** Rows assigned to partitions based on the hash of a column value. Distributes rows evenly across a fixed number of partitions. Used in distributed systems to balance storage and compute across nodes.

Partitioning in Cloud Data Warehouses

**BigQuery:** Table partitioning on a DATE or TIMESTAMP column, integer range, or using ingestion time. A query filtering on the partition column is "partition aware" and reads only matching partitions. Unpartitioned queries on large tables are expensive — BigQuery charges by bytes scanned, and a full table scan on a multi-terabyte table is both slow and costly. Partition decoration in INSERT queries (INSERT INTO table$20240315) writes directly to a specific partition.

**Snowflake:** Snowflake uses automatic micro-partitioning rather than explicit user-defined partitions. Every Snowflake table is automatically divided into compressed micro-partitions of 50-500MB each. Snowflake records the min/max values of each column in each micro-partition's metadata. When a query filters on a column, Snowflake consults this metadata to skip micro-partitions where the column's values are outside the filter range — without the user defining explicit partition keys.

Snowflake's CLUSTER BY definition specifies which column(s) Snowflake should optimize clustering for. When a CLUSTER BY is applied, Snowflake continuously re-orders micro-partitions to improve pruning on the specified columns. Effective clustering on date columns means that a query for last month's data prunes all micro-partitions where the date range does not overlap the filter.

**Redshift:** SORTKEY determines the physical sort order of rows within each distribution node. Queries filtering on sortkey columns can efficiently skip blocks of rows. DISTKEY determines how rows are distributed across nodes — queries joining tables on the distkey column avoid expensive cross-node data movement.

Partition Granularity

Choosing the right partition granularity for time-based partitioning depends on query patterns and data volume:

**Daily partitioning:** The most common choice for most event-based tables. Queries filtering to any date range benefit; partition management is simple. Daily partitioning becomes impractical when daily volumes are so small that the overhead of thousands of tiny partitions exceeds the pruning benefit.

**Monthly partitioning:** Appropriate when query filters are almost always at the month level or larger, or when daily data volumes are very small.

**Hourly partitioning:** Appropriate for high-volume streaming data tables where queries frequently filter to sub-daily windows. Creates more partitions but provides finer-grained pruning.

The practical guideline: partition at the granularity that matches your most common filter range. If 90% of queries filter by month, monthly partitions provide the pruning benefit with fewer partitions to manage.

Our data architecture services practice designs partitioning strategies for cloud data warehouse tables — BigQuery partitioning, Snowflake clustering, and Redshift sort/dist key selection — that optimize query performance for your specific workload. Contact us to discuss your data architecture.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →