BlogAnalytics

SQL Aggregation: GROUP BY, HAVING, and Aggregate Functions Explained

Eric Chen
Eric Chen
BI Solutions Architect
·February 2, 202810 min read

SQL aggregation collapses multiple rows into a single summary value — COUNT, SUM, AVG, MIN, MAX. This guide covers GROUP BY fundamentals, multi-column grouping, HAVING for filtering aggregated results, GROUPING SETS and ROLLUP for multi-level summaries, and common aggregation mistakes that produce incorrect results.

What Aggregation Does

SQL aggregation collapses multiple rows into a single summary value using aggregate functions. Without aggregation, a query on an orders table returns one row per order. With aggregation, it returns one row per dimension combination — one row per region, one row per month, one row per customer.

Aggregation is the engine of analytical SQL. Understanding GROUP BY, HAVING, and the behaviour of aggregate functions is fundamental to writing correct analytical queries.

Aggregate Functions

**COUNT**: Counts rows.

- COUNT(*): counts all rows including nulls

- COUNT(column): counts rows where column is not null

- COUNT(DISTINCT column): counts unique non-null values

SELECT

COUNT(*) AS total_orders,

COUNT(DISTINCT customer_id) AS unique_customers,

COUNT(discount_code) AS orders_with_discount -- nulls excluded

FROM orders

**SUM**: Adds all non-null values in a column. Returns null if all values are null.

SELECT SUM(order_amount) AS total_revenue FROM orders

**AVG**: Average of non-null values. Does not count nulls in the denominator — AVG over a column with 8 values and 2 nulls divides by 8, not 10.

**MIN / MAX**: Minimum and maximum non-null value. Works on numeric, date, and string columns (string MIN/MAX is lexicographic).

**STRING_AGG / LISTAGG**: Concatenates string values from multiple rows into one string:

- PostgreSQL/BigQuery: STRING_AGG(value, ', ' ORDER BY value)

- Snowflake: LISTAGG(value, ', ') WITHIN GROUP (ORDER BY value)

Useful for collapsing multiple category labels or tag values into a single column per group.

GROUP BY

GROUP BY collapses rows with the same values in the specified columns into one output row. Aggregate functions are then applied within each group.

SELECT

region,

product_category,

COUNT(*) AS order_count,

SUM(order_amount) AS revenue

FROM orders

GROUP BY region, product_category

ORDER BY revenue DESC

Rules:

- Every column in SELECT that is not an aggregate function must appear in GROUP BY

- The query returns one row per unique combination of GROUP BY columns

- NULL values are grouped together — two rows with NULL region are in the same group

**Grouping by expressions**:

SELECT

DATE_TRUNC('month', order_date) AS order_month,

SUM(order_amount) AS monthly_revenue

FROM orders

GROUP BY DATE_TRUNC('month', order_date)

You can group by expressions, not just column names. Some databases allow GROUP BY column alias; others require repeating the expression or using a positional reference (GROUP BY 1).

HAVING

WHERE filters rows before aggregation. HAVING filters groups after aggregation. The difference is critical: WHERE cannot reference aggregate functions; HAVING can.

-- Wrong: WHERE cannot reference aggregate functions

SELECT customer_id, COUNT(*) FROM orders WHERE COUNT(*) > 5 GROUP BY customer_id;

-- Correct: HAVING filters after aggregation

SELECT customer_id, COUNT(*) AS order_count

FROM orders

GROUP BY customer_id

HAVING COUNT(*) > 5

ORDER BY order_count DESC

HAVING conditions: any boolean expression involving aggregate functions, column names, or constants.

-- Customers with revenue between $1,000 and $50,000

SELECT

customer_id,

SUM(order_amount) AS total_revenue,

COUNT(*) AS order_count

FROM orders

GROUP BY customer_id

HAVING SUM(order_amount) BETWEEN 1000 AND 50000

ORDER BY total_revenue DESC

Performance note: WHERE filters happen before GROUP BY and reduce the rows that need to be aggregated — pushing filters to WHERE rather than HAVING is more efficient where possible.

GROUPING SETS, ROLLUP, and CUBE

Standard GROUP BY produces one level of aggregation. GROUPING SETS, ROLLUP, and CUBE extend this to produce multiple aggregation levels in one query.

**GROUPING SETS**: Specify multiple GROUP BY combinations explicitly. Each set produces separate output rows:

SELECT region, product_category, SUM(revenue) AS revenue

FROM orders

GROUP BY GROUPING SETS (

(region, product_category), -- by region and category

(region), -- by region only

() -- grand total

)

Equivalent to three separate GROUP BY queries UNION ALL'd together.

**ROLLUP**: Hierarchical aggregation from most to least granular. ROLLUP(year, quarter, month) produces year+quarter+month, year+quarter, year, and grand total groups.

SELECT

EXTRACT(YEAR FROM order_date) AS year,

EXTRACT(QUARTER FROM order_date) AS quarter,

SUM(revenue) AS revenue

FROM orders

GROUP BY ROLLUP(EXTRACT(YEAR FROM order_date), EXTRACT(QUARTER FROM order_date))

**CUBE**: All possible combinations of the specified columns. CUBE(a, b, c) produces groups for (a,b,c), (a,b), (a,c), (b,c), (a), (b), (c), and the grand total.

**GROUPING() function**: When NULL appears in a ROLLUP or CUBE output, it could mean either "a row where the actual value was NULL" or "an aggregate subtotal row." GROUPING(column) returns 1 for subtotal rows and 0 for actual-value rows.

Common Aggregation Mistakes

**Counting distinct when you mean count**: COUNT(*) on a table with joins that produce fanout over-counts. COUNT(DISTINCT order_id) after joining order_line_items back to orders gives the correct order count. Always think about whether the JOIN produces one-to-many cardinality before counting.

**AVG of averages**: Computing AVG of a per-group average is mathematically incorrect when groups have different sizes. The average of (10-person group average $50) and (100-person group average $30) is not ($50 + $30) / 2. Compute SUM/COUNT directly against the underlying rows.

**SUM of NULLs**: SUM over a column with all-null values returns NULL, not zero. Use COALESCE(SUM(column), 0) when you need zero instead of null for missing groups.

**NULL behaviour in COUNT**: COUNT(column) excludes nulls; COUNT(*) does not. Comparing COUNT(email) to COUNT(*) reveals the null rate in the email column — a useful data quality check pattern.

Our data architecture practice designs SQL-based transformation layers for enterprise analytics — contact us to discuss your analytics engineering requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →