BlogData Engineering

SQL for Business Analysts: The Queries You Actually Use Every Day

Austin Duncan
Austin Duncan
Managing Director & Principal Data Architect
·March 28, 202712 min read

Most business analysts use 10% of SQL's capability 90% of the time. This guide covers the patterns that matter most for day-to-day analytical work — not just syntax, but the reasoning behind query construction and the common mistakes that produce subtly wrong results.

Most business analysts who use SQL daily know five or six patterns by heart and approximate everything else by trial and error. That works — but it leaves a class of common analytical problems that require awkward workarounds when the right SQL construct would be clean and fast.

This guide covers the patterns that come up constantly in analytical work, with the reasoning behind them rather than just the syntax. The goal is not SQL comprehensiveness — that is what documentation is for. The goal is the 20% of SQL that does 80% of analytical work.

Aggregation: Getting the Numbers You Want

GROUP BY is the foundation of analytical SQL. The rule: every column in the SELECT that is not inside an aggregate function (SUM, COUNT, AVG, MAX, MIN) must appear in the GROUP BY.

The most common beginner mistake is using COUNT(*) when you mean COUNT(DISTINCT column). COUNT(*) counts rows. COUNT(DISTINCT customer_id) counts unique customers. These are different numbers and both are useful in different contexts.

A frequently needed pattern: counts at different levels of aggregation in the same query. How many orders per customer, and what is the average across all customers? A subquery handles this:

SELECT

AVG(order_count) AS avg_orders_per_customer

FROM (

SELECT

customer_id,

COUNT(*) AS order_count

FROM orders

GROUP BY customer_id

) customer_orders

The inner query computes orders per customer. The outer query aggregates across those counts. This two-level aggregation pattern appears constantly in analytical work.

Filtering: WHERE vs HAVING

WHERE filters rows before aggregation. HAVING filters the aggregated results. The practical implication: you cannot use an alias defined in the SELECT clause in a WHERE filter (the alias does not exist yet when WHERE is evaluated), but you can use it in a HAVING clause.

Common mistake — filtering on an aggregate in WHERE:

-- This fails: WHERE cannot reference an aggregate

SELECT customer_id, COUNT(*) AS order_count

FROM orders

WHERE order_count > 5 -- Error

GROUP BY customer_id

Correct:

SELECT customer_id, COUNT(*) AS order_count

FROM orders

GROUP BY customer_id

HAVING COUNT(*) > 5 -- Correct: filters after aggregation

The execution order in SQL is: FROM → JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT. Understanding this order explains which clauses can reference which.

Joins: Getting the Right Records

The most important join decision is the cardinality of the relationship. Before writing any join, ask: for each row in the left table, how many rows in the right table can match? One? Many? Zero or one?

**One-to-one joins**: one customer in the customers table matches one row in the customer_attributes table. The result has the same number of rows as the left table (assuming no missing right-table records). If the row count changes after the join, the relationship is not one-to-one.

**One-to-many joins**: one customer matches many orders. Joining customers to orders multiplies the customer row for each matching order. Revenue from the customers table gets duplicated across all order rows — summing it after the join will overcount. This is called a fan-out, and it is the most common source of incorrect aggregations.

Detecting a fan-out: compare the row count before and after the join. If it increased beyond what you expected, you have a one-to-many join. Fix it by aggregating the "many" side before joining, or by doing the aggregation in a subquery and joining the result.

**Left join vs inner join**: inner join returns only rows that match in both tables. Left join returns all rows from the left table, with NULLs for right-table columns where no match exists. Use left join when you want to preserve all left-table records regardless of matching right-table records. Use inner join when you only want records that have a match.

A subtle trap: filtering on a right-table column in the WHERE clause of a left join implicitly converts it to an inner join. To filter the right table while preserving the left join semantics, put the filter in the ON clause:

-- This works like an inner join (null rows from the left join are excluded by WHERE):

SELECT o.*, p.payment_method

FROM orders o

LEFT JOIN payments p ON o.order_id = p.order_id

WHERE p.payment_method = 'credit_card' -- Excludes orders with no payment

-- This correctly keeps all orders, and filters which payments are joined:

SELECT o.*, p.payment_method

FROM orders o

LEFT JOIN payments p ON o.order_id = p.order_id AND p.payment_method = 'credit_card'

CTEs: Making Complex Queries Readable

Common Table Expressions (WITH clauses) allow you to name an intermediate query result and reference it multiple times. They do not change performance — the query engine usually materialises a CTE once (though this is implementation-dependent). Their value is in readability and organisation.

The pattern: break a complex query into named steps, each building on the previous.

WITH

active_customers AS (

SELECT customer_id

FROM customers

WHERE status = 'active' AND last_order_date >= CURRENT_DATE - INTERVAL '90 days'

),

customer_revenue AS (

SELECT

o.customer_id,

SUM(o.revenue) AS total_revenue,

COUNT(*) AS order_count

FROM orders o

INNER JOIN active_customers ac ON o.customer_id = ac.customer_id

WHERE o.order_date >= CURRENT_DATE - INTERVAL '365 days'

GROUP BY o.customer_id

)

SELECT

customer_id,

total_revenue,

order_count,

total_revenue / order_count AS avg_order_value

FROM customer_revenue

ORDER BY total_revenue DESC

Each CTE is a named step. The final SELECT is clean. A reader can follow the logic without holding the entire query in their head simultaneously.

Date Arithmetic

Date manipulation is an area where SQL syntax varies more across databases than almost anything else. The core concepts are consistent even when the syntax differs.

Truncating a date to a period (getting the first day of the month):

-- Snowflake, PostgreSQL:

DATE_TRUNC('month', order_date)

-- BigQuery:

DATE_TRUNC(order_date, MONTH)

Adding and subtracting intervals:

-- Most databases:

order_date + INTERVAL '7 days'

order_date - INTERVAL '1 year'

-- Snowflake:

DATEADD('day', 7, order_date)

Getting the current date:

CURRENT_DATE -- ANSI standard, works in most databases

GETDATE() -- SQL Server

NOW() -- PostgreSQL, MySQL

Computing days between two dates:

-- PostgreSQL:

order_date - first_order_date -- Returns integer (days)

-- Snowflake:

DATEDIFF('day', first_order_date, order_date)

-- BigQuery:

DATE_DIFF(order_date, first_order_date, DAY)

Always check your database's specific date function syntax. The concepts are universal; the function names are not.

CASE Statements: Conditional Columns

CASE is SQL's if-then-else. It appears constantly in analytical queries for categorisation, custom groupings, and traffic-light status fields.

Simple CASE (value comparison):

CASE status

WHEN 'shipped' THEN 'Fulfilled'

WHEN 'delivered' THEN 'Fulfilled'

WHEN 'cancelled' THEN 'Cancelled'

ELSE 'In Progress'

END AS fulfillment_status

Searched CASE (expression evaluation):

CASE

WHEN revenue >= 10000 THEN 'High Value'

WHEN revenue >= 1000 THEN 'Mid Value'

ELSE 'Low Value'

END AS customer_tier

CASE inside aggregates is common for conditional counting:

COUNT(CASE WHEN status = 'cancelled' THEN 1 END) AS cancellation_count

This counts only rows where status is 'cancelled'. Rows where the CASE returns NULL (the ELSE default) are not counted by COUNT.

NULL Handling

NULL is SQL's representation of missing or unknown values. It behaves differently from zero or empty string. NULL is not equal to anything — not even another NULL. NULL compared to any value with = returns NULL (which is treated as false in WHERE clauses).

Practical implications:

- WHERE column = NULL matches nothing. Use WHERE column IS NULL.

- In aggregates, NULL values are ignored. COUNT(column) counts non-NULL values; COUNT(*) counts rows.

- COALESCE(column, default_value) returns the first non-NULL value in its arguments. Use it to replace NULLs with a default: COALESCE(discount_amount, 0).

- String concatenation with NULL produces NULL in most databases. 'Order: ' || NULL evaluates to NULL, not 'Order: '.

Subqueries vs Joins

Subqueries and joins often produce the same result through different mechanisms. The choice affects readability and sometimes performance.

For filtering against a set of values, both work:

-- Subquery approach:

WHERE customer_id IN (SELECT customer_id FROM high_value_customers)

-- Join approach:

INNER JOIN high_value_customers USING (customer_id)

For modern query optimisers, these usually produce identical execution plans. Prefer whichever is more readable for your specific case.

Correlated subqueries (where the subquery references the outer query) are the exception — they can be slow because they execute once per row of the outer query. When a correlated subquery can be rewritten as a join or a window function, prefer the alternative.

Our data engineering practice supports teams building analytical SQL at scale — contact us to discuss your analytics infrastructure requirements.

Get your data architecture audit in 30 minutes.

A former Microsoft data architect audits your data foundation, identifies your top priorities, and sends you a written plan. Free. No pitch.

Book a Call →