Skip to main content

During my formation on Snowflake at Nimbus Intelligence our professor showed us a query that appeared to be nesting Group By and Partitioning in order to calculate a percentage. Morever, the query was using a total of three aggregate functions in the same line. In this post, we will analyze and understand how SQL handles these queries that may seem daunting at first sight.

Let’s first introduce some context. We will work on the Snowflake Sample Data TPC-H using schema TPCH_SF1. More specifically, we will be focusing on the following orders table.

orders(o_)type
orderkeyINT
custkeyINT
orderstatusVARCHAR
totalpriceDECIMAL
orderdateDATE
orderpriorityVARCHAR
clerkVARCHAR
shippriorityVARCHAR
commentVARCHAR
Columns and datatypes of the orders table in TPCH_SF1

Here is an example of the attributes that we will be looking at

USE SCHEMA snowflake_sample_data.tpch_sf1;

SELECT o_orderkey, o_clerk, o_orderpriority FROM orders LIMIT 6;
o_orderkeyo_clerko_orderpriority
1,800,001Clerk#0000003061-URGENT
1,800,002Clerk#0000008973-MEDIUM
1,800,003Clerk#0000001374-NOT SPECIFIED
1,800,004Clerk#0000004812-HIGH
1,800,005Clerk#0000004713-MEDIUM
1,800,006Clerk#0000006484-NOT SPECIFIED

Our aim is to find the percentage of each of the different order priorities ordered by each clerk. In essence, a table that looks like this

Clerk (Simplified)Order_Priority (Simplified)Percent
Clerk_1Urgent25
Clerk_1Normal40
Clerk_1Low35
Clerk_2Urgent10
Clerk_2Normal70
Clerk_2Low20

Prerequisites

  • Aggregation in SQL both with GROUP BY and PARTITION BY
  • Common Table Expressions

Standard/Legible approach

A standard approach would probably be to first obtain the total amount of orders of each clerk, so we need to group by clerk, save this in a CTE, and then find the amount of orders for each clerk and priority, so we need to group by clerk and order priority. Notice that we require two GROUP BY that use different attributes. This forces us (not really, as we will see in the next section) to use more than a single query to obtain the results.

The first query looks like this

SELECT DISTINCT
    o_clerk,
    COUNT(*) AS total_orders
FROM orders
GROUP BY o_clerk
ORDER BY o_clerk;  -- for legibility

and provides the following results

clerktotal_orders
Clerk#0000000011,467
Clerk#0000000021,494
Clerk#0000000031,538
Clerk#0000000041,502
Clerk#0000000051,503

Then, we store this table on a CTE and use it in a subsequent query where we GROUP BY o_clerk and o_orderpriority to count the amount of orders of each priority of each clerk

WITH clerk_orders AS (
    -- our previous query
    SELECT DISTINCT
        o_clerk,
        COUNT(*) AS total_orders
    FROM orders
    GROUP BY o_clerk
)
-- the new query to find the percentage of orders for each clerk and order priority
SELECT
    orders.o_clerk,
    orders.o_orderpriority,
    100 * COUNT(*) / clerk_orders.total_orders AS pctg_of_orders
FROM orders
INNER JOIN clerk_orders
    ON orders.o_clerk = clerk_orders.o_clerk
GROUP BY orders.o_clerk, orders.o_orderpriority, clerk_orders.total_orders
ORDER BY orders.o_clerk, orders.o_orderpriority; -- for legibility

Note that, besides grouping by orders.o_clerk and orders.o_orderpriority, we must add clerk_orders.total_orders so that SQL understands what to do with this column. The result, as expected, is the following

o_clerko_orderprioritypctg_of_orders
Clerk#0000000011-URGENT19.90
Clerk#0000000012-HIGH19.90
Clerk#0000000013-MEDIUM19.70
Clerk#0000000014-NOT SPECIFIED20.52
Clerk#0000000015-LOW19.97
Clerk#0000000021-URGENT19.75
Clerk#0000000022-HIGH19.54
Clerk#0000000023-MEDIUM20.21

Nesting Group By and Partitioning

The solution provided by our teacher was the following

SELECT
    o_clerk,
    o_orderpriority,
    100 * COUNT(*) / SUM(COUNT(*)) OVER (PARTITION BY o_clerk) AS pctg_of_orders
FROM orders
GROUP BY o_clerk, o_orderpriority;

As you can see, it is quite more compact than our previous solution. Note, however, that it is very hard to understand what is happening at first sight. There are three aggregate functions, two COUNT and a SUM, which are grouped using a PARTITION BY o_clerk and a GROUP BY o_clerk, o_orderpriority.

How does SQL know which functions to use for which groupings? In the nested SUM(COUNT(*)), is the COUNT referring to the PARTITION BY and the SUM to the GROUP BY, or the other way around? Which grouping function is taking care of the COUNT in the numerator?

Let’s dissect it and perform a couple of tests to find out!

Making sense of Nesting Group By and Partitioning

First of all, let’s see what partitioning over o_clerk produces

SELECT DISTINCT
    o_clerk,
    o_orderpriority,
    COUNT(*) OVER (PARTITION BY o_clerk) AS orders_by_clerk
FROM orders
ORDER BY o_clerk, o_orderpriority; -- for legibility
o_clerko_orderpriorityorders_by_clerk
Clerk#0000000011-URGENT1,467
Clerk#0000000012-HIGH1,467
Clerk#0000000013-MEDIUM1,467
Clerk#0000000014-NOT SPECIFIED1,467
Clerk#0000000015-LOW1,467
Clerk#0000000021-URGENT1,494
Clerk#0000000022-HIGH1,494

As expected, for every clerk it provides the total amount of orders. Therefore, we can already discard that the partition is taking care of either of the two COUNTs in the original query. In said COUNTs, we need the grouping by o_clerk and o_orderpriority in order to provide the correct results.

Just to confirm, grouping by o_clerk and o_orderpriority

SELECT
    o_clerk,
    o_orderpriority,
    COUNT(*) AS orders_by_clerk_and_priority
FROM orders
GROUP BY o_clerk, o_orderpriority
ORDER BY o_clerk, o_orderpriority;

provides

o_clerko_orderpriorityorders_by_clerk_and_priority
Clerk#0000000011-URGENT292
Clerk#0000000012-HIGH292
Clerk#0000000013-MEDIUM289
Clerk#0000000014-NOT SPECIFIED301
Clerk#0000000015-LOW293
Clerk#0000000021-URGENT295
Clerk#0000000022-HIGH292
Clerk#0000000023-MEDIUM302

Therefore, we can infer that the COUNTs are being taken care by the GROUP BY and the SUM by the partitioning.

At first it may seem weird and arbitrary, but it actually makes kind of sense. First, we must remember the order of operations of SQL (see table below). Knowing that window functions and partitions are taken care in the SELECT statement, we can see that GROUP BY occurs first, and, therefore, it must take care of the “first level” of aggregate functions.

OrderClause
1FROM
2WHERE
3GROUP BY
4HAVING
5SELECT
6ORDER BY
7LIMIT
Order of operations in SQL queries

In the numerator, there is only a COUNT, so it calculates it by the specified grouping (o_clerk and o_orderpriority). In the denominator there are two nested functions, SUM(COUNT(*)). Evidently, the COUNT must be calculated before performing the SUM (meaning that COUNT is the “first level” function here) and, indeed, the GROUP BY takes care of it.

Then, once we reach the SELECT statement, the PARTITION BY receives a column that can be thought of something similar to

orders_by_clerk_and_priority divided by the sum of orders_by_clerk_and_priority over clerks.
$$
\frac{orders_by_clerk_and_priority}{\textrm{SUM}(orders_by_clerk_and_priority)}
$$

where orders_by_clerk_and_priority are the values previously calculated by the GROUP BY. Therefore, the PARTION BY simply sees a SUM that must be calculated for every clerk, providing in the denominator the total amount of orders of each clerk.

Summary

In this blog, we have made sense of the nesting Group By and Partitioning query that initially seemed complicated and intimidating by approaching the problem with a “divide and conquer” mindset. As a result, we have added to our toolset a new method for performing queries that require several GROUP BYs with different aggregating keys. This method of nesting Group By and Partitioning lets us write much more compact queries that avoid using CTEs.

Auteur