Skip to main content

Ever wonder how to easily look-up a corresponding column of a maximum value in Snowflake? This blog will introduce you to the Snowflake MAX_BY function. I originally stumbled across the function while trying to answer “Who is the clerk with the most orders?” from the Snowflake sample dataset.

Finding a maximum or minimum value

A large part of the learning process at Nimbus Academy is dedicated to Snowflake, and one of the best ways to learn a new tool is by playing around with it. Snowflake offers a sample dataset that is accessible to every account. Many business questions can be answered by querying this data, such as “Who is the clerk with the most orders?”. As practice we worked individually for two days to answer these questions, and hone our newly acquired skills in Snowflake. 

These business questions are often related to a maximum or minimum value. We were, of course, aware of the MAX function that is a basic built-in function of Snowflake. However, I only found out about about the new Snowflake MAX_BY function around lunchtime the second day! Snowflake introduced this function as part of the 7.1 release in January 2023. Instead of returning the maximum value of the expression, it can return the corresponding value of a different column. If I want to answer the question “Who is the clerk with the most orders?”, I am not actually looking for the value of most orders. I just want the identifier of the clerk. The MAX_BY function makes it possible to very easily look-up this value. 

Code with MAX function

Let’s look at the way I originally queried the data to get the answer for this question.

SELECT      clerk_id AS clerk_mvp
FROM     	orders_per_clerk
WHERE    	1=1
   	 	    AND num_orders = (SELECT MAX(num_orders) FROM orders_per_clerk);

We make use of a table called orders_per_clerk that has the count of orders for each clerk_id. We use a select statement to identify that we are looking for the clerk_id. Since we specify in the WHERE statement that we only want to select the rows in which the count is equal to the maximum value, we indeed get the result that we want. Lastly, we rename this particular clerk to our most valuable player (MVP).

Result of select max function
Result of SELECT MAX function

Seasoned Snowflake users have probably seen queries like these frequently, and can easily deduct what the result tells us. For more novice users like myself, however, a subquery like this can look complicated. I found myself longing for a query that is more elegant and legible.

Code with the Snowflake MAX_BY function

When we use the new MAX_BY function, the query is more straightforward.

SELECT	    MAX_BY(clerk_id, num_orders) AS clerk_mvp
FROM     	orders_per_clerk
WHERE    	1=1;

This looks way better! We use the same table orders_per_clerk as in the original query. We then select the clerk_id by the maximum number of orders and rename it to clerk_mvp. The result of the query is the same. Clerk 542 is still our MVP in terms of most orders sold.

Result of the max_by function
Result SELECT MAX_BY function

We definitely know the syntax has changed, so how about the semantics? Are both queries logically equivalent?

Lifesaver or booby trap?

Unfortunately, this new query is not a one-on-one replacement for the original query, thus not logically equivalent. What happens if there is a tie between two clerks? The default setting of this function will only return one value. At the moment there is still a bug that prevents you from specifying the number of results this function should return (see this blog for more details). Even if Snowflake fixes this error, we would have to know before we run the query that there is a shared rank and we are therefore looking for multiple returned rows. But how many rows am I looking for exactly? What if there are ten employees with the same number of orders? In contrast, the original query will just give me every clerk ID that is selected for the maximum value – even if we have a hundred employees that all have the maximum value of orders made.

Summary

My love of making queries as short and readable as possible gave me a small eureka moment when I first found the Snowflake MAX_BY function. After some investigation, I am a little wary because the result of the function does not always tell the whole story. The function will nonetheless go into my Snowflake toolbox, but I will be using it with caution. Accuracy and completeness are more important in my opinion than shorter and more readable queries. 

Auteur