Window functions allow users to perform calculations across a set of rows that are somehow related to the current row. This can include calculations like running totals, moving averages, and ranking without the need to group the entire query into one aggregate result.
Despite their flexibility, window functions are sometimes underutilised, either because users are unaware of them or because they’re considered too complex for everyday tasks. Learning how to effectively use window functions can improve the efficiency and readability of SQL queries, particularly for reporting and data analysis purposes. This article will explore several use cases.
Numbering Rows
The simplest application area for window functions is the numbering of rows. The ROW_NUMBER()
function assigns a unique number to each row within the partition of a result set. The numbering is sequential and starts at 1. It’s useful for creating a unique identifier for rows within a partition, even when the rows are identical in terms of data.
Consider the following database table of library checkouts:
book | checkout_date | member_id |
The Great Adventure | 2024-02-15 | 102 |
The Great Adventure | 2024-01-10 | 105 |
Mystery of the Seas | 2024-01-20 | 103 |
Mystery of the Seas | 2024-03-01 | 101 |
Journey Through Time | 2024-02-01 | 104 |
Journey Through Time | 2024-02-18 | 102 |
We want to assign a unique row number to each checkout instance for every book, ordered by the checkout date to analyze the circulation trend:
SELECT
book,
checkout_date,
member_id,
ROW_NUMBER() OVER (PARTITION BY book ORDER BY checkout_date) AS checkout_order
FROM library_checkouts;
The result:
book | checkout_date | member_id | checkout_order |
The Great Adventure | 2024-01-10 | 105 | 1 |
The Great Adventure | 2024-02-15 | 102 | 2 |
Mystery of the Seas | 2024-01-20 | 103 | 1 |
Mystery of the Seas | 2024-03-01 | 101 | 2 |
Journey Through Time | 2024-02-01 | 104 | 1 |
Journey Through Time | 2024-02-18 | 102 | 2 |
Ranking
In the context of SQL and specifically regarding window functions, “ranking” refers to the process of assigning a unique position or rank to each row within a partition of a result set based on a specified ordering.
The RANK()
function provides a ranking for each row within a partition, with gaps in the ranking sequence when there are ties. It’s useful for ranking items that have the same value.
Consider the following database table of scores in a game tournament:
player | game | score |
Alice | Space Invaders | 4200 |
Bob | Space Invaders | 5700 |
Charlie | Space Invaders | 5700 |
Dana | Donkey Kong | 6000 |
Eve | Donkey Kong | 4800 |
Frank | Donkey Kong | 6000 |
Alice | Asteroids | 8500 |
Bob | Asteroids | 9300 |
Charlie | Asteroids | 7600 |
We want to rank the players within each game based on their score, with gaps in rank for ties:
SELECT
player,
game,
score,
RANK() OVER (PARTITION BY game ORDER BY score DESC) AS rank
FROM scores;
The result looks like this:
player | game | score | rank |
Bob | Space Invaders | 5700 | 1 |
Charlie | Space Invaders | 5700 | 1 |
Alice | Space Invaders | 4200 | 3 |
Dana | Donkey Kong | 6000 | 1 |
Frank | Donkey Kong | 6000 | 1 |
Eve | Donkey Kong | 4800 | 3 |
Bob | Asteroids | 9300 | 1 |
Alice | Asteroids | 8500 | 2 |
Charlie | Asteroids | 7600 | 3 |
If you don’t want to have gaps in the ranking sequence when there are ties, you can substitute DENSE_RANK()
for RANK()
.
Cumulative Sum
The SUM()
function can be used as a window function to calculate the cumulative sum of a column over a partition of rows.
Example: We are tracking our garden’s vegetable harvest in a database table, and we want to calculate the cumulative yield for each type of vegetable over the harvesting season.
vegetable | harvest_date | yield_kg |
Carrots | 2024-06-18 | 10 |
Carrots | 2024-07-10 | 15 |
Tomatos | 2024-06-15 | 20 |
Tomatos | 2024-07-01 | 30 |
Tomatos | 2024-07-20 | 25 |
Zucchini | 2024-06-20 | 15 |
Zucchini | 2024-07-05 | 20 |
We calculate the running total (cumulative yield) for each vegetable type as the season progresses, using the SUM()
function:
SELECT
vegetable,
harvest_date,
yield_kg,
SUM(yield_kg) OVER (PARTITION BY vegetable ORDER BY harvest_date ASC) AS cumulative_yield
FROM garden_harvest;
Now we can see which vegetables are most productive and how yield accumulates throughout the season:
vegetable | harvest_date | yield_kg | cumulative_yield |
Carrots | 2024-06-18 | 10 | 10 |
Carrots | 2024-07-10 | 15 | 25 |
Tomatos | 2024-06-15 | 20 | 20 |
Tomatos | 2024-07-01 | 30 | 50 |
Tomatos | 2024-07-20 | 25 | 75 |
Zucchini | 2024-06-20 | 15 | 15 |
Zucchini | 2024-07-05 | 20 | 35 |