SQL windows functions, with examples
window functions helps see data in beautiful ways
What are window functions?
SQL has many functional capabilities to analyze data. One of them is Window functions.Window functions can be used to create new columns — partitioned over an existing column i.e. over a group, or in other words, over a Window. The new columns are usually created to do calculations across series of records inside the Window — this window could be over a team, a person, a product type, a geographical location — something that has many records under it. These calculation set includes — sum, avg, count, row_number, rank, dense_rank, lag, lead. We will only cover few of them here to avoid too much information to digest.
Can you showcase a few examples?
Let’s use an example, pick a data set, and go through different things we can do with windows. We have a table purchases.
table name: purchasesid, puchaseDate, purchaseAmountDollars12, 2022–01–01, 512, 2022–01–02, 1012, 2022–01–05, 40
The table purchases contains id of a person, purchase date for transactions, and the purchase amount.
category 1: Ranking functions
As a simple first step, we can assign a row number (an index) to each transaction for each person. Below is the code for it.
SELECT ID, puchaseDate, purchaseAmount, row_number() over(partition by id order by purchaseDate asc) AS purchaseNumber FROM purchases
Look how row_number() and over( partition by ..) are used here. over( partition by ..) is what you will use for partitioning data, in other words, on deciding a goup — a Window, over which, you want to have your calculation done. In this case, that calculation is row_number() — you want to rank records within each Window i.e. within each person (id column).
The Result:id, puchaseDate, purchaseAmountDollars, purchaseNumber12, 2022–01–01, 5, 112, 2022–01–02, 10, 212, 2022–01–05, 40, 3
In addition to row_number() function, SQL offers rank() and dense_rank() function — both of those assign a rank based on the value of the column stated in the order by clause. row_number() did not do that — it just assigned a rank — based on row position — it did not look at any column value.
There is also percent_rank() function that generates the percentile rank i.e. in terms of percentage. This is helpful when you want to answer a question like — give me top top 10% purchases. There is tile() rank where you can specify the number of buckets you want to rank, example: ntile(10) will give you 10 buckets in rank.
category 2: get the previous or next value
Now let’s change gears a little bit and try something a little more complicated.
Let’s say we want to find the largest increase in purchase amount, between two consecutive shopping sessions, for across all customers. We want to find out what the largest increase in purchase amount was over consecutive transactions?
In order to get that, we need to find the difference between all consecutive transactions and then pick the maximum one. First we use over(partition by ..) to first create a window for each person (id column). Then notice how we use order by to sort all records within each window.
SELECT ID, purchaseAmount, LAG(purchaseAmount, 1, Null) OVER (PARTITION BY ID
ORDER BY date ASC) AS previousPurchaseAmount
FROM purchases
Now to get the previous purchase amount for each purchase, we use what’s the lag() function. This function gets the previous value of a column for each row in the window. LAG(purchaseAmount, 1, Null) — says — get me previous purchaseAmount column value, if it does not exist, give me Null. Notice here the seconds argument is “1” — which means — give me the record 1 position behind. If this was 2, it would give the record 2 positions behind each row.
SELECT MAX(increaseInPruchaseAmount) as maxIncrease FROM(
SELECT (purchaseAmount - previousPurchaseAmount) as increaseInPruchaseAmount FROM
(SELECT ID, purchaseAmount, LAG(purchaseAmount, 1, Null) OVER (PARTITION BY ID
ORDER BY date ASC) AS previousPurchaseAmount
FROM transactions) withPreviousAmount
) withIncreaseAmountRESULT:maxIncrease
40
Similarly, you could do lead() instead of lag(). lead() gets you the next row value. This is useful when you want get the next value in a series of rows.
category 3: running Aggregates
Let’s use one of the aggregate functions — sum(). sum() can be used to calculate rolling sum of values, of a column, over a Window. Let’s find the sum of all purchases for a person.
SELECT ID, SUM(purchaseAmount) OVER (PARTITION BY ID
ORDER BY date ASC) AS runningSumPurchaseAmount
FROM purchasesThe Result:id, puchaseDate, runningSumPurchaseAmount12, 2022–01–01, 512, 2022–01–02, 1512, 2022–01–05, 40, 3, Y
Similarly you could do running average — avg() or running count — count().
Conclusion and parting thoughts
Above are some ways in which you can use the concept of Windows and it’s functions in SQL. There will be situtation where you might need them. A good way to remember them is the visualize them and remember some example use cases.