Window functions are a group of functions that will perform calculations across a set of rows that are related to your current row. They are considered advanced sql and are often asked during data science interviews. It's also used at work a lot to solve many different types of problems. Let's summarize the 4 different types of window functions and cover the why and when you'd use them.
4 Types of Window Functions
1. Regular aggregate functions
o These are aggregates like AVG, MIN/MAX, COUNT, SUM
o You'll want to use these to aggregate your data and group it by another column like month or year
2. Ranking functions
o ROW_NUMBER, RANK, RANK_DENSE
o These are functions that help you rank your data. You can either rank your entire dataset or rank them by groups like by month or country
o Extremely useful to generate ranking indexes within groups
3. Generating statistics
o These are great if you need to generate simple statistics like NTILE (percentiles, quartiles, medians)
o You can use this for your entire dataset or by group
4. Handling time series data
o A very common window function especially if you need to calculate trends like a month-over-month rolling average or a growth metric
o LAG and LEAD are the two functions that allow you to do this.
1. Regular aggregate function
Regular aggregate functions are functions like average, count, sum, min/max that are applied to columns. The goal is to apply the aggregate function if you want to apply aggregations to different groups in the dataset, like month.
This is similar to the type of calculation that can be
done with an aggregate function that you'd find in the SELECT clause,
but unlike regular aggregate functions, window functions do not group
several rows into a single output row, they are grouped together or
retain their own identities, depending on how you find them.
Let's take a look at one example of an avg() window function implemented to answer a data analytics question. You can view the question and write code in the link below:
This is a perfect example of using a window function and then applying an avg() to a month group. Here we're trying to calculate the average distance per dollar by the month. This is hard to do in SQL without this window function. Here we've applied the avg() window function to the 3rd column where we've found the average value for the month-year for every month-year in the dataset. We can use this metric to calculate the difference between the month average and the date average for each request date in the table.
The code to implement the window function would look like this:
AVG(a.dist_to_cost) OVER(PARTITION BY a.request_mnth) AS avg_dist_to_cost
to_char(request_date::date, 'YYYY-MM') AS request_mnth,
(distance_to_travel/monetary_cost) AS dist_to_cost
FROM uber_request_logs) a
ORDER BY request_date
2. Ranking Functions
Ranking functions are an important utility for a data scientist. You're always ranking and indexing your data to better understand which rows are the best in your dataset. SQL window functions give you 3 ranking utilities -- RANK(), DENSE_RANK(), ROW_NUMBER() -- depending on your exact use case. These functions will help you list your data in order and in groups based on what you desire.
Let's take a look at one ranking window function example to see how we can rank data within groups using SQL window functions. Follow along interactively with this link: platform.stratascratch.com/coding-question?id=9898&python=
Here we want to find the top salaries by department. We can't just find the top 3 salaries without a window function because it will just give us the top 3 salaries across all departments, so we need to rank the salaries by departments individually. This is done by rank() and partitioned by department. From there it's really easy to filter for top 3 across all departments
Here's the code to output this table. You can copy and paste in the SQL editor in the link above and see the same output.
RANK() OVER (PARTITION BY a.department
ORDER BY a.salary DESC) AS rank_id
(SELECT department, salary
GROUP BY department, salary
ORDER BY department, salary) a
ORDER BY department,
NTILE is a very useful function for those in data analytics, business analytics, and data science. Often times when deadline with statistical data, you probably need to create robust statistics such as quartile, quintile, median, decile in your daily job, and NTILE makes it easy to generate these outputs.
NTILE takes an argument of the number of bins (or basically how many buckets you want to split your data into), and then creates this number of bins by dividing your data into that many number of bins. You set how the data is ordered and partitioned, if you want additional groupings.
In this example, we'll learn how to use NTILE to categorize our data into percentiles. You can follow along interactively in the link here: platform.stratascratch.com/coding-question?id=10303&python=
What you're trying to do here is identify the top 5 percent of claims based on a score an algorithm outputs. But you can't just find the top 5% and do an order by because you want to find the top 5% by state. So one way to do this is to use a NTILE() ranking function and then PARTITION by the state. You can then apply a filter in the WHERE clause to get the top 5%.
Here's the code to output the entire table above. You can copy and paste it in the link above.
NTILE(100) OVER(PARTITION BY state
ORDER BY fraud_score DESC) AS percentile
FROM fraud_score) a
WHERE percentile <=5
4. Handling time series data
LAG and LEAD are two window functions that are useful for dealing with time series data. The only difference between LAG and LEAD is whether you want to grab from previous rows or following rows, almost like sampling from previous data or future data.
You can use LAG and LEAD to calculate month-over-month growth or rolling averages. As a data scientist and business analyst, you're always dealing with time series data and creating those time metrics.
In this example, we want to find the percentage growth year-over-year, which is a very common question that data scientists and business analyst answer on a daily basis. The problem statement, data, and SQL editor is in the following link if you want to try to code the solution on your own: platform.stratascratch.com/coding-question?id=9637&python=
What's hard about this problem is the data is set up -- you need to use the previous row's value in your metric. But SQL isn't built to do that. SQL is built to calculate anything you want as long as the values are on the same row. So we can use the lag() or lead() window function which will take the previous or subsequent rows and put it in your current row which is what this question is doing.
Here's the code to output the entire table above. You can copy and paste the code in the SQL editor in the link above:
round(((current_year_host - prev_year_host)/(cast(prev_year_host AS numeric)))*100) estimated_growth
(SELECT year, da
LAG(current_year_host, 1) OVER (ORDER BY year) AS prev_year_host
FROM host_since::date) AS year,
WHERE host_since IS NOT NULL
GROUP BY extract(year
ORDER BY year) t1) t2