Ranking functions are one of the most commonly used Window functions in Data analysis these functions provide you with the tools to assign ranks, orders, and positions to your data, transforming an undifferentiated collection of rows into a structured and intelligible narrative.
Whether you're dealing with sales figures, exam scores, employee performance, or any other dataset, ranking functions enable you to answer crucial questions such as: "Who are the top performers?", "What's the distribution of scores?", and "How does each element compare to others?
In this exploration of ranking functions in SQL, we'll dive into the enchanting world of ROW_NUMBER(), RANK(), DENSE_RANK(), and NTILE(). By the time we're done, you'll not only understand how to apply these functions practically but also appreciate how they add depth and structure to the narratives hidden within your datasets. Let's Start.
Must have Parameter:
If you have read my previous blog on window framing (If not here), we saw that we used partition by
and order by
in Over()
function to create the windows. In some window functions partition by
and order by
in Over()
are optional but that is not the case with ranking functions.
One of the key things to remember when using any ranking function is that we must always have a order by
in Over()
function. It's not optional in the ranking function. Think through it how can you rank something when you don't know what order they follow?
so yes order by
in Over()
function is mandatory.
The key difference.
ROW_NUMBER(), RANK(), and DENSE_RANK() are all ranking window functions in SQL, but they have distinct behaviors when assigning ranks to rows in a result set. Here's a breakdown of their key differences:
ROW_NUMBER():
Assigns a unique sequential integer to each row in the result set based on the specified ordering.
If two rows have the same values, they will receive different row numbers.
There are no gaps or ties in row numbers, and each row gets a unique number.
ROW_NUMBER() does not account for duplicate values; each row is uniquely numbered.
RANK():
Assigns a unique rank to each distinct row based on the specified ordering.
If two rows have the same values, they will receive the same rank, and the next rank will be skipped.
RANK() may result in gaps in rank values, especially when there are tied values.
For example, if two rows tie for second place, the next row will be assigned a rank of 4 (skipping 3).
DENSE_RANK():
Assigns a unique rank to each distinct row based on the specified ordering.
If two rows have the same values, they will receive the same rank, and the next rank will not be skipped.
DENSE_RANK() does not result in gaps in rank values, even when there are tied values.
For example, if two rows tie for second place, the next row will be assigned a rank of 3 (not skipping any ranks).
Example of ranking function in SQL
without partition by
Let's consider the below dataset that showcases 15 fictional employee records, each with an assigned EmployeeID, name, department, and salary.
Let's try to understand the salaries of the employees, will order the table by EmployeeSalary
in descending order and will see how the different ranking functions work.
SELECT * ,
ROW_NUMBER() OVER(ORDER BY EmployeeSalary DESC) AS row_number,
RANK() OVER(ORDER BY EmployeeSalary DESC) AS rank,
DENSE_RANK() OVER(ORDER BY EmployeeSalary DESC) AS dense_rank
FROM EmployeeData;
Output:
Let's break down the output columns:
row_number:
Assigns unique sequential numbers to each row.
No skipping or handling of ties; each row gets a distinct number.
rank:
Assigns the same rank to tied rows and skips the next rank.
Creates gaps in rank values for tied rows.
dense_rank:
Assigns the same rank to tied rows without skipping ranks.
Maintains a continuous sequence of rank values.
with partition by
Let's try to understand the salaries of the employees with respect to the department wise, will create a window framing with the help of partition by
on the Department
column and will order the table by EmployeeSalary
in descending order and will see how the different ranking functions work.
SELECT * ,
ROW_NUMBER() OVER(PARTITION BY Department ORDER BY EmployeeSalary DESC) AS row_number,
RANK() OVER(PARTITION BY Department ORDER BY EmployeeSalary DESC) AS rank,
DENSE_RANK() OVER(PARTITION BY Department ORDER BY EmployeeSalary DESC) AS dense_rank
FROM EmployeeData;
output:
The PARTITION BY
clause divides the data into departments, and ORDER BY
Orders the employee based on their salaries in descending order in each department group.
Pay closer attention to doted boxes in each partition group you will see a distinct change in the rank
and dense_rank
columns.
NTILE function.
The NTILE()
function divides the dataset into a specified number of roughly equal parts (or "tiles") and assigns a tile number to each row based on the specified ordering. This function is particularly useful for distributing data evenly and creating percentiles or quartiles within the result set.
Before we understand the NTILE()
function we need to understand how tiles are created, for that let's try to understand this small scene.
Tales of Tiles: Dividing Data Fairly and Clearly
Imagine you have a basket of apples, and you want to divide them into small baskets. These small baskets are like the "tiles" we're talking about in your data.
Equal Number of Rows in Each Tile: When you have a certain number of apples that can be easily divided by the number of small baskets you want, each small basket will have the same number of apples. It's like if you have 12 apples and want 3 small baskets, you can put 4 apples in each basket. Everything fits perfectly, and each basket has the same amount.
Different Number of Rows in Each Tile: Now, imagine you have a different number of apples. Let's say you have 15 apples and still want 2 small baskets. You can't divide them equally anymore. But what you can do is try to make the baskets as even as possible. Some baskets might have a bit more, and some might have a bit less, but you still try to distribute the apples fairly across the baskets.
In your data, the same thing happens. Sometimes you have a nice, even distribution of rows for each "tile," but other times the numbers don't match up perfectly. In both cases, the goal is to create these "tiles" in a way that's organized and helps you see patterns in your data.
So, when using functions like NTILE()
, you'll either have tiles with an equal number of rows, just like the apples evenly divided, or you'll have tiles where the rows are distributed as evenly as possible, even if they're not the same in every tile.
NTILE Function with Equal Number of Rows in Each Tile:
SELECT *,
NTILE(3) OVER(ORDER BY EmployeeSalary DESC) AS tiles
FROM EmployeeData;
output:
Here's how the NTILE()
the function behaves for the given dataset:
Rows are ordered by
EmployeeSalary
in descending order.The dataset is divided into 3 tiles based on the order of salaries.
Each row is assigned a tile number (1, 2, or 3) indicating its position within the divided dataset.
NTILE Function with Odd Number of Rows in Each Tile:
SELECT *,
NTILE(4) OVER(ORDER BY EmployeeSalary DESC) AS tiles
FROM EmployeeData;
output:
Here's how the NTILE()
function behaves when you can't divide the rows into equal tiles:
Rows are ordered by
EmployeeSalary
in descending order.The dataset is divided into 4 tiles based on the order of salaries.
The function aims to distribute the rows as evenly as possible across the tiles, considering the total number of rows, and then equally distributes the remaining rows from top most tiles.
Conclusion :
In the world of data, ranking functions are like storytellers. They give order to chaos and let us see who's on top, where we stand, and how things are spread out. Whether it's numbers or people, these functions turn data into understandable tales. Remember, ranking needs order, and different ranking ways give different stories. So, embrace these tools, explore your data's adventures, and let rankings guide you to the insights you've been looking for. In an upcoming blog post will learn more about Aggregate window functions. I hope you enjoyed reading this blog.
The entire code of the blog can be found here.
THANK YOU !!!!!!
Happy Analysis !!!!