«

»

Apr 04

Window Functions! RANK(), DENSE_RANK(), ROW_NUMBER(), NTILE(n)

Microsoft introduced ranking functions (also known as window functions) in SQL Server 2005. What are window functions exactly? First, they are awesome. Second, they are functions that can be applied to a group of rows as specified by the window descriptor (I. Ben-Gan, 2011). Itzik Ben-Gan has a great series on how to use these and the the new functions in SQL Server 2012. He also wrote an entire book on these! That’s how valuable they are…

So, in layman’s terms… We can apply a function to a group of rows as if they were the only results returned. This is done using, as mentioned above, a window descriptor. A sample from MSDN using AdventureWorks2008R2 database is below:

USE AdventureWorks2008R2;
GO SELECT p.FirstName, p.LastName
  ,ROW_NUMBER() OVER (ORDER BY a.PostalCode) AS 'Row Number'
  ,RANK() OVER (ORDER BY a.PostalCode) AS 'Rank'
  ,DENSE_RANK() OVER (ORDER BY a.PostalCode) AS 'Dense Rank'
  ,NTILE(4) OVER (ORDER BY a.PostalCode) AS 'Quartile'
  ,s.SalesYTD, a.PostalCode
FROM Sales.SalesPerson s
INNER JOIN Person.Person p
  ON s.BusinessEntityID = p.BusinessEntityID
INNER JOIN Person.Address a
  ON a.AddressID = p.BusinessEntityID
WHERE s.TerritoryID IS NOT NULL
  AND s.SalesYTD <> 0;

Ok, so what does this do? Let’s talk about the common clause of each function above: OVER(ORDER BY a.PostalCode)  This is the window descriptor, it is describing the set of rows to apply the function to and how to apply it. In this example it is applying the function to all rows, but in the order of the PostalCode column. In addition to the ORDER BY, a PARTITION BY can precede it to tell the function which group of rows to apply the function to. In the case of sales, you may want to use the functions to rank ALL of the sales reps within each territory.  To do this, the above code can be modified to include a PARTITION BY clause within each window descriptor:

USE AdventureWorks2008R2;
GO SELECT s.TerritoryID --added to show partition by TerritoryID
 , p.FirstName, p.LastName
 ,ROW_NUMBER() OVER (PARTITION BY s.TerritoryID ORDER BY a.PostalCode) AS 'Row Number'
 ,RANK() OVER (PARTITION BY s.TerritoryID ORDER BY a.PostalCode) AS 'Rank'
 ,DENSE_RANK() OVER (PARTITION BY s.TerritoryID ORDER BY a.PostalCode) AS 'Dense Rank'
 ,NTILE(4) OVER (PARTITION BY s.TerritoryID ORDER BY a.PostalCode) AS 'Quartile'
 ,s.SalesYTD, a.PostalCode
FROM Sales.SalesPerson s
INNER JOIN Person.Person p
 ON s.BusinessEntityID = p.BusinessEntityID
INNER JOIN Person.Address a
 ON a.AddressID = p.BusinessEntityID
WHERE s.TerritoryID IS NOT NULL
 AND s.SalesYTD <> 0;

So what is the difference between each of the ranking functions?

RANK()

According to BOL: RANK() “returns the rank of each row within the partition of a result set. The rank of a row is one plus the number of ranks that come before the row in question.”  This means that rank numbers will be assigned to each row in the window descriptor based on the value of the ORDER BY value.  Equal values will receive the same rank number, but rank number will still increase for each row in the set.  This is why you will see only 1 and 6 in the first example.

DENSE_RANK()

According to BOL: DENSE_RANK() “returns the rank of rows within the partition of a result set, without any gaps in the ranking. The rank of a row is one plus the number of distinct ranks that come before the row in question.” This means that rank numbers will be assigned to each row in the window descriptor based on the value of the ORDER BY value (just as in RANK).  Equal values will receive the same rank number (just as in RANK), but rank number will NOT increase for each row in the set.  This is why you will see only 1 and 2 in the first example.

ROW_NUMBER()

According to BOL: ROW_NUMBER() “returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition.” So, this is simple numbering from 1 to n of the rows defined in the window descriptor. If there is a partition clause, it will start the ordering of the next partition at 1 and resume incrementing until the end of the partition in the order defined.

NTILE(n)

According to BOL:NTILE(n) “Distributes the rows in an ordered partition into a specified number of groups. The groups are numbered, starting at one. For each row, NTILE returns the number of the group to which the row belongs.” In the above example, we told the function to split the partition into 4 groups.  The function will then assign a number 1-4 to the members of the group based on the ordering in the window descriptor. If there are not enough groups to meet n, it will increase until there are no more rows leaving n-totalGroups not included in the values of the result set.

A Great Utility of ROW_NUMBER()

Deleting true duplicates is something we have all experienced at one point or another.  The problem is that we have no easy way to tell SQL Server which of 1 or more duplicate rows to delete. Typical solutions have involved a cursor or many nested cursors to achieve this. Personally, if I can find a way to avoid a cursor I will use it.  Below is the script that I use to identify and delete duplicate rows:

WITH RowNumbers AS
(
SELECT columnA, columnB, columnC, columnD
  ,ROW_NUMBER()
    OVER (PARTITION BY  columnA, columnB, columnC, columnD ORDER BY columnA, columnB, columnC, columnD) AS rn
FROM tableA
)
DELETE FROM RowNumbers WHERE rn > 1

This is setting up my window descriptor to partition the groups based on all columns in this table so that true duplicates are in their own grouping. It then uses the ROW_NUMBER() function to assign a row number to each row in the group. If there are duplicates, there will be rows with a row number greater than 1.  If there are no duplicates of a row, it will be assigned a row number of 1.  Finally, delete all rows with a row number greater than 1! This script also uses a CTE or Common Table Expression to produce the result set that we will use for the delete.  CTEs will be discussed in a future post.

Leave a Reply