Tuesday 5 December 2017

Analytic Functions in Impala - III

We will now take a look at more analytic functions continuing in the series. We will follow the post that we wrote before. For all queries in this post, we will use the Cloudera sandbox, Cloudera QuickStart VM 5.12.

RANK as the name suggests ranks any measure in a query using sequential numbers starting from 1. If in the ranking process, a tie is encountered, then, the same rank is repeated for all values that tie and the next rank in sequence is used after skipping as many rank values as was repeated in the tie . Let us see that by means of an example. For starters, let us use the same hr.employees table used in the previous posts. We are interested only in the results of the below query:

select employee_id, last_name, department_id, salary from employees where employee_id <= 109 order by salary desc

The results returned are shown below:  















The query returns ten records. Let us now run the first query using RANK shown below:

select employee_id, last_name, department_id, salary, rank() over (order by salary desc) as SALARY_RANK from employees where employee_id <= 109

The results returned are shown below:  

















The results are as written above: a tie is seen at 17000, so there is a repeat of ranks for 17000 and at the next value of 12008, rank value 3 is skipped for one repeat of 17000 and a rank of 4 is returned. Same is true for the salary value 9000 where rank of 6 is skipped.

DENSE_RANK is similar to RANK except that in case a of tie, the rank values are not skipped. Let us replace RANK with DENSE_RANK in above query to see this contrast.

select employee_id, last_name, department_id, salary, dense_rank() over (order by salary desc) as SALARY_DENSE_RANK from employees where employee_id <= 109

The results returned are shown below:  

















Results show that DENSE_RANK does not skip any values at all and ranks are repeated at ties. 

Let us add the partition keyword and see the results. The query using partition is show below:

select department_id, last_name, salary, rank() over (partition by department_id order by salary desc) as SALARY_RANK from employees where employee_id <= 109

The results returned are shown below:

















The results show that the RANK function is applied department wise. The same can be tried using DENSE_RANK as shown below:

select department_id, last_name, salary, dense_rank() over (partition by department_id order by salary desc)  as SALARY_DENSE_RANK from employees where employee_id <= 109

The results returned are shown below:

















The results are clearly in line with our understanding of DENSE_RANK as is seen in the ranking of Lorentz.

The query for top 5 salaries as shown below:

select employee_id, last_name, department_id, salary from employees where employee_id <= 109 order by salary desc limit 5

The results returned are shown below:














We conclude the discussion on RANK with a query on PERCENT_RANK. PERCENT_RANK is defined as (RANK-1)/(ROWS_IN_GROUP-1). The query below validates the formula for PERCENT_RANK as well.

select  department_id, last_name, salary,
percent_rank() over (partition by department_id order by salary desc) as percent_rank, rank, rows_in_group,
(rank-1)/(rows_in_group -1) as calculated_percent_renk
from (select department_id, last_name, salary,
count(*) over (partition by department_id)  as rows_in_group,
rank() over (partition by department_id order by salary desc)  as rank
from employees where employee_id <= 109) x


The results returned are shown below:

 















The results look good.