Wednesday, 6 December 2017

Analytic Functions in Impala - VI

In the last segment of the series, we will be looking at analytic functions ROW_NUMBER, NTILE and CUME_DIST in Impala. For all queries in this post, we will use the Cloudera sandbox, Cloudera QuickStart VM 5.12.

ROW_NUMBER returns an ascending sequence of integers, starting with 1. If there is a partition by clause, then, it returns the sequence over for each group produced by the PARTITIONED BY clause. The output sequence includes different values irrespective duplicate input values.

As a first example, let us run below below query:

select employee_id, last_name, hire_date, department_id, salary, row_number() over (order by salary desc) from employees where employee_id <= 109 

The results returned are shown below:















We can see that ROW_NUMBER starts from 1 and ends in 10. Introducing the partition by clause returns makes the ROW_NUMBER start from 1 for each new department_id as seen in below query:

select employee_id, last_name, hire_date, department_id, salary,  row_number() over (partition by department_id order by salary desc) from employees where employee_id <= 109

The results returned are shown below:

















The results are as expected

Next, we check out NTILE. NTILE  returns the "bucket number" associated with each row, between 1 and the value of the expression that follows NTILE keyword. Let us run below query:

select employee_id, last_name, hire_date, department_id, salary,  ntile(4) over (order by salary desc)
from employees where employee_id <= 109


The results returned are shown below:















Like in the case of ROW_NUMBER,  we can add the partition by clause as shown below:

select employee_id, last_name, hire_date, department_id, salary,  ntile(4) over (partition by department_id order by salary desc) from employees where employee_id <= 109

The results returned are shown below:















The behavior is similar to results given by ROW_NUMBER

Lastly, we look at CUME_DIST. CUME_DIST returns the cumulative distribution of a value. The value for each row in the result set is greater than 0 and less than or equal to 1. The below query incorporates CUME_DIST:

select employee_id, last_name, hire_date, department_id, salary,  cume_dist() over (order by salary desc) from employees where employee_id <= 109

The results returned are shown below:

















In the last query in this series, we add the partition by clause as we have done earlier:

select employee_id, last_name, hire_date, department_id, salary, cume_dist() over (partition by department_id order by salary desc) from employees where employee_id <= 109

The results returned are shown below:
















We see that cume_dist() values are by department_id. This concludes the series Analytic Functions in Impala.