In the last segment of the series, we will be looking at analytic functions ROW_NUMBER, NTILE and CUME_DIST in
Impala. For all queries in this post, we will use the Cloudera sandbox,
Cloudera QuickStart VM 5.12.
ROW_NUMBER returns an ascending sequence of integers, starting with 1. If there is a partition by clause, then, it returns the sequence over for each group produced by the PARTITIONED BY clause. The output sequence includes different values irrespective duplicate input values.
As a first example, let us run below below query:
select employee_id, last_name, hire_date, department_id, salary, row_number() over (order by salary desc) from employees where employee_id <= 109
The results returned are shown below:
We can see that ROW_NUMBER starts from 1 and ends in 10. Introducing the partition by clause returns makes the ROW_NUMBER start from 1 for each new department_id as seen in below query:
select employee_id, last_name, hire_date, department_id, salary, row_number() over (partition by department_id order by salary desc) from employees where employee_id <= 109
The results returned are shown below:
The results are as expected
Next, we check out NTILE. NTILE returns the "bucket number" associated with each row, between 1 and the value of the expression that follows NTILE keyword. Let us run below query:
select employee_id, last_name, hire_date, department_id, salary, ntile(4) over (order by salary desc)
from employees where employee_id <= 109
The results returned are shown below:
Like in the case of ROW_NUMBER, we can add the partition by clause as shown below:
select employee_id, last_name, hire_date, department_id, salary, ntile(4) over (partition by department_id order by salary desc) from employees where employee_id <= 109
The results returned are shown below:
The behavior is similar to results given by ROW_NUMBER
Lastly, we look at CUME_DIST. CUME_DIST returns the cumulative distribution of a value. The value for each row in the result set is greater than 0 and less than or equal to 1. The below query incorporates CUME_DIST:
select employee_id, last_name, hire_date, department_id, salary, cume_dist() over (order by salary desc) from employees where employee_id <= 109
The results returned are shown below:
In the last query in this series, we add the partition by clause as we have done earlier:
select employee_id, last_name, hire_date, department_id, salary, cume_dist() over (partition by department_id order by salary desc) from employees where employee_id <= 109
The results returned are shown below:
We see that cume_dist() values are by department_id. This concludes the series Analytic Functions in Impala.
ROW_NUMBER returns an ascending sequence of integers, starting with 1. If there is a partition by clause, then, it returns the sequence over for each group produced by the PARTITIONED BY clause. The output sequence includes different values irrespective duplicate input values.
As a first example, let us run below below query:
select employee_id, last_name, hire_date, department_id, salary, row_number() over (order by salary desc) from employees where employee_id <= 109
The results returned are shown below:
We can see that ROW_NUMBER starts from 1 and ends in 10. Introducing the partition by clause returns makes the ROW_NUMBER start from 1 for each new department_id as seen in below query:
select employee_id, last_name, hire_date, department_id, salary, row_number() over (partition by department_id order by salary desc) from employees where employee_id <= 109
The results returned are shown below:
The results are as expected
Next, we check out NTILE. NTILE returns the "bucket number" associated with each row, between 1 and the value of the expression that follows NTILE keyword. Let us run below query:
select employee_id, last_name, hire_date, department_id, salary, ntile(4) over (order by salary desc)
from employees where employee_id <= 109
The results returned are shown below:
Like in the case of ROW_NUMBER, we can add the partition by clause as shown below:
select employee_id, last_name, hire_date, department_id, salary, ntile(4) over (partition by department_id order by salary desc) from employees where employee_id <= 109
The results returned are shown below:
The behavior is similar to results given by ROW_NUMBER
Lastly, we look at CUME_DIST. CUME_DIST returns the cumulative distribution of a value. The value for each row in the result set is greater than 0 and less than or equal to 1. The below query incorporates CUME_DIST:
select employee_id, last_name, hire_date, department_id, salary, cume_dist() over (order by salary desc) from employees where employee_id <= 109
The results returned are shown below:
In the last query in this series, we add the partition by clause as we have done earlier:
select employee_id, last_name, hire_date, department_id, salary, cume_dist() over (partition by department_id order by salary desc) from employees where employee_id <= 109
The results returned are shown below:
We see that cume_dist() values are by department_id. This concludes the series Analytic Functions in Impala.