In second part of the series on Analytic Functions in Hive, we will follow the blog that was written earlier for Oracle. For all our work, we will use the Mapr sandbox, MapR-Sandbox-For-Hadoop-5.2.0.
For starters, let us run the following query:
select count(*) from employees where department_id < 40;
The results returned are shown below:
The result is the same that we have got earlier.
Let us add the OVER keyword to this as shown below:
select count(*) over() from employees where department_id < 40;
The results returned are shown below:
The results returned are the same as before.
The same result is now returned nine times. Since nothing is specified in the parenthesis following OVER, we see that OVER returns the results of COUNT nine times as there are nine rows that satisfy the filter clause without the COUNT and OVER clauses.
The same behaviour can be seen in the below query where we introduce a partition clause as shown below:
select department_id, count(*) over(partition by department_id) from employees where department_id < 40;
The results returned are shown below:
The results returned are the same as before.
A more meaningful query is shown below:
select distinct department_id, count(*) over(partition by department_id) from employees where department_id < 40;
The results returned are shown below:
The results returned are not in line with the results got in the earlier blog for Oracle. While for the all previous queries, the results were the same, when we introduce the distinct keyword, the query in Hive gives incorrect results. After logging into the Mapr Virtual Machine as mapr/mapr, we can run the commands shown below to get the version of Hive as shown below:
Fortunately, we have a work around in the query below to get the correct result:
select department_id, count(*) from employees where department_id < 40 group by department_id;
The results returned are shown below:
Now the results look good.
Now that we have seen how OVER() and OVER() with a PARTITION clause works, let us see next variant that can be used to Moving calculations. For all the variants shown hereafter, the calculations are dependent on the order of the rows of the result. The first query calculates the moving average
of the salary in the previous record and salary in current record:
select employee_id, first_name, last_name, salary, avg(salary) over (order by employee_id rows between 1 preceding and current row) as moving_avg_of_sal from employees where employee_id <= 105;
The results returned are shown below:
The results returned look the same that we have got before.
The moving average salary of the first row is the same as the salary of that row as there is no previous row. Others salaries are calculated according to the logic correctly.
The second query is similar to the first one with the only difference that the average is calculated based on three values: salary in previous row, salary in current row and salary in next row.
select employee_id, first_name, last_name, salary, avg(salary) over (order by employee_id rows between 1 preceding and 1 following) as moving_avg_of_sal from employees where employee_id <= 105;
The results returned are shown below:
The results returned are the same as before.
It is instructive to note that for the first and the last row in the result set, since there only two records, the averages of only two salaries are shown.
The third query calculates the moving average of salaries in all previous records.
select employee_id, first_name, last_name, salary, avg(salary) over (order by employee_id rows between unbounded preceding and current row) as moving_avg_of_sal from employees where employee_id <= 105;
The results returned are shown below:
The results returned are the same as before.
We will next look at FIRST_VALUE. In the first query, we will fetch the lowest salary in departments with department_id 20 and 30. The query is shown below:
select employee_id, department_id, salary, first_value(salary) over (partition by department_id order by salary) as first_sal_in_dept from employees where department_id in (20,30);
The results returned are shown below:
As expected, we see that first_sal_in_dept gives the lowest salary in respective departments having department_ids 20 and 30.
We can fetch the highest salary by changing reversing the order of salary as shown below:
select employee_id, department_id, salary, first_value(salary) over (partition by department_id order by salary desc) as first_sal_in_dept from employees where department_id in (20,30);
The results returned are shown below:
The results are on expected lines. In the case of LAST_VALUE, we get the same funny results we got in Oracle as shown below:
select employee_id, department_id, salary, last_value(salary) over (partition by department_id order by salary) as last_sal_in_dept from employees where department_id in (20,30);
For some reason, value in each row is considered the last value. To correct this, we have to add an additional clause indicating the extent of the application of LAST_VALUE like "rows between unbounded preceding and unbounded following". This will fetch the last value correctly as shown below where the highest salary is obtained:
select employee_id, department_id, salary, last_value(salary) over (partition by department_id order by salary rows between unbounded preceding and unbounded following) as last_sal_in_dept from employees where department_id in (20,30);
In the same way, we can add desc keyword to the query as shown below to get the minimum salary in the two departments:
select employee_id, department_id, salary, last_value(salary) over (partition by department_id order by salary desc rows between unbounded preceding
and unbounded following) as last_sal_in_dept from employees where department_id in (20,30);
With this treatment of FIRST_VALUE and LAST_VALUE, we come to the close of the second part in this series.
For starters, let us run the following query:
select count(*) from employees where department_id < 40;
The results returned are shown below:
The result is the same that we have got earlier.
Let us add the OVER keyword to this as shown below:
select count(*) over() from employees where department_id < 40;
The results returned are shown below:
The results returned are the same as before.
The same result is now returned nine times. Since nothing is specified in the parenthesis following OVER, we see that OVER returns the results of COUNT nine times as there are nine rows that satisfy the filter clause without the COUNT and OVER clauses.
The same behaviour can be seen in the below query where we introduce a partition clause as shown below:
select department_id, count(*) over(partition by department_id) from employees where department_id < 40;
The results returned are shown below:
The results returned are the same as before.
A more meaningful query is shown below:
select distinct department_id, count(*) over(partition by department_id) from employees where department_id < 40;
The results returned are shown below:
The results returned are not in line with the results got in the earlier blog for Oracle. While for the all previous queries, the results were the same, when we introduce the distinct keyword, the query in Hive gives incorrect results. After logging into the Mapr Virtual Machine as mapr/mapr, we can run the commands shown below to get the version of Hive as shown below:
Fortunately, we have a work around in the query below to get the correct result:
select department_id, count(*) from employees where department_id < 40 group by department_id;
The results returned are shown below:
Now the results look good.
Now that we have seen how OVER() and OVER() with a PARTITION clause works, let us see next variant that can be used to Moving calculations. For all the variants shown hereafter, the calculations are dependent on the order of the rows of the result. The first query calculates the moving average
of the salary in the previous record and salary in current record:
select employee_id, first_name, last_name, salary, avg(salary) over (order by employee_id rows between 1 preceding and current row) as moving_avg_of_sal from employees where employee_id <= 105;
The results returned are shown below:
The results returned look the same that we have got before.
The moving average salary of the first row is the same as the salary of that row as there is no previous row. Others salaries are calculated according to the logic correctly.
The second query is similar to the first one with the only difference that the average is calculated based on three values: salary in previous row, salary in current row and salary in next row.
select employee_id, first_name, last_name, salary, avg(salary) over (order by employee_id rows between 1 preceding and 1 following) as moving_avg_of_sal from employees where employee_id <= 105;
The results returned are shown below:
The results returned are the same as before.
It is instructive to note that for the first and the last row in the result set, since there only two records, the averages of only two salaries are shown.
The third query calculates the moving average of salaries in all previous records.
select employee_id, first_name, last_name, salary, avg(salary) over (order by employee_id rows between unbounded preceding and current row) as moving_avg_of_sal from employees where employee_id <= 105;
The results returned are shown below:
The results returned are the same as before.
We will next look at FIRST_VALUE. In the first query, we will fetch the lowest salary in departments with department_id 20 and 30. The query is shown below:
select employee_id, department_id, salary, first_value(salary) over (partition by department_id order by salary) as first_sal_in_dept from employees where department_id in (20,30);
The results returned are shown below:
As expected, we see that first_sal_in_dept gives the lowest salary in respective departments having department_ids 20 and 30.
We can fetch the highest salary by changing reversing the order of salary as shown below:
select employee_id, department_id, salary, first_value(salary) over (partition by department_id order by salary desc) as first_sal_in_dept from employees where department_id in (20,30);
The results returned are shown below:
The results are on expected lines. In the case of LAST_VALUE, we get the same funny results we got in Oracle as shown below:
select employee_id, department_id, salary, last_value(salary) over (partition by department_id order by salary) as last_sal_in_dept from employees where department_id in (20,30);
For some reason, value in each row is considered the last value. To correct this, we have to add an additional clause indicating the extent of the application of LAST_VALUE like "rows between unbounded preceding and unbounded following". This will fetch the last value correctly as shown below where the highest salary is obtained:
select employee_id, department_id, salary, last_value(salary) over (partition by department_id order by salary rows between unbounded preceding and unbounded following) as last_sal_in_dept from employees where department_id in (20,30);
In the same way, we can add desc keyword to the query as shown below to get the minimum salary in the two departments:
select employee_id, department_id, salary, last_value(salary) over (partition by department_id order by salary desc rows between unbounded preceding
and unbounded following) as last_sal_in_dept from employees where department_id in (20,30);
With this treatment of FIRST_VALUE and LAST_VALUE, we come to the close of the second part in this series.