Sunday, 19 November 2017

Analytic Functions in Hive - IV

In the following post, we will take a look at a new set of analytic functions continuing in the series. We will follow the post that we wrote before. For all our work, we will use the Mapr sandbox, MapR-Sandbox-For-Hadoop-5.2.0.

We will look at LAG and LEAD functions. We will use the results of the below query to understand these two functions:

select * from employees where employee_id <= 109 order by salary desc;

The results returned are shown below:  











The two functions are dependent on the order of the rows. If the order of the rows are changed, then, the values returned by these functions also will change as we will see in the queries shortly. LAG and LEAD functions return values from previous of following rows. The offset determines the row that is selected from which the value will be fetched. If no offset is specified, then, value from the immediate previous or immediate following row depending on whether LAG or LEAD function is used, respectively. If no value is available because previous row or next row is non-existent, then NULL, is returned. To avoid NULL values being returned, we can specify default value so that a meaningful result is returned.

As an example, let us run below query:

select employee_id, last_name, hire_date, department_id, salary, lag(salary) over (order by salary desc) as PREVIOUS_SALARY from employees where employee_id <= 109;

The results returned are shown below:  



The second row and further rows correctly carry the salary from the previous row. Two things are to be noted. The first row has NULL for previous salary as there there is no previous row from which salary can be fetched. NULL value is returned as we have not set a default value to LAG function.  Below query corrects both the issues where 0 is returned instead of NULL:

select employee_id, last_name, hire_date, department_id, salary, lag(salary,1,0) over (order by salary desc) as PREVIOUS_SALARY from employees where employee_id <= 109;

The results returned are shown below:











If, however, instead of 1 as offset, we use 0 with no default for NULL value as shown below, the PREVIOUS SALARY column shows the current salary. This is only to be expected.

select employee_id, last_name, hire_date, department_id, salary, lag(salary,0) over (order by salary desc) as PREVIOUS_SALARY from employees where employee_id <= 109;

The results returned are shown below: 












 The next query shows a offset value of 2 using LEAD function but with a different sorting on order.

select employee_id, last_name, hire_date, department_id, salary, lead(salary,2,0) over (order by salary) as FOLLOWING_SALARY from employees where employee_id <= 109;


The results returned are shown below:












 Finally, we introduce the PARTITION keyword in the clause following OVER as shown below:

select employee_id, last_name, hire_date, department_id, salary, lag(salary) over (partition by department_id order by salary desc) as PREVIOUS_SALARY from employees where employee_id <= 109;

The results returned are shown below:










The results are neatly partitioned by department_id.