Saturday 11 November 2017

Analytic Functions in Hive - I


Now that we have written many articles on data related areas, it is time we look at Big Data. We will begin with an introduction to Hive.


Hive shares much in common with SQL, a fair share of analytical aspect of which we have already covered in a series before. In this post, we will try to rerun most the queries that we have attempted before. This will be useful for those who are looking to take a crack at Big Data. We will use sandbox provided by Mapr for all the Hive queries.

Before we start to run the queries, we will set up the tables that we will use as to demonstrate the analytic functions in Hive. Let us first get an extract of the employee table and department table from Oracle HR schema by performing an export in SQL developer in the form of .csv files containing data. These files can be imported into Mapr sandbox will be shown shortly. As a first step, set up the sandbox and navigate to http://127.0.0.1:8443/.
















Click on Launch Hue. On the "Quick Start Wizard - Hue™ 3.9.0 - The Hadoop UI" screen, click on File Browser to upload the data files exported earlier from Oracle.












On the File Browser screen, we can add the files using the Upload button. Once the files are uploaded, we can see the uploaded files at /user/mapr/ as shown below:










Since we have uploaded the files, we can now create tables based on the data. Navigate to "Metastore Manager" as shown below:













On the "Metastore Manager", click on Create a new file from a table. On the right hand side, enter the details as shown and click on Next button.














Examine the columns and data on the next screen and click Next button.













On the last screen in the process, ensure that "Use first row as column names" box is checked.  Note that based on the data, the data type is already suggested.














In the above screen, set the Column Type of DEPARTMENT_ID and COLUMN_ID columns to smallint. After this change change, the screen should look as below:
















Make any other modifications, if necessary. Otherwise, accept the other defaults and  click on Create Table button to create the table in Hive.

Once the process of table creation is over, we get to see below results.










Check out the Sample and Properties tab to see more details about the departments table.

In the same manner, let us try to create the employees table as well. On the way to the creation of this table, when we come to the last step to the table creation, we see that COMMISSION_PCT is a string and HIRE_DATE is not a date but a string.





















To correct this issue, we will create a staging table based on employee data file and then create a table in Hive that we will use for our queries. The reason for the intermediate staging table is that it will help us for any data clean up and help us with any data transformation so that the final table used for querying has the right schema and has the correct data.

The staging table will be in the form of an external table. But, the final table will be an internal table. The concept of external tables or internal tables in Hive is the same as it is in a relational database. Briefly, Hive does not manage the storage for an external table. Only the definition of an external table is maintained in Hive. So, if an external table is deleted, only the definition is deleted but not the data in the external file. On the other hand, both the definition and data are maintained in Hive for internal tables. So, if an internal table is deleted, both the definition and data are deleted.

Create a folder called employees to hold the employees.csv by clicking on New button and then Directory menu item as shown below:










Then, drag and drop the employees.csv file into the employees folder. The final directory structure should be as shown below:










Click on Hive under Query Editor menu to navigate to the Hive Query Editor.











On the Query Editor, click in the empty space below the query on the right to enter new queries.









Enter below query and click on Execute button:

CREATE EXTERNAL TABLE IF NOT EXISTS staging_employees (EMPLOYEE_ID SMALLINT, FIRST_NAME STRING, LAST_NAME STRING, EMAIL STRING, PHONE_NUMBER STRING, HIRE_DATE STRING, JOB_ID STRING, SALARY SMALLINT, COMMISSION_PCT DOUBLE, MANAGER_ID SMALLINT, DEPARTMENT_ID TINYINT) COMMENT 'Staging table for employee data' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/mapr/employees' TBLPROPERTIES ("skip.header.line.count"="1");










Once executed, we get result as shown below:












Validate the data in staging table by running below query:

 SELECT COUNT(*) FROM STAGING_EMPLOYEES;













Once we have the staging table, we need to transform the HIRE_DATE column from a string to DATE datatype. So, we create the final table called employees with the correct datatypes and then populate it with data from the staging_employees table.

CREATE TABLE IF NOT EXISTS employees (EMPLOYEE_ID SMALLINT, FIRST_NAME STRING, LAST_NAME STRING, EMAIL STRING, PHONE_NUMBER STRING, HIRE_DATE DATE, JOB_ID STRING, SALARY SMALLINT, COMMISSION_PCT DOUBLE, MANAGER_ID SMALLINT,DEPARTMENT_ID TINYINT) COMMENT 'Table holding employee details';













Then, add the data using below query:

INSERT INTO employees SELECT EMPLOYEE_ID, FIRST_NAME, LAST_NAME, EMAIL, PHONE_NUMBER, TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(HIRE_DATE,'dd-MMM-yy'))), JOB_ID, SALARY, COMMISSION_PCT, MANAGER_ID, DEPARTMENT_ID FROM staging_employees;














Now that we have data in both employees and department table, we can go ahead with analytic functions in Hive. We will be largely drawing from the queries that we have written before in the case of Oracle.

The first one will be the group by clause:

select job_id, sum(salary) from employees where department_id < 40 group by (job_id);

The results are shown below and they are the same that we have got earlier.











Let us try to run below query having ROLLUP keyword:

select job_id, sum(salary) from employees where department_id < 40 group by rollup(job_id);

We get below error:









We will write a work around as show below:

select job_id, sum(salary) from employees where department_id < 40 group by (job_id) 
union 
select null, sum(salary) from employees where department_id < 40;


The results are shown below and they are the same that we have got earlier.












For cube too, we have to implement a work around.

Below grouping set will not run as an error will be thrown like earlier in the case of ROLLUP.






 
The query that will mimic the grouping set query is:

select null, job_id, sum(salary) from employees where department_id < 40 group by job_id
union
select department_id, null, sum(salary) from employees where department_id < 40 group by department_id; 


The results are shown below and they are the same that we have got earlier.












Likewise, the equivalent of below query

select department_id, job_id, sum(salary) from employees where department_id < 40 group by grouping sets (department_id,job_id,());

in Hive will be:

select null, job_id, sum(salary) from employees where department_id < 40 group by job_id
union
select department_id, null, sum(salary) from employees where department_id < 40 group by department_id
union
select null, null, sum(salary) from employees where department_id < 40;


The results are shown below and they are the same that we have got earlier.













ROLLUP and CUBE are used in a different way as shown in the next queries. For example, a query with ROLLUP is shown below:

select department_id, job_id, sum(salary) from employees where department_id < 40 group by department_id, job_id with rollup;

Running this gives below result:











The same results are shown if we run below query:

select department_id, job_id, sum(salary) from employees where department_id < 40 group by department_id,job_id
union
select department_id, null, sum(salary) from employees where department_id < 40 group by department_id
union
select null, null, sum(salary) from employees where department_id < 40;














A query using CUBE is shown below:

select department_id, job_id, sum(salary) from employees where department_id < 40 group by department_id, job_id with cube;

The results are shown below:















The same results are shown if we run below query:

select department_id, job_id, sum(salary) from employees where department_id < 40 group by department_id,job_id
union
select null, job_id, sum(salary) from employees where department_id < 40 group by job_id
union
select department_id, null, sum(salary) from employees where department_id < 40 group by department_id
union
select null, null, sum(salary) from employees where department_id < 40;

















We will continue with this topic in the next part of this series.