Tuesday 21 November 2017

Analytic Functions in Hive - VI

In the following post, we will take a look at a case study similar to word count in Hive. For all our work, we will use the Mapr sandbox, MapR-Sandbox-For-Hadoop-5.2.0.

Before we proceed to  the case study, let us take a look at EXPLODE and POSEXPLODE.

We will run below query to understand EXPLODE:

select explode(array('Hello','Big','Data'));

The results returned are shown below:














As the name suggests, EXPLODE decomposes the elements in an array into elements in multiple rows. POSEXPLODE is the same as EXPLODE but returns another column with positions as well starting from 0.

select posexplode(array('Hello','Big','Data')); 

The results returned are shown below:











For the case study, the data is described by the query below:

select last_name, phone_number from employees where rlike(last_name, 'K');

The results returned are shown below: 













The goal is to count the number of elements in the phone numbers after they are split into different elements using '.' as a delimiter. As an example, '515' should have a count of 3 while '615' should have a count of 1 only.

As a first step, we remove the double quotes in the numbers with below query:

select regexp_replace(phone_number, '([/"])', '') as part_number from employees where rlike(last_name, 'K');

The results returned are shown below: 













We then split the result into an array having different phone number parts based on delimiter "." as shown below:

select split(regexp_replace(phone_number, '([/"])', ''),'\\.') as part_number from employees where rlike(last_name, 'K');

The results returned are shown below: 













We then use EXPLODE to move the different parts of the array to multiple rows as shown below: 

select explode(split(regexp_replace(phone_number, '([/"])', ''),'\\.')) as part_number from employees where rlike(last_name, 'K');

The results returned are shown below:


























Now, the part numbers are into different rows. We just need to do a group by to complete the part number count as shown below:

select part_number, count(*) as count_number from (select explode(split(regexp_replace(phone_number, '([/"])', ''),'\\.')) as part_number from employees where rlike(last_name, 'K')) w group by part_number;

The results returned are shown below: 















This shows the count of the different part numbers. To get the total number of part numbers, use below query:

select sum(count_number) from (select part_number, count(*) as count_number from (select explode(split(regexp_replace(phone_number, '([/"])', ''),'\\.')) as part_number from employees where rlike(last_name, 'K')) w group by part_number) y;

The results returned are shown below:











This concludes the Analytic Functions series on Hive.