Saturday 25 November 2017

Word Count in Hive

In this post, we will take a look at word count in Hive. This post is the culmination of the posts that we have written earlier on Analytic Functions in Hive. For all queries in this post, we will use the Cloudera sandbox, Cloudera QuickStart VM 5.12.

We begin by creating an external table that will have the text data on which we wish to do a word count. The table will have just one column of data type, STRING. The command to create the table is shown below:

create external table text (line string);

The screenshot is shown below:












Once the table is created, we can import data into this table. We will import a .txt file that has the first few paragraphs of the short story, The Adventure of the Blue Carbuncle, by Sir Arthur Conan Doyle.

To port data into this external table, right click on table name on left as shown below:
















In the next screen, click on Import Data as shown below:












In the dialog box, click Upload a file as shown below:
























Select the file once uploaded and click on Submit button:















Click on Sample tab to see the data loaded as shown below:




















Navigate back to Hive Editor as shown below:

























Run the below query to see the position of the word on each line and the word itself:

select posexplode(split(line, '\\s')) as (position, word) from text; 

The results returned are shown below:
























Running below query returns the words in rows:

select explode(split(line,'\\s')) as word from text; 

The results returned are shown below:

























Adding a count as shown below returns the word count:

select word, count(*) from(select explode(split(line,'\\s')) as word from text) z group by word;

The results returned are shown below:



















Counting only those words that begin with a capital letter:

select word, count(*) from(select explode(split(line,'\\s')) as word from text) z where regexp(word, '^[A-Z]') group by word; 

The results returned are shown below:
















The last query returns the top three words that begin with a capital letter occurring in the text:

select word, count(*) as word_count from(select explode(split(line,'\\s')) as word from text) z where regexp(word, '^[A-Z]') group by word order by word_count desc limit 3;

The results returned are shown below: