Sunday 26 November 2017

Word Count in Pig

After the word count in Hive, we will proceed to word count in Pig. For all queries in this post, we will use the Cloudera sandbox, Cloudera QuickStart VM 5.12.

We will use the same file that we used in the earlier word count exercise. The path to the file is /user/hive/warehouse/text/blue_carbuncle.txt

Navigate to the Pig Editor as shown below:






















The first step is the below command:

data = LOAD '/user/hive/warehouse/text/blue_carbuncle.txt' as (text:CHARARRAY);

The above command loads the data in the file into a field called text that is of type CHARARRAY. Once the data is loaded, we can run below command to describe the schema.

DESCRIBE data;

The results returned are shown below on line 17:







The next line is shown below:

words = FOREACH data GENERATE (TOKENIZE(text)) AS word;

FOREACH applies that the operation that follows on every element of input, data in our case.

GENERATE generates the records or the output from the input on the left.

TOKENIZE splits a string input into a bag of words based on word separators, space in our case.

The output at the end of this operation is shown below from line 93:











We can see that based on the space the words are separated. But, we need to put the words on different rows so that that can counted like we did in case of HIVE. So, we use the word FLATTEN as shown below:

words = FOREACH data GENERATE FLATTEN(TOKENIZE(text)) AS word;

The output at the end of this operation is shown below from line 93:











The next task is to group the words using below command:

grpd  = GROUP words BY word;

The output at the end of this operation is shown below from line 96:










Lastly, we need to add the count as shown below:

cntd  = FOREACH grpd GENERATE group, COUNT(words);

The output at the end of this operation is shown below from line 96:














The overall code is:

data = LOAD '/user/hive/warehouse/text/blue_carbuncle.txt' as (text:CHARARRAY);

words = FOREACH data GENERATE FLATTEN(TOKENIZE(text)) AS word;

grpd  = GROUP words BY word;

cntd  = FOREACH grpd GENERATE group, COUNT(words);

DUMP cntd;


The DUMP command outputs the result.