Let's Refresh: Cloud

We did a word count using Apache Hive on Cloudera QuickStart VM 5.12 here. In this post, we will repeat the same word count using Apache Hive but on AWS EMR. Amazon Elastic MapReduce or AWS EMR is a managed Hadoop framework that can be easily deployed swiftly to process large amounts of data across dynamically scalable Amazon EC2 instances. You can also run open source tools and popular distributed frameworks such as Apache Hive, Apache Spark, Apache HBase, Presto, and Apache Flink in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3, Amazon DynamoDB, and Amazon Redshift.

We will use Amazon S3 to store our Hive queries, input data, and also the output result from Hive queries on Amazon EMR. In S3, we have created a bucket called emr-example--bucket having three folders: code, input and output. code folder will house all the hive queries. input folder will contain the data file called blue_carbuncle.txt containing the text on which we will attempt the word count. Results of the Hive queries will be outputted to the output folder. The bucket and folders are shown below:

The first paragraph in the data file, blue_carbuncle.txt, is shown below:

The data file, blue_carbuncle.txt, is placed in input folder:

The Hive queries is shown below:

create external table text (line string) location '${INPUT}/input';

insert overwrite directory '${OUTPUT}/result1/' select word, count(*) from(select explode(split(line,'\\s')) as word from text) z group by word;

They are the same from the previous post expect for the "location" part in the first query and "insert overwrite directory" part in the second query. They are in a single file called hive1.q under code folder:

The output folder is empty as no Hive queries have been run so far. Now, we can go ahead and run the Hive queries on the data file by spinning up a Amazon EMR on the fly. Frankly, I enjoyed the experience as I never thought setting up a Hadoop cluster will be so much of a breeze. So, now onto Amazon EMR:

Click on Create cluster button. In the next screen, only add a EC2 key pair if you already have, else, take the defaults. Click on Create cluster to launch a cluster comprising 1 m5.xlarge master node and 2 m5.xlarge core nodes:

The cluster will take a few minutes to launch. Once launched, click on steps tab and Add step button to add the details to run Hive query:

In the Add step window, add the values as follows:

After setting the values for Script, Input and Output locations, click Add button to kick off the Hive query on the cluster. Once the Hive query has run, we get Completed in Status: