Let's Refresh: October 2019

Tuesday 29 October 2019

Cloud - VII

We did a word count using Apache Hive on Cloudera QuickStart VM 5.12 here. In this post, we will repeat the same word count using Apache Hive but on AWS EMR. Amazon Elastic MapReduce or AWS EMR is a managed Hadoop framework that can be easily deployed swiftly to process large amounts of data across dynamically scalable Amazon EC2 instances. You can also run open source tools and popular distributed frameworks such as Apache Hive, Apache Spark, Apache HBase, Presto, and Apache Flink in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3, Amazon DynamoDB, and Amazon Redshift.

We will use Amazon S3 to store our Hive queries, input data, and also the output result from Hive queries on Amazon EMR. In S3, we have created a bucket called emr-example--bucket having three folders: code, input and output. code folder will house all the hive queries. input folder will contain the data file called blue_carbuncle.txt containing the text on which we will attempt the word count. Results of the Hive queries will be outputted to the output folder. The bucket and folders are shown below:

The first paragraph in the data file, blue_carbuncle.txt, is shown below:

The data file, blue_carbuncle.txt, is placed in input folder:

The Hive queries is shown below:

create external table text (line string) location '${INPUT}/input';

insert overwrite directory '${OUTPUT}/result1/' select word, count(*) from(select explode(split(line,'\\s')) as word from text) z group by word;

They are the same from the previous post expect for the "location" part in the first query and "insert overwrite directory" part in the second query. They are in a single file called hive1.q under code folder:

The output folder is empty as no Hive queries have been run so far. Now, we can go ahead and run the Hive queries on the data file by spinning up a Amazon EMR on the fly. Frankly, I enjoyed the experience as I never thought setting up a Hadoop cluster will be so much of a breeze. So, now onto Amazon EMR:

Click on Create cluster button. In the next screen, only add a EC2 key pair if you already have, else, take the defaults. Click on Create cluster to launch a cluster comprising 1 m5.xlarge master node and 2 m5.xlarge core nodes:

The cluster will take a few minutes to launch. Once launched, click on steps tab and Add step button to add the details to run Hive query:

In the Add step window, add the values as follows:

After setting the values for Script, Input and Output locations, click Add button to kick off the Hive query on the cluster. Once the Hive query has run, we get Completed in Status:

Navigate to S3 to see result:

Download the file, 000000_0 and see the contents:

The contents are the same like in the last post. This concludes the post on Amazon EMR

Monday 28 October 2019

Cloud - VI

In this post, we take a look at AWS Lambda, a good example of serverless compute. As with any serverless application, there is no need to manage any servers and related server activities like OS installation, patching, etc. Scaling is handled automatically in that AWS Lambda code is triggered in response to an event. If more events occur, then, for each event, AWS Lambda code is executed. Similarly, if lesser events occur, lesser corresponding amount of AWS Lambda code are executed. If no events occur, no Lambda is executed. Billing for Lambda is for the number of times the code is executed and code execution time in multiples of 100 milliseconds. We will see a few simple examples on AWS Lambda below.

After logging into Management Console, call AWS Lambda, click on Create Function:

We will not write any code from scratch. Instead, we will borrow code from an existing blueprint. In the next window, after clicking Use a blueprint, using the filter under Blueprints, bring up hello-world-python blueprint. Select it and click Configure:

Under Function name, enter FirstLambda. Select Create a new role with basic Lambda permissions under Execution role. Observe the code and click Crate Function at the bottom of the page:

Click on Save button. Then, click on Test button to the left of Save button. In the Configure test event window, replace value1 to Hello, world! and enter LambdaEvent under Event name and click Create button at bottom: