Let's Refresh: April 2018

In the earlier post on Hadoop, we did a Top N Analysis using Hadoop. In this post, we will take a look at word count using Hadoop Streaming. More details on Hadoop Steaming is here. Using Hadoop Streaming we can create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. Like in the earlier posts, we will be using the Cloudera sandbox, Cloudera QuickStart VM 5.12

We will use python to write the mapper code and reducer code. The mapper code is shown below:

The reducer code is shown below:

The above code is very similar to the map reduce program for word count we have seen before. Note how the words are extracted from the input using Regular Expressions. We will use the same blue_carbuncle.txt file that we have used for word count exercise in earlier posts. We can create a directory called input to hold the this file as shown below:

hdfs dfs -mkdir /user/cloudera/input

The command to push the file into HDFS is shown below:

hdfs dfs -put /home/cloudera/Desktop/blue_carbuncle.txt /user/cloudera/input

We can then verify that the file has been loaded into HDFS using below command:

hdfs dfs -ls /user/cloudera/input

The output is shown below:

[cloudera@quickstart ~]$ hdfs dfs -mkdir /user/cloudera/input

[cloudera@quickstart ~]$ hdfs dfs -put /home/cloudera/Desktop/blue_carbuncle.txt /user/cloudera/input

[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/input

Found 1 items

-rw-r--r-- 1 cloudera cloudera 2062 2018-04-12 09:17 /user/cloudera/input/blue_carbuncle.txt

We can run below command to call the mapper.py and reducer.py code that we wrote earlier:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \

-input /user/cloudera/input \

-output /user/cloudera/output_streaming_wordcount \

-mapper /home/cloudera/Desktop/mapper.py \

-reducer /home/cloudera/Desktop/reducer.py

The output is shown below:

[cloudera@quickstart Desktop]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \

> -input /user/cloudera/input \

> -output /user/cloudera/output_streaming_wordcount \

> -mapper /home/cloudera/Desktop/mapper.py \

> -reducer /home/cloudera/Desktop/reducer.py

packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.12.0.jar] /tmp/streamjob7911282178172462208.jar tmpDir=null

18/04/12 10:31:28 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032

18/04/12 10:31:29 INFO mapred.FileInputFormat: Total input paths to process : 1

18/04/12 10:31:29 INFO mapreduce.JobSubmitter: number of splits:2

18/04/12 10:31:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1523543152535_0002

18/04/12 10:31:30 INFO impl.YarnClientImpl: Submitted application application_1523543152535_0002

18/04/12 10:31:30 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1523543152535_0002/

18/04/12 10:31:30 INFO mapreduce.Job: Running job: job_1523543152535_0002

18/04/12 10:31:38 INFO mapreduce.Job: Job job_1523543152535_0002 running in uber mode : false

18/04/12 10:31:38 INFO mapreduce.Job: map 0% reduce 0%

18/04/12 10:31:50 INFO mapreduce.Job: map 50% reduce 0%

18/04/12 10:31:51 INFO mapreduce.Job: map 100% reduce 0%

18/04/12 10:31:57 INFO mapreduce.Job: map 100% reduce 100%

18/04/12 10:31:58 INFO mapreduce.Job: Job job_1523543152535_0002 completed successfully

18/04/12 10:31:59 INFO mapreduce.Job: Counters: 49

File System Counters

FILE: Number of bytes read=3430

FILE: Number of bytes written=388306

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=3337

HDFS: Number of bytes written=1801

HDFS: Number of read operations=9

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=2

Launched reduce tasks=1

Data-local map tasks=2

Total time spent by all maps in occupied slots (ms)=18617

Total time spent by all reduces in occupied slots (ms)=5081

Total time spent by all map tasks (ms)=18617

Total time spent by all reduce tasks (ms)=5081

Total vcore-milliseconds taken by all map tasks=18617

Total vcore-milliseconds taken by all reduce tasks=5081

Total megabyte-milliseconds taken by all map tasks=19063808

Total megabyte-milliseconds taken by all reduce tasks=5202944

Map-Reduce Framework

Map input records=41

Map output records=368

Map output bytes=2688

Map output materialized bytes=3436

Input split bytes=244

Combine input records=0

Combine output records=0

Reduce input groups=217

Reduce shuffle bytes=3436

Reduce input records=368

Reduce output records=217

Spilled Records=736

Shuffled Maps =2

Failed Shuffles=0

Merged Map outputs=2

GC time elapsed (ms)=437

CPU time spent (ms)=1980

Physical memory (bytes) snapshot=701603840

Virtual memory (bytes) snapshot=4523958272

Total committed heap usage (bytes)=668999680

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters

Bytes Read=3093

File Output Format Counters

Bytes Written=1801

18/04/12 10:31:59 INFO streaming.StreamJob: Output directory: /user/cloudera/output_streaming_wordcount

[cloudera@quickstart Desktop]$

Let us see the output directory contents with below command:

hdfs dfs -ls /user/cloudera/output_streaming_wordcount

The output is shown below:

[cloudera@quickstart Desktop]$ hdfs dfs -ls /user/cloudera/output_streaming_wordcount

Found 2 items

-rw-r--r-- 1 cloudera cloudera 0 2018-04-12 10:31 /user/cloudera/output_streaming_wordcount/_SUCCESS

-rw-r--r-- 1 cloudera cloudera 1801 2018-04-12 10:31 /user/cloudera/output_streaming_wordcount/part-00000

Let us now see the first 25 lines of part-00000 file with below command:

hdfs dfs -cat /user/cloudera/output_streaming_wordcount/part-00000 | head -25

The output is shown below:

[cloudera@quickstart Desktop]$ hdfs dfs -cat /user/cloudera/output_streaming_wordcount/part-00000 | head -25

A 1

ADVENTURE 1

Amid 1

BLUE 1

Beside 1

CARBUNCLE 1

Christmas 1

He 1

Holmes 2

I 10

No 2

Not 1

OF 1

Only 1

Sherlock 2

So 1

THE 2

The 1

We 1

You 1

a 12

action 1

added 1

after 1

all 2

[cloudera@quickstart Desktop]$

This completes the post on Hadoop Streaming

In an earlier post we have taken a look at Regular Expressions. We extend the conversation on Regular Expressions in this post by covering a few more topics as inline modifiers, capturing groups, non-capturing groups and look arounds. Some of these topics may seem a bit involved at first glance. For demonstrating examples related to the different topics, we will be using the test harness mentioned here. The java program is saved to a folder called RegularExpressions. Let us use it to check out a few simple examples in Regular Expressions before we take up the topics mentioned earlier. The commands to run the test harness are below:

F:\>cd RegularExpressions

F:\RegularExpressions>javac RegexTestHarness.java

F:\RegularExpressions>java -classpath . RegexTestHarness

The results are shown below:

We can enter a regular expression that we intend to search for. Once we enter the pattern to search for, we get a prompt where we can enter the text that will be searched for the pattern entered earlier. Then, the program will return the results of the search. This process is then repeated till we exit the program using CTRL+C keys

Let us look for vowels in foobar. The results are shown below:

Enter your regex: [aeiou]

Enter input string to search: foobar

I found the text "o" starting at index 1 and ending at index 2.

I found the text "o" starting at index 2 and ending at index 3.

I found the text "a" starting at index 4 and ending at index 5.

Note that three results are returned and in each case the vowels are picked. We can add a restriction on the search to look for two two vowels appearing together as shown below:

Enter your regex: [aeiou]{2}

Enter input string to search: foobar

I found the text "oo" starting at index 1 and ending at index 3.

Inline modifiers have the syntax, (?z) where z is an alphabet like i or s. i means case insensitive. An example of its usage and the result is shown below:

Enter your regex: (?i)ms\.