Thursday, 12 April 2018

Hadoop Map Reduce V

In the earlier post on Hadoop, we did a Top N Analysis using Hadoop. In this post, we will take a look at word count using Hadoop Streaming. More details on Hadoop Steaming is here. Using Hadoop Streaming we can create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.  Like in the earlier posts, we will be using the Cloudera sandbox, Cloudera QuickStart VM 5.12

We will use python to write the mapper code and reducer code. The mapper code is shown below:



The reducer code is shown below:


























The above code is very similar to the map reduce program for word count we have seen before. Note how the words are extracted from the input using Regular Expressions. We will use the same blue_carbuncle.txt file that we have used for word count exercise in earlier posts. We can create a directory called input to hold the this file as shown below:

hdfs dfs -mkdir /user/cloudera/input

The command to push the file into HDFS is shown below:

hdfs dfs -put /home/cloudera/Desktop/blue_carbuncle.txt /user/cloudera/input 

We can then verify that the file has been loaded into HDFS using below command:

hdfs dfs -ls /user/cloudera/input

The output is shown below:

[cloudera@quickstart ~]$ hdfs dfs -mkdir /user/cloudera/input
[cloudera@quickstart ~]$ hdfs dfs -put /home/cloudera/Desktop/blue_carbuncle.txt /user/cloudera/input 
[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/input
Found 1 items
-rw-r--r--   1 cloudera cloudera       2062 2018-04-12 09:17 /user/cloudera/input/blue_carbuncle.txt

We can run below command to call the mapper.py and reducer.py code that we wrote earlier:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
   -input /user/cloudera/input \
   -output /user/cloudera/output_streaming_wordcount \
   -mapper /home/cloudera/Desktop/mapper.py \
   -reducer /home/cloudera/Desktop/reducer.py

The output is shown below:

[cloudera@quickstart Desktop]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
>    -input /user/cloudera/input \
>    -output /user/cloudera/output_streaming_wordcount \
>    -mapper /home/cloudera/Desktop/mapper.py \
>    -reducer /home/cloudera/Desktop/reducer.py
packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.12.0.jar] /tmp/streamjob7911282178172462208.jar tmpDir=null
18/04/12 10:31:28 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/04/12 10:31:28 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/04/12 10:31:29 INFO mapred.FileInputFormat: Total input paths to process : 1
18/04/12 10:31:29 INFO mapreduce.JobSubmitter: number of splits:2
18/04/12 10:31:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1523543152535_0002
18/04/12 10:31:30 INFO impl.YarnClientImpl: Submitted application application_1523543152535_0002
18/04/12 10:31:30 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1523543152535_0002/
18/04/12 10:31:30 INFO mapreduce.Job: Running job: job_1523543152535_0002
18/04/12 10:31:38 INFO mapreduce.Job: Job job_1523543152535_0002 running in uber mode : false
18/04/12 10:31:38 INFO mapreduce.Job:  map 0% reduce 0%
18/04/12 10:31:50 INFO mapreduce.Job:  map 50% reduce 0%
18/04/12 10:31:51 INFO mapreduce.Job:  map 100% reduce 0%
18/04/12 10:31:57 INFO mapreduce.Job:  map 100% reduce 100%
18/04/12 10:31:58 INFO mapreduce.Job: Job job_1523543152535_0002 completed successfully
18/04/12 10:31:59 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=3430
FILE: Number of bytes written=388306
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=3337
HDFS: Number of bytes written=1801
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters 
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=18617
Total time spent by all reduces in occupied slots (ms)=5081
Total time spent by all map tasks (ms)=18617
Total time spent by all reduce tasks (ms)=5081
Total vcore-milliseconds taken by all map tasks=18617
Total vcore-milliseconds taken by all reduce tasks=5081
Total megabyte-milliseconds taken by all map tasks=19063808
Total megabyte-milliseconds taken by all reduce tasks=5202944
Map-Reduce Framework
Map input records=41
Map output records=368
Map output bytes=2688
Map output materialized bytes=3436
Input split bytes=244
Combine input records=0
Combine output records=0
Reduce input groups=217
Reduce shuffle bytes=3436
Reduce input records=368
Reduce output records=217
Spilled Records=736
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=437
CPU time spent (ms)=1980
Physical memory (bytes) snapshot=701603840
Virtual memory (bytes) snapshot=4523958272
Total committed heap usage (bytes)=668999680
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters 
Bytes Read=3093
File Output Format Counters 
Bytes Written=1801
18/04/12 10:31:59 INFO streaming.StreamJob: Output directory: /user/cloudera/output_streaming_wordcount
[cloudera@quickstart Desktop]$ 

Let us see the output directory contents with below command:

hdfs dfs -ls /user/cloudera/output_streaming_wordcount

The output is shown below:

[cloudera@quickstart Desktop]$ hdfs dfs -ls /user/cloudera/output_streaming_wordcount
Found 2 items
-rw-r--r--   1 cloudera cloudera          0 2018-04-12 10:31 /user/cloudera/output_streaming_wordcount/_SUCCESS
-rw-r--r--   1 cloudera cloudera       1801 2018-04-12 10:31 /user/cloudera/output_streaming_wordcount/part-00000

Let us now see the first 25 lines of part-00000 file with below command:

hdfs dfs -cat /user/cloudera/output_streaming_wordcount/part-00000 | head -25

The output is shown below:

[cloudera@quickstart Desktop]$ hdfs dfs -cat /user/cloudera/output_streaming_wordcount/part-00000 | head -25
A 1
ADVENTURE 1
Amid 1
BLUE 1
Beside 1
CARBUNCLE 1
Christmas 1
He 1
Holmes 2
I 10
No 2
Not 1
OF 1
Only 1
Sherlock 2
So 1
THE 2
The 1
We 1
You 1
a 12
action 1
added 1
after 1
all 2
[cloudera@quickstart Desktop]$ 

This completes the post on Hadoop Streaming