Thursday, 12 April 2018

Hadoop Map Reduce V

In the earlier post on Hadoop, we did a Top N Analysis using Hadoop. In this post, we will take a look at word count using Hadoop Streaming. More details on Hadoop Steaming is here. Using Hadoop Streaming we can create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.  Like in the earlier posts, we will be using the Cloudera sandbox, Cloudera QuickStart VM 5.12

We will use python to write the mapper code and reducer code. The mapper code is shown below:



The reducer code is shown below:


























The above code is very similar to the map reduce program for word count we have seen before. Note how the words are extracted from the input using Regular Expressions. We will use the same blue_carbuncle.txt file that we have used for word count exercise in earlier posts. We can create a directory called input to hold the this file as shown below:

hdfs dfs -mkdir /user/cloudera/input

The command to push the file into HDFS is shown below:

hdfs dfs -put /home/cloudera/Desktop/blue_carbuncle.txt /user/cloudera/input 

We can then verify that the file has been loaded into HDFS using below command:

hdfs dfs -ls /user/cloudera/input

The output is shown below:

[cloudera@quickstart ~]$ hdfs dfs -mkdir /user/cloudera/input
[cloudera@quickstart ~]$ hdfs dfs -put /home/cloudera/Desktop/blue_carbuncle.txt /user/cloudera/input 
[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/input
Found 1 items
-rw-r--r--   1 cloudera cloudera       2062 2018-04-12 09:17 /user/cloudera/input/blue_carbuncle.txt

We can run below command to call the mapper.py and reducer.py code that we wrote earlier:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
   -input /user/cloudera/input \
   -output /user/cloudera/output_streaming_wordcount \
   -mapper /home/cloudera/Desktop/mapper.py \
   -reducer /home/cloudera/Desktop/reducer.py

The output is shown below:

[cloudera@quickstart Desktop]$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
>    -input /user/cloudera/input \
>    -output /user/cloudera/output_streaming_wordcount \
>    -mapper /home/cloudera/Desktop/mapper.py \
>    -reducer /home/cloudera/Desktop/reducer.py
packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.12.0.jar] /tmp/streamjob7911282178172462208.jar tmpDir=null
18/04/12 10:31:28 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/04/12 10:31:28 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/04/12 10:31:29 INFO mapred.FileInputFormat: Total input paths to process : 1
18/04/12 10:31:29 INFO mapreduce.JobSubmitter: number of splits:2
18/04/12 10:31:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1523543152535_0002
18/04/12 10:31:30 INFO impl.YarnClientImpl: Submitted application application_1523543152535_0002
18/04/12 10:31:30 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1523543152535_0002/
18/04/12 10:31:30 INFO mapreduce.Job: Running job: job_1523543152535_0002
18/04/12 10:31:38 INFO mapreduce.Job: Job job_1523543152535_0002 running in uber mode : false
18/04/12 10:31:38 INFO mapreduce.Job:  map 0% reduce 0%
18/04/12 10:31:50 INFO mapreduce.Job:  map 50% reduce 0%
18/04/12 10:31:51 INFO mapreduce.Job:  map 100% reduce 0%
18/04/12 10:31:57 INFO mapreduce.Job:  map 100% reduce 100%
18/04/12 10:31:58 INFO mapreduce.Job: Job job_1523543152535_0002 completed successfully
18/04/12 10:31:59 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=3430
FILE: Number of bytes written=388306
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=3337
HDFS: Number of bytes written=1801
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters 
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=18617
Total time spent by all reduces in occupied slots (ms)=5081
Total time spent by all map tasks (ms)=18617
Total time spent by all reduce tasks (ms)=5081
Total vcore-milliseconds taken by all map tasks=18617
Total vcore-milliseconds taken by all reduce tasks=5081
Total megabyte-milliseconds taken by all map tasks=19063808
Total megabyte-milliseconds taken by all reduce tasks=5202944
Map-Reduce Framework
Map input records=41
Map output records=368
Map output bytes=2688
Map output materialized bytes=3436
Input split bytes=244
Combine input records=0
Combine output records=0
Reduce input groups=217
Reduce shuffle bytes=3436
Reduce input records=368
Reduce output records=217
Spilled Records=736
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=437
CPU time spent (ms)=1980
Physical memory (bytes) snapshot=701603840
Virtual memory (bytes) snapshot=4523958272
Total committed heap usage (bytes)=668999680
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters 
Bytes Read=3093
File Output Format Counters 
Bytes Written=1801
18/04/12 10:31:59 INFO streaming.StreamJob: Output directory: /user/cloudera/output_streaming_wordcount
[cloudera@quickstart Desktop]$ 

Let us see the output directory contents with below command:

hdfs dfs -ls /user/cloudera/output_streaming_wordcount

The output is shown below:

[cloudera@quickstart Desktop]$ hdfs dfs -ls /user/cloudera/output_streaming_wordcount
Found 2 items
-rw-r--r--   1 cloudera cloudera          0 2018-04-12 10:31 /user/cloudera/output_streaming_wordcount/_SUCCESS
-rw-r--r--   1 cloudera cloudera       1801 2018-04-12 10:31 /user/cloudera/output_streaming_wordcount/part-00000

Let us now see the first 25 lines of part-00000 file with below command:

hdfs dfs -cat /user/cloudera/output_streaming_wordcount/part-00000 | head -25

The output is shown below:

[cloudera@quickstart Desktop]$ hdfs dfs -cat /user/cloudera/output_streaming_wordcount/part-00000 | head -25
A 1
ADVENTURE 1
Amid 1
BLUE 1
Beside 1
CARBUNCLE 1
Christmas 1
He 1
Holmes 2
I 10
No 2
Not 1
OF 1
Only 1
Sherlock 2
So 1
THE 2
The 1
We 1
You 1
a 12
action 1
added 1
after 1
all 2
[cloudera@quickstart Desktop]$ 

This completes the post on Hadoop Streaming

Tuesday, 10 April 2018

More Regular Expressions

In an earlier post we have taken a look at Regular Expressions. We extend the conversation on Regular Expressions in this post by covering a few more topics as inline modifiers, capturing groups, non-capturing groups and look arounds. Some of these topics may seem a bit involved at first glance. For demonstrating examples related to the different topics, we will be using the test harness mentioned here. The java program is saved to a folder called RegularExpressions. Let us use it to check out a few simple examples in Regular Expressions before we take up the topics mentioned earlier. The commands to run the test harness are below:

F:\>cd RegularExpressions

F:\RegularExpressions>javac RegexTestHarness.java

F:\RegularExpressions>java -classpath . RegexTestHarness

The results are shown below:









We can enter a regular expression that we intend to search for. Once we enter the pattern to search for, we get a prompt where we can enter the text that will be searched for the pattern entered earlier. Then, the program will return the results of the search. This process is then repeated till we exit the program using CTRL+C keys

Let us look for vowels in foobar. The results are shown below:

Enter your regex: [aeiou]
Enter input string to search: foobar
I found the text "o" starting at index 1 and ending at index 2.
I found the text "o" starting at index 2 and ending at index 3.
I found the text "a" starting at index 4 and ending at index 5.

Note that three results are returned and in each case the vowels are picked. We can add a restriction on the search to look for two two vowels appearing together as shown below:

Enter your regex: [aeiou]{2}
Enter input string to search: foobar
I found the text "oo" starting at index 1 and ending at index 3.

Inline modifiers have the syntax, (?z) where z is an alphabet like i or s. i means case insensitive. An example of its usage and the result is shown below:

Enter your regex: (?i)ms\.
Enter input string to search: Ms. Jones, MS. Parker and ms. White were at the party.
I found the text "Ms." starting at index 0 and ending at index 3.
I found the text "MS." starting at index 11 and ending at index 14.
I found the text "ms." starting at index 26 and ending at index 29.

If we wish to match case insensitive feature to be applicable to M only, then,

Enter your regex: (?i:M)s\.
Enter input string to search: Ms. Jones, MS. Parker and ms. White were at the party.
I found the text "Ms." starting at index 0 and ending at index 3.
I found the text "ms." starting at index 26 and ending at index 29.

(?s) enables the metacharacter . to match every character on a single line including newline characters. Two examples is shown below:

Enter your regex: (?s).*
Enter input string to search: Ms. Jones, MS. Parker and ms. White were at the party.
I found the text "Ms. Jones, MS. Parker and ms. White were at the party." starting at index 0 and ending at index 54.
I found the text "" starting at index 54 and ending at index 54.

Enter your regex: (?s)^.*
Enter input string to search: Ms. Jones, MS. Parker and ms. White were at the party.
I found the text "Ms. Jones, MS. Parker and ms. White were at the party." starting at index 0 and ending at index 54.

Enter your regex: (?s).*$
Enter input string to search: Ms. Jones, MS. Parker and ms. White were at the party.
I found the text "Ms. Jones, MS. Parker and ms. White were at the party." starting at index 0 and ending at index 54.
I found the text "" starting at index 54 and ending at index 54.

Capturing groups are quite interesting because they help us in picking up selectively those expressions that match the patterns that we are searching for and can be used for processing later. Any search pattern in parenthesis qualifies for a capturing group as shown below:

Enter your regex: (\w{3})
Enter input string to search: a bc def gh ijk
I found the text "def" starting at index 5 and ending at index 8.
I found the text "ijk" starting at index 12 and ending at index 15.

Any expression that matches three alphanumeric elements is returned. In the next example, we use back references in conjunction with two capturing groups:

Enter your regex: (\w{1})(\w{1})\2\1\2
Enter input string to search: abcde fghij abbab kllkl lmnop
I found the text "abbab" starting at index 12 and ending at index 17.
I found the text "kllkl" starting at index 18 and ending at index 23.

Non-capturing groups are very similar to capturing groups but the match is not picked. The syntax is too is very similar to that of capturing groups but we use ?: within the parenthesis as shown below:

Enter your regex: (?:\w{1})(\w{1})(\w{1})\1\1\2
Enter input string to search: abcbbc defghi klmllm
I found the text "abcbbc" starting at index 0 and ending at index 6.
I found the text "klmllm" starting at index 14 and ending at index 20.

Note that there are three groups: the first one is non-capturing and the next two are capturing groups. Since we have only two capturing groups, we can have only two back references

Next we take a look at look arounds. They take only a look for a search pattern either in the forward direction or in the backward direction. But, the search pattern is itself skipped. So, they are called look arounds. If they take a look in the forward direction, they are called Lookaheads, and if they take a look in the backward direction, then, they are called Lookbehinds. There are two types of Lookaheads: Positive Lookahead and Negative Lookahead. The syntax for Positive Lookahead is (?=). An example of Positive Lookahead is shown below:

Enter your regex: foo(?=bar)
Enter input string to search: One can often see foobar used in software code.
I found the text "foo" starting at index 18 and ending at index 21.

The check is made for foo followed by bar but only foo is captured but not the bar. This is evident in the next example:

Enter your regex: foo(?=bar)
Enter input string to search: foo in foobar is used separately also.
I found the text "foo" starting at index 7 and ending at index 10.

There are two foo in input string. But, only the foo that is followed by bar is picked. Negative Lookahead has syntax as (?!) and is same as Positive Lookahead but it will only pick when the search is not matched as shown in below example:

Enter your regex: foo(?!bar)
Enter input string to search: foo in foobar is used separately also.
I found the text "foo" starting at index 0 and ending at index 3.

There are two foo in input string. But, only the foo that is not followed by bar is picked by Negative Lookahead. The next look around is Lookbehind. The syntax for Lookbehind is (?<=). This is similar to Positive Lookahead except that the search pattern is before the picked unit. An example is shown below:

Enter your regex: (?<=foo)bar
Enter input string to search: foo, bar and foobar are often used in software examples
I found the text "bar" starting at index 16 and ending at index 19.

In the above example, between the standalone bar and the bar that is part of foobar, the bar that is preceded by foo is picked because only that satisfies the Lookbehind search pattern. Lastly, we take a look at Negative Lookbehind. Negative Lookbehind have the syntax (?<!). Negative Lookbehind is similar to Lookbehind but the search pattern should not find a match as is seen in below example:

Enter your regex: (?<!foo)bar
Enter input string to search: foo, bar and foobar are often used in software examples
I found the text "bar" starting at index 5 and ending at index 8.

We have used the same input text as in the case of Lookbehind example. Note that the bar that is not preceded by foo is picked and the bar that is preceded by foo is missed. 

This concludes the discussion on inline modifiers, capturing groups, non-capturing groups and look arounds in Regular Expressions