Monday 19 March 2018

Hadoop Map Reduce I

In a series of posts, we plan to dive into Map Reduce programs in the Hadoop framework. For the all the work in this post, we will be using the Cloudera sandbox, Cloudera QuickStart VM 5.12

Map Reduce programming paradigm has become common place in the last decade. In brief, a typical big data program as word count in text format is seen as the classic hello world program. So, we will also start with the word count program

We are mostly interested in understanding the logic behind Map Reduce programming model. So, we will simplify the exercise to running the programs from Eclipse that comes with the sandbox in a very stand alone mode that will use the local file system and not HDFS. Though this setup will use only a single JVM, the ease of use is to appreciated as we will read the files directly and not writing any hdfs commands to read the files. Besides, we are not using any build tool as Maven in Eclipse and we will add the necessary jar files manually. So, running Map Reduce cannot really get any simpler than this!

There are three programs in all: the Mapper program that is shown below:

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class FirstWordCountMapper extends Mapper<LongWritable,Text,Text,IntWritable> {

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{


    String line = value.toString();
    for (String word : line.split(" ")){
        if(word.length()>0){
            context.write(new Text(word),new IntWritable(1));
        }
    }
}

}


In brief, the above program reads the text input line by line, splits each line with space as the delimiter and then emits a (key, value) pair for each word as (word, 1). Then, we have the Reducer program shown below:

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class FirstWordCountReducer extends Reducer<Text, IntWritable,Text,IntWritable> {

    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException{
 

        int count = 0;
        for (IntWritable value:values)
        {
            count += value.get();
        }

        context.write(key, new IntWritable(count));

    }
}


The above program reads the (word, 1) pairs output from the mapper program and reduces it to (word, count) where count is the number of word occurrences in the text input

Lastly, to call the Mapper and Reducer programs, shown below is the Driver program:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class FirstWordCountDriver {

    public static void main(String[] args) throws Exception {
        if(args.length !=2){
            System.err.println("Enter 'input path' and 'output path' arguments");
            System.exit(0);
        }
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf,"First Word Count");
        job.setJarByClass(FirstWordCountDriver.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        job.setMapperClass(FirstWordCountMapper.class);
        job.setReducerClass(FirstWordCountReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        System.exit(job.waitForCompletion(true)?0:1);
    }
}


While the above three programs are the simplest of programs in the Map Reduce scheme of things, if you wish to understanding the finer details of above program, please refer to any site that may suit your taste. The internet is flooded with tons of such sites eager to help developers come online on Map Reduce

Once the Cloudera Sandbox comes up, click on the Eclipse icon on the desktop. This will bring up Eclipse Luna Service Release 2 (4.4.2). Notice that there is already a project as shown below:













The existing project is called training that we will ignore for now. Let us create a new project from scratch by right clicking in the space below training project as shown below:



In the window (shown below) that comes up, enter hadoop as the Project name and click on Finish button:

























Note that JavaSE-1.7 JRE will serve as the Execution environment. Notice that a new project has been created called hadoop. Select it and right click it to select Properties at the bottom as shown below:




















In the new window, select Java Build Path as shown below:














Click on Add External JARs ...

















Now, navigate to /usr/lib/hadoop/ directory and select jars indicated below under that directory and add them

hadoop-annotations.jar
hadoop-auth.jar
hadoop-common.jar
















In the same way, add commons-httpclient-3.1.jar from under /usr/lib/hadoop/lib/. Lastly, add the jars under /usr/lib/hadoop/client-0.20 as shown below:
















If all the jars are added, then, they look like this:















Click on OK. Select src under hadoop root folder and add a new class as shown below:
















In the window that comes up, clear the package name and enter FirstWordCountMapper as the class name and click on Finish:













Now, plug the above code for FirstWordCountMapper written at the start of this blog into the editor on the right and right click and click Save to save the program:










In the same manner, add the FirstWordCountReducer and FirstWordCountDriver programs:











Select hadoop, right click to select Folder to create a new folder called input as shown below:















Enter input in Folder name and click on Finish:




















Right click on the newly created input folder and click on Import ...:




















Click on Next >:



















In the next window, select the directory of interest by clicking Browse... and selecting a text file that will serve as text input. It is called blue_carbuncle.txt in our case. This is from an earlier blog on word count in Hive




















You can see the file added under input folder. We now have all the parts to run the Map Reduce program. Right click on the Driver program and click on Run As and then Run Configurations... to set the arguments for the Driver program:

















On the new window, enter input and output as shown in the Program arguments as shown below:

















Then, click on Run to run the program:














We can see the output in the Console in particular the below line:

18/03/19 10:46:38 INFO mapred.JobClient:  map 100% reduce 100%

Now, right click on hadoop and click on Refresh as shown below:



We can see a new folder called output having two files. Double click on part-r-00000 file to see the contents:












The results looks similar to one in the earlier post. This concludes the running of our first Map Reduce program