23
Map Reduce Basics This handout is an introduction to Map Reduce in Hadoop. No background is assumed. I couldn’t find any of this material in any books or on the web. Map Reduce programming consists of 3 segments, writing the setup code, writing the mapper and writing the reducer. We are going to start with the canonical word count program. Because data moves around between mappers and reducers and it isn’t clear how this works we are going to break up our tutorial into a couple exercises. There are 2 Map Reduce interfaces. When you search the web for sample programs and examples this is confusing. The older Hadoop interface uses “import hadoop.mapred.” The newer interface which we are using now is “import hadoop.mapreduce” The code for a map reduce program is different using the old mapred interface vs. the newer mapreduce interface. This is especially prone to error because of the IDE autofill typeahead actions. Class names between the 2 packages are similar where the IDE will often insert the wrong package which leads to errors which are difficult to debug for the beginner. For those of you used to the old Hadoop mapred package, there are some differences between the mapred and mapreduce packages. There are 2 Mapper/Reducer interfaces from different packages. This is important to keep clear because this is a cause of frequent errors, mixing classes with the same name from different packages. This is a frequent source of errors with autocomplete or the programmer specifies the incorrect package and for beginners there are errors which are difficult to diagnose.

apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

Embed Size (px)

Citation preview

Page 1: apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

Map Reduce Basics

This handout is an introduction to Map Reduce in Hadoop. No background is assumed. I couldn’t find any of this material in any books or on the web.

Map Reduce programming consists of 3 segments, writing the setup code, writing the mapper and writing the reducer.

We are going to start with the canonical word count program. Because data moves around between mappers and reducers and it isn’t clear how this works we are going to break up our tutorial into a couple exercises.

There are 2 Map Reduce interfaces. When you search the web for sample programs and examples this is confusing. The older Hadoop interface uses “import hadoop.mapred.” The newer interface which we are using now is “import hadoop.mapreduce” The code for a map reduce program is different using the old mapred interface vs. the newer mapreduce interface. This is especially prone to error because of the IDE autofill typeahead actions. Class names between the 2 packages are similar where the IDE will often insert the wrong package which leads to errors which are difficult to debug for the beginner.

For those of you used to the old Hadoop mapred package, there are some differences between the mapred and mapreduce packages.

There are 2 Mapper/Reducer interfaces from different packages. This is important to keep clear because this is a cause of frequent errors, mixing classes with the same name from different packages. This is a frequent source of errors with autocomplete or the programmer specifies the incorrect package and for beginners there are errors which are difficult to diagnose.

Old/New Mapper interfacesThe older one: org.apache.hadoop.mapred.Mapper, org.apache.hadoop.mapred.ReducerThe newer one: org.apache.hadoop.mapreduce.Mapper and org.apache.hadoop.mapreduce.Reducer

Old/New FileInputPath are the sameTo set the input path where the input data is, the newer API requires setting FileInputFormat.addInputPath() and FileOutputFormat.setOutputPath(). Make sure to get a FileSystem object and delete the output path so the MR program wont give you a “directory already exists” error message.

Old/New sending data from Mapper to ReducerThe old API uses output.collect(K,V) to send K/V pairs from the mapper to the reducer. The new API uses context.write(K,V);

Page 2: apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

Each K,V pair output from the mapper is sent to the reducer. Pairs with the same key are sent to the same reducer. We can show this in our test program below. You can write debug messages to the output file using context.write(Text,IntWritable) as we will see in the demo program.

Old/New JobThe old API uses JobConf to configure the M/R parameters and JobClient to talk to the JobTracker to start the job. The new API uses Configuration to set the M/R paramerters and Job to run the M/R job.

Old/New To run the JobNew API: System.exit(job.waitForCompletion(true) ? 0 : 1);Old API: JobClient.runJob(conf);

Lab #1 Getting Started Lab, Hadoop Eclipse Plugin and WordCount Example:Goal: to make sure the Map/Reduce Eclipse tools are installed correctly. Download and install the Map Reduce jar from https://issues.apache.org/jira/browse/MAPREDUCE-1280, copy it into your Eclipse plugin directory where all plugins are installed. Restart eclipse and you should see a MapReduce Perspective in the Upper Right hand corner of Eclipse:

Create a new Map Reduce Project and a new WordCount Class

Enter in the following modified WordCount program from the Hadoop examples. The program below has 2 modifcations

1. is modified to not use the command line, the input directory where the input files reside and the output directry where the program writes results are hardcoded into the program. This makes for faster iterations.

2. Deletes the output directory before each run. The Hadoop default configuration is setup so if there is an output directory present an exception is generated. This is to prevent the user from accidentally writing over past results.

Page 3: apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

As a warning not all the example programs in the Hadoop source distribution use the mapreduce package, some use the mapred package. We will use the new mapreduce interface in the mapreduce package under mapred vs. the old mapred package.

import java.io.IOException;import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);private final Text word = new Text();

@Overridepublic void map(Object key, Text value, Context context)

throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) {

word.set(itr.nextToken());context.write(word, one);

}}

}

public static class IntSumReducer extendsReducer<Text, IntWritable, Text, IntWritable> {

private final IntWritable result = new IntWritable();

@Overridepublic void reduce(Text key, Iterable<IntWritable> values,

Context context) throws IOException, InterruptedException {

int sum = 0; for (IntWritable val : values) {

sum += val.get();

Page 4: apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

}result.set(sum);context.write(key, result);

}}

public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = new Job(conf, "word count");job.setJarByClass(WordCount.class);job.setMapperClass(TokenizerMapper.class);job.setReducerClass(IntSumReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);FileSystem fs = FileSystem.get(conf);fs.delete ( new Path( "/Users/dc/ncdcdata/testoutput" )) ;FileInputFormat.addInputPath(job, new Path(

"/Users/dc/ncdcdata/testdata"));FileOutputFormat.setOutputPath(job, new Path(

"/Users/dc/ncdcdata/testoutput"));System.exit(job.waitForCompletion(true) ? 0 : 1);}

}

Create the directory ~/ncdcdata/testoutput and add one test file with a bunch of character “1” entered on each new lines. I used 11 ones and one newline with no character. Mine looks like this:

Run the program using RunAs>>Run On Hadoop

Page 5: apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

Go to the output directory, ~/ncdcdata/testoutput. You should see a file named part-r-00000. Cat this file:

The contents of this file are 2 columns which shows there are 11 tokens named 1. Note the newline is not counted as a null token. This makes sense with what the wordcount program is supposed to do.

Lab Part 2a Print out the K/V pairs in the map() function in the Mapper: Input data is split by the one of the classes under InputSplit into K/V pairs input directly into the mapper. Each set of K/V pairs input to the map function() in the mapper is called an input split. The map() function in the mapper is called once per input split. Print out the Key/Value pairs in each call to the mapper, enter in the following code to your map() function.

Object k = context.getCurrentKey();Text v = context.getCurrentValue();System.out.println("current key:" + k.toString() + " currentValue:"+ v.toString());

Q) when you run the program what are the keys and what are the values? Do we use the keys?

Most programs don’t use the file offset. Some make the key an object, some use longwritable, etc….

Lab Part 2b Print out the InputFormat Class used by the Mapper/inputSplit: Print out the InputFormatClass used in the mapper. Hadoop takes the files in a directory and computes an input split, splitting up the input data sending each input split to a different mapper in a distributed mapper. While running in a single node configuration we won’t see a distribution of the input data. However, we do need to

Page 6: apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

create our own InputFormatClasses in future exercises to control how we divide up the input data. Enter in the following code:

try {System.out.println("input format class:"

+ context.getInputFormatClass().toString());} catch (Exception e) {

e.printStackTrace();}

Q) What is the name of the InputFormatClass? Describe the function of the org.apache.hadoop.mapreduce.lib.input.TextInputFormat class. When we are writing the program, we specified the input path for the input data using:

FileInputFormat.addInputPath(job, new Path("/Users/dc/ncdcdata/testdata"));

Lab Part 2c Print out the input splits from the TextInputFormat class you retrieved from above: Use Java Reflection to call the getSplits() method of TextInputFormat class you retrieved from above. Java Reflection example: http://java.sun.com/developer/technicalArticles/ALT/Reflection/

// Start Lab2c num input splitsClass<? extends InputFormat<?, ?>> tif = context

.getInputFormatClass();// the statement below is equivalent to above// Class tif = context.getInputFormatClass();System.out.println("tif name:" + tif.getCanonicalName());Method meth = tif.getMethod("getSplits",org.apache.hadoop.mapreduce.JobContext.class);System.out.println("meth:" + meth.getName());Object argList[] = new Object[1];argList[0] = new org.apache.hadoop.mapreduce.JobContext(

context.getConfiguration(), context.getJobID());Object returnObject = meth.invoke(new TextInputFormat(),argList);List<InputSplit> listInputSplit = (java.util.List) returnObject;System.out.println("num inputsplits in list:"+ listInputSplit.size());// End Lab2c

Lab Part 2d: What is the difference between the above listing 1 split and the below call to context.getInputSplit()?

// Start Lab2d// how does the above input split correlate to:

Page 7: apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

InputSplit is = context.getInputSplit();// the input split is the length of the fileSystem.out.println("input split:" + is.toString());System.out.println("input split len:" + is.getLength());// End Lab2d

Lab Part2e: Add more files to the input directory. Does this cause the number of input splits to increase? What is the difference between Parts 2c and 2d.

Note the differences, context.getInputSplit() returns the current input split the map() function is working on and it shows the size of the file.

The listInputSplit.size() lists the number of files in the input directory.

The K/V pairs list the current K/V for the map which is the line in the file being processed. The K is the file offset and the V is the fileline.

Input data is split into

Lab Part 3 Mapper context.write() experiments. There are 2 different context objects, one is Mapper.Context and the other Reducer.Context. Same name but they behave differently.

Mapper.Context: the mapper context contains the RecordReader and InputSplits for all the mappers. The map() function is called once per input split.

Add the following to the map() function in the Mapper:

// Start Lab3context.write(new Text("test token"), new IntWritable(100));// End Lab3

This will send the K,V pair ‘test token’, 100 to the reducer. How many of these do you expect to show up at the reducer? What is the final output in the output file?

The final output should be testtoken, sum where sum is 100 * the number of combined text lines in all your input files. The map() function in the mapper is called once per file line or once per input split. Adding a context.write() statement doubles the number of input records sent to the reducer.

Lab Part 4 Reducer context.write() experiments.

Page 8: apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

Remove the context.write from the Mapper node. The behavior between Mappers and Reducers are different. The K/V pairs sent to the mapper when they appeat inside the reducer are summed by key. All the K/V pairs with the same key are sent to the same reducer. This is clear in the interface where the Value is an Iterable vs. a single value.

The behavior of the reducer context.write() is different from the Mapper context.write(). Add:

// Start Lab4context.write(new Text(" reducer test token"), new IntWritable(1000));// End Lab4

How many times is the reducer context.write() called? There should be only 1 call to the context.write() method.

Lab Part 5 Combiner Class Calculating an AverageOne way to reduce traffic across the network from the Mapper to the reducer is by the use of a combiner class. Hadoop clusters can show abnormally poor performance when sending too much data from the mappers to the reducers results in network saturation.

Aside: the old M/R package, mapred, had an output collector class. This class no longer exists in the mapreduce package but is internal to the Mapper.

From the experiments above we see there is a combination of the k/v output from the Mapper into the Reducer.

If no collector is specified there is a default output collector in the mapper which combines the k/v pairs. Is this true?

One of the basic tasks in data mining/machine learning is to compute a summation over a set of data. https://docs.google.com/a/hackerdojo.com/viewer?a=v&q=cache:SJNfWzTMlG8J:citeseerx.ist.psu.edu/viewdoc/download?doi%3D10.1.1.71.4156%26rep%3Drep1%26type%3Dpdf+andrew+ng+map+reduce&hl=en&gl=us&pid=bl&srcid=ADGEESj9vNhO1cw0wb84sMu6eMBcf6rvbCuJqm0pEqk3eeLoVVVw6zPZkDxZD8pgu4Gqbomqj0aIbutAcj-7hnkGk2yif_BrhdMTltJ-MWYkhJPyaN31ZUKN8ftjkpnY8DGYbv9c3Ah7&sig=AHIEtbT7mqwJ-Zyg5hLagoSltFcDWU1Saw&pli=1

Using the material covered in Lin’s Data Intensive Text Processing with Map Reduce we use a combiner to compute the sum using his pseudo algorithm in Figure 3.6

Page 9: apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

Note the material from section 3.1.2 which covers the correctness of a Map Reduce implementation using a Combiner. There is no guarantee in the Hadoop Runtime a combiner can be run, either 0, 1 or multiple times. Your code should work without it or if it is called multiple times. Your code should be idempotent to combiner and reducer calls.

The Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine property which by default is set to 3. The combiner call requires a serialization and deserialization of the input/output data. Any performance gained from reduction of network traffic from the combiner to the reducer can be eliminated by the increase in disk seeks. You could observe data dependent performance here you have worse performance using a combiner than without one. Yahoo has rough guidelines when to use a combiner, when it reduces traffic by a min of 20-30%: http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/

There are 2 averages, it is very important to keep both of these concepts separate in MR programming.

local to the reducer and computes the average fro each key. Global average of all keys using a static variable in the class.

Page 10: apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

Here is my solution for computing the average:import java.io.IOException;import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.FloatWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

//this implementation is different from Lin's pseudocode in 3.6//compute average per key//compute average for all keys (added)public class HadoopAvg {

static private int totalSum;static private int totalCount;

static class HadoopAvgMapper extendsMapper<LongWritable, Text, Text, AvgPair> {

@Overridepublic void map(LongWritable fileOffset, Text fileLine, Context

context) {StringTokenizer st = new

StringTokenizer(fileLine.toString());while (st.hasMoreTokens()) {

String emitMe = st.nextToken();System.out.println("emitMe:" + emitMe);try {

AvgPair av = new AvgPair();av.setKey(emitMe);av.setCount(1);av.setSum(Integer.parseInt(emitMe));context.write(new Text(emitMe), av);

} catch (Exception e) {e.printStackTrace();

}}

}}

static class HadoopAvgCombiner extendsReducer<Text, Iterable<AvgPair>, Text, AvgPair> {

Page 11: apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

public void reduce(Text key, Iterable<AvgPair> countItr, Context context) {

int sum = 0;int count = 0;System.out.println("combiner key:" + key.toString());// sum over the keys in the reducer.for (AvgPair ap : countItr) {

sum = sum + ap.getSum();count = count + ap.getCount();

}AvgPair av = new AvgPair();av.setCount(count);av.setSum(sum);try {

context.write(key, av);} catch (Exception e) {

e.printStackTrace();}

}}

static class HadoopAvgReducer extendsReducer<Text, AvgPair, Text, FloatWritable> {

@Overridepublic void reduce(Text key, Iterable<AvgPair> pairItr, Context

context) {float avg = 0;int sum = 0;int count = 0;System.out.println("reducer key:" + key.toString());for (AvgPair ap : pairItr) {

sum = sum + ap.getSum();totalSum = totalSum + ap.getSum();totalCount = totalCount + ap.getCount();count = count + ap.getCount();

System.out.println("ap.getSum:" + ap.getSum()+ "ap.getCount():" + ap.getCount());

}System.out.println("sum:" + sum + " , count:" + count);avg = sum / count;System.out.println("avg:" + avg);System.out.println("totalSum:" + totalSum + "

totalCount:"+ totalCount);

try {context.write(key, new FloatWritable(avg));

} catch (Exception e) {e.printStackTrace();

}

Page 12: apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

}}

public static void main(String[] args) throws IOException,ClassNotFoundException, InterruptedException {

Configuration conf = new Configuration();Job job = new Job(conf, "HadoopAvg");job.setJarByClass(HadoopAvg.class);job.setMapperClass(HadoopAvgMapper.class);job.setReducerClass(HadoopAvgReducer.class);job.setCombinerClass(HadoopAvgCombiner.class);// very important, add this to handout. This has to match the

function// signatures. The// output types from each stage are set here not automatically

through// the interfacejob.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(AvgPair.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(FloatWritable.class);FileInputFormat.addInputPath(job, new

Path("/Users/dc/avgdata/data"));FileSystem fs = FileSystem.get(conf);fs.delete ( new Path( "/Users/dc/avgdata/output" )) ;FileOutputFormat.setOutputPath(job,

new Path("/Users/dc/avgdata/output"));

System.exit(job.waitForCompletion(true) ? 0 : 1);}

}

Lab Part 6: Split above Lab 5 into 2 labs, different averages. Lab Part xx: Pairs and Stripes for large sparse matrices

Lab Part xx: Create Hadoop Programs from pseudocode. Tom Whites: Lab Part xx: Joins from Tom White’s book and Lin’s bookLab Part 7: Running Programs in a cluster. Use ToolRunner, upload the jar and run from the command line.

Lab Part 8 Controlling the input splits. Create an InputFormat called WholeFileInputFormat which creates 1 input split for each file.

One of the most common tasks is to parse data into HDFS. This consists of defining classes to parse custom file formats into domain objects. We will cover 2 simplified

Page 13: apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

examples here which should give you practice with defining a custom InputFormat and custom RecordReader.

The InputFormat interface has 2 methods, isSplitable() and createRecord(). The createRecord method requires you to create a RecordReader class. There are 2 RecordReader classes/interfaces both of which are different.

The older interface is RecordReader with methods createKey(), createValue(), getPos(), getProgress(), next() and close().

The newer Hadoop mapreduce packages replaces the mapred interface with an abstract class and different method names, close(), getCurrentKey(), getCurrentValue(), getProgress(), initialize(), nextKeyValue().

Part A: create 1 record per file. The data in each file refers to 1 record only. There is 1 record per file, test this by printing out the value in the mapper, you should see the entire file contents.

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FSDataInputStream;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.BytesWritable;import org.apache.hadoop.io.IOUtils;import org.apache.hadoop.io.NullWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.InputSplit;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.RecordReader;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.TaskAttemptContext;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.FileSplit;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

//given an input file, print out each file line in the input file in the output file in orderpublic class HadoopReadWholeFile {

static class HadoopReadWholeFileMapper extendsMapper<Object, Text, Text, Text> {

@Overridepublic void map(Object key, Text value, Context context) {

Page 14: apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

// key is offset, value is entire contents of file

System.out.println("value:" + value.toString());}

}

static class HadoopReadWholeFileReducer extendsReducer<Text, Text, Text, Text> {

@Overridepublic void reduce(Text key, Iterable<Text> values, Context

context) {

}}

static class WholeFileInputFormat extendsFileInputFormat<NullWritable, BytesWritable> {

// dumb people, not engineers, can't spell. Why penalize those that can?

boolean isSplitable(FileSystem fs, Path fileName) {return false;

}

@Overridepublic RecordReader<NullWritable, BytesWritable>

createRecordReader(InputSplit split, TaskAttemptContext context)throws IOException, InterruptedException {

// TODO Auto-generated method stubWholeFileRecordReader wr = new

WholeFileRecordReader();wr.initialize(split, context);return wr;

}

}

static class WholeFileRecordReader extendsRecordReader<NullWritable, BytesWritable> {

private FileSplit fileSplit;private Configuration conf;private final boolean processed = false;private NullWritable key;private BytesWritable value;

@Override// called before input split is used.public void initialize(InputSplit inputSplit, TaskAttemptContext

Page 15: apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

context) {// inputSplit vs FileSplit// this.fileSplit = inputSplit;this.fileSplit = (FileSplit) inputSplit;this.conf = context.getConfiguration();this.key = NullWritable.get();this.value = new BytesWritable();

}

@Overridepublic NullWritable getCurrentKey() {

return this.key;}

@Overridepublic BytesWritable getCurrentValue() {

return this.value;}

@Overridepublic float getProgress() {

// inputSplit.no getengthfloat f = 0.0f;try {

f = fileSplit.getLength();} catch (Exception e) {

e.printStackTrace();}return f;

}

@Overridepublic void close() {}

@Overridepublic boolean nextKeyValue() throws IOException,

InterruptedException {// TODO Auto-generated method stubif (!processed) {

byte[] contents = new byte[(int) fileSplit.getLength()];

Path file = fileSplit.getPath();FileSystem fs = file.getFileSystem(conf);FSDataInputStream in = null;try {

in = fs.open(file);IOUtils.readFully(in, contents, 0,

contents.length);

Page 16: apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

value.set(contents, 0, contents.length);} catch (Exception e) {

e.printStackTrace();}

}return false;

}}

public static void main(String args[]) {try {

Configuration conf = new Configuration();Job job = new Job(conf, "HadoopReadWholeFile");job.setJarByClass(HadoopReadWholeFile.class);job.setMapperClass(HadoopReadWholeFileMapper.class);

job.setReducerClass(HadoopReadWholeFileReducer.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(Text.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(Text.class);FileInputFormat.addInputPath(job, new Path(

"/Users/dc/hadoopwholefile/data"));FileSystem fs = FileSystem.get(conf);fs.delete ( new Path( "/Users/dc/hadoopwholefile/output" )) ;FileOutputFormat.setOutputPath(job, new Path(

"/Users/dc/hadoopwholefile/output"));

System.exit(job.waitForCompletion(true) ? 0 : 1);

} catch (Exception e) {e.printStackTrace();

}}

}

Part 9 More Practice with Input Records. NCDC Data Lab. This lab is somewhat related to Tom White’s book examples using the NCDC data in Hadoop programs

Read the NCDC data from: http://www.infochimps.com/link_frame?dataset=11860, download the tar.gz files and build a Hadoop Program which can read them into a mapper and find the max temp. Test on a single file first and compare against a nonmap reduce program. Then expand to a cluster and test.

Page 17: apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

You are free to follow the examples in Tom White’s book which don’t create domain objects. For a modification from the book, practice using WritableComparables and Writables.

Part 1: Create a NCDC object as the key into the mapper and implement Writable and WritableComparable for the NCDCObject. Creating domain objects is a large part of writing Hadoop programs and allows you to sort your data and pack them into sequence files and map files for access later. Another large part of Hadoop programming is reformatting received data into data which can be map reduce sorted and ordered.

Part 2: Parse the .gz files and create NCDC objects, store them in SequenceFiles, BloomMapFiles, MapFiles. Test sequential access for fetching all the temperature records for a particular station, test random access for station data and test computational access times for calculating temperature averages.

Part 3: Graph the performance above and make a statement about which to use when. Write a program to calculate the average temps per station or max temp per station and compare performance.

Part 4: Repeat using Pig, graph performance.

Part 10: More Practice with input Records. Enron Email Data. Same as above but with a different data set.

Part 11: More practice with Input Records. Data Mining urls

Page 18: apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

Part 9 Partitioner Lab: For single node Hadoop configurations all the mapper outputs are sent to 1 reducer. When there are multiple reducers, the time to complete the job is as long as the slowest node. Partitioners tell the framework which reducer to send Mapper outputs to. A well designed partitioner balances the work load. This is especially important if your development code is used in production.

Part 10: Chaining Map Reduce jobs. Read the Pig and Hive source code and see how they do it. Chain Mapper, Chain Reducer.

Part 11: Matrix Multiplication using Chained M/R jobs as per Ullman.

Part 11: Multiple Output Files, then use final MR job to combine multiple output files. hadoop fs -getmerge hdfs-output-dir local-file

Part 12: More serialization. Writing classes conforming to Writable and Writable Comparable is tedious and time consuming. You can replace this process using the Google Protocol Buffer, Thrift or Avro packages.

Using Google Protocol Buffers:

message Book { required int32 id = 1; required string title = 2; repeated string author = 3; message Author { required string name = 1; } }

Code generated from Google Protocol Buffer specification above: Library.Book book = Library.Book.newBuilder() .setId(1234) .setTitle("My Stories") .addAuthor("Me") .build(); FileOutputStream fos = new FileOutputStream(new File("book.ser")); book.writeTo(fos); }

Plug this into your Hadoop project.

Page 19: apachebigtop.pbworks.comapachebigtop.pbworks.com/w/file/fetch/52617964/Map R…  · Web viewThe Hadoop combiner is run when the size of the map output passes the min.num.spills.for.combine

Using AVRO: Example: