13
 Copyright ©2012 Cloudwick Technologies 1  Copyright ©2012 Cloudwick Technologies

07 MapReduce Features.pdf

Embed Size (px)

Citation preview

  • Copyright 2012 Cloudwick Technologies 1Copyright 2012 Cloudwick Technologies

  • Copyright 2012 Cloudwick Technologies 2

  • Copyright 2012 Cloudwick Technologies 3

    Counters are a useful channel for gathering statistics about the job: for quality control or for application level-statistics.

    They are also useful for problem diagnosis

    Built-in Counters:

  • Copyright 2012 Cloudwick Technologies 4

    User Defined Java Counters:

    MapReduce allows user code to define a set of counters, which are then incremented as desired in the mapper or reducer. Counters are defined by a Java enum, which serves to group related counters.

    The name of the enum is the group name, and the enums fields are the counter names.

    Counters are global: the MapReduce framework aggregates them across all maps and reduces to produce a grand total at the end of the job.

    public class MaxTemperatureWithCounters extends Configured implements Tool {enum Temperature {MISSING,MALFORMED}

    Counters can be set and incremented via methodReporter.incrCounter(group, name, amount);

    Sample Output:09/04/20 12:33:36 INFO mapred.JobClient: Air Temperature Records09/04/20 12:33:36 INFO mapred.JobClient: Malformed=309/04/20 12:33:36 INFO mapred.JobClient: Missing=66136856

  • Copyright 2012 Cloudwick Technologies 5

    Often,Mappers produce large amounts of intermediate data That data must be passed to the Reducers This can result in a lot of network traffic It is often possible to specify a Combiner Like a mini-Reducer Runs locally on a single Mappers output Output from the Combiner is sent to the Reducers Combiner and Reducer code are often identical Technically, this is possible if the operation performed is commutative and associative In this case, input and output data types for the Combiner/Reducer must be identical

  • Copyright 2012 Cloudwick Technologies 6

    The Partitioner divides up the keyspace Controls which Reducer each intermediate key and its associated values goes toOften, the default behavior is fine Default is the HashPartitioner

    public class HashPartitioner implements Partitioner { public void configure(JobConf job) {} public int getPartition(K2 key, V2 value,

    int numReduceTasks) {return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;

    } }

    Implement the custom paritioner to send particular keys to a particular reducer.

    Demo on Custom Partitioner:

  • Copyright 2012 Cloudwick Technologies 7

    Though Sorting is done at sort and shuffle phase, there are different ways to achieve and control sorting.

    Partial Sort:

    The default MapReduce job will sort the input records by keys. If there are 30 reducers, 30 sorted files will be generated. These files cannot be combined to produce a globally sorted file.

    Total Sort:

    Use only one reducer. But, it is very inefficient for large files.

    Use a Partitioner that respects the total order of the output. For example, if we had four partitions, we could put keys for temperatures less than 10C in the first partition, those between 10C and 0C in the second, those between 0C and 10C in the third, and those over 10C in the fourth.

    Secondary Sort:

    For any particular key , values are not sorted. Use a composite key of key and value and use a Partitioner by the key part of the composite key.

  • Copyright 2012 Cloudwick Technologies 8

    MapReduce can perform joins between large datasets, but writing the code to do joins from scratch is fairly involved. Rather than writing MapReduce programs, you might consider using a higher-level framework such as Pig, Hive, or Cascading, in which join operations are a core part of the implementation.

    If the join is performed by the mapper, it is called a map-side join, whereas if it is performed by the reducer it is called a reduce-side join.

    A map-side join between large inputs works by performing the join before the data reaches the map function. For this to work, though, the inputs to each map must be partitioned and sorted in a particular way. Each input dataset must be divided into the same number of partitions, and it must be sorted by the same key (the join key) in each source. All the records for a particular key must reside in the same partition.

    Use a CompositeInputFormat from the org.apache.hadoop.mapreduce.join package to run a map-side join.

    A Reduce-side join is less efficient as both datasets have to go through the MapReduce shuffle. The basic idea is that the mapper tags each record with its source and uses the join key as the map output key, so that the records with the same key are brought together in the reducer.

  • Copyright 2012 Cloudwick Technologies 9

    Side data can be defined as extra read-only data needed by a job (map or reduce tasks) to process the main dataset. The challenge is to make side data available to all the map or reduce tasks (which are spread across the cluster) in a convenient and efficient fashion. Example of side data:

    Lookup tables Dictionaries Standard configuration values

    it is possible to cache side-data in memory in a static field, so that tasks of the same job that run in succession on the same tasktracker can share the data.

    You can set arbitrary key-value pairs in the job configuration using the various setter methods on Configuration (or JobConf in the old MapReduce API). This is very useful if you need to pass a small piece of metadata to your tasks. But, this is not scalable.

  • Copyright 2012 Cloudwick Technologies 10

    Rather than serializing side data in the job configuration, it is preferable to distributedatasets using Hadoops distributed cache mechanism. This provides a service forcopying files and archives to the task nodes in time for the tasks to use them when theyrun. To save network bandwidth, files are normally copied to any particular node onceper job.

    Transfer happens behind the scenes before any task is executed Note: DistributedCache is read-only Files in the DistributedCache are automatically deleted from slave nodes when the job

    finishes Implementation:Place the files into HDFSConfigure the DistributedCache in your driver code

    JobConf job = new JobConf();DistributedCache.addCacheFile(new URI("/tmp/lookup.txt"), job);DistributedCache.addFileToClassPath(new Path("/tmp/abc.jar"), job); DistributedCache.addCacheArchive(new URI("/tmp/xyz.zip", job);

    or$ hadoop jar myjar.jar MyDriver -files file1, file2, file3, ...

  • Copyright 2012 Cloudwick Technologies 11

    Retrive Filesystem API for using it

    Configuration conf = new Configuration();FileSystem fs = FileSystem.get(conf);

    A file in HDFS represented by

    Path p = new Path("/path/to/my/file");

    Some useful API methods: FSDataOuputStream create(...)

    Provides methods for writing primitives, raw bytes etc FSDataInputStream open(...)

    Provides methods for reading primitives, raw bytes etc boolean delete(...) boolean mkdirs(...) void copyFromLocalFile(...) void copyToLocalFile(...) FileStatus[] listStatus(...)

  • Copyright 2012 Cloudwick Technologies 12

    Retrive Filesystem API for using it

    Configuration conf = new Configuration();FileSystem fs = FileSystem.get(conf);

    A file in HDFS represented by

    Path p = new Path("/path/to/my/file");

    Some useful API methods: FSDataOuputStream create(...)

    Provides methods for writing primitives, raw bytes etc FSDataInputStream open(...)

    Provides methods for reading primitives, raw bytes etc boolean delete(...) boolean mkdirs(...) void copyFromLocalFile(...) void copyToLocalFile(...) FileStatus[] listStatus(...)

  • 13