MapReduce: Design PatternsA.A. 2018/19
Fabiana Rossi
Laurea Magistrale in Ingegneria Informatica - II anno
Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica
The reference Big Data stack
Fabiana Rossi - SABD 2018/19 1
Resource Management
Data Storage
Data Processing
High-level Interfaces Support / Integration
Main reference for this lectureD. Miner and A. Shook MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O'Reilly Media, 2012.
2Fabiana Rossi - SABD 2018/19
MapReduce is a Framework• Fit your solution into the framework of map and
reduce
• In some situations might be challenging– MapReduce can be a constraint– provides clear boundaries for what you can and cannot do
• Figuring out how to solve a problem with constraints requires– cleverness – a change in thinking!
3Fabiana Rossi - SABD 2018/19
MapReduce Design PatternWhat is a MapReduce design pattern? • It is a template for solving a common and general data
manipulation problem with MapReduce. • Inspired by "Design Patterns: Elements of Reusable Object-
Oriented Software" by the Gang of four
A pattern:• is a general approach for solving a problem• is not specific to a domain (e.g., text processing, graph analysis)
A design patterns allows• to use tried and true design principles• to build better software
4Fabiana Rossi - SABD 2018/19
MapReduce Design Pattern• MapReduce is a framework
– Fit your solution into the framework of map and reduce– Can be challenging in some situations
• Need to take the algorithm and break it into filter/aggregate steps– Filter becomes part of the map function– Aggregate becomes part of the reduce function
• Sometimes we need multiple MapReduce stages• MapReduce is not a solution to every problem, not
even every parallel problem• It makes sense when:
– Files are very large and are rarely updated – We need to iterate over all the files to generate some
interesting property of the data in those files 5Fabiana Rossi - SABD 2018/19
Hands-on Hadoop(our pre-configured Docker image)
Fabiana Rossi - SABD 2018/19
Hadoop with Dockers
7
• Create a small network named hadoop_network with one namenode (master) and 3 datanode (slave).
• We will interact on the master node, exchanging file through the volume mounted in /data
$ docker network create --driver bridge hadoop_network
$ docker run -t -i -p 9864:9864 -d --network=hadoop_network --name=slave1 effeerre/hadoop
$ docker run -t -i -p 9863:9864 -d --network=hadoop_network --name=slave2 effeerre/hadoop
$ docker run -t -i -p 9862:9864 -d --network=hadoop_network --name=slave3 effeerre/hadoop
$ docker run -t -i -p 9870:9870 -p 8088:8088 --network=hadoop_network --name=master -v $PWD/hddata:/data effeerre/hadoop
Fabiana Rossi - SABD 2018/19
Hadoop with Dockers
8
• Before we start, we need to initialize our environment
• On the master node
• The WebUI tells us if everything is working properly:– HDFS: http://localhost:9870/– MapReduce Master: http://localhost:8088/
$ hdfs namenode –format
$ $HADOOP_HOME/sbin/start-dfs.sh
$ $HADOOP_HOME/sbin/start-yarn.sh
Fabiana Rossi - SABD 2018/19
Hadoop with Dockers
9
How to remove the containers
• stop and delete the namenode and datanodes
• remove the network
$ docker network rm hadoop_network
$ docker kill master slave1 slave2 slave3
$ docker rm master slave1 slave2 slave3
Fabiana Rossi - SABD 2018/19
A simplified view of MapReduce
• Mappers are applied to all input key-value pairs, to generate an arbitrary number of intermediate pairs
• Reducers are applied to all intermediate values associated with the same intermediate key
• Between the map and reduce phase lies a barrier that involves a large distributed sort and group by
10Fabiana Rossi - SABD 2018/19
A more detailed view of MapReduce
11
• Combiner: optimization that anticipates on the map node the reduce function;
– Hadoop does not provide a guarantee of how many times it will call it
• Partitioner: when there are multiple reducers, divides keys in partitions that will be assigned to each reducer
– A custom partitioner can be used to control how keys are passed to the reducer, e.g., to balance load, to guarantee properties – such as total ordering
Fabiana Rossi - SABD 2018/19
Job in MapReduce
• A MapReduce (i.e., Java) program, referred to as a job, consists of:• Code for Map and Reduce packaged together• Configuration parameters (where the input lies,
where the output should be stored)• Input data set, stored on the underlying distributed
file system• Applications typically implement the Mapper
and Reducer interfaces to provide the map and reduce methods. They form the core of a MapReduce job.
Fabiana Rossi - SABD 2018/19 12
Job MapReduce: Input• InputFormat describes the input-specification for a
MapReduce job.
• The default behavior of file-based InputFormatimplementations (typically sub-classes of FileInputFormat) is to split the input into logicalInputSplit instances based on the total size (in bytes) of the input files. • The FileSystem blocksize of the input files is treated
as an upper bound for input splits.
• The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
Fabiana Rossi - SABD 2018/19 13
Job MapReduce: Output
• OutputFormat describes the output-specification for a MapReduce job.
• Output files are stored in a FileSystem.
• TextOutputFormat is the default OutputFormat.
Fabiana Rossi - SABD 2018/19 14
Mapper and Reducerpublic class Map extends Mapper { public void map(Object key, Text value, Context context){ ... } }
Fabiana Rossi - SABD 2018/19 15
Context object: allows the Mapper/Reducer to interact with the Hadoopsystem. It includes configuration data for the job as well as interfaces whichallow it to emit output.
public class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) { ... } }
Job MapReduce: Example /* Create and configure a new MapReduce Job */ Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); /* Map function */ job.setMapperClass(Mapper.class); /* Reduce function */ job.setReducerClass(Reducer.class); job.setNumReduceTasks(2); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); ...
Fabiana Rossi - SABD 2018/19 16
This is only an excerpt of WordCount.java
Design Pattern: Number Summarizations• Goal: compute some numerical aggregate value
(count, maximum, average, ...) over a set of values
• Structure: – Mapper: it outputs keys that consist of each field to group by,
and values consisting of any pertinent numerical items– Combiner: (optional) it can greatly reduce the number of
intermediate key/value pairs to be sent across the network, but works well only with associative and commutative operations
– Partitioner: (optional) it can better distribute key/value pairs across the reduce tasks
– Reducer: The reducer receives a set of numerical values and applies the aggregation function
17Fabiana Rossi - SABD 2018/19
Design Pattern: Number Summarizations
18Fabiana Rossi - SABD 2018/19
Design Pattern: Number SummarizationsExamples:• Word count, record count
– Count the number of occurrences of each world• Min/Max
– Compute the max temperature per region• Average/Median/Standard Deviation
– Average the number of requests per page per Web site• Inverted Index Summarization:
– The inverted index pattern is commonly used to generate an index from a data set to allow for faster searches or data enrichment capabilities.
19Fabiana Rossi - SABD 2018/19
WordCount: Example
hello world goodbyehello fabianahello johnhello mikehello mapreduce
Fabiana Rossi - SABD 2018/19 20
fabiana 1goodbye 1hello 5john 1mapreduce 1mike 1world 1
Summarization: Example
hello world goodbyehello fabianahello johnhello mikehello mapreduce
Fabiana Rossi - SABD 2018/19 21
g 7.0m 6.5w 5.0
f 7.0h 5.0j 4.0
• Goal: compute the average word length by initial letter
Summarization: Example
22
• Goal: compute the average word length by initial letter
• Check: AverageWordLengthByInitialLetter.java
public void map(Object key, Text value, Context context) { String line = value.toString().toLowerCase(); /* Emit length by initial letter */ StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { String word = itr.nextToken(); initialLetter.set(word.substring(0,1)); length.set(word.length()); context.write(initialLetter, length); } }
This is only an excerpt
Fabiana Rossi - SABD 2018/19
Summarization: Example
23
• Goal: compute the average word length by initial letter
• Check: AverageWordLengthByInitialLetter.java
public void reduce(Text key, Iterable<IntWritable> values, Context context){ int sum = 0; int count = 0; for (IntWritable val : values) { sum += val.get(); count++; } average.set(((float) sum / (float) count)); context.write(key, average); }
This is only an excerpt
Fabiana Rossi - SABD 2018/19
Design Pattern: Filtering• Goal: filter out records that are not of interest and
keep the others.• An application of filtering is the sampling
– Sampling can be used to get a smaller, yet representative, data set
• Structure: – Mapper: filters data (it does most of the work)– Reduce: may simply be the identity, if the job does not
produce an aggregation on filtered data
24Fabiana Rossi - SABD 2018/19
Design Pattern: Filtering
25Fabiana Rossi - SABD 2018/19
Design Pattern: FilteringUse cases:• Closer view of data: to extract records that have something in
common or something of interest (e.g., same event-date, same user id)
• Tracking a thread of events: extract a thread of consecutive events as a case study from a larger data set.
• Distributed grep• Simple random sampling: simple random sampling of the data
set– use a filter with an evaluation function that randomly returns
true or false• Remove low scoring data
26Fabiana Rossi - SABD 2018/19
Filtering: Example
hello world goodbyehello fabianahello johnhello mikehello mapreduce
Fabiana Rossi - SABD 2018/19 27
hello world goodbye
good
• Goal: implement a distributed version of grep• grep is a command-line utility for searching plain-text data sets for lines that match a regular expression
Filtering: Example
28
• Goal: implement a distributed version of grep• Check: DistributedGrep.java
public static class GrepMapper extends Mapper<Object, Text, NullWritable, Text> {
private Pattern pattern = null;
public void setup(Context context) throws ... { pattern = Pattern.compile( ... ); } public void map(Object key, Text value, Context context) ... { Matcher matcher = pattern.matcher(value.toString()); if (matcher.find()) { context.write(NullWritable.get(), value); } } }
This is only an excerptFabiana Rossi - SABD 2018/19
Design Pattern: Distinct• Special case of filter pattern• Goal: filter out records that look like another record in
the data set• Structure:
– Mapper: it takes each record and extracts the data fields for which we want unique values. The mapper outputs the record as the key, and null as the value
– Reduce: it groups the nulls together by key. We then simply output the key. Because each key is grouped together, the output data set is guaranteed to be unique.
• Examples:– Retrieve the list of words, with no repetition, in a document
29Fabiana Rossi - SABD 2018/19
Distinct: Example
30
• Goal: retrieve the list of words, with no repetitions, in a document
• Check: DistinctWords.javapublic void map(Object key, Text value, Context context) ... { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, NullWritable.get()); } } ... public void reduce(Text key, Iterable<NullWritable> values, Context context) ... { context.write(key, NullWritable.get()); }
This is only an excerptFabiana Rossi - SABD 2018/19
Design Pattern: Data Organization• Goal: combine and organize data in a more complex
data structure.
• This pattern includes several pattern sub-categories:– structure to hierarchical pattern (e.g., denormalization)– partitioning and binning patterns– total order sorting patterns– shuffling patterns
31Fabiana Rossi - SABD 2018/19
Design Pattern: Structure to Hierarchical• Goal: create new records from data stored in very
different structures. – This pattern follows the denormalization principles of big
data stores • Structure:
– We might need to combine data from multiple data sources (use MultipleInputs)
– Map: it associate data to be aggregated to the same key (e.g., root of hierarchical record). Each data can be enriched with a label to identify the source.
– Reduce: it creates the hierarchical structure from the list of received data items
32Fabiana Rossi - SABD 2018/19
Design Pattern: Structure to Hierarchical
33Fabiana Rossi - SABD 2018/19
Structure to Hierarchical: Example
1::Movies2::Football teams3::Software
Fabiana Rossi - SABD 2018/19 34
{"topic":"Software","items":["Autocad","Eclipse","IntelliJ","Microsoft Office","Linux","Google Chrome"]}
1::Star Wars1::Mad Max1::Creed2::Roma2::Juventus3::Autocad3::Eclipse3::IntelliJ3::Microsoft Office3::Linux3::Google Chrome
topic
items
Structure to Hierarchical: Example• Goal: create a json structure of a topic, which
contains the list of its items– Two inputs are provided, the list of topics, the list of items
• Check: TopicItemsHierarchy.java
35
public void map(Object key, Text value, Context context) ... { String line = value.toString(); String[] parts = line.split("::"); if (parts.length != 2) return; String id = parts[0]; String content = parts[1]; outKey.set(id); outValue.set(valuePrefix + content); context.write(outKey, outValue);
}
This is only an excerptFabiana Rossi - SABD 2018/19
Structure to Hierarchical: Example• Check: TopicItemsHierarchy.java
36
public void reduce(Text key, Iterable<Text> values, Context context) ... { Topic topic = new Topic(); for (Text t : values) { String value = t.toString(); if (ValueType.TOPIC.equals(discriminate(value))){ topic.setTopic(getContent(value)); } else if (ValueType.ITEM.equals(discriminate(value))){ topic.addItem(getContent(value)); } } /* Serialize topic */ String serializedTopic = gson.toJson(topic); context.write(new Text(serializedTopic), NullWritable.get()); }
This is only an excerptFabiana Rossi - SABD 2018/19
Design Pattern: Partitioning• Goal: move the records into categories (i.e., shards,
partitions, or bins) without taking care about the order of records.
• Structure:– Map: in most cases, the identity mapper can be used.– Partitioner: it will determine which reducer to send each
record to; each reducer corresponds to a particular partition– Reduce: in most cases, the identity reducer can be used– All you have to define is the function that determines what
partition a record is going to go.
37Fabiana Rossi - SABD 2018/19
Design Pattern: Partitioning
38Fabiana Rossi - SABD 2018/19
Partitioning: Example• Goal: group date by year. In this case a year
represents a partition• Check: PartitionDatesByYear.java
39
public static class DatePartitioner extends Partitioner<IntWritable, Text> { public int getPartition(IntWritable key, Text value, int numPartitions) { return (key.get() - CONFIG_INITIAL_YEAR) % numPartitions; } }
This is only an excerpt
Fabiana Rossi - SABD 2018/19
Two-stage MapReduce• As map-reduce calculations get more complex, break
them down into stages– Output of one stage = input to next stage
• Intermediate output may be useful for different outputs too, so you can get some reuse– Intermediate records can be saved in the data store, forming
a materialized view• Early stages of map-reduce operations often
represent the heaviest amount of data access, so building and save them once as a basis for many downstream uses saves a lot of work
40Fabiana Rossi - SABD 2018/19
Design Pattern: Total Order Sorting• Sort all the records of the data set
– Sorting in a parallel manner is not easy.• Observe:
– each individual reducer will sort its data by key, but unfortunately, this sorting is not global across all data.
• Goal: we want to have a total order sorting where, if you concatenate the output files, the records are sorted.
• Sorted data has a number of useful properties:– Sorted by time, it can provide a timeline view on the data– Finding things in a sorted data set can be done with binary
search– Some databases can bulk load data faster if the data is
sorted on the primary key or index column41Fabiana Rossi - SABD 2018/19
Design Pattern: Total Order Sorting• This pattern has two phases (jobs):
– an analyze phase that determines the ranges, and the order phase that actually sorts the data.
Analyze Phase: identify the data set slices• Map: it does a simple random sampling• Reduce: only one reducer will be used. It collects the sort keys
and slices them into the data range boundariesOrder Phase: order the dataset• Map: similar to the mapper function of the analyze phase, but
the record itself is stored as the value• Partition: it loads up the partition file, routes data according to
the paritions– Hadoop provides an implementation: TotalOrderPartitioner
• Reduce: it is the identify function; the number of reducers needs to be equal to the number of partitions
42Fabiana Rossi - SABD 2018/19
Total Order Sorting: Example• Goal: order the dataset
– We rely on the TotalOrderPartitioner class– Slightly different implementation of Analyze and Order Phases
• Check: TotalOrdering.java• Observe the driver, which defines the chain of MapReduce jobs
43
/* **** Job #1: Analyze phase **** */ Job sampleJob = Job.getInstance(conf, "TotalOrdering"); /* Map: samples data; Reduce: identity function */ sampleJob.setMapperClass(AnalyzePhaseMapper.class); sampleJob.setNumReduceTasks(0); sampleJob.setOutputFormatClass(SequenceFileOutputFormat.class); ... if (isCompletedCorrecty(sampleJob)) {
This is only an excerpt
Fabiana Rossi - SABD 2018/19
Total Order Sorting: Example
44
/* **** Job #2: Ordering phase **** */ Job orderJob = Job.getInstance(conf, "TotalOrderSortingStage"); /* Map: identity function; Reduce: emits only the key */ orderJob.setMapperClass(Mapper.class); orderJob.setReducerClass(OrderingPhaseReducer.class); orderJob.setNumReduceTasks(10); /* Partitioner */ orderJob.setPartitionerClass(TotalOrderPartitioner.class); /* Define the dataset sampling strategy to identify partition bounds */ InputSampler.writePartitionFile(orderJob, new InputSampler.RandomSampler(.3, 10)); }
This is only an excerpt
Fabiana Rossi - SABD 2018/19
Order Phase (1)
/* **** Job #2: Ordering phase **** */ Job orderJob = Job.getInstance(conf,"TotalOrderSortingStage"); orderJob.setJarByClass(TotalOrdering.class); /* Map: identity function outputs the key/value pairs in the SequenceFile */ orderJob.setMapperClass(Mapper.class); /* Reduce: identity function */ orderJob.setReducerClass(OrderingPhaseReducer.class); orderJob.setNumReduceTasks(10);
Fabiana Rossi - SABD 2018/19 45
This is only an excerpt of main in TotalOrdering.java
Order Phase (2)
/* Set input and output files: the input is the previous job's output */ orderJob.setInputFormatClass(SequenceFileInputFormat.class); orderJob.setPartitionerClass(TotalOrderPartitioner.class); TotalOrderPartitioner.setPartitionFile(orderJob.getConfiguration(), partitionFile); InputSampler.writePartitionFile(orderJob, new InputSampler.RandomSampler(.3, 10));
Fabiana Rossi - SABD 2018/19 46
This is only an excerpt of main in TotalOrdering.java
Analyze Phase (1)public static class AnalyzePhaseMapper extends Mapper { ... public void map(Object key, Text value, Context context) throws IOException, InterruptedException { outkey.set(value.toString()); context.write(outkey, value); } }
Fabiana Rossi - SABD 2018/19 47
/* Map: samples data; Reduce: identity function */ sampleJob.setMapperClass(AnalyzePhaseMapper.class); sampleJob.setNumReduceTasks(0); /* Set input and output files */ sampleJob.setOutputFormatClass(SequenceFileOutputFormat.class);
This is only an excerpt of main in TotalOrdering.java
Design Pattern: Shuffling• Goal: we want to shuffle our dataset, to randomize
our records (e.g., to improve anonymity)
• Structure:– Map: it emits the record as the value along with a random
key– Reduce: the reducer sorts the random keys, further
randomizing the data
48Fabiana Rossi - SABD 2018/19