Download pdf - MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

MapReduce: Design PatternsA.A. 2018/19

Fabiana Rossi

Laurea Magistrale in Ingegneria Informatica - II anno

Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica

The reference Big Data stack

Fabiana Rossi - SABD 2018/19 1

Resource Management

Data Storage

Data Processing

High-level Interfaces Support / Integration

Main reference for this lectureD. Miner and A. Shook MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O'Reilly Media, 2012.

2Fabiana Rossi - SABD 2018/19

MapReduce is a Framework• Fit your solution into the framework of map and

reduce

• In some situations might be challenging– MapReduce can be a constraint– provides clear boundaries for what you can and cannot do

• Figuring out how to solve a problem with constraints requires– cleverness – a change in thinking!


MapReduce Design PatternWhat is a MapReduce design pattern? • It is a template for solving a common and general data

manipulation problem with MapReduce. • Inspired by "Design Patterns: Elements of Reusable Object-

Oriented Software" by the Gang of four

A pattern:• is a general approach for solving a problem• is not specific to a domain (e.g., text processing, graph analysis)

A design patterns allows• to use tried and true design principles• to build better software


MapReduce Design Pattern• MapReduce is a framework

– Fit your solution into the framework of map and reduce– Can be challenging in some situations

• Need to take the algorithm and break it into filter/aggregate steps– Filter becomes part of the map function– Aggregate becomes part of the reduce function

• Sometimes we need multiple MapReduce stages• MapReduce is not a solution to every problem, not

even every parallel problem• It makes sense when:

– Files are very large and are rarely updated – We need to iterate over all the files to generate some

interesting property of the data in those files 5Fabiana Rossi - SABD 2018/19

Hands-on Hadoop(our pre-configured Docker image)

Fabiana Rossi - SABD 2018/19

Hadoop with Dockers

7

• Create a small network named hadoop_network with one namenode (master) and 3 datanode (slave).

• We will interact on the master node, exchanging file through the volume mounted in /data

$ docker network create --driver bridge hadoop_network

$ docker run -t -i -p 9864:9864 -d --network=hadoop_network --name=slave1 effeerre/hadoop



$ docker run -t -i -p 9870:9870 -p 8088:8088 --network=hadoop_network --name=master -v $PWD/hddata:/data effeerre/hadoop


Hadoop with Dockers

8

• Before we start, we need to initialize our environment

• On the master node

• The WebUI tells us if everything is working properly:– HDFS: http://localhost:9870/– MapReduce Master: http://localhost:8088/

$ hdfs namenode –format

$ $HADOOP_HOME/sbin/start-dfs.sh

$ $HADOOP_HOME/sbin/start-yarn.sh


Hadoop with Dockers

9

How to remove the containers

• stop and delete the namenode and datanodes

• remove the network

$ docker network rm hadoop_network

$ docker kill master slave1 slave2 slave3

$ docker rm master slave1 slave2 slave3


A simplified view of MapReduce

• Mappers are applied to all input key-value pairs, to generate an arbitrary number of intermediate pairs

• Reducers are applied to all intermediate values associated with the same intermediate key

• Between the map and reduce phase lies a barrier that involves a large distributed sort and group by


A more detailed view of MapReduce

11

• Combiner: optimization that anticipates on the map node the reduce function;

– Hadoop does not provide a guarantee of how many times it will call it

• Partitioner: when there are multiple reducers, divides keys in partitions that will be assigned to each reducer

– A custom partitioner can be used to control how keys are passed to the reducer, e.g., to balance load, to guarantee properties – such as total ordering


Job in MapReduce

• A MapReduce (i.e., Java) program, referred to as a job, consists of:• Code for Map and Reduce packaged together• Configuration parameters (where the input lies,

where the output should be stored)• Input data set, stored on the underlying distributed

file system• Applications typically implement the Mapper

and Reducer interfaces to provide the map and reduce methods. They form the core of a MapReduce job.


Job MapReduce: Input• InputFormat describes the input-specification for a

MapReduce job.

• The default behavior of file-based InputFormatimplementations (typically sub-classes of FileInputFormat) is to split the input into logicalInputSplit instances based on the total size (in bytes) of the input files. • The FileSystem blocksize of the input files is treated

as an upper bound for input splits.

• The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.


Job MapReduce: Output

• OutputFormat describes the output-specification for a MapReduce job.

• Output files are stored in a FileSystem.

• TextOutputFormat is the default OutputFormat.


Mapper and Reducerpublic class Map extends Mapper { public void map(Object key, Text value, Context context){ ... } }


Context object: allows the Mapper/Reducer to interact with the Hadoopsystem. It includes configuration data for the job as well as interfaces whichallow it to emit output.

public class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) { ... } }

Job MapReduce: Example /* Create and configure a new MapReduce Job */ Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); /* Map function */ job.setMapperClass(Mapper.class); /* Reduce function */ job.setReducerClass(Reducer.class); job.setNumReduceTasks(2); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); ...


This is only an excerpt of WordCount.java

Design Pattern: Number Summarizations• Goal: compute some numerical aggregate value

(count, maximum, average, ...) over a set of values

• Structure: – Mapper: it outputs keys that consist of each field to group by,

and values consisting of any pertinent numerical items– Combiner: (optional) it can greatly reduce the number of

intermediate key/value pairs to be sent across the network, but works well only with associative and commutative operations

– Partitioner: (optional) it can better distribute key/value pairs across the reduce tasks

– Reducer: The reducer receives a set of numerical values and applies the aggregation function


Design Pattern: Number Summarizations


Design Pattern: Number SummarizationsExamples:• Word count, record count

– Count the number of occurrences of each world• Min/Max

– Compute the max temperature per region• Average/Median/Standard Deviation

– Average the number of requests per page per Web site• Inverted Index Summarization:

– The inverted index pattern is commonly used to generate an index from a data set to allow for faster searches or data enrichment capabilities.


WordCount: Example

hello world goodbyehello fabianahello johnhello mikehello mapreduce


fabiana 1goodbye 1hello 5john 1mapreduce 1mike 1world 1

Summarization: Example



g 7.0m 6.5w 5.0

f 7.0h 5.0j 4.0

• Goal: compute the average word length by initial letter


22


• Check: AverageWordLengthByInitialLetter.java

public void map(Object key, Text value, Context context) { String line = value.toString().toLowerCase(); /* Emit length by initial letter */ StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { String word = itr.nextToken(); initialLetter.set(word.substring(0,1)); length.set(word.length()); context.write(initialLetter, length); } }

This is only an excerpt



23


• Check: AverageWordLengthByInitialLetter.java

public void reduce(Text key, Iterable<IntWritable> values, Context context){ int sum = 0; int count = 0; for (IntWritable val : values) { sum += val.get(); count++; } average.set(((float) sum / (float) count)); context.write(key, average); }



Design Pattern: Filtering• Goal: filter out records that are not of interest and

keep the others.• An application of filtering is the sampling

– Sampling can be used to get a smaller, yet representative, data set

• Structure: – Mapper: filters data (it does most of the work)– Reduce: may simply be the identity, if the job does not

produce an aggregation on filtered data


Design Pattern: Filtering


Design Pattern: FilteringUse cases:• Closer view of data: to extract records that have something in

common or something of interest (e.g., same event-date, same user id)

• Tracking a thread of events: extract a thread of consecutive events as a case study from a larger data set.

• Distributed grep• Simple random sampling: simple random sampling of the data

set– use a filter with an evaluation function that randomly returns

true or false• Remove low scoring data


Filtering: Example



hello world goodbye

good

• Goal: implement a distributed version of grep• grep is a command-line utility for searching plain-text data sets for lines that match a regular expression

Filtering: Example

28

• Goal: implement a distributed version of grep• Check: DistributedGrep.java

public static class GrepMapper extends Mapper<Object, Text, NullWritable, Text> {

private Pattern pattern = null;

public void setup(Context context) throws ... { pattern = Pattern.compile( ... ); } public void map(Object key, Text value, Context context) ... { Matcher matcher = pattern.matcher(value.toString()); if (matcher.find()) { context.write(NullWritable.get(), value); } } }

This is only an excerptFabiana Rossi - SABD 2018/19

Design Pattern: Distinct• Special case of filter pattern• Goal: filter out records that look like another record in

the data set• Structure:

– Mapper: it takes each record and extracts the data fields for which we want unique values. The mapper outputs the record as the key, and null as the value

– Reduce: it groups the nulls together by key. We then simply output the key. Because each key is grouped together, the output data set is guaranteed to be unique.

• Examples:– Retrieve the list of words, with no repetition, in a document


Distinct: Example

30

• Goal: retrieve the list of words, with no repetitions, in a document

• Check: DistinctWords.javapublic void map(Object key, Text value, Context context) ... { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, NullWritable.get()); } } ... public void reduce(Text key, Iterable<NullWritable> values, Context context) ... { context.write(key, NullWritable.get()); }


Design Pattern: Data Organization• Goal: combine and organize data in a more complex

data structure.

• This pattern includes several pattern sub-categories:– structure to hierarchical pattern (e.g., denormalization)– partitioning and binning patterns– total order sorting patterns– shuffling patterns


Design Pattern: Structure to Hierarchical• Goal: create new records from data stored in very

different structures. – This pattern follows the denormalization principles of big

data stores • Structure:

– We might need to combine data from multiple data sources (use MultipleInputs)

– Map: it associate data to be aggregated to the same key (e.g., root of hierarchical record). Each data can be enriched with a label to identify the source.

– Reduce: it creates the hierarchical structure from the list of received data items


Design Pattern: Structure to Hierarchical


Structure to Hierarchical: Example

1::Movies2::Football teams3::Software


{"topic":"Software","items":["Autocad","Eclipse","IntelliJ","Microsoft Office","Linux","Google Chrome"]}

1::Star Wars1::Mad Max1::Creed2::Roma2::Juventus3::Autocad3::Eclipse3::IntelliJ3::Microsoft Office3::Linux3::Google Chrome

topic

items

Structure to Hierarchical: Example• Goal: create a json structure of a topic, which

contains the list of its items– Two inputs are provided, the list of topics, the list of items

• Check: TopicItemsHierarchy.java

35

public void map(Object key, Text value, Context context) ... { String line = value.toString(); String[] parts = line.split("::"); if (parts.length != 2) return; String id = parts[0]; String content = parts[1]; outKey.set(id); outValue.set(valuePrefix + content); context.write(outKey, outValue);

}


Structure to Hierarchical: Example• Check: TopicItemsHierarchy.java

36

public void reduce(Text key, Iterable<Text> values, Context context) ... { Topic topic = new Topic(); for (Text t : values) { String value = t.toString(); if (ValueType.TOPIC.equals(discriminate(value))){ topic.setTopic(getContent(value)); } else if (ValueType.ITEM.equals(discriminate(value))){ topic.addItem(getContent(value)); } } /* Serialize topic */ String serializedTopic = gson.toJson(topic); context.write(new Text(serializedTopic), NullWritable.get()); }


Design Pattern: Partitioning• Goal: move the records into categories (i.e., shards,

partitions, or bins) without taking care about the order of records.

• Structure:– Map: in most cases, the identity mapper can be used.– Partitioner: it will determine which reducer to send each

record to; each reducer corresponds to a particular partition– Reduce: in most cases, the identity reducer can be used– All you have to define is the function that determines what

partition a record is going to go.


Design Pattern: Partitioning


Partitioning: Example• Goal: group date by year. In this case a year

represents a partition• Check: PartitionDatesByYear.java

39

public static class DatePartitioner extends Partitioner<IntWritable, Text> { public int getPartition(IntWritable key, Text value, int numPartitions) { return (key.get() - CONFIG_INITIAL_YEAR) % numPartitions; } }



Two-stage MapReduce• As map-reduce calculations get more complex, break

them down into stages– Output of one stage = input to next stage

• Intermediate output may be useful for different outputs too, so you can get some reuse– Intermediate records can be saved in the data store, forming

a materialized view• Early stages of map-reduce operations often

represent the heaviest amount of data access, so building and save them once as a basis for many downstream uses saves a lot of work


Design Pattern: Total Order Sorting• Sort all the records of the data set

– Sorting in a parallel manner is not easy.• Observe:

– each individual reducer will sort its data by key, but unfortunately, this sorting is not global across all data.

• Goal: we want to have a total order sorting where, if you concatenate the output files, the records are sorted.

• Sorted data has a number of useful properties:– Sorted by time, it can provide a timeline view on the data– Finding things in a sorted data set can be done with binary

search– Some databases can bulk load data faster if the data is

sorted on the primary key or index column41Fabiana Rossi - SABD 2018/19

Design Pattern: Total Order Sorting• This pattern has two phases (jobs):

– an analyze phase that determines the ranges, and the order phase that actually sorts the data.

Analyze Phase: identify the data set slices• Map: it does a simple random sampling• Reduce: only one reducer will be used. It collects the sort keys

and slices them into the data range boundariesOrder Phase: order the dataset• Map: similar to the mapper function of the analyze phase, but

the record itself is stored as the value• Partition: it loads up the partition file, routes data according to

the paritions– Hadoop provides an implementation: TotalOrderPartitioner

• Reduce: it is the identify function; the number of reducers needs to be equal to the number of partitions


Total Order Sorting: Example• Goal: order the dataset

– We rely on the TotalOrderPartitioner class– Slightly different implementation of Analyze and Order Phases

• Check: TotalOrdering.java• Observe the driver, which defines the chain of MapReduce jobs

43

/* **** Job #1: Analyze phase **** */ Job sampleJob = Job.getInstance(conf, "TotalOrdering"); /* Map: samples data; Reduce: identity function */ sampleJob.setMapperClass(AnalyzePhaseMapper.class); sampleJob.setNumReduceTasks(0); sampleJob.setOutputFormatClass(SequenceFileOutputFormat.class); ... if (isCompletedCorrecty(sampleJob)) {



Total Order Sorting: Example

44

/* **** Job #2: Ordering phase **** */ Job orderJob = Job.getInstance(conf, "TotalOrderSortingStage"); /* Map: identity function; Reduce: emits only the key */ orderJob.setMapperClass(Mapper.class); orderJob.setReducerClass(OrderingPhaseReducer.class); orderJob.setNumReduceTasks(10); /* Partitioner */ orderJob.setPartitionerClass(TotalOrderPartitioner.class); /* Define the dataset sampling strategy to identify partition bounds */ InputSampler.writePartitionFile(orderJob, new InputSampler.RandomSampler(.3, 10)); }



Order Phase (1)

/* **** Job #2: Ordering phase **** */ Job orderJob = Job.getInstance(conf,"TotalOrderSortingStage"); orderJob.setJarByClass(TotalOrdering.class); /* Map: identity function outputs the key/value pairs in the SequenceFile */ orderJob.setMapperClass(Mapper.class); /* Reduce: identity function */ orderJob.setReducerClass(OrderingPhaseReducer.class); orderJob.setNumReduceTasks(10);


This is only an excerpt of main in TotalOrdering.java

Order Phase (2)

/* Set input and output files: the input is the previous job's output */ orderJob.setInputFormatClass(SequenceFileInputFormat.class); orderJob.setPartitionerClass(TotalOrderPartitioner.class); TotalOrderPartitioner.setPartitionFile(orderJob.getConfiguration(), partitionFile); InputSampler.writePartitionFile(orderJob, new InputSampler.RandomSampler(.3, 10));



Analyze Phase (1)public static class AnalyzePhaseMapper extends Mapper { ... public void map(Object key, Text value, Context context) throws IOException, InterruptedException { outkey.set(value.toString()); context.write(outkey, value); } }


/* Map: samples data; Reduce: identity function */ sampleJob.setMapperClass(AnalyzePhaseMapper.class); sampleJob.setNumReduceTasks(0); /* Set input and output files */ sampleJob.setOutputFormatClass(SequenceFileOutputFormat.class);


Design Pattern: Shuffling• Goal: we want to shuffle our dataset, to randomize

our records (e.g., to improve anonymity)

• Structure:– Map: it emits the record as the value along with a random

key– Reduce: the reducer sorts the random keys, further

randomizing the data