25
MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica - II anno Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica The reference Big Data stack Fabiana Rossi - SABD 2018/19 1 Resource Management Data Storage Data Processing High-level Interfaces Support / Integration

MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

MapReduce: Design PatternsA.A. 2018/19

Fabiana Rossi

Laurea Magistrale in Ingegneria Informatica - II anno

Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica

The reference Big Data stack

Fabiana Rossi - SABD 2018/19 1

Resource Management

Data Storage

Data Processing

High-level Interfaces Support / Integration

Page 2: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Main reference for this lectureD. Miner and A. Shook MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O'Reilly Media, 2012.

2Fabiana Rossi - SABD 2018/19

MapReduce is a Framework• Fit your solution into the framework of map and

reduce

• In some situations might be challenging– MapReduce can be a constraint– provides clear boundaries for what you can and cannot do

• Figuring out how to solve a problem with constraints requires– cleverness – a change in thinking!

3Fabiana Rossi - SABD 2018/19

Page 3: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

MapReduce Design PatternWhat is a MapReduce design pattern? • It is a template for solving a common and general data

manipulation problem with MapReduce. • Inspired by "Design Patterns: Elements of Reusable Object-

Oriented Software" by the Gang of four

A pattern:• is a general approach for solving a problem• is not specific to a domain (e.g., text processing, graph analysis)

A design patterns allows• to use tried and true design principles• to build better software

4Fabiana Rossi - SABD 2018/19

MapReduce Design Pattern• MapReduce is a framework

– Fit your solution into the framework of map and reduce– Can be challenging in some situations

• Need to take the algorithm and break it into filter/aggregate steps– Filter becomes part of the map function– Aggregate becomes part of the reduce function

• Sometimes we need multiple MapReduce stages• MapReduce is not a solution to every problem, not

even every parallel problem• It makes sense when:

– Files are very large and are rarely updated – We need to iterate over all the files to generate some

interesting property of the data in those files 5Fabiana Rossi - SABD 2018/19

Page 4: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Hands-on Hadoop(our pre-configured Docker image)

Fabiana Rossi - SABD 2018/19

Hadoop with Dockers

7

• Create a small network named hadoop_network with one namenode (master) and 3 datanode (slave).

• We will interact on the master node, exchanging file through the volume mounted in /data

$ docker network create --driver bridge hadoop_network

$ docker run -t -i -p 9864:9864 -d --network=hadoop_network --name=slave1 effeerre/hadoop

$ docker run -t -i -p 9863:9864 -d --network=hadoop_network --name=slave2 effeerre/hadoop

$ docker run -t -i -p 9862:9864 -d --network=hadoop_network --name=slave3 effeerre/hadoop

$ docker run -t -i -p 9870:9870 -p 8088:8088 --network=hadoop_network --name=master -v $PWD/hddata:/data effeerre/hadoop

Fabiana Rossi - SABD 2018/19

Page 5: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Hadoop with Dockers

8

• Before we start, we need to initialize our environment

• On the master node

• The WebUI tells us if everything is working properly:– HDFS: http://localhost:9870/– MapReduce Master: http://localhost:8088/

$ hdfs namenode –format

$ $HADOOP_HOME/sbin/start-dfs.sh

$ $HADOOP_HOME/sbin/start-yarn.sh

Fabiana Rossi - SABD 2018/19

Hadoop with Dockers

9

How to remove the containers

• stop and delete the namenode and datanodes

• remove the network

$ docker network rm hadoop_network

$ docker kill master slave1 slave2 slave3

$ docker rm master slave1 slave2 slave3

Fabiana Rossi - SABD 2018/19

Page 6: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

A simplified view of MapReduce

• Mappers are applied to all input key-value pairs, to generate an arbitrary number of intermediate pairs

• Reducers are applied to all intermediate values associated with the same intermediate key

• Between the map and reduce phase lies a barrier that involves a large distributed sort and group by

10Fabiana Rossi - SABD 2018/19

A more detailed view of MapReduce

11

• Combiner: optimization that anticipates on the map node the reduce function;

– Hadoop does not provide a guarantee of how many times it will call it

• Partitioner: when there are multiple reducers, divides keys in partitions that will be assigned to each reducer

– A custom partitioner can be used to control how keys are passed to the reducer, e.g., to balance load, to guarantee properties – such as total ordering

Fabiana Rossi - SABD 2018/19

Page 7: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Job in MapReduce

• A MapReduce (i.e., Java) program, referred to as a job, consists of:• Code for Map and Reduce packaged together• Configuration parameters (where the input lies,

where the output should be stored)• Input data set, stored on the underlying distributed

file system• Applications typically implement the Mapper

and Reducer interfaces to provide the map and reduce methods. They form the core of a MapReduce job.

Fabiana Rossi - SABD 2018/19 12

Job MapReduce: Input• InputFormat describes the input-specification for a

MapReduce job.

• The default behavior of file-based InputFormatimplementations (typically sub-classes of FileInputFormat) is to split the input into logicalInputSplit instances based on the total size (in bytes) of the input files. • The FileSystem blocksize of the input files is treated

as an upper bound for input splits.

• The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.

Fabiana Rossi - SABD 2018/19 13

Page 8: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Job MapReduce: Output

• OutputFormat describes the output-specification for a MapReduce job.

• Output files are stored in a FileSystem.

• TextOutputFormat is the default OutputFormat.

Fabiana Rossi - SABD 2018/19 14

Mapper and Reducerpublic class Map extends Mapper { public void map(Object key, Text value, Context context){ ... } }

Fabiana Rossi - SABD 2018/19 15

Context object: allows the Mapper/Reducer to interact with the Hadoopsystem. It includes configuration data for the job as well as interfaces whichallow it to emit output.

public class Reduce extends Reducer { public void reduce(Text key, Iterable values, Context context) { ... } }

Page 9: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Job MapReduce: Example /* Create and configure a new MapReduce Job */ Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); /* Map function */ job.setMapperClass(Mapper.class); /* Reduce function */ job.setReducerClass(Reducer.class); job.setNumReduceTasks(2); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); ...

Fabiana Rossi - SABD 2018/19 16

This is only an excerpt of WordCount.java

Design Pattern: Number Summarizations• Goal: compute some numerical aggregate value

(count, maximum, average, ...) over a set of values

• Structure: – Mapper: it outputs keys that consist of each field to group by,

and values consisting of any pertinent numerical items– Combiner: (optional) it can greatly reduce the number of

intermediate key/value pairs to be sent across the network, but works well only with associative and commutative operations

– Partitioner: (optional) it can better distribute key/value pairs across the reduce tasks

– Reducer: The reducer receives a set of numerical values and applies the aggregation function

17Fabiana Rossi - SABD 2018/19

Page 10: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Design Pattern: Number Summarizations

18Fabiana Rossi - SABD 2018/19

Design Pattern: Number SummarizationsExamples:• Word count, record count

– Count the number of occurrences of each world• Min/Max

– Compute the max temperature per region• Average/Median/Standard Deviation

– Average the number of requests per page per Web site• Inverted Index Summarization:

– The inverted index pattern is commonly used to generate an index from a data set to allow for faster searches or data enrichment capabilities.

19Fabiana Rossi - SABD 2018/19

Page 11: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

WordCount: Example

hello world goodbyehello fabianahello johnhello mikehello mapreduce

Fabiana Rossi - SABD 2018/19 20

fabiana 1goodbye 1hello 5john 1mapreduce 1mike 1world 1

Summarization: Example

hello world goodbyehello fabianahello johnhello mikehello mapreduce

Fabiana Rossi - SABD 2018/19 21

g 7.0m 6.5w 5.0

f 7.0h 5.0j 4.0

• Goal: compute the average word length by initial letter

Page 12: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Summarization: Example

22

• Goal: compute the average word length by initial letter

• Check: AverageWordLengthByInitialLetter.java

public void map(Object key, Text value, Context context) { String line = value.toString().toLowerCase(); /* Emit length by initial letter */ StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { String word = itr.nextToken(); initialLetter.set(word.substring(0,1)); length.set(word.length()); context.write(initialLetter, length); } }

This is only an excerpt

Fabiana Rossi - SABD 2018/19

Summarization: Example

23

• Goal: compute the average word length by initial letter

• Check: AverageWordLengthByInitialLetter.java

public void reduce(Text key, Iterable<IntWritable> values, Context context){ int sum = 0; int count = 0; for (IntWritable val : values) { sum += val.get(); count++; } average.set(((float) sum / (float) count)); context.write(key, average); }

This is only an excerpt

Fabiana Rossi - SABD 2018/19

Page 13: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Design Pattern: Filtering• Goal: filter out records that are not of interest and

keep the others.• An application of filtering is the sampling

– Sampling can be used to get a smaller, yet representative, data set

• Structure: – Mapper: filters data (it does most of the work)– Reduce: may simply be the identity, if the job does not

produce an aggregation on filtered data

24Fabiana Rossi - SABD 2018/19

Design Pattern: Filtering

25Fabiana Rossi - SABD 2018/19

Page 14: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Design Pattern: FilteringUse cases:• Closer view of data: to extract records that have something in

common or something of interest (e.g., same event-date, same user id)

• Tracking a thread of events: extract a thread of consecutive events as a case study from a larger data set.

• Distributed grep• Simple random sampling: simple random sampling of the data

set– use a filter with an evaluation function that randomly returns

true or false• Remove low scoring data

26Fabiana Rossi - SABD 2018/19

Filtering: Example

hello world goodbyehello fabianahello johnhello mikehello mapreduce

Fabiana Rossi - SABD 2018/19 27

hello world goodbye

good

• Goal: implement a distributed version of grep• grep is a command-line utility for searching plain-text data sets for lines that match a regular expression

Page 15: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Filtering: Example

28

• Goal: implement a distributed version of grep• Check: DistributedGrep.java

public static class GrepMapper extends Mapper<Object, Text, NullWritable, Text> {

private Pattern pattern = null;

public void setup(Context context) throws ... { pattern = Pattern.compile( ... ); } public void map(Object key, Text value, Context context) ... { Matcher matcher = pattern.matcher(value.toString()); if (matcher.find()) { context.write(NullWritable.get(), value); } } }

This is only an excerptFabiana Rossi - SABD 2018/19

Design Pattern: Distinct• Special case of filter pattern• Goal: filter out records that look like another record in

the data set• Structure:

– Mapper: it takes each record and extracts the data fields for which we want unique values. The mapper outputs the record as the key, and null as the value

– Reduce: it groups the nulls together by key. We then simply output the key. Because each key is grouped together, the output data set is guaranteed to be unique.

• Examples:– Retrieve the list of words, with no repetition, in a document

29Fabiana Rossi - SABD 2018/19

Page 16: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Distinct: Example

30

• Goal: retrieve the list of words, with no repetitions, in a document

• Check: DistinctWords.javapublic void map(Object key, Text value, Context context) ... { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, NullWritable.get()); } } ... public void reduce(Text key, Iterable<NullWritable> values, Context context) ... { context.write(key, NullWritable.get()); }

This is only an excerptFabiana Rossi - SABD 2018/19

Design Pattern: Data Organization• Goal: combine and organize data in a more complex

data structure.

• This pattern includes several pattern sub-categories:– structure to hierarchical pattern (e.g., denormalization)– partitioning and binning patterns– total order sorting patterns– shuffling patterns

31Fabiana Rossi - SABD 2018/19

Page 17: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Design Pattern: Structure to Hierarchical• Goal: create new records from data stored in very

different structures. – This pattern follows the denormalization principles of big

data stores • Structure:

– We might need to combine data from multiple data sources (use MultipleInputs)

– Map: it associate data to be aggregated to the same key (e.g., root of hierarchical record). Each data can be enriched with a label to identify the source.

– Reduce: it creates the hierarchical structure from the list of received data items

32Fabiana Rossi - SABD 2018/19

Design Pattern: Structure to Hierarchical

33Fabiana Rossi - SABD 2018/19

Page 18: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Structure to Hierarchical: Example

1::Movies2::Football teams3::Software

Fabiana Rossi - SABD 2018/19 34

{"topic":"Software","items":["Autocad","Eclipse","IntelliJ","Microsoft Office","Linux","Google Chrome"]}

1::Star Wars1::Mad Max1::Creed2::Roma2::Juventus3::Autocad3::Eclipse3::IntelliJ3::Microsoft Office3::Linux3::Google Chrome

topic

items

Structure to Hierarchical: Example• Goal: create a json structure of a topic, which

contains the list of its items– Two inputs are provided, the list of topics, the list of items

• Check: TopicItemsHierarchy.java

35

public void map(Object key, Text value, Context context) ... { String line = value.toString(); String[] parts = line.split("::"); if (parts.length != 2) return; String id = parts[0]; String content = parts[1]; outKey.set(id); outValue.set(valuePrefix + content); context.write(outKey, outValue);

}

This is only an excerptFabiana Rossi - SABD 2018/19

Page 19: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Structure to Hierarchical: Example• Check: TopicItemsHierarchy.java

36

public void reduce(Text key, Iterable<Text> values, Context context) ... { Topic topic = new Topic(); for (Text t : values) { String value = t.toString(); if (ValueType.TOPIC.equals(discriminate(value))){ topic.setTopic(getContent(value)); } else if (ValueType.ITEM.equals(discriminate(value))){ topic.addItem(getContent(value)); } } /* Serialize topic */ String serializedTopic = gson.toJson(topic); context.write(new Text(serializedTopic), NullWritable.get()); }

This is only an excerptFabiana Rossi - SABD 2018/19

Design Pattern: Partitioning• Goal: move the records into categories (i.e., shards,

partitions, or bins) without taking care about the order of records.

• Structure:– Map: in most cases, the identity mapper can be used.– Partitioner: it will determine which reducer to send each

record to; each reducer corresponds to a particular partition– Reduce: in most cases, the identity reducer can be used– All you have to define is the function that determines what

partition a record is going to go.

37Fabiana Rossi - SABD 2018/19

Page 20: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Design Pattern: Partitioning

38Fabiana Rossi - SABD 2018/19

Partitioning: Example• Goal: group date by year. In this case a year

represents a partition• Check: PartitionDatesByYear.java

39

public static class DatePartitioner extends Partitioner<IntWritable, Text> { public int getPartition(IntWritable key, Text value, int numPartitions) { return (key.get() - CONFIG_INITIAL_YEAR) % numPartitions; } }

This is only an excerpt

Fabiana Rossi - SABD 2018/19

Page 21: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Two-stage MapReduce• As map-reduce calculations get more complex, break

them down into stages– Output of one stage = input to next stage

• Intermediate output may be useful for different outputs too, so you can get some reuse– Intermediate records can be saved in the data store, forming

a materialized view• Early stages of map-reduce operations often

represent the heaviest amount of data access, so building and save them once as a basis for many downstream uses saves a lot of work

40Fabiana Rossi - SABD 2018/19

Design Pattern: Total Order Sorting• Sort all the records of the data set

– Sorting in a parallel manner is not easy.• Observe:

– each individual reducer will sort its data by key, but unfortunately, this sorting is not global across all data.

• Goal: we want to have a total order sorting where, if you concatenate the output files, the records are sorted.

• Sorted data has a number of useful properties:– Sorted by time, it can provide a timeline view on the data– Finding things in a sorted data set can be done with binary

search– Some databases can bulk load data faster if the data is

sorted on the primary key or index column41Fabiana Rossi - SABD 2018/19

Page 22: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Design Pattern: Total Order Sorting• This pattern has two phases (jobs):

– an analyze phase that determines the ranges, and the order phase that actually sorts the data.

Analyze Phase: identify the data set slices• Map: it does a simple random sampling• Reduce: only one reducer will be used. It collects the sort keys

and slices them into the data range boundariesOrder Phase: order the dataset• Map: similar to the mapper function of the analyze phase, but

the record itself is stored as the value• Partition: it loads up the partition file, routes data according to

the paritions– Hadoop provides an implementation: TotalOrderPartitioner

• Reduce: it is the identify function; the number of reducers needs to be equal to the number of partitions

42Fabiana Rossi - SABD 2018/19

Total Order Sorting: Example• Goal: order the dataset

– We rely on the TotalOrderPartitioner class– Slightly different implementation of Analyze and Order Phases

• Check: TotalOrdering.java• Observe the driver, which defines the chain of MapReduce jobs

43

/* **** Job #1: Analyze phase **** */ Job sampleJob = Job.getInstance(conf, "TotalOrdering"); /* Map: samples data; Reduce: identity function */ sampleJob.setMapperClass(AnalyzePhaseMapper.class); sampleJob.setNumReduceTasks(0); sampleJob.setOutputFormatClass(SequenceFileOutputFormat.class); ... if (isCompletedCorrecty(sampleJob)) {

This is only an excerpt

Fabiana Rossi - SABD 2018/19

Page 23: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Total Order Sorting: Example

44

/* **** Job #2: Ordering phase **** */ Job orderJob = Job.getInstance(conf, "TotalOrderSortingStage"); /* Map: identity function; Reduce: emits only the key */ orderJob.setMapperClass(Mapper.class); orderJob.setReducerClass(OrderingPhaseReducer.class); orderJob.setNumReduceTasks(10); /* Partitioner */ orderJob.setPartitionerClass(TotalOrderPartitioner.class); /* Define the dataset sampling strategy to identify partition bounds */ InputSampler.writePartitionFile(orderJob, new InputSampler.RandomSampler(.3, 10)); }

This is only an excerpt

Fabiana Rossi - SABD 2018/19

Order Phase (1)

/* **** Job #2: Ordering phase **** */ Job orderJob = Job.getInstance(conf,"TotalOrderSortingStage"); orderJob.setJarByClass(TotalOrdering.class); /* Map: identity function outputs the key/value pairs in the SequenceFile */ orderJob.setMapperClass(Mapper.class); /* Reduce: identity function */ orderJob.setReducerClass(OrderingPhaseReducer.class); orderJob.setNumReduceTasks(10);

Fabiana Rossi - SABD 2018/19 45

This is only an excerpt of main in TotalOrdering.java

Page 24: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Order Phase (2)

/* Set input and output files: the input is the previous job's output */ orderJob.setInputFormatClass(SequenceFileInputFormat.class); orderJob.setPartitionerClass(TotalOrderPartitioner.class); TotalOrderPartitioner.setPartitionFile(orderJob.getConfiguration(), partitionFile); InputSampler.writePartitionFile(orderJob, new InputSampler.RandomSampler(.3, 10));

Fabiana Rossi - SABD 2018/19 46

This is only an excerpt of main in TotalOrdering.java

Analyze Phase (1)public static class AnalyzePhaseMapper extends Mapper { ... public void map(Object key, Text value, Context context) throws IOException, InterruptedException { outkey.set(value.toString()); context.write(outkey, value); } }

Fabiana Rossi - SABD 2018/19 47

/* Map: samples data; Reduce: identity function */ sampleJob.setMapperClass(AnalyzePhaseMapper.class); sampleJob.setNumReduceTasks(0); /* Set input and output files */ sampleJob.setOutputFormatClass(SequenceFileOutputFormat.class);

This is only an excerpt of main in TotalOrdering.java

Page 25: MapReduce: Design Patterns - ce.uniroma2.it€¦ · MapReduce: Design Patterns A.A. 2018/19 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria

Design Pattern: Shuffling• Goal: we want to shuffle our dataset, to randomize

our records (e.g., to improve anonymity)

• Structure:– Map: it emits the record as the value along with a random

key– Reduce: the reducer sorts the random keys, further

randomizing the data

48Fabiana Rossi - SABD 2018/19