21
Spring 2019 CS4823/6643 Parallel Computing 1 MapReduce with Example Wei Wang

MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 1

MapReduce with Example

Wei Wang

Page 2: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 2

History and Background

Page 3: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 3

The Problem Addressed by MapReduce

● Originated from Google.● Google needs to analyze huge sets of data (order

of petabytes) – e.g., search engine, web access log, etc.

● Algorithm to process data are relatively simple● However, to finish the analysis in an acceptable

amount of time, the analysis task must be split and executed in parallel on thousands of machines

Page 4: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 4

The Problem Addressed by MapReduce cont’d

● While processing the data is relatively easier, making the task run in parallel on thousands of machines is very challenging. Programmers have to– Split and distribute the data

– Associated data with computation

– Communicate the petabytes of data among thousands of machines

– Handle (error recovery from) failed machines and sub-tasks

– Retrieves and organizes results

● Tedious, error-prone, time-consuming and had to repeat this process for each analysis problem

Page 5: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 5

The MapReduce Solution

● A general framework and programming model, called “MapReduce”, is designed in Google to simplify the development process (published in 03 and 04)– MapReduce summarizes the parallel procedure

● Map: data are split and mapped to lots of machines (nodes) to be processed in parallel● Reduce: results are collected and digested (conceptually similar to the reduction in OpenMP

and MPI).

– The framework takes care the following tasks to reduce the burden of programmers:● Data split, distribution of computation● Distributed filesystem management● Parallel execution (map/reduce stages) management and communication● Fault tolerant (error recovery)

– The framework also simplifies● Execution monitoring and profiling● Adjust resource provision

Page 6: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 6

Apache Hadoop

● Hadoop is an open source implementation of MapRedcue programming model from Apache Software Foundation (around 2006)

● Main components of Hadoop:– Hadoop Distributed File System (HDFS); all data are

communicated through files on HDFS

– Hadoop YARN: a resource manager for parallel execution

– Hadoop commons and Hadoop MapReduce: the supporting libraries and the implementation of the MapReduce programming model

Page 7: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 7

The MapReduce Programming Model in Hadoop

Page 8: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 8

The MapReduce Programming Model

Distributed File System

Maptask

Maptask

Maptask

Maptask

DataData Data Data

Distributed File System

ReduceTask

combinertask

combinertask

combinertask

combinertask

M

ReduceTask

ReduceTask

M

M

MM

MM

M

MMM

M :Middle Results

All Original data in files on DFS

MapReduce framework split and distribute data

Map task process data in parallel

Middle results from map tasks

Combiner task further process local middle results (optional)

Combiner generates refined middle results

All middle results are saved back to DFS

Middle results are shuffled, sorted and re-groupedand distributed to reduce tasks

Reduce tasks collect and digest middles results to generate final results, which are saved back to DFS.

Page 9: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 9

The Data in MapReduce

● Because MapReduce targets huge data sets (so called big data), all data are saved and communicated through distributed filesystem (DFS)– i.e., original, middle results and final results are saved and read

from files in DFS

– DFS saves and distributes files among lots of file servers

– DFS is optimized for distributed accesses and is fault tolerate with redundancy

– Map and reduce tasks are scheduled to minimized the communication cost between compute servers and file servers

– DFS is the core of MapReduce framework

Page 10: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 10

Input and Output of MapReduce Tasks

● In theory, the inputs and outputs of map and reduce tasks can be in any format

● In practice, the inputs and outputs are always key-value pairs– Each piece of data is represented as <key, value>

tuple

Page 11: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 11

MapReduce Example: Word Count

● Count the number of words of some text file– e.g., a file with string “Hello World Bye World. \n

Hello Hadoop Goodbye Hadoop.”

– Output the countsWord Occurrence <key, value> expression

Bye 1 <”Bye”, 1>

Goodbye 1 <”Goodbye”, 1>

Hadoop 2 <”Hadoop”, 2>

Hello 2 <”Hello”, 2>

World 2 <”World”, 2>

Page 12: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 12

MapReduce Example: Word Count, The Process Procedure

<1, “Hello World Bye World.”>

<2, “Hello Hadoop Goodbye Hadoop.”>

< Hello, 1> < World, 1> < Bye, 1> < World, 1>

< Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>

< Hello, 1> < World, 1> < Bye, 1> < World, 1>< Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>

< Bye, 1>

< Goodbye, 1>

< Hadoop, 1> < Hadoop, 1>

< Hello, 1> < Hello, 1>

< World, 1> < World, 1>

“Hello World Bye World. \n Hello Hadoop Goodbye Hadoop.”

< Bye, 1>

< Goodbye, 1>

< Hadoop, 2>

< Hello, 2>

< World, 2>

Data Split

2 map-tasks each process a line and break down the line into word list

Middle results written back to DFS

MapReduce framework Shuffle, sort and regroup middle results

5 reduce tasks sum the counts for each word

Input file from DFS Input data, broken down by lines

Middle results: word list, key is the word, value is always 1 (one occurrence)

All middle results in key-value pairs Middle results are

regrouped based on keys. Newly grouped key-value pairs are sent to reduce tasks as inputs

Final results of words and their number of occurance. Final results will be written back to DFS.

Page 13: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 13

MapReduce Example: Word Count, The Process Procedure with

Combiner

< Hello, 1> < World, 1> < Bye, 1> < World, 1>

< Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>

< Bye, 1> < Hello, 1> < World, 2>< Goodbye, 1> < Hello, 1> < Hadoop, 2>

< Bye, 1>

< Goodbye, 1>

< Hadoop, 2>

< Hello, 1> < Hello, 1>

< World, 2>

< Bye, 1>

< Goodbye, 1>

< Hadoop, 2>

< Hello, 2>

< World, 2>

2 map tasks each process a line and break down the line into word list

Middle results of each map-task is sorted and regouped

MapReduce framework Shuffle, sort and regoup middle results

5 reduce tasks sum the counts for each word

Middle results: word list, key is the word, value is always 1 (one occurrence)

All middle results in key-value pairs Middle results are

regrouped based on keys. Newly grouped key-value pairs are sent to reduce tasks as inputs

Final results of words and their number of occurance. Final results will be written back to DFS.

< World, 1> < World, 1>

< Bye, 1>

< Hello, 1>

< Goodbye, 1>

< Hello, 1>

< Hadoop, 1> < Hadoop, 1>

< World, 2>

< Bye, 1>

< Hello, 1>

< Goodbye, 1>

< Hello, 1>

< Hadoop, 2>

6 combiner tasks (3 per map task) sum the counts locally

Middle results written back to DFS

Middle results are locally regrouped based on keys and send to local combiner tasks

Combiners generate the total occurrence for each word locally

Fro

m D

ata

Sp

lit ,

sam

e a

s th

e no

co

mbi

ner

ca

se

Page 14: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 14

MapReduce Example: Word Count, Map Task Code (Java)

// all map task must be declared in a Mapper-derived classpublic static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable{

private string key = new string();

public void map(Object key, // key of the input key-val pairText value, // val of the input key-val pair

Context context)// output key-val pairs{

// break down the input string into tokens/wordsStringTokenizer itr = new

StringTokenizer(value.toString());// loop over each token/wordwhile (itr.hasMoreTokens()) {

// use next token/word as the new keykey = itr.nextToken(); // add a new key-val pair <token/work, 1> to outputcontext.write(key, one);

}}

}

Page 15: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 15

MapReduce Example: Word Count, Reduce Task Code (Java)

// all reduce task must be declared in a Reducer-derived classpublic static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {

private int result = 0; // total number of occurrence

public void reduce(Text key, // key of the input key-val pairsIterable<int> values, /* vals of the input,

recall that key-val pairs are grouped based on keys, and all pairs with the same key are send to the same reducer */ Context context)// output key-val pairs

{int sum = 0;// for each val in values, sum them upfor (int val : values) {

sum += val.get();}result = sum;// output the <”token/word”, total_occurrenc>context.write(key, result);

}}

Page 16: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 16

MapReduce Example: Word Count, Main function

public class WordCount {Public static class TokenizerMapper … //from the mapper slidePublic static class InSumReducer … //from the reducer slidepublic static void main(String[] args) { // java main func

// create a new Hadoop configuration and a Hadoop jobConfiguration conf = new Configuration();Job job = Job.getInstance(conf, "word count");// tell java to find mapper/reducer impl in this classjob.setJarByClass(WordCount.class);// set mapper/reducer/combiner, reuse reducer for combinerjob.setMapperClass(TokenizerMapper.class);job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);// define the types for key and value for outputsjob.setOutputKeyClass(Text.class);job.setOutputValueClass(int.class);// add path to input and output filesFileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));// start the hadoop job and wait for it to finishSystem.exit(job.waitForCompletion(true) ? 0 : 1);

}}

Page 17: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 17

HDFS and MapReduce’s Disadvantage

Page 18: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 18

HDFS

● HDFS is the core component of Hadoop– nearly all communications in Hadoop are through file

system

– HDFS stores files distributed on large number of servers

– File data are replicated as a fault tolerant technique

– File servers are usually also used as compute servers. Map/reduce tasks are usually scheduled to servers where their data are located at.

● Similar, Google File system is the backbone of Google’s MapReduce

Page 19: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 19

HDFS Architecture

Name Node(Server)

Data Node(Server)

Data Node(Server)

Data Node(Server)

Data Node(Server)

D1 D2 D3 D4 D1 D2D3 D4

Name node stores file info and meta data, and managesdata nodes.

Data nodes are theactual holder of files

ReplicatedData

Client

Locate fileby name

Read/Write

Page 20: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 20

The Disadvantage of MapReduce

● Criticism on novelty – There has been criticism on the novelty of MapReduce, as many of MapReduce’s

ideas has been developed in distributed parallel database. Although MapReduce is more than just database.

– There has been studies trying to show that MapReduce is not the optimal solution for the problem.

● Communication through file system is slow– MapReduce really targets

● workloads that need to process really large data sets that are too big to fit in memory ● Workloads that process data once and then discard them

– To improve performance, Hadoop allows users to cache files in memory.

– For other types of workloads, different parallel programming models have to be used● E.g., for machine learning workloads, where data are preferred to reside in memory, a complete

different parallel framework is designed in Apache, called Spark.

Page 21: MapReduce with Example - GitHub Pages · Spring 2019 CS4823/6643 Parallel Computing 3 The Problem Addressed by MapReduce Originated from Google. Google needs to analyze huge sets

Spring 2019 CS4823/6643 Parallel Computing 21

Acknowledgement

● The “word-count” code example is adapted from Apache Hadoop Tutorial.