46
An Introduction to MapReduce Francisco Pérez-Sorrosal Distributed Systems Lab (DSL/LSD) Universidad Politécnica de Madrid 10/Apr/2008

An Introduction To Map-Reduce

Embed Size (px)

DESCRIPTION

A brief introduction to Map-Reduce on a PhD Course at UPM

Citation preview

Page 1: An Introduction To Map-Reduce

An Introduction to MapReduce

Francisco Pérez-SorrosalDistributed Systems Lab (DSL/LSD)

Universidad Politécnica de Madrid

10/Apr/2008

Page 2: An Introduction To Map-Reduce

An Introduction to MapReduce2

Outline

1. Motivation2. What is MapReduce?

Simple Example What is MapReduce’s Main Goal? Main Features What MapReduce Solves?

3. Programming Model4. Framework Overview

Example5. Other Features6. Hadoop: A MapReduce Implementation

Example7. References

Page 3: An Introduction To Map-Reduce

An Introduction to MapReduce3

Motivation

Increasing demand of large scale processing applications Web engines, semantic search tools, scientific

applications... Most of these applications can be parallelized

There are many ad-hoc implementations for such applications but...

Page 4: An Introduction To Map-Reduce

An Introduction to MapReduce4

Motivation (II)

...the development and management execution of such ad-hoc parallel applications was too complex Usually implies the use and management of

hundreds/thousands of machines

However, they share basically the same problems: Parallelization Fault-tolerance Data distribution Load balancing

Page 5: An Introduction To Map-Reduce

An Introduction to MapReduce5

What is MapReduce?

It is a framework to... ...automatically partition jobs that have large

input data sets into simpler work units or tasks, distribute them in the nodes of a cluster (map)

and... ...combine the intermediate results of those

tasks (reduce) in a way to produce the required results.

Presented by Google in 2004 http://labs.google.com/papers/mapreduce.html

Page 6: An Introduction To Map-Reduce

An Introduction to MapReduce6

Simple Example

Input data

Mapped dataon Node 1

Mapped dataon Node 2

Result

Page 7: An Introduction To Map-Reduce

An Introduction to MapReduce7

What is MapReduce’s Main Goal?

Simplify the parallelization and distribution of large-scale computations in clusters

Page 8: An Introduction To Map-Reduce

An Introduction to MapReduce8

MapReduce Main Features

Simple interface

Automatic partition, parallelization and distribution of tasks

Fault-tolerance

Status and monitoring

Page 9: An Introduction To Map-Reduce

An Introduction to MapReduce9

What does MapReduce solves?

It allows non-experienced programmers on parallel and distributed systems to use large distributed systems easily

Used extensively on many applications inside Google and Yahoo that...

...require simple processing tasks... ...but have large input data sets

Page 10: An Introduction To Map-Reduce

An Introduction to MapReduce10

What does MapReduce solves?

Examples: Distributed grep Distributed sort Count URL access frequency Web crawling Represent the structure of web documents Generate summaries (pages crawled per

host, most frequent queries, results returned...)

Page 11: An Introduction To Map-Reduce

An Introduction to MapReduce11

Programming Model Input & Output

Each one is a set of key/value pairs Map:

Processes input key/value pairs Compute a set of intermediate key/value pairs

map (in_key, in_value) -> list(int_key, intermediate_value)

Reduce: Combine all the intermediate values that share the

same key Produces a set of merged output values (usually just

one per key)reduce(int_key, list(intermediate_value)) -> list(out_value)

Page 12: An Introduction To Map-Reduce

An Introduction to MapReduce12

Programming Model: Example

Problem: Count of URL access frequency

Input: Log of web page requests Map:

Processes the assigned chunk of the log Compute a set of intermediate pairs <URL, 1>

Reduce: Processes the intermediate pairs <URL, 1> Adds together all the values that share the same

URL Produces a set pairs in the form <URL, total count>

Page 13: An Introduction To Map-Reduce

An Introduction to MapReduce13

Framework Overview

Page 14: An Introduction To Map-Reduce

An Introduction to MapReduce14

Framework Overview

Page 15: An Introduction To Map-Reduce

An Introduction to MapReduce15

Big File 640MB

Worker

Idle

Worker

Idle

Master

1) Split File into 10 pieces of 64MB

Worker

Idle

R = 4 output files(Set by theuser)

Example: Count # of Each Letter in a Big File

a t b om a p rr e d uc e g oo o g le a p im a c ac a b ra a r ro z f ei j a o

t o m at e c ru i m es s o l

(There are 26 different keysletters in the range [a..z])

Worker

Idle

Worker

Idle

Worker

Idle

Worker

Idle

Worker

Idle

12345

67

8

9

10

Page 16: An Introduction To Map-Reduce

An Introduction to MapReduce16

Big File 640MB

Worker

Idle

Worker

Idle

Master

2) Assign map and reduce tasks

Worker

Idle

Example: Count # of Each Letter in a Big File

a t b om a p rr e d uc e g oo o g le a p im a c ac a b ra a r ro z f ei j a o

t o m at e c ru i m es s o l

Worker

Idle

Worker

Idle

Worker

Idle

Worker

Idle

Worker

Idle

Mappers Reducers12345

67

8

9

10

Page 17: An Introduction To Map-Reduce

An Introduction to MapReduce17

Big File 640MB

Master

3) Read the split data

Example: Count # of Each Letter in a Big File

a t b om a p rr e d uc e g oo o g le a p im a c ac a b ra a r ro z f ei j a o

t o m at e c ru i m es s o l

Map T.

In progress

Map T.

In progress

Map T.

In progress

Reduce T.

Idle

Reduce T.

Idle

Reduce T.

Idle

Map T.

In progress

Reduce T.

Idle

1234

Page 18: An Introduction To Map-Reduce

An Introduction to MapReduce18

a b c d e f g h i j k l m n n o p q r s t v w x y z

Machine 1

Big File 640MB

4) Process data (in memory)

Map T.1

In-Progress

Example: Count # of Each Letter in a Big File

a y b om a p rr e d uc e g oo o g le a p im a c ac a b ra a r ro z f ei j a o

t o m at e c ru i m es s o l

R1

Partition Function(used to map the letters in regions):

R2R3R4

Simulating the execution in memory

R1R2R3R4

(a,1) (b,1) (a,1)(m1)

(o,1) (p,1) (r, 1)(y,1)

Page 19: An Introduction To Map-Reduce

An Introduction to MapReduce19

Machine 1

Big File 640MB

Master

5) Apply combiner function

Map T.1

In-Progress

Example: Count # of Each Letter in a Big File

a t b om a p rr e d uc e g oo o g le a p im a c ac a b ra a r ro z f ei j a o

t o m at e c ru i m es s o l

Simulating the execution in memory

R1R2R3R4

(a,1) (b,1) (a,1)(m1)

(o,1) (p,1) (r, 1)(y,1)

(a,2) (b,1) (m1)

(o,1) (p,1) (r, 1)(y,1)

Page 20: An Introduction To Map-Reduce

An Introduction to MapReduce20

Machine 1

Big File 640MB

Master

6) Store results on disk

Map T.1

In-Progress

Example: Count # of Each Letter in a Big File

a t b om a p rr e d uc e g oo o g le a p im a c ac a b ra a r ro z f ei j a o

t o m at e c ru i m es s o l

Memory

R1R2R3R4

Disk

(a,2) (b,1) (m1)

(o,1) (p,1) (r, 1)(y,1)

Page 21: An Introduction To Map-Reduce

An Introduction to MapReduce21

Big File 640MB

Master

7) Inform the master about the position of the intermediate results in local disk

Example: Count # of Each Letter in a Big File

a t b om a p rr e d uc e g oo o g le a p im a c ac a b ra a r ro z f ei j a o

t o m at e c ru i m es s o l

Machine 1

Map T.1

In-Progress

R1R2R3R4

MT1 ResultsLocation

MT1 Results (a,2) (b,1) (m1)

(o,1) (p,1) (r, 1)(y,1)

Page 22: An Introduction To Map-Reduce

An Introduction to MapReduce22

Big File 640MB

Master

8) The Master assigns the next task (Map Task 5) to the Worker recently free

Example: Count # of Each Letter in a Big File

a t b om a p rr e d uc e g oo o g le a p im a c ac a b ra a r ro z f ei j a o

t o m at e c ru i m es s o l

Machine 1

Worker

In-Progress

R1R2R3R4

T1 Results

Data for Map Task 5

(a,2) (b,1) (m1)

(o,1) (p,1) (r, 1)(y,1)

Task 5

Page 23: An Introduction To Map-Reduce

An Introduction to MapReduce23

Master

9) The Master forwards the location of the intermediate results of Map Task 1 to reducers

Example: Count # of Each Letter in a Big File

Machine 1

Map T.5

In-Progress

R1R2R3R4

Reduce T.1

Idle

MT1 Results

MT1 Results Location (R1)

MT1 Results Location (Rx)

Big File 640MBa t b om a p rr e d uc e g oo o g le a p im a c ac a b ra a r ro z f ei j a o

t o m at e c ru i m es s o l

...

(a,2) (b,1) (m1)

(o,1) (p,1) (r, 1)(y,1)

Page 24: An Introduction To Map-Reduce

An Introduction to MapReduce24

Example: Count # of Each Letter in a Big File

Reduce T.1

Idle

Big File 640MBa t b om a p rr e d uc e g oo o g le a p im a c ac a b ra a r ro z f ei j a o

t o m at e c ru i m es s o l

(a, 2) (b,1)(e, 1) (d, 1)(c, 1) (e, 1)

(g, 1)

(e, 1) (a, 3) (c, 1)(c, 1) (a, 1) (b,1)

(a, 2) (f, 1) (e, 1)

(a, 2)(e, 1)(c, 1)

(e, 1)

R1a b c d e f g

Letters in Region 1:

Page 25: An Introduction To Map-Reduce

An Introduction to MapReduce25

Machine N

Example: Count # of Each Letter in a Big File

Reduce T.1

In-Progress

(a, 2) (b,1)(e, 1) (d, 1)(c, 1) (e, 1)

(g, 1)(e, 1) (a, 3) (c, 1)(c, 1) (a, 1) (b,1)(a, 2) (f, 1) (e, 1)

(a, 2)(e, 1)(c, 1)

(e, 1)

Data read from each Map Task

stored in region 1

10) The RT 1 reads the data in R=1 from each MT

Page 26: An Introduction To Map-Reduce

An Introduction to MapReduce26

Machine N

Example: Count # of Each Letter in a Big File

Reduce T.1

In-Progress

(a, 2) (a, 3) (a, 1)(a, 2) (a, 2) (b,1)(b,1) (c, 1) (c, 1)(c, 1) (c, 1) (d, 1)(e, 1) (e, 1) (e, 1) (e, 1) (e, 1) (e, 1)

(f, 1) (g, 1)

11) The reduce task 1 sorts the data

Page 27: An Introduction To Map-Reduce

An Introduction to MapReduce27

Machine N

Example: Count # of Each Letter in a Big File

Reduce T.1

In-Progress

(a, 2) (a, 3) (a, 1)(a, 2) (a, 2) (b,1)(b,1) (c, 1) (c, 1)(c, 1) (c, 1) (d, 1)(e, 1) (e, 1) (e, 1) (e, 1) (e, 1) (e, 1)

(f, 1) (g, 1)

12) Then it passes the key and the corresponding set of intermediate data to the user's reduce function

(a, {2,3,1,2,2})

(b, {1,1})(c, {1,1,1,1})(d,{1})(e, {1,1,1,1,1,1})(f, {1})(g, {1})

Page 28: An Introduction To Map-Reduce

An Introduction to MapReduce28

Machine N

Example: Count # of Each Letter in a Big File

Reduce T.1

In-Progress

12) Finally, generates the output file 1 of R, after executing the user's reduce

(a, {2,3,1,2,2})(b, {1,1})

(c, {1,1,1,1})(d,{1})

(e, {1,1,1,1,1,1})(f, {1})(g, {1})

(a, 10)(b, 2)(c, 4)(d, 1)(e, 6)(f, 1)(g, 1)

Page 29: An Introduction To Map-Reduce

An Introduction to MapReduce29

Other Features: Failures Re-execution is the main mechanism for fault-tolerance Worker failures:

Master detect Worker failures via periodic heartbeats The master drives the re-execution of tasks

Completed and in-progress map tasks are re-executed In-progress reduce tasks are re-executed

Master failure: The initial implementation did not support failures of the

master Solutions:

Checkpoint the state of internal structures in the GFS Use replication techniques

Robust: lost 1600 of 1800 machines once, but finished fine

Ricardo Jimenez-Peris
Slides should be have as language English(United States), otherwise there are typos the speller would tell you that appear on the slides.
Page 30: An Introduction To Map-Reduce

An Introduction to MapReduce30

Other Features: Locality

Most input data is read locally

Why? To not consume network bandwidth

How does it achieve that? The master attempts to schedule a map task on a

machine that contains a replica (in the GFS) of the corresponding input data

If it fails, attempts to schedule near a replica (e.g. on the same network switch)

Page 31: An Introduction To Map-Reduce

An Introduction to MapReduce31

Other Features: Backup Tasks Some tasks may have delays (Stragglers):

A machine that takes too long time to complete one of the last few map or reduce tasks

Causes: Bad disk, concurrency with other processes, processor caches disabled

Solution: When close to completion, master schedules Backup Tasks for in-progress tasks Whichever one that finishes first "wins"

Effect: Dramatically shortens job completion time

Page 32: An Introduction To Map-Reduce

An Introduction to MapReduce32

Performance

Tests run on cluster of ~ 1800 machines: 4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine All machines in placed in the same hosting facility

Page 33: An Introduction To Map-Reduce

An Introduction to MapReduce33

Performance: Distributed Grep Program

Searching for rare three-character pattern The pattern occurs 97337 times

Scans through 1010 100-byte records (Input)

Input split into aprox. 64MB Map tasks = 15000

Entire output is placed in one file Reducers =1

Page 34: An Introduction To Map-Reduce

An Introduction to MapReduce34

Performance: Grep Test completes in ~

150 sec Locality optimization

helps: 1800 machines read 1

TB of data at peak of ~31 GB/s

Without this, rack switches would limit to 10 GB/s Startup overhead is

significant for short jobs

1764 Workers

Maps are starting to finish

Scan Rate

Page 35: An Introduction To Map-Reduce

An Introduction to MapReduce35

Hadoop: A MapReduce Implementation http://hadoop.apache.org Installing Hadoop MapReduce

Install Hadoop Core Configure Hadoop site

in conf/hadoop-site.xml

HDFS Master MapReduce Master # of replicated files in the

cluster

<configuration> <property>

<name>fs.default.name</name><value>hdfs://localhost:9000</

value></property><property>

<name>mapred.job.tracker</name><value>localhost:9001</value>

</property><property>

<name>dfs.replication</name><value>1</value>

</property></configuration>

Page 36: An Introduction To Map-Reduce

An Introduction to MapReduce36

Hadoop: A MapReduce Implementation Create a distributed filesystem:

$ bin/hadoop namenode -format Start Hadoop daemons

$ bin/start-all.sh ($ bin/start-dfs.sh + $ bin/start-mapred.sh)

Check the namenode (HDFS) http://localhost:50070/

Check the job tracker (MapReduce) http://localhost:50030/

Page 37: An Introduction To Map-Reduce

An Introduction to MapReduce37

Hadoop: HDFS Console

Page 38: An Introduction To Map-Reduce

An Introduction to MapReduce38

Hadoop: JobTracker Console

Page 39: An Introduction To Map-Reduce

An Introduction to MapReduce39

Hadoop: Word Count Example $ bin/hadoop dfs -ls

/tmp/fperez-hadoop/wordcount/input/ /tmp/fperez-hadoop/wordcount/input/file01 /tmp/fperez-hadoop/wordcount/input/file02

$ bin/hadoop dfs -cat /tmp/fperez-hadoop/wordcount/input/file01 Welcome To Hadoop World

$ bin/hadoop dfs -cat /tmp/fperez-hadoop/wordcount/input/file02 Goodbye Hadoop World

Page 40: An Introduction To Map-Reduce

An Introduction to MapReduce40

Hadoop: Running the Example Run the application

$ bin/hadoop jar /tmp/fperez-hadoop/wordcount.jar org.myorg.WordCount /tmp/fperez-hadoop/wordcount/input /tmp/fperez/wordcount/output

Output: $ bin/hadoop dfs -cat

/tmp/fperez-hadoop/wordcount/output/part-00000 Goodbye 1 Hadoop 2 To 1 Welcome 1 World 2

Page 41: An Introduction To Map-Reduce

An Introduction to MapReduce41

Hadoop: Word Count Examplepublic class WordCount extends Configured implements Tool {...

public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text,IntWritable> {... // Map Task Definition}

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text,IntWritable> {

... // Reduce Task Definition}

public int run(String[] args) throws Exception {... // Job Configuration}

public static void main(String[] args) throws Exception {int res = ToolRunner.run(new Configuration(), new WordCount(), args);System.exit(res);

}}

Page 42: An Introduction To Map-Reduce

An Introduction to MapReduce42

Hadoop: Job Configurationpublic int run(String[] args) throws Exception { JobConf conf = new JobConf(getConf(), WordCount.class); conf.setJobName("wordcount");

// the keys are words (strings) conf.setOutputKeyClass(Text.class); // the values are counts (ints) conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(MapClass.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class);

conf.setInputPath(new Path(args.get(0))); conf.setOutputPath(new Path(args.get(1))); JobClient.runJob(conf);

return 0; }

Page 43: An Introduction To Map-Reduce

An Introduction to MapReduce43

Hadoop: Map Classpublic static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new

IntWritable(1); private Text word = new Text(); // map(WritableComparable, Writable, OutputCollector, Reporter)

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

String line = value.toString(); StringTokenizer itr = new StringTokenizer(line);

while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one);

} }

}

Page 44: An Introduction To Map-Reduce

An Introduction to MapReduce44

Hadoop: Reduce Classpublic static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

// reduce(WritableComparable, Iterator, OutputCollector, Reporter)public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {

int sum = 0; while (values.hasNext()) {

sum += values.next().get(); } output.collect(key, new IntWritable(sum));

}

}

Page 45: An Introduction To Map-Reduce

An Introduction to MapReduce45

References• Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified Data

Processing on Large Clusters. OSDI'04, San Francisco, CA, December, 2004.

• Ralf Lämmel. Google's MapReduce Programming Model – Revisited. 2006-2007. Accepted for publication in the Science of Computer Programming Journal

• Jeff Dean, Sanjay Ghemawat. Slides from the OSDI'04. http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

• Hadoop. http://hadoop.apache.org

Page 46: An Introduction To Map-Reduce

An Introduction to MapReduce46

Questions?