25
Problem-solving on large- scale clusters: theory and applications Lecture 3: Bringing it all together

Problem-solving on large-scale clusters: theory and applications

Embed Size (px)

DESCRIPTION

Problem-solving on large-scale clusters: theory and applications. Lecture 3: Bringing it all together. Today’s Outline. Course directions, projects, and feedback Quiz 2 Context / Where we are Why do we care about fold() and map() ? - PowerPoint PPT Presentation

Citation preview

Page 1: Problem-solving on large-scale clusters: theory and applications

Problem-solving on large-scale clusters: theory and applications

Lecture 3: Bringing it all together

Page 2: Problem-solving on large-scale clusters: theory and applications

Today’s Outline

• Course directions, projects, and feedback

• Quiz 2

• Context / Where we are– Why do we care about fold() and map()?– Why do we care about parallelization and

data dependencies?

• MapReduce architecture from 10,000 feet

Page 3: Problem-solving on large-scale clusters: theory and applications

Context and Review

• Data dependencies determine whether a problem can be formulated in MapReduce

• The properties of fold() and map() determine how to formulate a problem in MapReduce

How do you parallelize fold()? map()?

Page 4: Problem-solving on large-scale clusters: theory and applications

MapReduce Introduction• MapReduce is both a programming model and a

clustered computing system– A specific way of formulating a problem, which yields

good parallelizability– A system which takes a MapReduce-formulated

problem and executes it on a large cluster• Hides implementation details, such as hardware failures,

grouping and sorting, scheduling …

• Previous lectures have focused on MapReduce-the-problem-formulation

• Today will mostly focus on MapReduce-the-system

Page 5: Problem-solving on large-scale clusters: theory and applications

MR Problem Formulation: Formal Definition

MapReduce:mapreduce fm fr l =

map (reducePerKey fr) (group (map fm l))

reducePerKey fr (k,v_list) =

(k, (foldl (fr k) [] v_list))

– Assume map here is actually concatMap.– Argument l is a list of documents– The result of first map is a list of key-value pairs– The function fr takes 3 arguments key, context, current.

With currying, this allows for locking the value of “key” for each list during the fold.

MapReduce maps a fold over the sorted result of a map!

Page 6: Problem-solving on large-scale clusters: theory and applications

MR System Overview (1 of 2)Map:

– Preprocesses a set of files to generate intermediate key-value pairs

– As parallelized as you want

Group:– Partitions intermediate key-value pairs by unique key, generating

a list of all associated values

Reduce:– For each key, iterates over value list– Performs computation that requires context between iterations– Parallelizable amongst different keys, but not within one key

Page 7: Problem-solving on large-scale clusters: theory and applications

MR System Overview (2 of 2)

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Page 8: Problem-solving on large-scale clusters: theory and applications

Example: MapReduce DocInfo (1 of 2)

MapReduce:mapreduce fm fr l =

map (reducePerKey fr) (group (map fm l))

reducePerKey fr (k,v_list) = (k, (foldl (fr k) [] v_list)

Pseudocode for fm

fm contents = concat [

[(“spaces”, (count_spaces contents))], (map (emit “raw”) (split contents)), (map (emit “scrub”) (scrub (split contents)))] emit label value = (label, (value, 1))

Page 9: Problem-solving on large-scale clusters: theory and applications

Example: MapReduce DocInfo (2 of 2)

MapReduce:mapreduce fm fr l =

map (reducePerKey fr) (group (map fm l))

reducePerKey fr (k,v_list) =

(k, (foldl (fr k) [] v_list)

Pseudocode for fr

fr ‘spaces’ count (total:xs) =(total+count:xs)

fr ‘raw’ (word,count) (result) =(update_result (word,count) result)

fr ‘scrub’ (word,count) (result) =(update_result (word,count) result)

Page 10: Problem-solving on large-scale clusters: theory and applications

Group ExerciseFormulate the following as map reduces:1. Find the set of unique words in a document

a) Input: a bunch of wordsb) Output: all the unique words (no repeats)

2. Calculate per-employee taxesa) Input: a list of (employee, salary, month) tuplesb) Output: a list of (employee, taxes due) pairs

3. Randomly reorder sentencesa) Input: a bunch of documentsb) Output: all sentences in random order (may include duplicates)

4. Compute the minesweeper grid/mapa) Input: coordinates for the location of minesb) Output: coordinate/value pairs for all non-zero cells

Can you think generalized techniques for decomposing problems?

Page 11: Problem-solving on large-scale clusters: theory and applications

MapReduce Parallelization: Execution

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Page 12: Problem-solving on large-scale clusters: theory and applications

MapReduce Parallelization: Pipelining• Finely granular tasks: many more map tasks than machines

– Better dynamic load balancing

– Minimizes time for fault recovery

– Can pipeline the shuffling/grouping while maps are still running

• Example: 2000 machines -> 200,000 map + 5000 reduce tasks

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Page 13: Problem-solving on large-scale clusters: theory and applications

Example: MR DocInfo, revisitedDo MapReduce DocInfo in 2 passes (instead of 1), performing all the

work in the “group” step

Map1: 1. Tokenize document2. For each token output:

a) (“raw:<word>”,1)b) (“scrubbed:<scrubbed_word>”, 1)

Reduce1:1. For each key, ignore value list and output (key,1)

Map2:1. Tokenize document2. For each token “type:value”, output (type,1)

Reduce 2:• For each key, output (key, (sum values))

Page 14: Problem-solving on large-scale clusters: theory and applications

Example: MR DocInfo, revisited

• Of the 2 DocInfo MapReduce implementations, which is better?

• Define “better”. What resources are you considering?Dev time? CPU? Network? Disk? Complexity? Reusability?

Mapper

Mapper

Mapper

Reducer

Reducer

GFS

Key:• Connections are network

links• GFS is a cluster of

storage machines

Page 15: Problem-solving on large-scale clusters: theory and applications

HaDoop-as-MapReducemapreduce fm fr l =

map (reducePerKey fr) (group (map fm l))

reducePerKey fr (k,v_list) =

(k, (foldl (fr k) [] v_list)

Hadoop:• The fm and fr are function objects (classes)• Class for fm implements the Mapper interface

Map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter)

• Class for fr implements the Reducer interface

reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter

reporter)Hadoop takes the generated class files and manages running them

Page 16: Problem-solving on large-scale clusters: theory and applications

Bonus Materials: MR Runtime

• The following slides illustrate an example run of MapReduce on a Google cluster

• A sample job from the indexing pipeline, processes ~900 GB of crawled pages

Page 17: Problem-solving on large-scale clusters: theory and applications

MR Runtime (1 of 9)

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Page 18: Problem-solving on large-scale clusters: theory and applications

MR Runtime (2 of 9)

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Page 19: Problem-solving on large-scale clusters: theory and applications

MR Runtime (3 of 9)

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Page 20: Problem-solving on large-scale clusters: theory and applications

MR Runtime (4 of 9)

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Page 21: Problem-solving on large-scale clusters: theory and applications

MR Runtime (5 of 9)

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Page 22: Problem-solving on large-scale clusters: theory and applications

MR Runtime (6 of 9)

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Page 23: Problem-solving on large-scale clusters: theory and applications

MR Runtime (7 of 9)

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Page 24: Problem-solving on large-scale clusters: theory and applications

MR Runtime (8 of 9)

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Page 25: Problem-solving on large-scale clusters: theory and applications

MR Runtime (9 of 9)

Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html