Topic 5: MapReduce Theory and Implementation

5: MapReduce Theory and Implementation

Zubair Nabi

[email protected]

April 18, 2013

Zubair Nabi 5: MapReduce Theory and Implementation April 18, 2013 1 / 34

Outline

1 Introduction

2 Programming Model

3 Implementation

4 Refinements

5 Hadoop


Outline

1 Introduction

2 Programming Model

3 Implementation

4 Refinements

5 Hadoop


Common computations at Google

Process large amounts of data generated from crawled documents,web request logs, etc.

Compute inverted index, graph structure of web documents,summaries of pages crawled per host, etc.Common properties:

1 Computation is conceptually simple and is distributed across hundredsor thousands of machines to leverage parallelism

2 Input data is large3 The original simple computation is made complex by system-level code

to deal with issues of work assignment and distribution, andfault-tolerance




Compute inverted index, graph structure of web documents,summaries of pages crawled per host, etc.

Common properties:1 Computation is conceptually simple and is distributed across hundreds

or thousands of machines to leverage parallelism2 Input data is large3 The original simple computation is made complex by system-level code














2 Input data is large

3 The original simple computation is made complex by system-level codeto deal with issues of work assignment and distribution, andfault-tolerance









Enter MapReduce

Based on the insights mentioned in the previous slide, 2 GoogleEngineers, Jeff Dean and Sanjay Ghemawat, in 2004 designedMapReduce

I Abstraction that helps the programmer express simple computationsI Hides the gory details of parallelization, fault-tolerance, data distribution,

and load balancingI Relies on user-provided map and reduce primitives present in functional

languages

Leverages one key insight: Most of the computation at Google involvedapplying a map operator to each logical record in the input dataset toobtain a set of intermediate key/value pairs and then applying a reduceoperation to all values with the same key, for aggregation


Enter MapReduce


I Abstraction that helps the programmer express simple computations

I Hides the gory details of parallelization, fault-tolerance, data distribution,and load balancing

I Relies on user-provided map and reduce primitives present in functionallanguages



Enter MapReduce



and load balancing

I Relies on user-provided map and reduce primitives present in functionallanguages



Enter MapReduce




languages



Enter MapReduce




languages




Outline

1 Introduction

2 Programming Model

3 Implementation

4 Refinements

5 Hadoop


Programming Model

Input: Set of key/value pairs

Output: Set of key/value pairs

The user provides the entire computation in the form of two functions:map and reduce


Programming Model





Programming Model





User-defined functions

1 MapI Takes an input pair and produces a set of intermediate key/value pairs

I The framework groups together the intermediate values by key forconsumption by the Reduce

2 ReduceI Takes as input a key and a list of associated valuesI In the common case, it merges these values to result in a smaller set of

values



1 MapI Takes an input pair and produces a set of intermediate key/value pairsI The framework groups together the intermediate values by key for

consumption by the Reduce


values





2 ReduceI Takes as input a key and a list of associated values

I In the common case, it merges these values to result in a smaller set ofvalues






values


Example: Word Count

Counting the occurrence of each word in a large collection of documents

1 MapI Emits each word and the value 1

2 ReduceI Sums together all counts emitted for a particular word


Example: Word Count

Counting the occurrence of each word in a large collection of documents1 Map

I Emits each word and the value 1



Example: Word Count

Counting the occurrence of each word in a large collection of documents1 Map

I Emits each word and the value 1



Example: Word Count(2)

1 map(String key, String value):

2 // key: document name

3 // value: document contents

4 for each word w in value:

5 EmitIntermediate(w, "1");

67 reduce(String key, Iterator values):

8 // key: a word

9 // values: a list of counts

10 int result = 0;

11 for each v in values:

12 result += ParseInt(v);

13 Emit(AsString(result));


Types

User-supplied map and reduce functions have associated types1 Map

I map(k1, v1)→ list(k2, v2)

2 ReduceI reduce(k2, list(v2))→ list(v2)


Types

User-supplied map and reduce functions have associated types1 Map

I map(k1, v1)→ list(k2, v2)

2 ReduceI reduce(k2, list(v2))→ list(v2)


More applications

Distributed Grep1 Map

F Emits a line if its matches a user-provided pattern

2 ReduceF Identity function

Count of URL Access Frequency1 Map

F Similar to Word Count map. Instead of words we have URLs

2 ReduceF Similar to Word Count reduce


More applications

Distributed Grep1 Map

F Emits a line if its matches a user-provided pattern

2 ReduceF Identity function

Count of URL Access Frequency1 Map

F Similar to Word Count map. Instead of words we have URLs

2 ReduceF Similar to Word Count reduce


More applications (2)

Inverted Index1 Map

F Emits a sequence of < word, document_ID >

2 ReduceF Emits < word, list(document_ID) >

Distributed Sort1 Map

F Identity

2 ReduceF Identity


More applications (2)

Inverted Index1 Map

F Emits a sequence of < word, document_ID >

2 ReduceF Emits < word, list(document_ID) >

Distributed Sort1 Map

F Identity

2 ReduceF Identity


Outline

1 Introduction

2 Programming Model

3 Implementation

4 Refinements

5 Hadoop


Cluster architecture

A large cluster of shared-nothing commodity machines connected viaEthernet

Each node is an x86 system running Linux with local memory

Commodity networking hardware connected in the form of a treetopology

As clusters consist of hundreds or thousands of machines, failure ispretty commonEach machine consists of local hard-drives

I Google Filesystem runs atop of these disks which employs replication toensure availability and reliability

Jobs are submitted to a scheduler, which maps tasks within that job toavailable machines within the cluster














As clusters consist of hundreds or thousands of machines, failure ispretty common

Each machine consists of local hard-drivesI Google Filesystem runs atop of these disks which employs replication to

ensure availability and reliability



























MapReduce architecture

1 Master: In charge of all meta data, work scheduling and distribution,and job orchestration

2 Workers: Contain slots to execute map or reduce functions


MapReduce architecture

1 Master: In charge of all meta data, work scheduling and distribution,and job orchestration

2 Workers: Contain slots to execute map or reduce functions


Execution

1 The user writes map and reduce functions and stitches together aMapReduce specification with the location of the input dataset, numberof reduce tasks, and other attributes

2 The master logically splits the input dataset into M splits, whereM = (Input_dataset_size)/(GFS_block_size)

I The GFS block size is typically a multiple of 64MB

3 It then earmarks M map tasks and assigns them to workers. Eachworker has a configurable number of task slots. Each time a workercompletes a task, the master assigns it more pending map tasks

4 Once all map tasks have completed, the master assigns R reducetasks to worker nodes


Execution







Execution







Execution







Execution







Mappers

1 A map worker reads the contents of the input split that it has beenassigned

2 It parses the file and converts it to key/value pairs and invokes theuser-defined map function for each pair

3 The intermediate key/value pairs after the application of the map logicare collected (buffered) in memory

4 Once the buffered key/value pairs exceed a threshold they are writtento local disk and partitioned (using a partitioning function) into Rpartitions. The location of each partition is passed to the master


Mappers






Mappers






Mappers






Reducers

1 A reduce worker gets locations of its input partitions from the masterand uses HTTP requests to retrieve them

2 Once it has read all its input, it sorts it by key to group together alloccurrences of the same key

3 It then invokes the user-defined reduce for each key and passes it thekey and its associated values

4 The key/value pairs generated after the application of the reduce logicare then written to a final output file, which is subsequently written tothe distributed filesystem


Reducers






Reducers






Reducers







Book-keeping by the Master

The master contains meta-data for all jobs running in the cluster

For each map and reduce tasks, it stores the state (pending,in-progress, or completed) and the ID of the worker on which it isexecuting (in-progress state)

It stores the locations and sizes of partitions for each map task












Fault-tolerance

For large compute clusters, failures are the norm rather than the exception

1 Worker:I Each worker sends a periodic heartbeat signal to the masterI If the master does not receive a heartbeat from a worker in a certain

amount of time, it marks the worker as failedI In-progress map and reduce tasks are simply re-executed on other

nodes. Same goes for completed map tasks (as their output is lost onmachine failure)

I Completed reduce tasks are not re-executed as their output resides onthe distributed filesystem

2 Master:I The entire computation is marked as failedI But simple to keep the master soft state and re-spawn


Fault-tolerance

For large compute clusters, failures are the norm rather than the exception1 Worker:

I Each worker sends a periodic heartbeat signal to the master

I If the master does not receive a heartbeat from a worker in a certainamount of time, it marks the worker as failed

I In-progress map and reduce tasks are simply re-executed on othernodes. Same goes for completed map tasks (as their output is lost onmachine failure)




Fault-tolerance


I Each worker sends a periodic heartbeat signal to the masterI If the master does not receive a heartbeat from a worker in a certain

amount of time, it marks the worker as failed

I In-progress map and reduce tasks are simply re-executed on othernodes. Same goes for completed map tasks (as their output is lost onmachine failure)




Fault-tolerance








Fault-tolerance








Fault-tolerance






2 Master:I The entire computation is marked as failed

I But simple to keep the master soft state and re-spawn


Fault-tolerance








Locality

Network bandwidth is a scare resource in typical clusters

GFS slices files into 64MB blocks and stores 3 replicas across thecluster

The master exploits this information by scheduling a map task near itsinput data. Preference is in the order, node-local, rack/switch-local, andany


Locality





Locality





Speculative re-execution

Every now and then the entire computation is held-up by a “straggler”task

Stragglers can arise due to a number of reasons, such as machineload, network traffic, software/hardware bugs, etc.

To deal with stragglers, the master speculatively re-executes slow taskson other machines

The task is marked as completed whenever the primary or the backupfinishes its execution




















Scalability

Possible to run on multiple scales: from single nodes to data centerswith tens of thousands of nodes

Nodes can be added/removed on the fly to scale up/down


Scalability

Possible to run on multiple scales: from single nodes to data centerswith tens of thousands of nodes

Nodes can be added/removed on the fly to scale up/down


Outline

1 Introduction

2 Programming Model

3 Implementation

4 Refinements

5 Hadoop


Partitioning

By default MapReduce uses hash partitioning to partition the keyspace

I hash(key) % R

Optionally, the user can provide a custom partitioning function to say,negate skew or to ensure that certain keys always end up at aparticular reduce worker


Partitioning

By default MapReduce uses hash partitioning to partition the keyspace

I hash(key) % R

Optionally, the user can provide a custom partitioning function to say,negate skew or to ensure that certain keys always end up at aparticular reduce worker


Combiner function

For reduce functions which are commutative and associative, the usercan additionally provide a combiner function which is applied to theoutput of the map for local merging

Typically, the same reduce function is used as a combiner


Combiner function

For reduce functions which are commutative and associative, the usercan additionally provide a combiner function which is applied to theoutput of the map for local merging

Typically, the same reduce function is used as a combiner


Input/output formats

By default, the library supports a number of input/output formats

I For instance, text as input and key/value pairs as output

Optionally, the user can specify custom input readers and outputwriters

I For instance, to read/write from/to a database



By default, the library supports a number of input/output formatsI For instance, text as input and key/value pairs as output















Outline

1 Introduction

2 Programming Model

3 Implementation

4 Refinements

5 Hadoop


Hadoop

Open-source implementation of MapReduce, developed by DougCutting originally at Yahoo! in 2004

Now a top-level Apache open-source project

Implemented in Java (Google’s in-house implementation is in C++)

Comes with an associated distributed filesystem, HDFS (clone of GFS)


Hadoop






Hadoop






Hadoop






References

Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplifieddata processing on large clusters. In Proceedings of the 6thSymposium on Operating Systems Design & Implementation -(OSDI’04), Vol. 6. USENIX Association, Berkeley, CA, USA.


Technology

Topic 5: MapReduce Theory and Implementation