38
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Map Reduce I Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Map Reduce I 1 / 32

Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics

Big Data Analytics

Lucas Rego Drumond

Information Systems and Machine Learning Lab (ISMLL)Institute of Computer Science

University of Hildesheim, Germany

Map Reduce I

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 1 / 32

Page 2: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics

Outline

1. Introduction

2. Parallel Computing

3. Parallel programming paradigms

4. Map-Reduce

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 1 / 32

Page 3: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 1. Introduction

Outline

1. Introduction

2. Parallel Computing

3. Parallel programming paradigms

4. Map-Reduce

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 1 / 32

Page 4: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 1. Introduction

Overview

Part III

Part II

Part I

Machine Learning Algorithms

Large Scale Computational Models

Distributed Database

Distributed File System

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 1 / 32

Page 5: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 2. Parallel Computing

Outline

1. Introduction

2. Parallel Computing

3. Parallel programming paradigms

4. Map-Reduce

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 2 / 32

Page 6: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 2. Parallel Computing

Why do we need a Computational Model?

I Our data is nicely stored in a distributed infrastructure

I We have a number of computers at our disposal

I We want our analytics software to take advantage of all thiscomputing power

I When programming we want to focus on understanding our data andnot our infrastructure

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 2 / 32

Page 7: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 2. Parallel Computing

Shared Memory Infrastructure

Processor

Memory

Processor Processor

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 3 / 32

Page 8: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 2. Parallel Computing

Distributed Infrastructure

Processor

Memory

Network

Processor

Memory

Processor

Memory

Processor

Memory

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 4 / 32

Page 9: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 3. Parallel programming paradigms

Outline

1. Introduction

2. Parallel Computing

3. Parallel programming paradigms

4. Map-Reduce

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 5 / 32

Page 10: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 3. Parallel programming paradigms

Parallel Computing principlesI We have p processors available to execute a task TI Ideally: the more processors the faster a task is executedI Reality: synchronisation and communication costsI Speedup s(T , p) of a task T by using p processors:

I Be t(T , p) the time needed to execute T using p processorsI Speedup is given by:

s(T , p) =t(T , 1)

t(T , p)

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 5 / 32

Page 11: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 3. Parallel programming paradigms

Parallel Computing principles

I We have p processors available to execute a task T

I Efficiency e(T , p) of a task T by using p processors:

e(T , p) =t(T , 1)

p · t(T , p)

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 6 / 32

Page 12: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 3. Parallel programming paradigms

Considerations

I It is not worth using a lot of processors for solving small problems

I Algorithms should increase efficiency with problem size

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 7 / 32

Page 13: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 3. Parallel programming paradigms

Paradigms - Shared Memory

I All the processors have access to all the data D := {d1, . . . , dn}

I Pieces of data can be overwritten

I Processors need to lock datapoints before using them

For each processor p:

1. lock(di )

2. process(di )

3. unlock(di )

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 8 / 32

Page 14: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 3. Parallel programming paradigms

Word Count Example

Given a corpus of text documents

D := {d1, . . . , dn}

each containing a sequence of words:

“w1, . . . ,w′′m

pooled from a set W of possible words.

the task is to generate word counts for each word in the corpus

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 9 / 32

Page 15: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 3. Parallel programming paradigms

Word Count - Shared Memory

Shared vector for word counts: c ∈ R|W |

c ← {0}|W |

Each processor:

1. access a document d ∈ D

2. for each word wi in document d :

3. lock(ci )

4. ci ← ci + 1

5. unlock(ci )

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 10 / 32

Page 16: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 3. Parallel programming paradigms

Paradigms - Shared Memory

I Inefficient in a distributed scenario

I Results of a process can easily be overwritten

I Possible long waiting times for a piece of data because of the lockmechanism

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 11 / 32

Page 17: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 3. Parallel programming paradigms

Paradigms - Message passing

I Each processor sees only one part of the dataπ(D, p) := {dp, . . . , dp+ n

p−1}

I Each processor works on its partition

I Results are exchanged between processors (message passing)

For each processor p:

1. For each d ∈ π(D, p)

2. process(d)

3. Communicate results

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 12 / 32

Page 18: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 3. Parallel programming paradigms

Word Count - Message passing

We need to define two types of processes:

1. Slave - counts the words on a subset of documents and informs themaster

2. Master - gathers counts from the slaves and sums them up

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 13 / 32

Page 19: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 3. Parallel programming paradigms

Word Count - Message passing

Slave:Local memory:

1. subset of documents: π(D, p) := {dp, . . . , dp+ np−1}

2. Address of the master: addr master

3. Local word counts: c ∈ R|W |

c ← {0}|W |

1. for each document d ∈ π(D, p)

2. for each word wi in document d :

3. ci ← ci + 1

Send message send(addr master, c)

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 14 / 32

Page 20: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 3. Parallel programming paradigms

Word Count - Message passing

Master:Local memory:

1. Global word counts: cglobal ∈ R|W |

2. List of slaves: S

cglobal ← {0}|W |

s ← {0}|S |

For each received message (p, cp)

1. cglobal ← cglobal + cp

2. sp ← 1

3. if ||s||1 = |S | return cglobal

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 15 / 32

Page 21: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 3. Parallel programming paradigms

Paradigms - Message passing

I We need to manually assign master and slave roles for each processor

I Partition of the data needs to be done manually

I Implementations like OpenMPI only provide services to exchangemessages

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 16 / 32

Page 22: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 4. Map-Reduce

Outline

1. Introduction

2. Parallel Computing

3. Parallel programming paradigms

4. Map-Reduce

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 17 / 32

Page 23: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 4. Map-Reduce

Map-Reduce

I Builds on the distributed message passing paradigm

I Considers the data is partitioned over the nodesI Pipelined procedure:

1. Map phase2. Reduce phase

I High level abstraction: programmer only specifies a map and a reduceroutine

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 17 / 32

Page 24: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 4. Map-Reduce

Map-Reduce

I No need to worry about how many processors are available

I No need to specify which ones will be mappers and which ones will bereducers

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 18 / 32

Page 25: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 4. Map-Reduce

Key-Value input data

I Map-Reduce requires the data to be stored in a key-value format

I Natural if one works with column databases

I Examples:

Key Value

document array of wordsdocument worduser moviesuser friendsuser tweet

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 19 / 32

Page 26: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 4. Map-Reduce

The Paradigm - Formally

Given

I A set of input keys I

I A set of output keys O

I A set of input values X

I A set of intermediate values V

I A set of output values Y

We can define:

map : I × X → P(O × V )

andreduce : O × P(V )→ O × Y

where P denotes the powerset

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 20 / 32

Page 27: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 4. Map-Reduce

The Paradigm - Unformal

1. Each mapper transforms a set key-value pairs into a list of outputkeys and intermediate value pairs

2. all intermediate values are grouped according to their output keys

3. each reducer receives all the intermediate values associated with agiven keys

4. each reducer associates one final value to each key

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 21 / 32

Page 28: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 4. Map-Reduce

Word Count Example

Map:

I Input: document-word list pairs

I Output: word-count pairs

(dk , “w1, . . . ,w′′m) 7→ [(wi , ci )]

Reduce:

I Input: word-(count list) pairs

I Output: word-count pairs

(wi , [ci ]) 7→ (wi ,∑c∈[ci ]

c)

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 22 / 32

Page 29: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 4. Map-Reduce

Word Count Example

(d1, “love ain't no stranger”)

(d2, “crying in the rain”)

(d3, “looking for love”)

(d4, “I'm crying”)

(d5, “the deeper the love”)

(d6, “is this love”)

(d7, “Ain't no love”)

(love, 1)

(ain't, 1)

(stranger,1)

(crying, 1)

(rain, 1)

(looking, 1)

(love, 2)

(crying, 1)

(deeper, 1)

(this, 1)

(love, 2)

(ain't, 1)

(love, 1)

(ain't, 1)

(stranger,1)

(crying, 1)

(rain, 1)

(looking, 1)

(love, 2)

(crying, 1)

(deeper, 1)

(this, 1)

(love, 2)

(ain't, 1)

(love, 5)

(stranger,1)

(crying, 2)

(ain't, 2)

(rain, 1)

(looking, 1)

(deeper, 1)

(this, 1)

Mappers ReducersLucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 23 / 32

Page 30: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 4. Map-Reduce

Map

p u b l i c s t a t i c c l a s s Mapextends MapReduceBaseimplements Mapper<LongWritab le , Text , Text , I n tWr i t a b l e> {

p r i v a t e f i n a l s t a t i c I n tW r i t a b l e one = new I n tW r i t a b l e ( 1 ) ;p r i v a t e Text word = new Text ( ) ;

p u b l i c v o i d map( LongWr i tab l e key , Text va lue ,Outpu tCo l l e c t o r<Text , I n tWr i t a b l e> output ,Repo r t e r r e p o r t e r )throws IOExcept i on {

S t r i n g l i n e = va l u e . t o S t r i n g ( ) ;S t r i n gTok e n i z e r t o k e n i z e r = new S t r i n gTok e n i z e r ( l i n e ) ;

w h i l e ( t o k e n i z e r . hasMoreTokens ( ) ) {word . s e t ( t o k e n i z e r . nextToken ( ) ) ;output . c o l l e c t ( word , one ) ;

}}}

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 24 / 32

Page 31: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 4. Map-Reduce

Reduce

p u b l i c s t a t i c c l a s s Reduceextends MapReduceBaseimplements Reducer<Text , I n tWr i t a b l e , Text , I n tWr i t a b l e> {

p u b l i c v o i d r educe ( Text key , I t e r a t o r <I n tWr i t a b l e> va l u e s ,Ou tpu tCo l l e c t o r<Text , I n tWr i t a b l e> output ,Repo r t e r r e p o r t e r )throws IOExcept i on {

i n t sum = 0 ;w h i l e ( v a l u e s . hasNext ( ) ) {

sum += va l u e s . nex t ( ) . ge t ( ) ;}output . c o l l e c t ( key , new I n tW r i t a b l e ( sum ) ) ;

}}

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 25 / 32

Page 32: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 4. Map-Reduce

Execution snippet

p u b l i c s t a t i c v o i d main ( S t r i n g [ ] a r g s ) throws Excep t i on {JobConf con f = new JobConf (WordCount . c l a s s ) ;con f . setJobName ( ”wordcount ” ) ;

con f . s e tOutputKeyC la s s ( Text . c l a s s ) ;con f . s e tOu tpu tVa l u eC l a s s ( I n tW r i t a b l e . c l a s s ) ;

con f . s e tMappe rC la s s (Map . c l a s s ) ;con f . s e tComb ine rC l a s s ( Reduce . c l a s s ) ;con f . s e tR edu c e rC l a s s ( Reduce . c l a s s ) ;

con f . s e t I npu tFo rmat ( Text InputFormat . c l a s s ) ;con f . setOutputFormat ( TextOutputFormat . c l a s s ) ;

F i l e I n pu tFo rma t . s e t I n pu tPa t h s ( conf , new Path ( a r g s [ 0 ] ) ) ;F i l eOutputFormat . setOutputPath ( conf , new Path ( a r g s [ 1 ] ) ) ;

J o bC l i e n t . runJob ( con f ) ;}

}

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 26 / 32

Page 33: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 4. Map-Reduce

Considerations

I Maps are executed in parallel

I Reduces are executed in parallel

I Bottleneck: Reducers can only execute after all the mappers arefinished

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 27 / 32

Page 34: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 4. Map-Reduce

Fault tolerance

When the master node detecs node failures:

I Re-executes completed and in-progress map()

I Re-executes in-progress reduce tasks

When the master node detects particular key-value pairs that causesmappers to crash:

I Problematic pairs are skipped in the execution

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 28 / 32

Page 35: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 4. Map-Reduce

Parallel Efficiency of Map-ReduceI We have p processors for performing map and reduce operationsI Time to perform a task T on data D: t(T , 1) = wDI Time for producing intermediate data σD after the map phase:

t(T inter, 1) = σDI Overheads:

I intermediate data per mapper: σDp

I each of the p reducers needs to read one p-th of the data written byeach of the p mappers:

σD

p

1

pp =

σD

p

I Time for performing the task with Map-reduce:

tMR(T , p) =wD

p+ 2K

σD

p

K - constant for representing the overhead of IO operations (reading andwriting data to disk)

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 29 / 32

Page 36: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 4. Map-Reduce

Parallel Efficiency of Map-Reduce

I Time for performing the task in one processor: wD

I Time for performing the task with p processors on Map-reduce:

tMR(T , p) =wD

p+ K

σD

p

I Efficiency equation:

e(T , p) =t(T , 1)

p · t(T , p)

I Efficiency of Map-Reduce:

eMR(T , p) =wD

p(wDp + 2K σDp )

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 30 / 32

Page 37: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 4. Map-Reduce

Parallel Efficiency of Map-Reduce

eMR(T , p) =wD

p(wDp + 2K σDp )

=wD

wD + 2KσD

=wD 1

wD

wD 1wD + 2KσD 1

wD

=1

1 + 2K σw

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 31 / 32

Page 38: Big Data Analytics - Universität Hildesheim · Map-Reduce Outline 1. Introduction 2. Parallel Computing 3. Parallel programming paradigms 4. Map-Reduce Lucas Rego Drumond, Information

Big Data Analytics 4. Map-Reduce

Parallel Efficiency of Map-Reduce

eMR(T , p) =1

1 + 2K σw

I Apparently the efficiency is independent of p

I Large efficiency can be achieved with large number of processors

I If σ is high (too much intermediate data) the efficiency suffers

I In many cases σ depends on p

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Map Reduce I 32 / 32