85
Big Data Analytics Big Data Analytics 4. Map Reduce I Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany original slides by Lucas Rego Drumond, ISMLL Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 1 / 32

Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics

Big Data Analytics4. Map Reduce I

Lars Schmidt-Thieme

Information Systems and Machine Learning Lab (ISMLL)Institute of Computer Science

University of Hildesheim, Germany

original slides by Lucas Rego Drumond, ISMLL

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany1 / 32

Page 2: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics

Outline

1. Introduction

2. Parallel programming paradigms

3. Map-Reduce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany1 / 32

Page 3: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 1. Introduction

Outline

1. Introduction

2. Parallel programming paradigms

3. Map-Reduce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany1 / 32

Page 4: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 1. Introduction

Overview

Part III

Part II

Part I

Machine Learning Algorithms

Large Scale Computational Models

Distributed Database

Distributed File System

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany1 / 32

Page 5: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 1. Introduction

Why do we need a Computational Model?

I Our data is nicely stored in a distributed infrastructure

I We have a number of computers at our disposal

I We want our analytics software to take advantage of all thiscomputing power

I When programming we want to focus on understanding our data andnot our infrastructure

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany2 / 32

Page 6: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 1. Introduction

Why do we need a Computational Model?

I Our data is nicely stored in a distributed infrastructure

I We have a number of computers at our disposal

I We want our analytics software to take advantage of all thiscomputing power

I When programming we want to focus on understanding our data andnot our infrastructure

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany2 / 32

Page 7: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 1. Introduction

Why do we need a Computational Model?

I Our data is nicely stored in a distributed infrastructure

I We have a number of computers at our disposal

I We want our analytics software to take advantage of all thiscomputing power

I When programming we want to focus on understanding our data andnot our infrastructure

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany2 / 32

Page 8: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 1. Introduction

Why do we need a Computational Model?

I Our data is nicely stored in a distributed infrastructure

I We have a number of computers at our disposal

I We want our analytics software to take advantage of all thiscomputing power

I When programming we want to focus on understanding our data andnot our infrastructure

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany2 / 32

Page 9: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 1. Introduction

Shared Memory Infrastructure

Processor

Memory

Processor Processor

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany3 / 32

Page 10: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 1. Introduction

Distributed Infrastructure

Processor

Memory

Network

Processor

Memory

Processor

Memory

Processor

Memory

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany4 / 32

Page 11: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Outline

1. Introduction

2. Parallel programming paradigms

3. Map-Reduce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany5 / 32

Page 12: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Parallel Computing principlesI We have p processors available to execute a task T

I Ideally: the more processors the faster a task is executedI Reality: synchronisation and communication costsI Speedup s(T , p) of a task T by using p processors:

I Be t(T , p) the time needed to execute T using p processors

I Speedup is given by:

s(T , p) =t(T , 1)t(T , p)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany5 / 32

Page 13: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Parallel Computing principlesI We have p processors available to execute a task TI Ideally: the more processors the faster a task is executed

I Reality: synchronisation and communication costsI Speedup s(T , p) of a task T by using p processors:

I Be t(T , p) the time needed to execute T using p processors

I Speedup is given by:

s(T , p) =t(T , 1)t(T , p)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany5 / 32

Page 14: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Parallel Computing principlesI We have p processors available to execute a task TI Ideally: the more processors the faster a task is executedI Reality: synchronisation and communication costs

I Speedup s(T , p) of a task T by using p processors:I Be t(T , p) the time needed to execute T using p processors

I Speedup is given by:

s(T , p) =t(T , 1)t(T , p)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany5 / 32

Page 15: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Parallel Computing principlesI We have p processors available to execute a task TI Ideally: the more processors the faster a task is executedI Reality: synchronisation and communication costsI Speedup s(T , p) of a task T by using p processors:

I Be t(T , p) the time needed to execute T using p processorsI Speedup is given by:

s(T , p) =t(T , 1)t(T , p)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany5 / 32

Page 16: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Parallel Computing principlesI We have p processors available to execute a task TI Ideally: the more processors the faster a task is executedI Reality: synchronisation and communication costsI Speedup s(T , p) of a task T by using p processors:

I Be t(T , p) the time needed to execute T using p processors

I Speedup is given by:

s(T , p) =t(T , 1)t(T , p)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany5 / 32

Page 17: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Parallel Computing principlesI We have p processors available to execute a task TI Ideally: the more processors the faster a task is executedI Reality: synchronisation and communication costsI Speedup s(T , p) of a task T by using p processors:

I Be t(T , p) the time needed to execute T using p processorsI Speedup is given by:

s(T , p) =t(T , 1)t(T , p)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany5 / 32

Page 18: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Parallel Computing principles

I We have p processors available to execute a task T

I Efficiency e(T , p) of a task T by using p processors:

e(T , p) =t(T , 1)

p · t(T , p)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany6 / 32

Page 19: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Parallel Computing principles

I We have p processors available to execute a task T

I Efficiency e(T , p) of a task T by using p processors:

e(T , p) =t(T , 1)

p · t(T , p)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany6 / 32

Page 20: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Parallel Computing principles

I We have p processors available to execute a task T

I Efficiency e(T , p) of a task T by using p processors:

e(T , p) =t(T , 1)

p · t(T , p)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany6 / 32

Page 21: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Considerations

I It is not worth using a lot of processors for solving small problems

I Algorithms should increase efficiency with problem size

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany7 / 32

Page 22: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Considerations

I It is not worth using a lot of processors for solving small problemsI Algorithms should increase efficiency with problem size

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany7 / 32

Page 23: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Paradigms - Shared Memory

I All the processors have access to all the data D := {d1, . . . , dn}

I Pieces of data can be overwritten

I Processors need to lock datapoints before using them

For each processor p:

1. lock(di )

2. process(di )

3. unlock(di )

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany8 / 32

Page 24: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Paradigms - Shared Memory

I All the processors have access to all the data D := {d1, . . . , dn}

I Pieces of data can be overwritten

I Processors need to lock datapoints before using them

For each processor p:

1. lock(di )

2. process(di )

3. unlock(di )

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany8 / 32

Page 25: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Paradigms - Shared Memory

I All the processors have access to all the data D := {d1, . . . , dn}

I Pieces of data can be overwritten

I Processors need to lock datapoints before using them

For each processor p:

1. lock(di )

2. process(di )

3. unlock(di )

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany8 / 32

Page 26: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Paradigms - Shared Memory

I All the processors have access to all the data D := {d1, . . . , dn}

I Pieces of data can be overwritten

I Processors need to lock datapoints before using them

For each processor p:

1. lock(di )

2. process(di )

3. unlock(di )

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany8 / 32

Page 27: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Word Count Example

Given a corpus of text documents

D := {d1, . . . , dn}

each containing a sequence of words:

“w1, . . . ,w ′′m

pooled from a set W of possible words.

the task is to generate word counts for each word in the corpus

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany9 / 32

Page 28: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Word Count Example

Given a corpus of text documents

D := {d1, . . . , dn}

each containing a sequence of words:

“w1, . . . ,w ′′m

pooled from a set W of possible words.

the task is to generate word counts for each word in the corpus

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany9 / 32

Page 29: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Word Count Example

Given a corpus of text documents

D := {d1, . . . , dn}

each containing a sequence of words:

“w1, . . . ,w ′′m

pooled from a set W of possible words.

the task is to generate word counts for each word in the corpus

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany9 / 32

Page 30: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Word Count - Shared Memory

Shared vector for word counts: c ∈ R|W |

c ← {0}|W |

Each processor:1. access a document d ∈ D2. for each word wi in document d :

2.1 lock(ci )2.2 ci ← ci + 12.3 unlock(ci )

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany10 / 32

Page 31: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Paradigms - Shared Memory

I Inefficient in a distributed scenario

I Results of a process can easily be overwrittenI Possible long waiting times for a piece of data because of the lock

mechanism

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany11 / 32

Page 32: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Paradigms - Shared Memory

I Inefficient in a distributed scenarioI Results of a process can easily be overwritten

I Possible long waiting times for a piece of data because of the lockmechanism

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany11 / 32

Page 33: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Paradigms - Shared Memory

I Inefficient in a distributed scenarioI Results of a process can easily be overwrittenI Possible long waiting times for a piece of data because of the lock

mechanism

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany11 / 32

Page 34: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Paradigms - Message passing

I Each processor sees only one part of the dataπ(D, p) := {dp, . . . , dp+ n

p−1}

I Each processor works on its partition

I Results are exchanged between processors (message passing)

For each processor p:

1. For each d ∈ π(D, p)

1.1 process(d)

2. Communicate results

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany12 / 32

Page 35: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Paradigms - Message passing

I Each processor sees only one part of the dataπ(D, p) := {dp, . . . , dp+ n

p−1}

I Each processor works on its partition

I Results are exchanged between processors (message passing)

For each processor p:

1. For each d ∈ π(D, p)

1.1 process(d)

2. Communicate results

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany12 / 32

Page 36: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Paradigms - Message passing

I Each processor sees only one part of the dataπ(D, p) := {dp, . . . , dp+ n

p−1}

I Each processor works on its partition

I Results are exchanged between processors (message passing)

For each processor p:

1. For each d ∈ π(D, p)

1.1 process(d)

2. Communicate results

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany12 / 32

Page 37: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Paradigms - Message passing

I Each processor sees only one part of the dataπ(D, p) := {dp, . . . , dp+ n

p−1}

I Each processor works on its partition

I Results are exchanged between processors (message passing)

For each processor p:

1. For each d ∈ π(D, p)

1.1 process(d)

2. Communicate results

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany12 / 32

Page 38: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Word Count - Message passing

We need to define two types of processes:1. Slave - counts the words on a subset of documents and informs the

master2. Master - gathers counts from the slaves and sums them up

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany13 / 32

Page 39: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Word Count - Message passing

Slave:Local memory:

subset of documents: π(D, p) := {dp, . . . , dp+ np−1}

address of the master: addr_masterlocal word counts: c ∈ R|W |

1. c ← {0}|W |

2. for each document d ∈ π(D, p)for each word wi in document d :

ci ← ci + 1

3. Send message send(addr_master, c)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany14 / 32

Page 40: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Word Count - Message passing

Master:Local memory:1. Global word counts: cglobal ∈ R|W |

2. List of slaves: S

cglobal ← {0}|W |

s ← {0}|S |

For each received message (p, cp)

1. cglobal ← cglobal + cp

2. sp ← 13. if ||s||1 = |S | return cglobal

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany15 / 32

Page 41: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Paradigms - Message passing

I We need to manually assign master and slave roles for each processor

I Partition of the data needs to be done manuallyI Implementations like OpenMPI only provide services to exchange

messages

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany16 / 32

Page 42: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Paradigms - Message passing

I We need to manually assign master and slave roles for each processorI Partition of the data needs to be done manually

I Implementations like OpenMPI only provide services to exchangemessages

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany16 / 32

Page 43: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 2. Parallel programming paradigms

Paradigms - Message passing

I We need to manually assign master and slave roles for each processorI Partition of the data needs to be done manuallyI Implementations like OpenMPI only provide services to exchange

messages

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany16 / 32

Page 44: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Outline

1. Introduction

2. Parallel programming paradigms

3. Map-Reduce

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany17 / 32

Page 45: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Map-Reduce

I Builds on the distributed message passing paradigm

I Considers the data is partitioned over the nodesI Pipelined procedure:

1. Map phase2. Reduce phase

I High level abstraction: programmer only specifies a map and a reduceroutine

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany17 / 32

Page 46: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Map-Reduce

I Builds on the distributed message passing paradigmI Considers the data is partitioned over the nodes

I Pipelined procedure:

1. Map phase2. Reduce phase

I High level abstraction: programmer only specifies a map and a reduceroutine

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany17 / 32

Page 47: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Map-Reduce

I Builds on the distributed message passing paradigmI Considers the data is partitioned over the nodesI Pipelined procedure:

1. Map phase2. Reduce phase

I High level abstraction: programmer only specifies a map and a reduceroutine

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany17 / 32

Page 48: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Map-Reduce

I Builds on the distributed message passing paradigmI Considers the data is partitioned over the nodesI Pipelined procedure:

1. Map phase

2. Reduce phase

I High level abstraction: programmer only specifies a map and a reduceroutine

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany17 / 32

Page 49: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Map-Reduce

I Builds on the distributed message passing paradigmI Considers the data is partitioned over the nodesI Pipelined procedure:

1. Map phase2. Reduce phase

I High level abstraction: programmer only specifies a map and a reduceroutine

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany17 / 32

Page 50: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Map-Reduce

I Builds on the distributed message passing paradigmI Considers the data is partitioned over the nodesI Pipelined procedure:

1. Map phase2. Reduce phase

I High level abstraction: programmer only specifies a map and a reduceroutine

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany17 / 32

Page 51: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Map-Reduce

I No need to worry about how many processors are availableI No need to specify which ones will be mappers and which ones will be

reducersLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

18 / 32

Page 52: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Key-Value input data

I Map-Reduce requires the data to be stored in a key-value format

I Natural if one works with column databasesI Examples:

Key Valuedocument array of wordsdocument worduser moviesuser friendsuser tweet

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany19 / 32

Page 53: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Key-Value input data

I Map-Reduce requires the data to be stored in a key-value formatI Natural if one works with column databases

I Examples:

Key Valuedocument array of wordsdocument worduser moviesuser friendsuser tweet

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany19 / 32

Page 54: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Key-Value input data

I Map-Reduce requires the data to be stored in a key-value formatI Natural if one works with column databasesI Examples:

Key Valuedocument array of wordsdocument worduser moviesuser friendsuser tweet

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany19 / 32

Page 55: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Key-Value input data

I Map-Reduce requires the data to be stored in a key-value formatI Natural if one works with column databasesI Examples:

Key Valuedocument array of wordsdocument worduser moviesuser friendsuser tweet

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany19 / 32

Page 56: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

The Paradigm - Formally

GivenI A set of input keys II A set of output keys OI A set of input values XI A set of intermediate values VI A set of output values Y

We can define:

map : I × X → P(O × V )

andreduce : O × P(V )→ O × Y

where P denotes the powerset

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany20 / 32

Page 57: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

The Paradigm - Informally1. Each mapper transforms some key-value pairs into a set of pairs of an

output key and an intermediate value2. all intermediate values are grouped according to their output keys3. each reducer receives some pairs of a key and all its intermediate

values4. each reducer for each key aggregates all its intermediate values to one

final value

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany21 / 32

Page 58: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Word Count Example

Map:

I Input: document-word list pairsI Output: word-count pairs

(dk , “w1, . . . ,w ′′m) 7→ [(wi , ci )]

Reduce:

I Input: word-(count list) pairsI Output: word-count pairs

(wi , [ci ]) 7→ (wi ,∑

c∈[ci ]

c)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany22 / 32

Page 59: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Word Count Example

(d1, “love ain't no stranger”)

(d2, “crying in the rain”)

(d3, “looking for love”)

(d4, “I'm crying”)

(d5, “the deeper the love”)

(d6, “is this love”)

(d7, “Ain't no love”)

(love, 1)

(ain't, 1)

(stranger,1)

(crying, 1)

(rain, 1)

(looking, 1)

(love, 2)

(crying, 1)

(deeper, 1)

(this, 1)

(love, 2)

(ain't, 1)

(love, 1)

(ain't, 1)

(stranger,1)

(crying, 1)

(rain, 1)

(looking, 1)

(love, 2)

(crying, 1)

(deeper, 1)

(this, 1)

(love, 2)

(ain't, 1)

(love, 5)

(stranger,1)

(crying, 2)

(ain't, 2)

(rain, 1)

(looking, 1)

(deeper, 1)

(this, 1)

Mappers ReducersLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

23 / 32

Page 60: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Map

1 public static class Map2 extends MapReduceBase3 implements Mapper<LongWritable, Text, Text, IntWritable> {4 private final static IntWritable one = new IntWritable(1);5 private Text word = new Text();67 public void map(LongWritable key, Text value,8 OutputCollector<Text, IntWritable> output,9 Reporter reporter)

10 throws IOException {1112 String line = value. toString ();13 StringTokenizer tokenizer = new StringTokenizer(line );1415 while ( tokenizer .hasMoreTokens()) {16 word.set( tokenizer .nextToken());17 output. collect (word, one);18 }19 }20 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany24 / 32

Page 61: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Reduce

1 public static class Reduce2 extends MapReduceBase3 implements Reducer<Text, IntWritable, Text, IntWritable> {45 public void reduce(Text key, Iterator <IntWritable> values,6 OutputCollector<Text, IntWritable> output,7 Reporter reporter)8 throws IOException {9

10 int sum = 0;11 while (values .hasNext())12 sum += values.next().get();1314 output. collect (key, new IntWritable(sum));15 }16 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany25 / 32

Page 62: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Execution snippet

1 public static void main(String [] args) throws Exception {2 JobConf conf = new JobConf(WordCount.class);3 conf .setJobName("wordcount");45 conf .setOutputKeyClass(Text.class);6 conf .setOutputValueClass(IntWritable.class);78 conf .setMapperClass(Map.class);9 conf .setCombinerClass(Reduce.class);

10 conf .setReducerClass(Reduce.class);1112 conf .setInputFormat(TextInputFormat.class);13 conf .setOutputFormat(TextOutputFormat.class);1415 FileInputFormat.setInputPaths(conf, new Path(args[0]));16 FileOutputFormat.setOutputPath(conf, new Path(args[1]));1718 JobClient .runJob(conf);19 }20 }

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany26 / 32

Page 63: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Considerations

I Maps are executed in parallelI Reduces are executed in parallelI Bottleneck: Reducers can only execute after all the mappers are

finished

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany27 / 32

Page 64: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Fault tolerance

When the master node detects node failures:I Re-executes completed and in-progress map()I Re-executes in-progress reduce tasks

When the master node detects particular key-value pairs that causemappers to crash:

I Problematic pairs are skipped in the execution

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany28 / 32

Page 65: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Fault tolerance

When the master node detects node failures:I Re-executes completed and in-progress map()I Re-executes in-progress reduce tasks

When the master node detects particular key-value pairs that causemappers to crash:

I Problematic pairs are skipped in the execution

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany28 / 32

Page 66: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-ReduceI We have p processors for performing map and reduce operations

I Time to perform a task T on data D: t(T , 1) = wDI Time for producing intermediate data σD after the map phase:

t(T inter, 1) = σDI Overheads:

I intermediate data per mapper: σDp

I each of the p reducers needs to read one p-th of the data written byeach of the p mappers:

σDp

1p

p =σDp

I Time for performing the task with Map-reduce:

tMR(T , p) =wDp

+ 2KσDp

K - constant for representing the overhead of IO operations (reading andwriting data to disk)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany29 / 32

Page 67: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-ReduceI We have p processors for performing map and reduce operationsI Time to perform a task T on data D: t(T , 1) = wD

I Time for producing intermediate data σD after the map phase:t(T inter, 1) = σD

I Overheads:

I intermediate data per mapper: σDp

I each of the p reducers needs to read one p-th of the data written byeach of the p mappers:

σDp

1p

p =σDp

I Time for performing the task with Map-reduce:

tMR(T , p) =wDp

+ 2KσDp

K - constant for representing the overhead of IO operations (reading andwriting data to disk)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany29 / 32

Page 68: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-ReduceI We have p processors for performing map and reduce operationsI Time to perform a task T on data D: t(T , 1) = wDI Time for producing intermediate data σD after the map phase:

t(T inter, 1) = σD

I Overheads:

I intermediate data per mapper: σDp

I each of the p reducers needs to read one p-th of the data written byeach of the p mappers:

σDp

1p

p =σDp

I Time for performing the task with Map-reduce:

tMR(T , p) =wDp

+ 2KσDp

K - constant for representing the overhead of IO operations (reading andwriting data to disk)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany29 / 32

Page 69: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-ReduceI We have p processors for performing map and reduce operationsI Time to perform a task T on data D: t(T , 1) = wDI Time for producing intermediate data σD after the map phase:

t(T inter, 1) = σDI Overheads:

I intermediate data per mapper: σDp

I each of the p reducers needs to read one p-th of the data written byeach of the p mappers:

σDp

1p

p =σDp

I Time for performing the task with Map-reduce:

tMR(T , p) =wDp

+ 2KσDp

K - constant for representing the overhead of IO operations (reading andwriting data to disk)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany29 / 32

Page 70: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-ReduceI We have p processors for performing map and reduce operationsI Time to perform a task T on data D: t(T , 1) = wDI Time for producing intermediate data σD after the map phase:

t(T inter, 1) = σDI Overheads:

I intermediate data per mapper: σDp

I each of the p reducers needs to read one p-th of the data written byeach of the p mappers:

σDp

1p

p =σDp

I Time for performing the task with Map-reduce:

tMR(T , p) =wDp

+ 2KσDp

K - constant for representing the overhead of IO operations (reading andwriting data to disk)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany29 / 32

Page 71: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-ReduceI We have p processors for performing map and reduce operationsI Time to perform a task T on data D: t(T , 1) = wDI Time for producing intermediate data σD after the map phase:

t(T inter, 1) = σDI Overheads:

I intermediate data per mapper: σDp

I each of the p reducers needs to read one p-th of the data written byeach of the p mappers:

σDp

1p

p =σDp

I Time for performing the task with Map-reduce:

tMR(T , p) =wDp

+ 2KσDp

K - constant for representing the overhead of IO operations (reading andwriting data to disk)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany29 / 32

Page 72: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-ReduceI We have p processors for performing map and reduce operationsI Time to perform a task T on data D: t(T , 1) = wDI Time for producing intermediate data σD after the map phase:

t(T inter, 1) = σDI Overheads:

I intermediate data per mapper: σDp

I each of the p reducers needs to read one p-th of the data written byeach of the p mappers:

σDp

1p

p =σDp

I Time for performing the task with Map-reduce:

tMR(T , p) =wDp

+ 2KσDp

K - constant for representing the overhead of IO operations (reading andwriting data to disk)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany29 / 32

Page 73: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-Reduce

I Time for performing the task in one processor: wD

I Time for performing the task with p processors on Map-reduce:

tMR(T , p) =wDp

+ 2KσDp

I Efficiency equation:

e(T , p) =t(T , 1)

p · t(T , p)I Efficiency of Map-Reduce:

eMR(T , p) =wD

p(wDp + 2K σD

p )

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany30 / 32

Page 74: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-Reduce

I Time for performing the task in one processor: wDI Time for performing the task with p processors on Map-reduce:

tMR(T , p) =wDp

+ 2KσDp

I Efficiency equation:

e(T , p) =t(T , 1)

p · t(T , p)I Efficiency of Map-Reduce:

eMR(T , p) =wD

p(wDp + 2K σD

p )

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany30 / 32

Page 75: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-Reduce

I Time for performing the task in one processor: wDI Time for performing the task with p processors on Map-reduce:

tMR(T , p) =wDp

+ 2KσDp

I Efficiency equation:

e(T , p) =t(T , 1)

p · t(T , p)

I Efficiency of Map-Reduce:

eMR(T , p) =wD

p(wDp + 2K σD

p )

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany30 / 32

Page 76: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-Reduce

I Time for performing the task in one processor: wDI Time for performing the task with p processors on Map-reduce:

tMR(T , p) =wDp

+ 2KσDp

I Efficiency equation:

e(T , p) =t(T , 1)

p · t(T , p)I Efficiency of Map-Reduce:

eMR(T , p) =wD

p(wDp + 2K σD

p )

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany30 / 32

Page 77: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-Reduce

eMR(T , p) =wD

p(wDp + 2K σD

p )

=wD

wD + 2KσD

=wD 1

wD

wD 1wD + 2KσD 1

wD

=1

1+ 2K σw

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany31 / 32

Page 78: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-Reduce

eMR(T , p) =wD

p(wDp + 2K σD

p )

=wD

wD + 2KσD

=wD 1

wD

wD 1wD + 2KσD 1

wD

=1

1+ 2K σw

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany31 / 32

Page 79: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-Reduce

eMR(T , p) =wD

p(wDp + 2K σD

p )

=wD

wD + 2KσD

=wD 1

wD

wD 1wD + 2KσD 1

wD

=1

1+ 2K σw

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany31 / 32

Page 80: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-Reduce

eMR(T , p) =wD

p(wDp + 2K σD

p )

=wD

wD + 2KσD

=wD 1

wD

wD 1wD + 2KσD 1

wD

=1

1+ 2K σw

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany31 / 32

Page 81: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-Reduce

eMR(T , p) =1

1+ 2K σw

I Apparently the efficiency is independent of pI High speedups can be achieved with large number of processorsI If σ is high (too much intermediate data) the efficiency deterioratesI In many cases σ depends on p

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany32 / 32

Page 82: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-Reduce

eMR(T , p) =1

1+ 2K σw

I Apparently the efficiency is independent of p

I High speedups can be achieved with large number of processorsI If σ is high (too much intermediate data) the efficiency deterioratesI In many cases σ depends on p

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany32 / 32

Page 83: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-Reduce

eMR(T , p) =1

1+ 2K σw

I Apparently the efficiency is independent of pI High speedups can be achieved with large number of processors

I If σ is high (too much intermediate data) the efficiency deterioratesI In many cases σ depends on p

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany32 / 32

Page 84: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-Reduce

eMR(T , p) =1

1+ 2K σw

I Apparently the efficiency is independent of pI High speedups can be achieved with large number of processorsI If σ is high (too much intermediate data) the efficiency deteriorates

I In many cases σ depends on p

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany32 / 32

Page 85: Big Data Analytics - 4. Map Reduce I · 2020. 7. 27. · Big Data Analytics 2. Parallel programming paradigms Outline 1. Introduction 2. Parallelprogrammingparadigms 3. Map-Reduce

Big Data Analytics 3. Map-Reduce

Parallel Efficiency of Map-Reduce

eMR(T , p) =1

1+ 2K σw

I Apparently the efficiency is independent of pI High speedups can be achieved with large number of processorsI If σ is high (too much intermediate data) the efficiency deterioratesI In many cases σ depends on p

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany32 / 32