73
What Why How MapReduce Tran Giang Son, [email protected] ICT Department, USTH MapReduce Tran Giang Son, [email protected] 1 / 44

MapReduce - USTH Moodle

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MapReduce - USTH Moodle

What Why How

MapReduce

Tran Giang Son, [email protected]

ICT Department, USTH

MapReduce Tran Giang Son, [email protected] 1 / 44

Page 2: MapReduce - USTH Moodle

What Why How

What

MapReduce Tran Giang Son, [email protected] 2 / 44

Page 3: MapReduce - USTH Moodle

What Why How

What?

• A simple programming model that applies to manylarge-scale computing problem

• Parallel computation• Workload distribution• Load balancing• Fault tolerance

• Not a language• Not a library

MapReduce Tran Giang Son, [email protected] 3 / 44

Page 4: MapReduce - USTH Moodle

What Why How

What?

• Example• Count number of students inside USTH building at the

moment?

• Traditional way?• Smart way?• Smarter way?

MapReduce Tran Giang Son, [email protected] 4 / 44

Page 5: MapReduce - USTH Moodle

What Why How

What?

• Traditional way

:• Place a counting table at the parking entrance of USTH• Announce to everyone to go down there, make a queue, count

Problem: slow, bottleneck at the counting table

MapReduce Tran Giang Son, [email protected] 5 / 44

Page 6: MapReduce - USTH Moodle

What Why How

What?

• Traditional way :• Place a counting table at the parking entrance of USTH

• Announce to everyone to go down there, make a queue, count

Problem: slow, bottleneck at the counting table

MapReduce Tran Giang Son, [email protected] 5 / 44

Page 7: MapReduce - USTH Moodle

What Why How

What?

• Traditional way :• Place a counting table at the parking entrance of USTH• Announce to everyone to go down there, make a queue, count

Problem: slow, bottleneck at the counting table

MapReduce Tran Giang Son, [email protected] 5 / 44

Page 8: MapReduce - USTH Moodle

What Why How

What?

• Traditional way :• Place a counting table at the parking entrance of USTH• Announce to everyone to go down there, make a queue, count

Problem: slow, bottleneck at the counting table

MapReduce Tran Giang Son, [email protected] 5 / 44

Page 9: MapReduce - USTH Moodle

What Why How

What?

• Smart way

: place counting tables at every possible exit ofUSTH building

• Emergency exit near the museum• Parking exits at the ground floor (2)• Stair exits on the second floor (2)• Hit fire alarm

• Wait and count

• Still bottleneck at counting tables

MapReduce Tran Giang Son, [email protected] 6 / 44

Page 10: MapReduce - USTH Moodle

What Why How

What?

• Smart way : place counting tables at every possible exit ofUSTH building

• Emergency exit near the museum• Parking exits at the ground floor (2)• Stair exits on the second floor (2)• Hit fire alarm

• Wait and count

• Still bottleneck at counting tables

MapReduce Tran Giang Son, [email protected] 6 / 44

Page 11: MapReduce - USTH Moodle

What Why How

What?

• Smart way : place counting tables at every possible exit ofUSTH building

• Emergency exit near the museum

• Parking exits at the ground floor (2)• Stair exits on the second floor (2)• Hit fire alarm

• Wait and count

• Still bottleneck at counting tables

MapReduce Tran Giang Son, [email protected] 6 / 44

Page 12: MapReduce - USTH Moodle

What Why How

What?

• Smart way : place counting tables at every possible exit ofUSTH building

• Emergency exit near the museum• Parking exits at the ground floor (2)

• Stair exits on the second floor (2)• Hit fire alarm

• Wait and count

• Still bottleneck at counting tables

MapReduce Tran Giang Son, [email protected] 6 / 44

Page 13: MapReduce - USTH Moodle

What Why How

What?

• Smart way : place counting tables at every possible exit ofUSTH building

• Emergency exit near the museum• Parking exits at the ground floor (2)• Stair exits on the second floor (2)

• Hit fire alarm

• Wait and count

• Still bottleneck at counting tables

MapReduce Tran Giang Son, [email protected] 6 / 44

Page 14: MapReduce - USTH Moodle

What Why How

What?

• Smart way : place counting tables at every possible exit ofUSTH building

• Emergency exit near the museum• Parking exits at the ground floor (2)• Stair exits on the second floor (2)• Hit fire alarm

• Wait and count

• Still bottleneck at counting tables

MapReduce Tran Giang Son, [email protected] 6 / 44

Page 15: MapReduce - USTH Moodle

What Why How

What?

• Smart way : place counting tables at every possible exit ofUSTH building

• Emergency exit near the museum• Parking exits at the ground floor (2)• Stair exits on the second floor (2)• Hit fire alarm

• Wait and count

• Still bottleneck at counting tables

MapReduce Tran Giang Son, [email protected] 6 / 44

Page 16: MapReduce - USTH Moodle

What Why How

What?

• Smart way : place counting tables at every possible exit ofUSTH building

• Emergency exit near the museum• Parking exits at the ground floor (2)• Stair exits on the second floor (2)• Hit fire alarm

• Wait and count

• Still bottleneck at counting tables

MapReduce Tran Giang Son, [email protected] 6 / 44

Page 17: MapReduce - USTH Moodle

What Why How

What?

• Smarter way

:• Come to each classroom• Ask the class monitor to count• Aggregate the results in the second time

• Less intrusive, more work done, can be better parallelized

MapReduce Tran Giang Son, [email protected] 7 / 44

Page 18: MapReduce - USTH Moodle

What Why How

What?

• Smarter way :• Come to each classroom

• Ask the class monitor to count• Aggregate the results in the second time

• Less intrusive, more work done, can be better parallelized

MapReduce Tran Giang Son, [email protected] 7 / 44

Page 19: MapReduce - USTH Moodle

What Why How

What?

• Smarter way :• Come to each classroom• Ask the class monitor to count

• Aggregate the results in the second time

• Less intrusive, more work done, can be better parallelized

MapReduce Tran Giang Son, [email protected] 7 / 44

Page 20: MapReduce - USTH Moodle

What Why How

What?

• Smarter way :• Come to each classroom• Ask the class monitor to count• Aggregate the results in the second time

• Less intrusive, more work done, can be better parallelized

MapReduce Tran Giang Son, [email protected] 7 / 44

Page 21: MapReduce - USTH Moodle

What Why How

What?

• Smarter way :• Come to each classroom• Ask the class monitor to count• Aggregate the results in the second time

• Less intrusive, more work done, can be better parallelized

MapReduce Tran Giang Son, [email protected] 7 / 44

Page 22: MapReduce - USTH Moodle

What Why How

What?

• Two operations• map(): “one to one” transform of each element in a set

mapfS = {f(x)|x ∈ S}

• reduce(): “many to one” transform of a element set

reducefS = f({x|x ∈ S})

MapReduce Tran Giang Son, [email protected] 8 / 44

Page 23: MapReduce - USTH Moodle

What Why How

map()

• Pre-map()• Reads data from source

• Transform

MapReduce Tran Giang Son, [email protected] 9 / 44

Page 24: MapReduce - USTH Moodle

What Why How

Why

MapReduce Tran Giang Son, [email protected] 10 / 44

Page 25: MapReduce - USTH Moodle

What Why How

Data explosion

• A lot of data• 130+ trillion of webpages (2016)• 20KB each• 2,600,000+ TB

MapReduce Tran Giang Son, [email protected] 11 / 44

Page 26: MapReduce - USTH Moodle

What Why How

Data explosion

• Hard drive: 100MB/s sequential read• ~824,450,000 years to read

• SSD• SATA3 500MB/s sequential read ~ 164,800,000 years• M.2 3500MB/s sequential read ~ 23,500,000 years

• Processing this data• Sorting / Searching / Indexing / Classification

MapReduce Tran Giang Son, [email protected] 12 / 44

Page 27: MapReduce - USTH Moodle

What Why How

Data explosion

• Hard drive: 100MB/s sequential read• ~824,450,000 years to read

• SSD• SATA3 500MB/s sequential read ~ 164,800,000 years• M.2 3500MB/s sequential read ~ 23,500,000 years

• Processing this data• Sorting / Searching / Indexing / Classification

MapReduce Tran Giang Son, [email protected] 12 / 44

Page 28: MapReduce - USTH Moodle

What Why How

Data explosion

• Hard drive: 100MB/s sequential read• ~824,450,000 years to read

• SSD• SATA3 500MB/s sequential read ~ 164,800,000 years• M.2 3500MB/s sequential read ~ 23,500,000 years

• Processing this data• Sorting / Searching / Indexing / Classification

MapReduce Tran Giang Son, [email protected] 12 / 44

Page 29: MapReduce - USTH Moodle

What Why How

Why MapReduce?

• Traditional programming is serial• Break processing into independent batches• Process concurrently• Aggregate result

MapReduce Tran Giang Son, [email protected] 13 / 44

Page 30: MapReduce - USTH Moodle

What Why How

Parallelization

• Multi-core• Multi-CPU• Cluster• Grid

MapReduce Tran Giang Son, [email protected] 14 / 44

Page 31: MapReduce - USTH Moodle

What Why How

Parallelization

MapReduce Tran Giang Son, [email protected] 15 / 44

Page 32: MapReduce - USTH Moodle

What Why How

Parallelization

MapReduce Tran Giang Son, [email protected] 16 / 44

Page 33: MapReduce - USTH Moodle

What Why How

Parallelization

MapReduce Tran Giang Son, [email protected] 17 / 44

Page 34: MapReduce - USTH Moodle

What Why How

Parallelization

MapReduce Tran Giang Son, [email protected] 18 / 44

Page 35: MapReduce - USTH Moodle

What Why How

Parallelization

MapReduce Tran Giang Son, [email protected] 19 / 44

Page 36: MapReduce - USTH Moodle

What Why How

Parallelization

Key Value

Name Sunway TaihulightNodes 40,960CPU SW26010, 256 cores 1.45GHz/nodeCores 10,649,600Memory 1.31PB (1310TB)Storage 20PB (20000TB)Peak 125 PFLOPSLinpack 93.01 PFLOPSPower 15MWLocation National Supercomputer Center, Wuxi, ChinaActive June 2016

MapReduce Tran Giang Son, [email protected] 20 / 44

Page 37: MapReduce - USTH Moodle

What Why How

Why MapReduce?

Challenges:

• Breaking problem into smaller task• Assigning tasks to machines?• Partitioning and distributing data?• Sharing intermediate data?• Coordinating synchronization? Scheduling? Fault-tolerance?

MapReduce Tran Giang Son, [email protected] 21 / 44

Page 38: MapReduce - USTH Moodle

What Why How

Why MapReduce?

• Scale “out”, not scale “up”• E.g. more workers, not more levels of management

• Failure are common• Process data sequentially and not randomly

MapReduce Tran Giang Son, [email protected] 22 / 44

Page 39: MapReduce - USTH Moodle

What Why How

How

MapReduce Tran Giang Son, [email protected] 23 / 44

Page 40: MapReduce - USTH Moodle

What Why How

Implementations

• Google• Internal• Proprietary

• Apache Hadoop MapReduce• Most common open source implementation

• Amazon Elastic MapReduce• On EC2

MapReduce Tran Giang Son, [email protected] 24 / 44

Page 41: MapReduce - USTH Moodle

What Why How

Who does what?

• Implement two methods• map(): Mapper• reduce(): Reducer

MapReduce Tran Giang Son, [email protected] 25 / 44

Page 42: MapReduce - USTH Moodle

What Why How

MapReduce architecture

split 1

split 0 Map

Map

Map

Reduce

Reduce

output 0

output 1

Map phase Shuffle & Sort Reduce phase

split 2

split 3

split 4

split 5

Input (HDFS)

Intermediate Results (Local)

Output (HDFS)

MapReduce Tran Giang Son, [email protected] 26 / 44

Page 43: MapReduce - USTH Moodle

What Why How

Execution Framework

• The execution framework (runtime) handles everything else• Scheduling: who does map()? who does reduce()?• Data distribution: move data to processes (worker)• Synchronization: gathers, sorts,• Fault-tolerance: detects failure, restarts• Distributed file system

MapReduce Tran Giang Son, [email protected] 27 / 44

Page 44: MapReduce - USTH Moodle

What Why How

Who does what?

• A “master” controls execution of “slaves”• Mappers are put near their input block

• Minimize network usage

• Mappers persist outputs to disk before passing to producer• For fault tolerance

MapReduce Tran Giang Son, [email protected] 28 / 44

Page 45: MapReduce - USTH Moodle

What Why How

Fault Tolerance

• Task crashes• Retry on other node

• map()?

no deps• reduce()? saved on disk

• Important: Task independence

MapReduce Tran Giang Son, [email protected] 29 / 44

Page 46: MapReduce - USTH Moodle

What Why How

Fault Tolerance

• Task crashes• Retry on other node

• map()? no deps

• reduce()? saved on disk

• Important: Task independence

MapReduce Tran Giang Son, [email protected] 29 / 44

Page 47: MapReduce - USTH Moodle

What Why How

Fault Tolerance

• Task crashes• Retry on other node

• map()? no deps• reduce()?

saved on disk

• Important: Task independence

MapReduce Tran Giang Son, [email protected] 29 / 44

Page 48: MapReduce - USTH Moodle

What Why How

Fault Tolerance

• Task crashes• Retry on other node

• map()? no deps• reduce()? saved on disk

• Important: Task independence

MapReduce Tran Giang Son, [email protected] 29 / 44

Page 49: MapReduce - USTH Moodle

What Why How

Fault Tolerance

• Task crashes• Retry on other node

• map()? no deps• reduce()? saved on disk

• Important: Task independence

MapReduce Tran Giang Son, [email protected] 29 / 44

Page 50: MapReduce - USTH Moodle

What Why How

Fault Tolerance

• Node crashes• Start tasks on a new node

• map()?

restart• reduce()? nothing else

MapReduce Tran Giang Son, [email protected] 30 / 44

Page 51: MapReduce - USTH Moodle

What Why How

Fault Tolerance

• Node crashes• Start tasks on a new node

• map()? restart

• reduce()? nothing else

MapReduce Tran Giang Son, [email protected] 30 / 44

Page 52: MapReduce - USTH Moodle

What Why How

Fault Tolerance

• Node crashes• Start tasks on a new node

• map()? restart• reduce()?

nothing else

MapReduce Tran Giang Son, [email protected] 30 / 44

Page 53: MapReduce - USTH Moodle

What Why How

Fault Tolerance

• Node crashes• Start tasks on a new node

• map()? restart• reduce()? nothing else

MapReduce Tran Giang Son, [email protected] 30 / 44

Page 54: MapReduce - USTH Moodle

What Why How

Fault Tolerance

• Task becomes slow• Launch same task on another node• Use result of whoever finishes first• Kill the second one

• Popular in large cluster

MapReduce Tran Giang Son, [email protected] 31 / 44

Page 55: MapReduce - USTH Moodle

What Why How

Extras

• Extra optional supporting functions• partition(): divide key space for parallelization• combine(): mini reducers to combine after map

• Barriers

MapReduce Tran Giang Son, [email protected] 32 / 44

Page 56: MapReduce - USTH Moodle

What Why How

Example: Word Count

• The classic example for MapReduce• Input: a large text file• Output : number of occurrence of each word

MapReduce Tran Giang Son, [email protected] 33 / 44

Page 57: MapReduce - USTH Moodle

What Why How

Example: Word Count

• map(): count occurence of word in a single line

1. three witches watch threeswatch watches

<three, 1><witches, 1><watch, 1><three, 1><swatch, 1><watches, 1>

MapReduce Tran Giang Son, [email protected] 34 / 44

Page 58: MapReduce - USTH Moodle

What Why How

Example: Word Count

• map(): count occurence of word in a single line

1. three witches watch threeswatch watches

<three, 1><witches, 1><watch, 1><three, 1><swatch, 1><watches, 1>

MapReduce Tran Giang Son, [email protected] 34 / 44

Page 59: MapReduce - USTH Moodle

What Why How

Example: Word Count

• map(): count occurence of word in a single line

2. which witch watches whichswatch watch

<which, 1><witch, 1><watches, 1><which, 1><swatch, 1><watch, 1>

MapReduce Tran Giang Son, [email protected] 35 / 44

Page 60: MapReduce - USTH Moodle

What Why How

Example: Word Count

• map(): count occurence of word in a single line

2. which witch watches whichswatch watch

<which, 1><witch, 1><watches, 1><which, 1><swatch, 1><watch, 1>

MapReduce Tran Giang Son, [email protected] 35 / 44

Page 61: MapReduce - USTH Moodle

What Why How

Example: Word Count

• map(): count occurence of word in a single line

<three, 1><witches, 1><watch, 1><three, 1><swatch, 1><watches, 1><which, 1><witch, 1><watches, 1><which, 1><swatch, 1><watch, 1>

MapReduce Tran Giang Son, [email protected] 36 / 44

Page 62: MapReduce - USTH Moodle

What Why How

Example: Word Count

• Group pairs that have same K

<three, 1><witches, 1><watch, 1><three, 1><swatch, 1><watches, 1><which, 1><witch, 1><watches, 1><which, 1><swatch, 1><watch, 1>

<three, 1><three, 1><witches, 1><watch, 1><watch, 1><swatch, 1><swatch, 1><watches, 1><watches, 1><which, 1><which, 1><witch, 1>

MapReduce Tran Giang Son, [email protected] 37 / 44

Page 63: MapReduce - USTH Moodle

What Why How

Example: Word Count

• Group pairs that have same K

<three, 1><witches, 1><watch, 1><three, 1><swatch, 1><watches, 1><which, 1><witch, 1><watches, 1><which, 1><swatch, 1><watch, 1>

<three, 1><three, 1><witches, 1><watch, 1><watch, 1><swatch, 1><swatch, 1><watches, 1><watches, 1><which, 1><which, 1><witch, 1>

MapReduce Tran Giang Son, [email protected] 37 / 44

Page 64: MapReduce - USTH Moodle

What Why How

Example: Word Count

• reduce(): combine occurence of word in a single line

<three, 1><three, 1><witches, 1><watch, 1><watch, 1><swatch, 1><swatch, 1><watches, 1><watches, 1><which, 1><which, 1><witch, 1>

<three, 2><witches, 1><watch, 2><swatch, 2><watches, 2><which, 2><witch, 1>

MapReduce Tran Giang Son, [email protected] 38 / 44

Page 65: MapReduce - USTH Moodle

What Why How

Example: Word Count

1. threewitcheswatch threeswatchwatches

2. whichwitchwatcheswhichswatchwatch

<three, 1><witches, 1><watch, 1><three, 1><swatch, 1><watches, 1>

<which, 1><witch, 1><watches, 1><which, 1><swatch, 1><watch, 1>

<three, 1><three, 1><witches, 1><watch, 1><watch, 1><swatch, 1><swatch, 1><watches, 1><watches, 1><which, 1><which, 1><witch, 1>

<three, 2><witches, 1><watch, 2><swatch, 2><watches, 2><which, 2><witch, 1>

MapReduce Tran Giang Son, [email protected] 39 / 44

Page 66: MapReduce - USTH Moodle

What Why How

Example: Word Count

1. threewitcheswatch threeswatchwatches

2. whichwitchwatcheswhichswatchwatch

<three, 1><witches, 1><watch, 1><three, 1><swatch, 1><watches, 1>

<which, 1><witch, 1><watches, 1><which, 1><swatch, 1><watch, 1>

<three, 1><three, 1><witches, 1><watch, 1><watch, 1><swatch, 1><swatch, 1><watches, 1><watches, 1><which, 1><which, 1><witch, 1>

<three, 2><witches, 1><watch, 2><swatch, 2><watches, 2><which, 2><witch, 1>

MapReduce Tran Giang Son, [email protected] 39 / 44

Page 67: MapReduce - USTH Moodle

What Why How

Example: Word Count

1. threewitcheswatch threeswatchwatches

2. whichwitchwatcheswhichswatchwatch

<three, 1><witches, 1><watch, 1><three, 1><swatch, 1><watches, 1>

<which, 1><witch, 1><watches, 1><which, 1><swatch, 1><watch, 1>

<three, 1><three, 1><witches, 1><watch, 1><watch, 1><swatch, 1><swatch, 1><watches, 1><watches, 1><which, 1><which, 1><witch, 1>

<three, 2><witches, 1><watch, 2><swatch, 2><watches, 2><which, 2><witch, 1>

MapReduce Tran Giang Son, [email protected] 39 / 44

Page 68: MapReduce - USTH Moodle

What Why How

Example: Word Count

Easy?

MapReduce Tran Giang Son, [email protected] 40 / 44

Page 69: MapReduce - USTH Moodle

What Why How

Example: Word Count Extra

three swiss witch-bitches, whichwished to be switched swisswitch-bitches, watch three swissswatch watch switches. whichswiss witch-bitch, which wishesto be a switched witch-bitch,wishes to watch which swissswatch watch switch?

<swiss, 5><witch, 4><watch, 4><three, 2><bitches, 2><switched, 2><swatch, 2><bitch, 2><wishes, 2><wished, 1>

MapReduce Tran Giang Son, [email protected] 41 / 44

Page 70: MapReduce - USTH Moodle

What Why How

Example: Word Count Extra

three swiss witch-bitches, whichwished to be switched swisswitch-bitches, watch three swissswatch watch switches. whichswiss witch-bitch, which wishesto be a switched witch-bitch,wishes to watch which swissswatch watch switch?

<swiss, 5><witch, 4><watch, 4><three, 2><bitches, 2><switched, 2><swatch, 2><bitch, 2><wishes, 2><wished, 1>

MapReduce Tran Giang Son, [email protected] 41 / 44

Page 71: MapReduce - USTH Moodle

What Why How

Practical work 4: Word Count

• Create a new directory named «WordCount»• Use any MapReduce framework of your choice to implement

Word Count example• Java is OK• C/C++ is still preferred

• No MapReduce framework for C/C++ at the moment• Invent yourself

MapReduce Tran Giang Son, [email protected] 42 / 44

Page 72: MapReduce - USTH Moodle

What Why How

Practical work 4: Word Count

• Write a short report in LATEX:• Name it « 04.word.count.tex »• Why you chose your specific MapReduce implementation• How your Mapper and Reducer work. Figure.• Who does what.

• Work in your group, in parallel• Push your report to corresponding forked Github repository

MapReduce Tran Giang Son, [email protected] 43 / 44

Page 73: MapReduce - USTH Moodle

What Why How

Practical work 5: The Longest Path

• Use any MapReduce framework of your choice to implementLongestPath toy project

• Input: set of files, one for each of your laptops• Each line contain one full path of a file• find /

• Output: longest path(s)

• Write a short report in LATEX:• Name it « 05.word.count.tex »• How your Mapper and Reducer work. Figure.• Who does what.

• Work in your group, in parallel• Push your report to corresponding forked Github repository

MapReduce Tran Giang Son, [email protected] 44 / 44