MapReduce succinctly

MAPREDUCE SUCCINCTLY

Data everywhere

Problem - We are drowning in data

Hadoop’s place

Effective storage and processing of large chunks of data

Google GFS and MapReduce

• Google was dealing a large amount of data over 10 years ago

• Documented experience in a series of papers

• The MapReduce programming model

• Google File System

• Scalable model that was implemented in Hadoop

Disk speeds

• Processing 10 TB file

• Time – ~430 minutes

• Stored as 1TB on 10 machines

• Time – ~43 minutes

To store data at scale you need to

use multiple disks/machines

Processor trends

• CPU speeds are not growing exponentially

• Processors take less power

• Processors are able to do more in one cycle

Product Name

Intel® Core™ i7-920

Processor (8M Cache,

2.66 GHz, 4.80 GT/s

Intel® QPI)

Intel® Core™ i7-6700K

Processor (8M Cache, up

to 4.20 GHz)

Code Name Bloomfield Skylake

Launch Date Q4'08 Q3'15

Lithography 45 nm 14 nm

Recommended

Customer Price BOX : $305.00 BOX : $350.00

# of Cores 4 4

# of Threads 8 8

Processor Base

Frequency 2.66 GHz 4 GHz

Max Turbo

Frequency 2.93 GHz 4.2 GHz

TDP 130 W 91 W

Source - http://ark.intel.com/compare/88195,37147

To scale you need to use multiple

CPUs/machines

Network speeds

• Gigabit - Speed: 1000 mbps

• Size: 1 TB

• ~ 2 Hours

Don’t move data unless you have to

Example scenario

• Example that we will use to understand the problem

• Data on favorite beverage

• Calculate average cups consumed per day for each beverage

Brianna, coffee, 3

Cameron, milk, 5

Thomas, milk, 4

Wyatt, coffee, 5

coffee, 4

milk, 4.5

Example – Single Threaded

Average cups consumed by tea drinkers is 3.33

Transform

Group by beverage

Summarize and display results

The problem of shared state

Can we avoid

shared state?

Key idea – cooperating units

• Organize program into independent but cooperating units

• Programs need to be broken into a structure that will minimize

the need for any shared state

• Cooperating units can work in parallel without sharing resources

and cooperate as needed

Key idea – avoid shared state

Sum large list

Add list 1

Add list 2

Add list 3

Add and display

sum

How can we apply to our problem?

• Data can be split into blocks

• Each block of data can be processed by a thread

Stage 1 - input Stage 1 - output Stage 2 - output Stage 3 output

Brianna, coffee, 1

Cameron, milk, 5

Thomas, milk, 4

Wyatt, tea, 1

Victoria, coffee, 3

Grace, coffee, 4

David, tea, 4

coffee, 1

milk, 5

milk, 4

tea, 1

coffee, 3

coffee, 4

tea, 4

coffee, {1,3,4}

milk, {5, 4}

tea, {1, 4}

Coffee – 2.67

Milk, 4.5

Tea – 2.5

The Akka Actor model

• Units can send and receive messages

• Mailbox

Implementation structured to avoid shared state

Implementation – Take 2

Implementation – Take 3

MapReduce

Framework

Sorts, groups and

sends data by key

[Sort/Shuffle step]

The MapReduce framework

Preparation Map - input Map - output Sort/shuffle -

output

Reduce output

Break files into

blocks that can

be processed

independently

Locate and use

code to read

each record

Brianna, coffee, 1

Cameron, milk, 5

Thomas, milk, 4

Wyatt, tea, 1

Victoria, coffee, 3

Grace, coffee, 4

David, tea, 4

coffee, 1

milk, 5

milk, 4

tea, 1

coffee, 3

coffee, 4

tea, 4

coffee, {1,3,4}

milk, {5, 4}

tea, {1, 4}

Coffee – 2.67

Milk, 4.5

Tea – 2.5

Hadoop Distributed File System

• Files are split into large blocks

• Each block is stored on multiple nodes

• Namenode tracks block location

Other aspects

• Framework does a lot of the heavy lifting

• Machines can fail

• Tasks can fail

• Stragglers

• Users just write the Map and Reduce functions

Cup count demo – Apache Hadoop

• Demo

• Program is almost identical to what we wrote

Next steps

• Check out sample files on GitHub - https://github.com/danjebaraj/hadoopmr

• Read Google’s paper on Map Reduce and GFS (HDFS)

• http://research.google.com/archive/mapreduce.html

• http://research.google.com/archive/gfs.html

• Get familiar with Hadoop and Apache Spark

• Become familiar with functional programming

• Scala, F#, Clojure

• Check out Syncfusion’s free e-Books on related topics

• If working with Windows checkout Syncfusion’s easy to use Big Data Platform -

http://www.syncfusion.com/products/big-data

https://github.com/danjebaraj/hadoopmr

http://research.google.com/archive/mapreduce.html

http://research.google.com/archive/gfs.html



http://www.syncfusion.com/resources/techportal/ebooks

Related links

Thank you

Daniel Jebaraj

www.syncfusion.com

http://www.syncfusion.com/

Software

MapReduce succinctly