Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

Introduction to MapReduce Paradigm for Data Mining

COSC 526 Class 2

Arvind RamanathanComputational Science & Engineering DivisionOak Ridge National Laboratory, Oak RidgePh: 865-576-7266E-mail: [email protected]

mailto:[email protected]

2 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Last class …

• Class Logistics

• Introduction to big data

• Types of data and compute systems

• Bonferoni Principle and “how-not-to-design-an-experiment”

• The Big Data Mining Process


This class…

• Need for Map Reduce Paradigm

• Map Reduce

• Decision making and Design of Map Reduce algorithms

• Example usage for easy statistics:– Word count

– Co-occurrence counts


What are common to data mining/analytic algorithms?

Acquire (data)

Extract and

Clean

Aggregate and

Integrate

Represent

Analyze and

Model

Interpret

Iterate over a large set of data

Extract some quantities of interest from the data

Shuffle and sort the data

Aggregate intermediate resultsMake it look pretty!


Traditional Architecture of Data Mining

• Classical Machine Learning/ Data Mining

• Data fetched from disk loaded onto main memory and processed in the CPUs

CPU

Memory

Disk


Compute Intensive vs. Data Intensive Computing

Compute Intensive• Traditionally designed for

optimizing floating point operations (FLOPS)

• Key assumption: Working set data will fit main memory

• Memory bandwidth is usually high (and optimized)

• “Computationally Dense” – meaning all applications will have to rethink how to optimize use of compute resources

Data Intensive• Has to be optimized for data

movement, storage, analysis

• Data ops not FLOPS are important

• Key assumption: Working data set will not fit (may not be even available on the same machine)

• Current architectures are optimized for either media or transactional use


Compute Intensive vs. Data Intensive Computing (2)

2BerkinO Dzisikyilmaz,RamanathanNarayanan,JosephZambreno,GokhanMemik,andAlokN. Choudhary. An architectural characterization study of data mining and bioinformatics work- loads. In IISWC, pages 61–70, 2006

integer Floating point

Key Take home message: Current compute architectures are not optimized for Data mining/analytic operations!


Programmers shoulder responsibility in traditional HPC environments!

P1 P2 P3 P4 P5

Mem

ory

P1 P2 P3 P4 P5

Message Passing Shared Memory

• Issues related to scheduling, data distribution, synchronization, inter-process communication, etc.

• Architectural considerations: SIMD/MIMD, Network topology, etc. • OS issues: mutexes, deadlocks, etc.


Scalable Algorithms for Data Mining

• Data sizes are vast (> 100 Terabytes)

• Even assuming nominal read speed of 35 MB/sec, it can take over a month to just access/read the data!

• How about answering more useful questions?– Number of categories

– Types of datasets represented, etc.

• Takes even longer!!


Challenges

• How to ease access of data?– (Reasonably) Fast and efficient

– (Somewhat) fault tolerant access

• How to distribute computation?– Parallel Programming is hard!

– Use commodity clusters for processing

Hadoop Distributed File System (HDFS)Google File System

Hadoop MapReduce / Google MapReduce

MapReduce is an elegant paradigm of working with Big Data


What would you change in the underlying architectures?

Hybrid Memory Cloud (HMC)Non-volatile Random Access MemoryGlobal Address Space (GAS)

Synergistic Challenges in Data Intensive and Exascale Computing (DOE ASCAC report 2013)


Let’s talk about distributing computations

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

……

Switch Switch

Switch

• Commodity clusters

• What do we do when we have supercomputers?


Programming Model for Data Mining

• Transferring data over network can take time

• Key Ideas:– Bring the computation close to the data

– Replicate the data multiple times for reliability

• MapReduce:– Provides a storage infrastructure

– @Google: GFS; @class: Hadoop-HDFS

– Programming Model

– Parallel paradigm, easier than conventional MPI


MapReduce Architecture


This is not the first lecture on MapReduce…

• Material is (in part) inspired by:– William Cohen’s lectures (10601 class in CMU)

– Jure Leskovec (Stanford)

– Aditya Prakash (Viriginia Tech)

– Cloudera

– Google

– And many many others!

• Materials “redrawn”, “reorganized” and “reworked” to reflect how we use it


1st Key Idea MapReduce: Bring computations close to the data

• Programmers specify two functions:– Map(in_k, in_v) <out_k, inter_v’> list

– Reduce(<out_k, inter_v> list) <out_v> list

• All values with the same key are reduced together

• Let the “runtime” handle everything else:– Scheduling, I/O, networking, Inter-process

communication, etc.


Visual interpretation of MapReduce

k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 1 c 4 b 6 a 5 d 3 b 5 c 4

Shuffle & Sort: aggregate by key values

a 1 5 b 1 65 c 4 4 d 3

reduce reduce reduce reduce

a 6 b 12 c 8 d 3


Other things programmer also considers

• Partition(out_k, numberOfPartitions):– A simple hash e.g., hash(out_k) mod n

– Divides the key space for parallel reduce operations

• Combine(out_k, inter_v) <out_k, inter_v> list:– Mini reduce function that run in memory after

the map phase

– Optimize the network traffic


Now, how does MapReduce look like?

k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 1 c 4 b 6 d 5 d 3 b 5 c 4


a 1 5 b 1 65 c 4 4 d 3


a 6 b 12 c 8 d 3

combine combine combine combine

d 8

partition partition partition partition


Let’s understand the MapReduce Runtime

• Scheduling:– workers are assigned to map and reduce tasks

• Data distribution:– move processes to the data

• Synchronization:– Gather, sort, and shuffle intermediate data

• Fault tolerance:– Detect worker failures and restarts

• Hadoop Distributed File System


Now for an example: WordCount

• Let’s look at a corpus of documents

• How do we write the algorithm?

Joe likes toastJane likes toast with jamJoe burnt toast


Word Count (2)

def map(String doc_id, String text):for each word w in text:

emit(w, 1);

def reduce(String term, Iterator<int> values):int sum = 0;for each v in values:

sum += v;Emit(term, sum);



k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 1 c 4 b 6 d 5 d 3 b 5 c 4


a 1 5 b 1 65 c 4 4 d 3


a 6 b 12 c 8 d 3


d 8



WordCount (3): Slow Motion (SloMo) Map



k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 1 c 4 b 6 d 5 d 3 b 5 c 4


a 1 5 b 1 65 c 4 4 d 3


a 6 b 12 c 8 d 3


d 8



WordCount (4): SloMo Shuffle & Sort

Joe1

Likes 1

Toast1

Jane1

Likes 1

Toast1

With1

Jam1

Joe1

burnt1

Toast1

Joe1

Joe1Jane1

likes1

likes1toast

1toast

1Toast

1with

1jam

1burnt

1the

1

Inp

ut

Ou

tpu

t



k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 1 c 4 b 6 d 5 d 3 b 5 c 4


a 1 5 b 1 65 c 4 4 d 3


a 6 b 12 c 8 d 3


d 8



WordCount (5): SloMo Reduce

Joe1

Joe1Jane1

likes1

likes1toast

1toast

1Toast

1with

1jam

1burnt

1the

1

Inp

ut

Joe2

Jane1

likes2

toast3

with1

jam1

burnt1

the1

Ou

tpu

t


A look under the hood: what happens when you invoke WordCount?

Split 1

Split 2

Split 3

Split 4

Split 0

• Split into 64MB chunk per piece

• Multiple copies of the program across cluster

User Program

Master

Worker

Worker

Worker

fork

assign map

read

• Master task is special

• M map tasks and R reduce tasks

• Idle workers are picked to run

• Worker reads the split it is assigned to

• <in,out> key value pairs are written to buffer

local write

Worker

Worker

assign reduce

remote read

OutputFile0

OutputFile1

fork

• Reduce workers are notified by the master about locations of files

• Reduce workers sort and present results

• Final results are stored with the correct intermediate key


How do commodity clusters use this?

Compute Nodes

NAS

SAN

• Main problem: how to handle the data store + compute?


2nd Key Idea MapReduce: Replicate the data multiple times for reliability

• Hadoop Distributed File System:– Store data (replicates) on the local disks

– Start running jobs on nodes that have data

• Why?– Not enough RAM to hold the data on main

memory

– Disk access is slow, but throughput is usually higher


File storage design

• Files stored as chunks (e.g., 128 MB)

• Reliability through replication:– Each chunk replicated across 3+ chunkservers

• Single master to coordinate access + metadata:– Centralized management

• No data caching:– Little benefits for large data, streaming reads

• Simple API


How does HDFS work?• NameNode stores cluster metadata • Files and directories are represented by

inodes• Inodes store attributes like permissions,

etc.

• Data is stored across datanodes• Replicated effectively


WordCount Code: Main

others:• KeyValueInputFormat• SequenceFileInputForm

at


WordCount: Map Function


WordCount: Reduce Function


MapReduce Limits

• Moving data is very expensive:– writing and reading are both expensive

• No reduce jobs can start until:– All map jobs are done

– Data in its partition is shuffled/sorted


Limitations of MapReduce

• No control of the order in which reduce jobs are performed– Only ordering is that reduce jobs start after map

jobs

• Assume that the map and reduce jobs will take place:– across different machines

– across different memory spaces


Programming Pitfalls• Don’t make a static variable and assume that other

processes read it

– They can’t

– It appears that they can when run locally, but they can’t

• Do not communicate between mappers or between reducers

– overhead is high

– you don’t know which mappers/reducers are actually running at any given point

– there’s no easy way to find out what machine they’re running on

• because you shouldn’t be looking for them anyway

Thanks to Shannon Quinn for his pointers!


Designing MapReduce Algorithms


A slightly more complex example: Term Co-occurrence

• Given a large text collection compute a matrix of all words:– M = N x N matrix (N = vocabulary size)

– Mij: number of times i and j co-occur in a

sentence

• Why?– Distributional profiles are a way of measuring

semantic distance

– Semantic distance is important for NLP tasks


Example of a large counting problem

• Term co-occurrence matrix computation:– A large event space (no. of terms)

– A large number of observations (no. of documents)

– Keep track of interesting statistics about events

• Approach:– Mappers generate partial counts

– Reducers aggregate counts


First Approach: “Pairs”

• Let each mapper take a sentence:– Generate all co-occurring term pairs

– For all pairs, emit(a, b) count

• Reducers sum up counts associated with these pairs

• User combiners to aggregate results

• Advantages: • Easy to implement, understand

• Disadvantages: • Upper bound on pairs is unknown


Second approach: “Stripes”

• Group together pairs into an associative array

• Each mapper takes a sentence:– Generate all co-occurring term pairs

– For each term, emit a {b: countb, c: countc, …}

• Reducers perform element-wise sum of associative arrays

(a, b) 1(a, c) 2

(a, d) 5 a: {b: 1, c: 2, d: 5, e: 3, f: 2}

(a, e) 3(a, f) 2

a {b: 1, d: 5, e:3}a {b: 1, c: 2, d: 5, f: 2}-------------------------------------------------a {b:2, c:2, d: 10, e: 3, f: 2}

• Advantages: • Far less sorting and shuffling of key value pairs• Can make better use of combiners

• Disadvantages: • More difficult to implement• Underlying objects are “larger” than a typical

intermediate results


How do the runtimes compare


Summary and To Dos


Summary

• MapReduce is a big data programming paradigm– map jobs

– reduce jobs

• Careful consideration of data movement is required


Notes and What to expect next?

• Please form project teams as soon as possible– 2 is good; 3 is okay.

– More team members higher expectations!

• Assignment 1 is due today!

• Additional notes are put up on the website for Hadoop

• Next class:– Probability and Statistics Review Basics

– Naïve Bayes and Logistic Regression on Hadoop

THANK YOU!!!

Documents

Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,