49
Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail: [email protected]

Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

Embed Size (px)

Citation preview

Page 1: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

Introduction to MapReduce Paradigm for Data Mining

COSC 526 Class 2

Arvind RamanathanComputational Science & Engineering DivisionOak Ridge National Laboratory, Oak RidgePh: 865-576-7266E-mail: [email protected]

Page 2: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

2 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Last class …

• Class Logistics

• Introduction to big data

• Types of data and compute systems

• Bonferoni Principle and “how-not-to-design-an-experiment”

• The Big Data Mining Process

Page 3: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

3 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

This class…

• Need for Map Reduce Paradigm

• Map Reduce

• Decision making and Design of Map Reduce algorithms

• Example usage for easy statistics:– Word count

– Co-occurrence counts

Page 4: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

4 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

What are common to data mining/analytic algorithms?

Acquire (data)

Extract and

Clean

Aggregate and

Integrate

Represent

Analyze and

Model

Interpret

Iterate over a large set of data

Extract some quantities of interest from the data

Shuffle and sort the data

Aggregate intermediate resultsMake it look pretty!

Page 5: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

5 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Traditional Architecture of Data Mining

• Classical Machine Learning/ Data Mining

• Data fetched from disk loaded onto main memory and processed in the CPUs

CPU

Memory

Disk

Page 6: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

6 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Compute Intensive vs. Data Intensive Computing

Compute Intensive• Traditionally designed for

optimizing floating point operations (FLOPS)

• Key assumption: Working set data will fit main memory

• Memory bandwidth is usually high (and optimized)

• “Computationally Dense” – meaning all applications will have to rethink how to optimize use of compute resources

Data Intensive• Has to be optimized for data

movement, storage, analysis

• Data ops not FLOPS are important

• Key assumption: Working data set will not fit (may not be even available on the same machine)

• Current architectures are optimized for either media or transactional use

Page 7: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

7 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Compute Intensive vs. Data Intensive Computing (2)

2BerkinO Dzisikyilmaz,RamanathanNarayanan,JosephZambreno,GokhanMemik,andAlokN. Choudhary. An architectural characterization study of data mining and bioinformatics work- loads. In IISWC, pages 61–70, 2006

integer Floating point

Key Take home message: Current compute architectures are not optimized for Data mining/analytic operations!

Page 8: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

8 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Programmers shoulder responsibility in traditional HPC environments!

P1 P2 P3 P4 P5

Mem

ory

P1 P2 P3 P4 P5

Message Passing Shared Memory

• Issues related to scheduling, data distribution, synchronization, inter-process communication, etc.

• Architectural considerations: SIMD/MIMD, Network topology, etc. • OS issues: mutexes, deadlocks, etc.

Page 9: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

9 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Scalable Algorithms for Data Mining

• Data sizes are vast (> 100 Terabytes)

• Even assuming nominal read speed of 35 MB/sec, it can take over a month to just access/read the data!

• How about answering more useful questions?– Number of categories

– Types of datasets represented, etc.

• Takes even longer!!

Page 10: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

10 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Challenges

• How to ease access of data?– (Reasonably) Fast and efficient

– (Somewhat) fault tolerant access

• How to distribute computation?– Parallel Programming is hard!

– Use commodity clusters for processing

Hadoop Distributed File System (HDFS)Google File System

Hadoop MapReduce / Google MapReduce

MapReduce is an elegant paradigm of working with Big Data

Page 11: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

11 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

What would you change in the underlying architectures?

Hybrid Memory Cloud (HMC)Non-volatile Random Access MemoryGlobal Address Space (GAS)

Synergistic Challenges in Data Intensive and Exascale Computing (DOE ASCAC report 2013)

Page 12: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

12 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Let’s talk about distributing computations

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

CPU

Memory

Disk

……

Switch Switch

Switch

• Commodity clusters

• What do we do when we have supercomputers?

Page 13: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

13 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Programming Model for Data Mining

• Transferring data over network can take time

• Key Ideas:– Bring the computation close to the data

– Replicate the data multiple times for reliability

• MapReduce:– Provides a storage infrastructure

– @Google: GFS; @class: Hadoop-HDFS

– Programming Model

– Parallel paradigm, easier than conventional MPI

Page 14: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

14 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

MapReduce Architecture

Page 15: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

15 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

This is not the first lecture on MapReduce…

• Material is (in part) inspired by:– William Cohen’s lectures (10601 class in CMU)

– Jure Leskovec (Stanford)

– Aditya Prakash (Viriginia Tech)

– Cloudera

– Google

– And many many others!

• Materials “redrawn”, “reorganized” and “reworked” to reflect how we use it

Page 16: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

16 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

1st Key Idea MapReduce: Bring computations close to the data

• Programmers specify two functions:– Map(in_k, in_v) <out_k, inter_v’> list

– Reduce(<out_k, inter_v> list) <out_v> list

• All values with the same key are reduced together

• Let the “runtime” handle everything else:– Scheduling, I/O, networking, Inter-process

communication, etc.

Page 17: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

17 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Visual interpretation of MapReduce

k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 1 c 4 b 6 a 5 d 3 b 5 c 4

Shuffle & Sort: aggregate by key values

a 1 5 b 1 65 c 4 4 d 3

reduce reduce reduce reduce

a 6 b 12 c 8 d 3

Page 18: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

18 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Other things programmer also considers

• Partition(out_k, numberOfPartitions):– A simple hash e.g., hash(out_k) mod n

– Divides the key space for parallel reduce operations

• Combine(out_k, inter_v) <out_k, inter_v> list:– Mini reduce function that run in memory after

the map phase

– Optimize the network traffic

Page 19: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

19 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Now, how does MapReduce look like?

k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 1 c 4 b 6 d 5 d 3 b 5 c 4

Shuffle & Sort: aggregate by key values

a 1 5 b 1 65 c 4 4 d 3

reduce reduce reduce reduce

a 6 b 12 c 8 d 3

combine combine combine combine

d 8

partition partition partition partition

Page 20: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

20 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Let’s understand the MapReduce Runtime

• Scheduling:– workers are assigned to map and reduce tasks

• Data distribution:– move processes to the data

• Synchronization:– Gather, sort, and shuffle intermediate data

• Fault tolerance:– Detect worker failures and restarts

• Hadoop Distributed File System

Page 21: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

21 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Now for an example: WordCount

• Let’s look at a corpus of documents

• How do we write the algorithm?

Joe likes toastJane likes toast with jamJoe burnt toast

Page 22: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

22 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Word Count (2)

def map(String doc_id, String text):for each word w in text:

emit(w, 1);

def reduce(String term, Iterator<int> values):int sum = 0;for each v in values:

sum += v;Emit(term, sum);

Page 23: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

23 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Now, how does MapReduce look like?

k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 1 c 4 b 6 d 5 d 3 b 5 c 4

Shuffle & Sort: aggregate by key values

a 1 5 b 1 65 c 4 4 d 3

reduce reduce reduce reduce

a 6 b 12 c 8 d 3

combine combine combine combine

d 8

partition partition partition partition

Page 24: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

24 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

WordCount (3): Slow Motion (SloMo) Map

Page 25: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

25 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Now, how does MapReduce look like?

k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 1 c 4 b 6 d 5 d 3 b 5 c 4

Shuffle & Sort: aggregate by key values

a 1 5 b 1 65 c 4 4 d 3

reduce reduce reduce reduce

a 6 b 12 c 8 d 3

combine combine combine combine

d 8

partition partition partition partition

Page 26: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

26 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

WordCount (4): SloMo Shuffle & Sort

Joe1

Likes 1

Toast1

Jane1

Likes 1

Toast1

With1

Jam1

Joe1

burnt1

Toast1

Joe1

Joe1Jane1

likes1

likes1toast

1toast

1Toast

1with

1jam

1burnt

1the

1

Inp

ut

Ou

tpu

t

Page 27: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

27 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Now, how does MapReduce look like?

k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

map map map map

a 1 b 1 c 4 b 6 d 5 d 3 b 5 c 4

Shuffle & Sort: aggregate by key values

a 1 5 b 1 65 c 4 4 d 3

reduce reduce reduce reduce

a 6 b 12 c 8 d 3

combine combine combine combine

d 8

partition partition partition partition

Page 28: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

28 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

WordCount (5): SloMo Reduce

Joe1

Joe1Jane1

likes1

likes1toast

1toast

1Toast

1with

1jam

1burnt

1the

1

Inp

ut

Joe2

Jane1

likes2

toast3

with1

jam1

burnt1

the1

Ou

tpu

t

Page 29: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

29 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

A look under the hood: what happens when you invoke WordCount?

Split 1

Split 2

Split 3

Split 4

Split 0

• Split into 64MB chunk per piece

• Multiple copies of the program across cluster

User Program

Master

Worker

Worker

Worker

fork

assign map

read

• Master task is special

• M map tasks and R reduce tasks

• Idle workers are picked to run

• Worker reads the split it is assigned to

• <in,out> key value pairs are written to buffer

local write

Worker

Worker

assign reduce

remote read

OutputFile0

OutputFile1

fork

• Reduce workers are notified by the master about locations of files

• Reduce workers sort and present results

• Final results are stored with the correct intermediate key

Page 30: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

30 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

How do commodity clusters use this?

Compute Nodes

NAS

SAN

• Main problem: how to handle the data store + compute?

Page 31: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

31 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

2nd Key Idea MapReduce: Replicate the data multiple times for reliability

• Hadoop Distributed File System:– Store data (replicates) on the local disks

– Start running jobs on nodes that have data

• Why?– Not enough RAM to hold the data on main

memory

– Disk access is slow, but throughput is usually higher

Page 32: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

32 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

File storage design

• Files stored as chunks (e.g., 128 MB)

• Reliability through replication:– Each chunk replicated across 3+ chunkservers

• Single master to coordinate access + metadata:– Centralized management

• No data caching:– Little benefits for large data, streaming reads

• Simple API

Page 33: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

33 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

How does HDFS work?• NameNode stores cluster metadata • Files and directories are represented by

inodes• Inodes store attributes like permissions,

etc.

• Data is stored across datanodes• Replicated effectively

Page 34: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

34 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

WordCount Code: Main

others:• KeyValueInputFormat• SequenceFileInputForm

at

Page 35: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

35 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

WordCount: Map Function

Page 36: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

36 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

WordCount: Reduce Function

Page 37: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

37 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

MapReduce Limits

• Moving data is very expensive:– writing and reading are both expensive

• No reduce jobs can start until:– All map jobs are done

– Data in its partition is shuffled/sorted

Page 38: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

38 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Limitations of MapReduce

• No control of the order in which reduce jobs are performed– Only ordering is that reduce jobs start after map

jobs

• Assume that the map and reduce jobs will take place:– across different machines

– across different memory spaces

Page 39: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

39 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Programming Pitfalls• Don’t make a static variable and assume that other

processes read it

– They can’t

– It appears that they can when run locally, but they can’t

• Do not communicate between mappers or between reducers

– overhead is high

– you don’t know which mappers/reducers are actually running at any given point

– there’s no easy way to find out what machine they’re running on

• because you shouldn’t be looking for them anyway

Thanks to Shannon Quinn for his pointers!

Page 40: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

40 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Designing MapReduce Algorithms

Page 41: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

41 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

A slightly more complex example: Term Co-occurrence

• Given a large text collection compute a matrix of all words:– M = N x N matrix (N = vocabulary size)

– Mij: number of times i and j co-occur in a

sentence

• Why?– Distributional profiles are a way of measuring

semantic distance

– Semantic distance is important for NLP tasks

Page 42: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

42 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Example of a large counting problem

• Term co-occurrence matrix computation:– A large event space (no. of terms)

– A large number of observations (no. of documents)

– Keep track of interesting statistics about events

• Approach:– Mappers generate partial counts

– Reducers aggregate counts

Page 43: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

43 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

First Approach: “Pairs”

• Let each mapper take a sentence:– Generate all co-occurring term pairs

– For all pairs, emit(a, b) count

• Reducers sum up counts associated with these pairs

• User combiners to aggregate results

• Advantages: • Easy to implement, understand

• Disadvantages: • Upper bound on pairs is unknown

Page 44: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

44 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Second approach: “Stripes”

• Group together pairs into an associative array

• Each mapper takes a sentence:– Generate all co-occurring term pairs

– For each term, emit a {b: countb, c: countc, …}

• Reducers perform element-wise sum of associative arrays

(a, b) 1(a, c) 2

(a, d) 5 a: {b: 1, c: 2, d: 5, e: 3, f: 2}

(a, e) 3(a, f) 2

a {b: 1, d: 5, e:3}a {b: 1, c: 2, d: 5, f: 2}-------------------------------------------------a {b:2, c:2, d: 10, e: 3, f: 2}

• Advantages: • Far less sorting and shuffling of key value pairs• Can make better use of combiners

• Disadvantages: • More difficult to implement• Underlying objects are “larger” than a typical

intermediate results

Page 45: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

45 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

How do the runtimes compare

Page 46: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

46 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Summary and To Dos

Page 47: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

47 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Summary

• MapReduce is a big data programming paradigm– map jobs

– reduce jobs

• Careful consideration of data movement is required

Page 48: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

48 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration

Notes and What to expect next?

• Please form project teams as soon as possible– 2 is good; 3 is okay.

– More team members higher expectations!

• Assignment 1 is due today!

• Additional notes are put up on the website for Hadoop

• Next class:– Probability and Statistics Review Basics

– Naïve Bayes and Logistic Regression on Hadoop

Page 49: Introduction to MapReduce Paradigm for Data Mining COSC 526 Class 2 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

THANK YOU!!!