Upload
maude-wilkins
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Introduction to MapReduce Paradigm for Data Mining
COSC 526 Class 2
Arvind RamanathanComputational Science & Engineering DivisionOak Ridge National Laboratory, Oak RidgePh: 865-576-7266E-mail: [email protected]
2 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Last class …
• Class Logistics
• Introduction to big data
• Types of data and compute systems
• Bonferoni Principle and “how-not-to-design-an-experiment”
• The Big Data Mining Process
3 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
This class…
• Need for Map Reduce Paradigm
• Map Reduce
• Decision making and Design of Map Reduce algorithms
• Example usage for easy statistics:– Word count
– Co-occurrence counts
4 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
What are common to data mining/analytic algorithms?
Acquire (data)
Extract and
Clean
Aggregate and
Integrate
Represent
Analyze and
Model
Interpret
Iterate over a large set of data
Extract some quantities of interest from the data
Shuffle and sort the data
Aggregate intermediate resultsMake it look pretty!
5 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Traditional Architecture of Data Mining
• Classical Machine Learning/ Data Mining
• Data fetched from disk loaded onto main memory and processed in the CPUs
CPU
Memory
Disk
6 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Compute Intensive vs. Data Intensive Computing
Compute Intensive• Traditionally designed for
optimizing floating point operations (FLOPS)
• Key assumption: Working set data will fit main memory
• Memory bandwidth is usually high (and optimized)
• “Computationally Dense” – meaning all applications will have to rethink how to optimize use of compute resources
Data Intensive• Has to be optimized for data
movement, storage, analysis
• Data ops not FLOPS are important
• Key assumption: Working data set will not fit (may not be even available on the same machine)
• Current architectures are optimized for either media or transactional use
7 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Compute Intensive vs. Data Intensive Computing (2)
2BerkinO Dzisikyilmaz,RamanathanNarayanan,JosephZambreno,GokhanMemik,andAlokN. Choudhary. An architectural characterization study of data mining and bioinformatics work- loads. In IISWC, pages 61–70, 2006
integer Floating point
Key Take home message: Current compute architectures are not optimized for Data mining/analytic operations!
8 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Programmers shoulder responsibility in traditional HPC environments!
P1 P2 P3 P4 P5
Mem
ory
P1 P2 P3 P4 P5
Message Passing Shared Memory
• Issues related to scheduling, data distribution, synchronization, inter-process communication, etc.
• Architectural considerations: SIMD/MIMD, Network topology, etc. • OS issues: mutexes, deadlocks, etc.
9 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Scalable Algorithms for Data Mining
• Data sizes are vast (> 100 Terabytes)
• Even assuming nominal read speed of 35 MB/sec, it can take over a month to just access/read the data!
• How about answering more useful questions?– Number of categories
– Types of datasets represented, etc.
• Takes even longer!!
10 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Challenges
• How to ease access of data?– (Reasonably) Fast and efficient
– (Somewhat) fault tolerant access
• How to distribute computation?– Parallel Programming is hard!
– Use commodity clusters for processing
Hadoop Distributed File System (HDFS)Google File System
Hadoop MapReduce / Google MapReduce
MapReduce is an elegant paradigm of working with Big Data
11 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
What would you change in the underlying architectures?
Hybrid Memory Cloud (HMC)Non-volatile Random Access MemoryGlobal Address Space (GAS)
Synergistic Challenges in Data Intensive and Exascale Computing (DOE ASCAC report 2013)
12 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Let’s talk about distributing computations
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
CPU
Memory
Disk
……
Switch Switch
Switch
• Commodity clusters
• What do we do when we have supercomputers?
13 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Programming Model for Data Mining
• Transferring data over network can take time
• Key Ideas:– Bring the computation close to the data
– Replicate the data multiple times for reliability
• MapReduce:– Provides a storage infrastructure
– @Google: GFS; @class: Hadoop-HDFS
– Programming Model
– Parallel paradigm, easier than conventional MPI
14 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
MapReduce Architecture
15 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
This is not the first lecture on MapReduce…
• Material is (in part) inspired by:– William Cohen’s lectures (10601 class in CMU)
– Jure Leskovec (Stanford)
– Aditya Prakash (Viriginia Tech)
– Cloudera
– And many many others!
• Materials “redrawn”, “reorganized” and “reworked” to reflect how we use it
16 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
1st Key Idea MapReduce: Bring computations close to the data
• Programmers specify two functions:– Map(in_k, in_v) <out_k, inter_v’> list
– Reduce(<out_k, inter_v> list) <out_v> list
• All values with the same key are reduced together
• Let the “runtime” handle everything else:– Scheduling, I/O, networking, Inter-process
communication, etc.
17 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Visual interpretation of MapReduce
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
map map map map
a 1 b 1 c 4 b 6 a 5 d 3 b 5 c 4
Shuffle & Sort: aggregate by key values
a 1 5 b 1 65 c 4 4 d 3
reduce reduce reduce reduce
a 6 b 12 c 8 d 3
18 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Other things programmer also considers
• Partition(out_k, numberOfPartitions):– A simple hash e.g., hash(out_k) mod n
– Divides the key space for parallel reduce operations
• Combine(out_k, inter_v) <out_k, inter_v> list:– Mini reduce function that run in memory after
the map phase
– Optimize the network traffic
19 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Now, how does MapReduce look like?
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
map map map map
a 1 b 1 c 4 b 6 d 5 d 3 b 5 c 4
Shuffle & Sort: aggregate by key values
a 1 5 b 1 65 c 4 4 d 3
reduce reduce reduce reduce
a 6 b 12 c 8 d 3
combine combine combine combine
d 8
partition partition partition partition
20 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Let’s understand the MapReduce Runtime
• Scheduling:– workers are assigned to map and reduce tasks
• Data distribution:– move processes to the data
• Synchronization:– Gather, sort, and shuffle intermediate data
• Fault tolerance:– Detect worker failures and restarts
• Hadoop Distributed File System
21 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Now for an example: WordCount
• Let’s look at a corpus of documents
• How do we write the algorithm?
Joe likes toastJane likes toast with jamJoe burnt toast
22 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Word Count (2)
def map(String doc_id, String text):for each word w in text:
emit(w, 1);
def reduce(String term, Iterator<int> values):int sum = 0;for each v in values:
sum += v;Emit(term, sum);
23 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Now, how does MapReduce look like?
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
map map map map
a 1 b 1 c 4 b 6 d 5 d 3 b 5 c 4
Shuffle & Sort: aggregate by key values
a 1 5 b 1 65 c 4 4 d 3
reduce reduce reduce reduce
a 6 b 12 c 8 d 3
combine combine combine combine
d 8
partition partition partition partition
24 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
WordCount (3): Slow Motion (SloMo) Map
25 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Now, how does MapReduce look like?
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
map map map map
a 1 b 1 c 4 b 6 d 5 d 3 b 5 c 4
Shuffle & Sort: aggregate by key values
a 1 5 b 1 65 c 4 4 d 3
reduce reduce reduce reduce
a 6 b 12 c 8 d 3
combine combine combine combine
d 8
partition partition partition partition
26 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
WordCount (4): SloMo Shuffle & Sort
Joe1
Likes 1
Toast1
Jane1
Likes 1
Toast1
With1
Jam1
Joe1
burnt1
Toast1
Joe1
Joe1Jane1
likes1
likes1toast
1toast
1Toast
1with
1jam
1burnt
1the
1
Inp
ut
Ou
tpu
t
27 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Now, how does MapReduce look like?
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
map map map map
a 1 b 1 c 4 b 6 d 5 d 3 b 5 c 4
Shuffle & Sort: aggregate by key values
a 1 5 b 1 65 c 4 4 d 3
reduce reduce reduce reduce
a 6 b 12 c 8 d 3
combine combine combine combine
d 8
partition partition partition partition
28 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
WordCount (5): SloMo Reduce
Joe1
Joe1Jane1
likes1
likes1toast
1toast
1Toast
1with
1jam
1burnt
1the
1
Inp
ut
Joe2
Jane1
likes2
toast3
with1
jam1
burnt1
the1
Ou
tpu
t
29 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
A look under the hood: what happens when you invoke WordCount?
Split 1
Split 2
Split 3
Split 4
Split 0
• Split into 64MB chunk per piece
• Multiple copies of the program across cluster
User Program
Master
Worker
Worker
Worker
fork
assign map
read
• Master task is special
• M map tasks and R reduce tasks
• Idle workers are picked to run
• Worker reads the split it is assigned to
• <in,out> key value pairs are written to buffer
local write
Worker
Worker
assign reduce
remote read
OutputFile0
OutputFile1
fork
• Reduce workers are notified by the master about locations of files
• Reduce workers sort and present results
• Final results are stored with the correct intermediate key
30 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
How do commodity clusters use this?
Compute Nodes
NAS
SAN
• Main problem: how to handle the data store + compute?
31 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
2nd Key Idea MapReduce: Replicate the data multiple times for reliability
• Hadoop Distributed File System:– Store data (replicates) on the local disks
– Start running jobs on nodes that have data
• Why?– Not enough RAM to hold the data on main
memory
– Disk access is slow, but throughput is usually higher
32 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
File storage design
• Files stored as chunks (e.g., 128 MB)
• Reliability through replication:– Each chunk replicated across 3+ chunkservers
• Single master to coordinate access + metadata:– Centralized management
• No data caching:– Little benefits for large data, streaming reads
• Simple API
33 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
How does HDFS work?• NameNode stores cluster metadata • Files and directories are represented by
inodes• Inodes store attributes like permissions,
etc.
• Data is stored across datanodes• Replicated effectively
34 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
WordCount Code: Main
others:• KeyValueInputFormat• SequenceFileInputForm
at
35 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
WordCount: Map Function
36 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
WordCount: Reduce Function
37 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
MapReduce Limits
• Moving data is very expensive:– writing and reading are both expensive
• No reduce jobs can start until:– All map jobs are done
– Data in its partition is shuffled/sorted
38 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Limitations of MapReduce
• No control of the order in which reduce jobs are performed– Only ordering is that reduce jobs start after map
jobs
• Assume that the map and reduce jobs will take place:– across different machines
– across different memory spaces
39 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Programming Pitfalls• Don’t make a static variable and assume that other
processes read it
– They can’t
– It appears that they can when run locally, but they can’t
• Do not communicate between mappers or between reducers
– overhead is high
– you don’t know which mappers/reducers are actually running at any given point
– there’s no easy way to find out what machine they’re running on
• because you shouldn’t be looking for them anyway
Thanks to Shannon Quinn for his pointers!
40 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Designing MapReduce Algorithms
41 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
A slightly more complex example: Term Co-occurrence
• Given a large text collection compute a matrix of all words:– M = N x N matrix (N = vocabulary size)
– Mij: number of times i and j co-occur in a
sentence
• Why?– Distributional profiles are a way of measuring
semantic distance
– Semantic distance is important for NLP tasks
42 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Example of a large counting problem
• Term co-occurrence matrix computation:– A large event space (no. of terms)
– A large number of observations (no. of documents)
– Keep track of interesting statistics about events
• Approach:– Mappers generate partial counts
– Reducers aggregate counts
43 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
First Approach: “Pairs”
• Let each mapper take a sentence:– Generate all co-occurring term pairs
– For all pairs, emit(a, b) count
• Reducers sum up counts associated with these pairs
• User combiners to aggregate results
• Advantages: • Easy to implement, understand
• Disadvantages: • Upper bound on pairs is unknown
44 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Second approach: “Stripes”
• Group together pairs into an associative array
• Each mapper takes a sentence:– Generate all co-occurring term pairs
– For each term, emit a {b: countb, c: countc, …}
• Reducers perform element-wise sum of associative arrays
(a, b) 1(a, c) 2
(a, d) 5 a: {b: 1, c: 2, d: 5, e: 3, f: 2}
(a, e) 3(a, f) 2
a {b: 1, d: 5, e:3}a {b: 1, c: 2, d: 5, f: 2}-------------------------------------------------a {b:2, c:2, d: 10, e: 3, f: 2}
• Advantages: • Far less sorting and shuffling of key value pairs• Can make better use of combiners
• Disadvantages: • More difficult to implement• Underlying objects are “larger” than a typical
intermediate results
45 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
How do the runtimes compare
46 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Summary and To Dos
47 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Summary
• MapReduce is a big data programming paradigm– map jobs
– reduce jobs
• Careful consideration of data movement is required
48 Verification, Validation and Uncertainty Quantification of Machine Learning Algorithms: Phase I Demonstration
Notes and What to expect next?
• Please form project teams as soon as possible– 2 is good; 3 is okay.
– More team members higher expectations!
• Assignment 1 is due today!
• Additional notes are put up on the website for Hadoop
• Next class:– Probability and Statistics Review Basics
– Naïve Bayes and Logistic Regression on Hadoop
THANK YOU!!!