Yihua Huang, Ph.D., Professor Email : [email protected][email protected] NJU-PASA Lab for Big Data Processing Department of Computer Science and Technology
Yihua Huang, Ph.D., Professor Email [email protected][email protected] NJU-PASA Lab for Big Data Processing Department
of Computer Science and Technology Nanjing University May 29, 2015,
India A Unified Programming Model and Platform for Big Data Machine
Learning & Data Mining
Slide 2
PASA Big Data Lab at Nanjing University Our lab studies on
Parallel Algorithms Systems, and Applications for Big Data
Processing We are the earliest big data lab in China, entering big
data research area since 2009 Now we are contributor of Apache
Spark and Tachyon
Slide 3
Parallel Computing Models and Frameworks & Hadoop/Spark
Performance Optimization Hadoop job and resource scheduling
optimization Spark RDD persisting optimization Big Data Storage and
Query Tachyon Optimization Performance Benchmarking Tools for
Tachyon and DFS HBase Secondary Indexing (HBase+In-memory) and
query system Large-Scale Semantic Data Storage and Query
Large-scale RDF semantic data storage and query
system(HBase+In-memory) RDFS/OWL semantic reasoning engines on
Hadoop and Spark Machine Learning Algorithms and Systems for Big
Data Analytics Parallel MLDM algorithm design with diversified
parallel computing platforms Unified programming model and platform
for MLDM algorithm design
Slide 4
Part 1. Parallel Algorithm Design for Machine Learning and Data
Mining Part2. Unified Programming Model and Platform for Big Data
Analytics Contents
Slide 5
Part1. Parallel Algorithm Design for Machine Learning and Data
Mining
Slide 6
A variety of Big Data parallel computing platforms (Hadoop,
Spark, MPI, etc.) emerging Serial machine learning algorithms not
able to finish computation upon large-scale dataset in acceptable
time Do not fit any of existing parallel computing platforms and
thus need to rewrite them in parallel upon different parallel
computing platforms Our lab has entered into Big Data area since
2009, starting from writing a variety of parallel Machine Learning
algorithms on Hadoop, Spark, etc.
Slide 7
Frequent Itemset Mining is one of the most important and often
used algorithm for data mining Apriori algorithm is the most
established algorithm for finding frequent itemset from a
transactional dataset Tao Xiao, Shuai Wang, Chunfeng Yuan, Yihua
Huang. PSON: A Parallelized SON Algorithm with MapReduce for Mining
Frequent Sets. The Fourth International Symposium on Parallel
Architectures, Algorithms and Programming, PAAP 2011, p 252-257,
2011 Hongjian Qiu, Rong Gu, Chunfeng Yuan and Yihua Huang. YAFIM: A
Parallel Frequent Itemset Mining Algorithm with Spark. The 3rd
International Workshop on Parallel and Distributed Computing for
Large Scale Machine Learning and Big Data Analytics, conjunction
with IPDPS 2014, May 23, 2014. Phoenix, USA
Slide 8
Suppose I is an itemset consisting of items from the
transaction database D Let N be the number of transactions D Let M
be the number of transactions that contain all the items of I M /N
is referred to as the support of I in D Example Here, N = 4, let I
= { I 1, I 2}, than M = 2 because I = { I 1, I 2} is contained in
transactions T100 and T400 so the support of I is 0.5 (2/4 = 0.5)
If sup( I ) is no less that an user-defined threshold, then I is
referred to as a frequent itemset Goal of frequent sets mining To
find all frequent k-itemsets from a transaction database (k = 1, 2,
3,....)
Slide 9
Apriori algorithm A classic frequent sets mining algorithm
Needs multiple passes over the database In the first pass, all
frequent 1-itemsets are discovered In each subsequent pass,
frequent (k+1)-itemsets are discovered, with the frequent k-
itemsets found in the previous pass as the seed (referred to as
candidate itemsets) Repeat until no more frequent itemsets can be
found
Slide 10
[1] Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for
Mining Association Rules in Large Databases. VLDB 1994: 487-499
Apriori Algorithm [1]:
Slide 11
The FIM process is both data-intensive and computing-intensive.
transactional dataset is become larger and larger Iteratively
trying all combinations from 1-itemset to k-itemset is
time-consuming FIM needs to scan the datasets iteratively for many
times.
Slide 12
Apriori in MapReduce:
Slide 13
Experimental results PSON achieves great speedup compared to
SON algorithm
Slide 14
Parallel Aprioir algorithm with MapReduce needs to run the
MapReduce job iteratively It need to scan the dataset iteratively
and store all the intermediate data in HDFS As a result, the
parallel Apriori algorithm with MapReduce is not efficient
enough
Slide 15
YAFIM, Apriori algorithm implemented in Spark Model, can gain
about 18x speedup in our experiments Our YAFIM contains two phases
to find all frequent itemsets Phase : Load transaction datasets as
a Spark RDD object and generate 1-frequent itemsets; Phase :
Iteratively generate (k+1)-frequent itemset from k-frequent
itemset.
Slide 16
All transaction data reside in RDD Load all transaction data
into a RDD
Slide 17
Phase
Slide 18
Phase I
Slide 19
Methods to speedup performance In-memory computing with RDDs.
We make full use of RDDs and complete total computing in memory
Share data with Broadcast. We adopt broadcast variables abstraction
in the Spark to reduce data transformation in tasks
Slide 20
We ran experiments with both programs on four benchmarks [3]
with different characteristics: MushRoom T10I4D100K Chess
Pumsb_star Achieving about 18x speedup with Spark compared to the
algorithm with MapReduce
Slide 21
Slide 22
Slide 23
Slide 24
We also apply our YAFIM in medical text semantic analysis
application and achieve 25x speedup.
Slide 25
Basic Algorithm Input: A dataset of N data points that need to
be clustered into K clusters Output K clusters Choose k cluster
center Centers[K] as initial cluster centers Loop: for each data
point P from dataset Calculate the distance between P and each of
Centers[i] Save p to the nearest cluster center Recalculate the new
Centers[K] Go loop until cluster centers converge
Slide 26
Pseudo codes for MapReduce class Mapper setup() { read k
cluster centers Centers[K]; } map(key, p) // p is a data point {
minDis = Double.MAX VALUE; index = -1; for i=0 to Centers.length {
dis= ComputeDist(p, Centers[i]); if dis < minDis { minDis = dis;
index = i; } emit(Centers[i].ClusterID, (p,1)); }
Slide 27
Pseudo codes for MapReduce To optimize the data I/O and network
transfer, we can use Combiner to reduce the number of key-value
pairs from a Map node class Combiner reduce(ClusterID, [(p1,1),
(p2,1), ] ) { pm = 0.0 n = [(p1,1), (p2,1), ] ; for i=0 to n pm +=
p[i]; pm = pm / n; // Calculate the average of points in the
Cluster emit(ClusterID, (pm, n)); // use it as new Center }
Slide 28
Pseudo codes for MapReduce class Reducer reduce(ClusterID,
valueList = [(pm1,n1),(pm2,n2) ] ) { pm = 0.0 n=0; k = length of
valuelist belonging to a ClusterID; for i=0 to k { pm +=
pm[i]*n[i]; n+= n[i]; } pm = pm / n; // calculate new center of the
Cluster emit(ClusterID, (pm,n)); // output new center of the
Cluster } In main() function of the MapReduce Job, set a loop to
run the MapReduce job until converge
Slide 29
Scala codes while(tempDist > convergeDist &&
tempIter < MaxIter) { var closest = data.map ( p =>
(closestPoint(p, kPoints), (p, 1))) // determine nearest center for
each P // calculate the average of all points in a cluster as new
center var pointStats = closest.reduceByKey{case ((x1, y1), (x2,
y2)) => (x1 + x2, y1 + y2)} var newPoints = pointStats.map {pair
=> (pair._1, pair._2._1 / pair._2._2)}.collectAsMap() tempDist =
0.0 for (i
Slide 30
Execution time(s) Number of Nodes 1st iteration next iteration
Spark speedup about 4-5 times compared to MapReduce Peng Liu, Jiayu
Teng, Yihua Huang. Study of k-means algorithm parallelization
performance based on spark. CCF Big Data 2014
Slide 31
Basic Idea Given m classes from training dataset: { C 1,C 2, ,
C m } Predict which class a testing sample X will belong to. =>
Only need to calculate Suppose x k is independent to each other
=> Thus, we can count from training samples to get both
Slide 32
Training Map Pseudo Code to calculate P(X|Ci) and P(Ci) class
Mapper map(key, tr) // tr is a training sample { tr trid, X, Ci
emit(Ci, 1) for j=0 to X.lenghth) { X[j] xnj & xvj // xnj: name
if xj, xvj: value of xj emit(, 1) }
Slide 33
Training Reduce Pseudo Code to calculate P(xj|Ci) and P(Ci)
class Reducer reduce(key, value_list) // key: either Ci or { sum
=0; // count for P(xj|Ci) and P(Ci) while(value_list.hasNext()) sum
+= value_list.next().get(); emit(key, sum) } // Trim and save
output as P(xj|Ci) and P(Ci) tables in HDFS
Slide 34
Predict Map Pseudo Code to Predict Test Sample class Mapper
setup() { load P(xj|Ci) and P(Ci) data from training stage FC = {
(Ci, P(Ci) ) }, FxC = { (, P(xj|Ci) ) } } map(key, ts) // ts is a
test sample { ts tsid, X MaxF = MIN_VALUE; idx = -1; for (i=0 to
FC.length) { FXCi = 1.0 Ci = FC[i].Ci; FCi = FC[i]. P(Ci) for (j=0
to X.length) { xnj = X[j].xnj; xvj = X[j].xvj Use to scan FxC, get
P(xj|Ci) FXCi = FXCYi * P(xj|Ci) ; } if(FXCi* FCi >MaxF) { MaxF
= FXCi*FCi; idx = i; } } emit(tsid, FC[idx].Ci) }
Slide 35
Training SparkR Code to calculate P(xj|Ci) and P(Ci)
parseVector For large-scale matrix multiplication, how to partition
matrix is very critical for the computation performance > We
developed an automatic matrix partitioning and optimized execution
algorithm in terms of the shapes and sizes of matrices and then
schedule them for execution in parallel HAMA Blocking CARMA
Blocking Broadcasting
Slide 64
Marlin: Optimized Distributed Matrix Multiplication with Spark
OctMatrix: Distributed Matrix Computation Lib Multiply big and
small matrices Multiply two big matrices
Marlin: Optimized Distributed Matrix Multiplication with Spark
OctMatrix: Distributed Matrix Computation Lib 4~5x Speedup Compared
to SparkR
Slide 67
Marlin: Optimized Distributed Matrix Multiplication with Spark
OctMatrix: Distributed Matrix Computation Lib Matrix Multiply, 96
partitions, executor memory 10GB, except that case 3_5 is 20GB
Slide 68
OctMatrix Data Representation and Storage \Octopus_HOME
\user-sesscion-id1\ \matrix-a info row_index \row-data par1.data
parN.data col_index \col-data par1.data parN.data \matrix-b
\matrix-c \user-sesscion-id2\ \user-sesscion-id3\ > Matrix data
can be stored in local file, HDFS, and Tachyon, allowing to read
from and write to these file systems from R programs > Matrix
data is organized and stored in terms of certain structure
Slide 69
Machine Learning Lib built with OctMatrix Classification and
regression Linear Regression Logistic Regression Softmax Linear
Support Vector Machine (SVM) Clustering K-Means Feature extraction
Deep Neural Network(Auto Encoder) More MLDM algorithms to come
Slide 70
How Octopus Works > Use standard R programming platform and
allow users to write and implement codes for a variety of MLDM
algorithms based on large-scale matrix computation model > Have
integrated Octopus with Spark, Hadoop MapReduce and MPI allowing
seamless switch and execution on top of underlying platforms Spark
Hadoop MapReduce MPI Single Machine Octopus
Slide 71
Octopus Features Summary Ease-to-use/High-level User APIs
high-level matrix operators and operations APIs. similar to that of
the Matrix/Vector operation APIs in the standard R language. does
not require the low-level knowledge for the distributed system
knowledge or programming skills. Write Once, Run Anywhere programs
written with Octopus can transparently run on top of different
computing engines such as Spark, Hadoop MapReduce, or MPI. using
OctMatrix APIs with small data running on a single-machine R engine
for test and run the program on large scale data without modifying
the codes. support a number of I/O sources including Tachyon, HDFS,
and local file systems.
Slide 72
Octopus Features Summary Distributed R apply Functions offers
the apply() function on OctMatrix. The parameter function will be
executed on each element/row/column of the OctMatrix on the cluster
in parallel. parameter functions passed to apply() can be any R
functions including the UDFs. Machine Learning Algorithm Library
Implemented a bunch of scalable machine learning algorithms and
demo applications built on top of OctMatrix. Seamless Integration
with R Ecosystem offers its features in a R package called
OctMatrix. naturally takes advantage of the rich resources of the R
ecosystem
Slide 73
Demonstrations Read/Write Octopus Matrix
Slide 74
Demonstrations A Variety of R Functions on Octopus
Slide 75
Demonstrations Logistic Regression Training Predicting Testing
Change enginetype will be able to quickly switch to and run on top
of one of underlying platforms without need to modify any other
codes
Slide 76
Demonstrations K-Means Algorithm Testing
Slide 77
Demonstrations Linear Regression Algorithm Testing
Slide 78
Demonstrations Code Style Comparison between R and Octopus LR
Codes with Standard R LR Codes with Octopus
Slide 79
Demonstrations Code Style Comparison between R and Octopus
K-Means Codes with Standard R K-Means Codes with Octopus
Slide 80
Demonstrations Algorithm with MPI and Hadoop MapReduce Linear
Algebra running with MPI Start a MPI Daemon to run MPI-Matrix
behind
Slide 81
Demonstrations Algorithm with MPI and Hadoop MapReduce Linear
Algebra running with Hadoop MapReduce
Slide 82
Octopus Project Website and Documents
http://pasa-bigdata.nju.edu.cn/octopus/
Slide 83
Project Team Yihua Huang, Rong Gu, Zhaokang Wang, Yun Tang,
Haipeng Zhan Contact Information Dr.Yihua Huang, Professor NJU-PASA
Big Data Lab http://pasa-bigdata.nju.edu.cn Department of Computer
Science and Technology Nanjing University, Nanjing, P.R.China Email
[email protected][email protected]