Yihua Huang, Ph.D., Professor Email ： [email protected] [email protected] NJU-PASA Lab for Big Data Processing Department of Computer Science and Technology

Yihua Huang, Ph.D., Professor Email [email protected] [email protected] NJU-PASA Lab for Big Data Processing Department of Computer Science and Technology Nanjing University May 29, 2015, India A Unified Programming Model and Platform for Big Data Machine Learning & Data Mining

PASA Big Data Lab at Nanjing University Our lab studies on Parallel Algorithms Systems, and Applications for Big Data Processing We are the earliest big data lab in China, entering big data research area since 2009 Now we are contributor of Apache Spark and Tachyon

Parallel Computing Models and Frameworks & Hadoop/Spark Performance Optimization Hadoop job and resource scheduling optimization Spark RDD persisting optimization Big Data Storage and Query Tachyon Optimization Performance Benchmarking Tools for Tachyon and DFS HBase Secondary Indexing (HBase+In-memory) and query system Large-Scale Semantic Data Storage and Query Large-scale RDF semantic data storage and query system(HBase+In-memory) RDFS/OWL semantic reasoning engines on Hadoop and Spark Machine Learning Algorithms and Systems for Big Data Analytics Parallel MLDM algorithm design with diversified parallel computing platforms Unified programming model and platform for MLDM algorithm design

Part 1. Parallel Algorithm Design for Machine Learning and Data Mining Part2. Unified Programming Model and Platform for Big Data Analytics Contents

Part1. Parallel Algorithm Design for Machine Learning and Data Mining

A variety of Big Data parallel computing platforms (Hadoop, Spark, MPI, etc.) emerging Serial machine learning algorithms not able to finish computation upon large-scale dataset in acceptable time Do not fit any of existing parallel computing platforms and thus need to rewrite them in parallel upon different parallel computing platforms Our lab has entered into Big Data area since 2009, starting from writing a variety of parallel Machine Learning algorithms on Hadoop, Spark, etc.

Frequent Itemset Mining is one of the most important and often used algorithm for data mining Apriori algorithm is the most established algorithm for finding frequent itemset from a transactional dataset Tao Xiao, Shuai Wang, Chunfeng Yuan, Yihua Huang. PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets. The Fourth International Symposium on Parallel Architectures, Algorithms and Programming, PAAP 2011, p 252-257, 2011 Hongjian Qiu, Rong Gu, Chunfeng Yuan and Yihua Huang. YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark. The 3rd International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics, conjunction with IPDPS 2014, May 23, 2014. Phoenix, USA

Suppose I is an itemset consisting of items from the transaction database D Let N be the number of transactions D Let M be the number of transactions that contain all the items of I M /N is referred to as the support of I in D Example Here, N = 4, let I = { I 1, I 2}, than M = 2 because I = { I 1, I 2} is contained in transactions T100 and T400 so the support of I is 0.5 (2/4 = 0.5) If sup( I ) is no less that an user-defined threshold, then I is referred to as a frequent itemset Goal of frequent sets mining To find all frequent k-itemsets from a transaction database (k = 1, 2, 3,....)

Apriori algorithm A classic frequent sets mining algorithm Needs multiple passes over the database In the first pass, all frequent 1-itemsets are discovered In each subsequent pass, frequent (k+1)-itemsets are discovered, with the frequent k- itemsets found in the previous pass as the seed (referred to as candidate itemsets) Repeat until no more frequent itemsets can be found

[1] Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994: 487-499 Apriori Algorithm [1]:

The FIM process is both data-intensive and computing-intensive. transactional dataset is become larger and larger Iteratively trying all combinations from 1-itemset to k-itemset is time-consuming FIM needs to scan the datasets iteratively for many times.

Apriori in MapReduce:

Experimental results PSON achieves great speedup compared to SON algorithm

Parallel Aprioir algorithm with MapReduce needs to run the MapReduce job iteratively It need to scan the dataset iteratively and store all the intermediate data in HDFS As a result, the parallel Apriori algorithm with MapReduce is not efficient enough

YAFIM, Apriori algorithm implemented in Spark Model, can gain about 18x speedup in our experiments Our YAFIM contains two phases to find all frequent itemsets Phase : Load transaction datasets as a Spark RDD object and generate 1-frequent itemsets; Phase : Iteratively generate (k+1)-frequent itemset from k-frequent itemset.

All transaction data reside in RDD Load all transaction data into a RDD

Phase I

Methods to speedup performance In-memory computing with RDDs. We make full use of RDDs and complete total computing in memory Share data with Broadcast. We adopt broadcast variables abstraction in the Spark to reduce data transformation in tasks

We ran experiments with both programs on four benchmarks [3] with different characteristics: MushRoom T10I4D100K Chess Pumsb_star Achieving about 18x speedup with Spark compared to the algorithm with MapReduce

We also apply our YAFIM in medical text semantic analysis application and achieve 25x speedup.

Basic Algorithm Input: A dataset of N data points that need to be clustered into K clusters Output K clusters Choose k cluster center Centers[K] as initial cluster centers Loop: for each data point P from dataset Calculate the distance between P and each of Centers[i] Save p to the nearest cluster center Recalculate the new Centers[K] Go loop until cluster centers converge

Pseudo codes for MapReduce class Mapper setup() { read k cluster centers Centers[K]; } map(key, p) // p is a data point { minDis = Double.MAX VALUE; index = -1; for i=0 to Centers.length { dis= ComputeDist(p, Centers[i]); if dis < minDis { minDis = dis; index = i; } emit(Centers[i].ClusterID, (p,1)); }

Pseudo codes for MapReduce To optimize the data I/O and network transfer, we can use Combiner to reduce the number of key-value pairs from a Map node class Combiner reduce(ClusterID, [(p1,1), (p2,1), ] ) { pm = 0.0 n = [(p1,1), (p2,1), ] ; for i=0 to n pm += p[i]; pm = pm / n; // Calculate the average of points in the Cluster emit(ClusterID, (pm, n)); // use it as new Center }

Pseudo codes for MapReduce class Reducer reduce(ClusterID, valueList = [(pm1,n1),(pm2,n2) ] ) { pm = 0.0 n=0; k = length of valuelist belonging to a ClusterID; for i=0 to k { pm += pm[i]*n[i]; n+= n[i]; } pm = pm / n; // calculate new center of the Cluster emit(ClusterID, (pm,n)); // output new center of the Cluster } In main() function of the MapReduce Job, set a loop to run the MapReduce job until converge

Scala codes while(tempDist > convergeDist && tempIter < MaxIter) { var closest = data.map ( p => (closestPoint(p, kPoints), (p, 1))) // determine nearest center for each P // calculate the average of all points in a cluster as new center var pointStats = closest.reduceByKey{case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2)} var newPoints = pointStats.map {pair => (pair._1, pair._2._1 / pair._2._2)}.collectAsMap() tempDist = 0.0 for (i

Execution time(s) Number of Nodes 1st iteration next iteration Spark speedup about 4-5 times compared to MapReduce Peng Liu, Jiayu Teng, Yihua Huang. Study of k-means algorithm parallelization performance based on spark. CCF Big Data 2014

Basic Idea Given m classes from training dataset: { C 1,C 2, , C m } Predict which class a testing sample X will belong to. => Only need to calculate Suppose x k is independent to each other => Thus, we can count from training samples to get both

Training Map Pseudo Code to calculate P(X|Ci) and P(Ci) class Mapper map(key, tr) // tr is a training sample { tr trid, X, Ci emit(Ci, 1) for j=0 to X.lenghth) { X[j] xnj & xvj // xnj: name if xj, xvj: value of xj emit(, 1) }

Training Reduce Pseudo Code to calculate P(xj|Ci) and P(Ci) class Reducer reduce(key, value_list) // key: either Ci or { sum =0; // count for P(xj|Ci) and P(Ci) while(value_list.hasNext()) sum += value_list.next().get(); emit(key, sum) } // Trim and save output as P(xj|Ci) and P(Ci) tables in HDFS

Predict Map Pseudo Code to Predict Test Sample class Mapper setup() { load P(xj|Ci) and P(Ci) data from training stage FC = { (Ci, P(Ci) ) }, FxC = { (, P(xj|Ci) ) } } map(key, ts) // ts is a test sample { ts tsid, X MaxF = MIN_VALUE; idx = -1; for (i=0 to FC.length) { FXCi = 1.0 Ci = FC[i].Ci; FCi = FC[i]. P(Ci) for (j=0 to X.length) { xnj = X[j].xnj; xvj = X[j].xvj Use to scan FxC, get P(xj|Ci) FXCi = FXCYi * P(xj|Ci) ; } if(FXCi* FCi >MaxF) { MaxF = FXCi*FCi; idx = i; } } emit(tsid, FC[idx].Ci) }

Training SparkR Code to calculate P(xj|Ci) and P(Ci) parseVector For large-scale matrix multiplication, how to partition matrix is very critical for the computation performance > We developed an automatic matrix partitioning and optimized execution algorithm in terms of the shapes and sizes of matrices and then schedule them for execution in parallel HAMA Blocking CARMA Blocking Broadcasting

Marlin: Optimized Distributed Matrix Multiplication with Spark OctMatrix: Distributed Matrix Computation Lib Multiply big and small matrices Multiply two big matrices

Marlin: Optimized Distributed Matrix Multiplication with Spark OctMatrix: Distributed Matrix Computation Lib

Marlin: Optimized Distributed Matrix Multiplication with Spark OctMatrix: Distributed Matrix Computation Lib 4~5x Speedup Compared to SparkR

Marlin: Optimized Distributed Matrix Multiplication with Spark OctMatrix: Distributed Matrix Computation Lib Matrix Multiply, 96 partitions, executor memory 10GB, except that case 3_5 is 20GB

OctMatrix Data Representation and Storage \Octopus_HOME \user-sesscion-id1\ \matrix-a info row_index \row-data par1.data parN.data col_index \col-data par1.data parN.data \matrix-b \matrix-c \user-sesscion-id2\ \user-sesscion-id3\ > Matrix data can be stored in local file, HDFS, and Tachyon, allowing to read from and write to these file systems from R programs > Matrix data is organized and stored in terms of certain structure

Machine Learning Lib built with OctMatrix Classification and regression Linear Regression Logistic Regression Softmax Linear Support Vector Machine (SVM) Clustering K-Means Feature extraction Deep Neural Network(Auto Encoder) More MLDM algorithms to come

How Octopus Works > Use standard R programming platform and allow users to write and implement codes for a variety of MLDM algorithms based on large-scale matrix computation model > Have integrated Octopus with Spark, Hadoop MapReduce and MPI allowing seamless switch and execution on top of underlying platforms Spark Hadoop MapReduce MPI Single Machine Octopus

Octopus Features Summary Ease-to-use/High-level User APIs high-level matrix operators and operations APIs. similar to that of the Matrix/Vector operation APIs in the standard R language. does not require the low-level knowledge for the distributed system knowledge or programming skills. Write Once, Run Anywhere programs written with Octopus can transparently run on top of different computing engines such as Spark, Hadoop MapReduce, or MPI. using OctMatrix APIs with small data running on a single-machine R engine for test and run the program on large scale data without modifying the codes. support a number of I/O sources including Tachyon, HDFS, and local file systems.

Octopus Features Summary Distributed R apply Functions offers the apply() function on OctMatrix. The parameter function will be executed on each element/row/column of the OctMatrix on the cluster in parallel. parameter functions passed to apply() can be any R functions including the UDFs. Machine Learning Algorithm Library Implemented a bunch of scalable machine learning algorithms and demo applications built on top of OctMatrix. Seamless Integration with R Ecosystem offers its features in a R package called OctMatrix. naturally takes advantage of the rich resources of the R ecosystem

Demonstrations Read/Write Octopus Matrix

Demonstrations A Variety of R Functions on Octopus

Demonstrations Logistic Regression Training Predicting Testing Change enginetype will be able to quickly switch to and run on top of one of underlying platforms without need to modify any other codes

Demonstrations K-Means Algorithm Testing

Demonstrations Linear Regression Algorithm Testing

Demonstrations Code Style Comparison between R and Octopus LR Codes with Standard R LR Codes with Octopus

Demonstrations Code Style Comparison between R and Octopus K-Means Codes with Standard R K-Means Codes with Octopus

Demonstrations Algorithm with MPI and Hadoop MapReduce Linear Algebra running with MPI Start a MPI Daemon to run MPI-Matrix behind

Demonstrations Algorithm with MPI and Hadoop MapReduce Linear Algebra running with Hadoop MapReduce

Octopus Project Website and Documents http://pasa-bigdata.nju.edu.cn/octopus/

Project Team Yihua Huang, Rong Gu, Zhaokang Wang, Yun Tang, Haipeng Zhan Contact Information Dr.Yihua Huang, Professor NJU-PASA Big Data Lab http://pasa-bigdata.nju.edu.cn Department of Computer Science and Technology Nanjing University, Nanjing, P.R.China Email [email protected] [email protected]