MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms

MLI - An API for Distributed Machine Learning

Sarang Dev

MLI - API

● Simplify the development of high-performance, scalable, distributed algorithms.

● Targets common ML problems related to data loading, feature extraction, model training.

● Usability : Comparable to Matlab, R

● Scalability : Matches low level systems like Graphlab,Vowpal Wabbit

Big Picture-MLBase

● ML Optimizer: This layer aims to automate the task of ML pipeline construction.

● MLI: An experimental API for feature extraction and algorithm development that introduces high-level ML programming abstractions.

● MLlib: Apache Spark's distributed ML library. Many features in MLlib have been borrowed from ML Optimizer and MLI. Maintained by Spark community.

Installing MLI

● https://github.com/amplab/MLI● Java 7 (not compatible with 8)● Scala 2.9.3● Spark 0.8.0● Needs some change in build files to compile

https://drive.google.com/open?id=0B64IP8kXPIDpTVE0NmFaanFWOUU

Uses sbt(interactive build tool) for building● Run command in sbt prompt

>compile

>assembly (makes a jar in the target)

https://github.com/amplab/MLI

https://drive.google.com/open?id=0B64IP8kXPIDpTVE0NmFaanFWOUU

MLI Interfaces

MLTable● MLTable is an object which provides a familiar

table-like interface to a developer, and is designed to mimic a SQL table.

● Interface for processing the semi-structured, mixed type data.

● Once data is featurized, it can be cast into an MLNumericTable, which is a convenience type that most ML algorithms will expect as input.

MLI Interfaces

LocalMatrix● LocalMatrix provides linear algebra primitives on partitions

of data. The partitions are automatically determined by the system.

Optimizer, Algorithm, and Model● Can implement algorithms using the Algorithm interface,

which should return a model as specified by the Model interface.

● Optimization techniques are used to converge to an approximate solution while iterating over the data.

Using MLI

● ADD_JARS = <path to mli jar> spark-shell

We can perform all the tasks in a spark-shell which always has a initialized spark context.import mli.feat._import mli.interface._

val mc = new MLContext(sc)val inputTable = mc.loadFile("/home/sarang/Downloads/sample.txt").cache() //MLTable

// c is the column on which we want to perform N-gram extraction// n is the N-gram length, e.g., n=2 corresponds to bigrams// k is the number of top N-grams we want to use (sorted by N-gram frequency)val (featurizedData, ngfeaturizer) = NGrams.extractNGrams(inputTable, c=0, n=2, k=10, stopWords = NGrams.stopWords)val (scaledData, featurizer) = Scale.scale(featurizedData.filter(_.nonZeros.length > 5).cache(), 0, ngfeaturizer)

Spark

Engine for large-scale data processing● Speed : Run programs up to 100x faster than

Hadoop MapReduce in memory● Ease of Use : Write applications in Java, Scala,

Python, R● Libraries : Spark powers a stack of libraries

including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming

Spark

Spark

Spark

Spark

Spark

MLlib

MLlib is a standard component of Spark providing machine learning primitives on top of Spark.

● Scalability● Performance● User-friendly APIs● Integration with Spark and its other components● Support for Java, Scala, Python

MLib● Classification: logistic regression, naive Bayes,...● Regression: generalized linear regression, isotonic regression,...● Decision trees, random forests, and gradient-boosted trees● Recommendation: alternating least squares (ALS)● Clustering: K-means, Gaussian mixtures (GMMs),...● Topic modeling: latent Dirichlet allocation (LDA)● Feature transformations: standardization, normalization, hashing,...● Model evaluation and hyper-parameter tuning● ML Pipeline construction● ML persistence: saving and loading models and Pipelines● Survival analysis: accelerated failure time model● Frequent itemset and sequential pattern mining: FP-growth, association rules, PrefixSpan● Distributed linear algebra: singular value decomposition (SVD), principal component analysis (PCA),...● Statistics: summary statistics, hypothesis testing,...

Data Types in MLlib

● Local vectorA local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine

● Labeled pointA labeled point is a local vector, either dense or sparse, associated with a label/response.

Eg . val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))

● Local matrix● Distributed matrix

– RowMatrix

– IndexedRowMatrix

– CoordinateMatrix

– BlockMatrix

DataFrames in Spark SQL

A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6. The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API.

Can be used as MLTable interface of MLI.

val sentenceData = spark.createDataFrame(Seq(

(0, "Hi I heard about Spark"),

(0, "I wish Java could use case classes"),

(1, "Logistic regression models are neat")

)).toDF("label", "sentence")

Using MLib

● Example :

K-Means Clustering( partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean)

Implement using Pyspark

Implement using Scala

Kmeans Pyspark from numpy import array

from math import sqrt

from pyspark import SparkContext, SparkConf

from pyspark.mllib.clustering import KMeans, KMeansModel

conf = SparkConf().setAppName("KMeans").setMaster("local")

sc = SparkContext(conf=conf)

# Load and parse the data

data = sc.textFile("/home/sarang/Downloads/kmeans_data.txt")

parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))

# Build the model (cluster the data)

clusters = KMeans.train(parsedData, 2, maxIterations=10,

runs=10, initializationMode="random")

# Evaluate clustering by computing Within Set Sum of Squared Errors

def error(point):

center = clusters.centers[clusters.predict(point)]

return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)

print("Within Set Sum of Squared Error = " + str(WSSSE))

# Save and load model

clusters.save(sc, "/home/sarang/KMeansModel")

sameModel = KMeansModel.load(sc, "/home/sarang/KmeansModel")

We can also use the pyspark shell instead

Kmeans Scala

import org.apache.spark.ml.clustering.KMeans

val dataset = spark.read.format("libsvm").load("/home/sarang/Downloads/kmeans_data1.txt")

// Trains a k-means model.

val kmeans = new KMeans().setK(2).setSeed(1L)

val model = kmeans.fit(dataset)

// Evaluate clustering by computing Within Set Sum of Squared Errors.

val WSSSE = model.computeCost(dataset)

model.clusterCenters.foreach(println)

Conclusion

● MLI is outdated and most of its features have been included in MLlib.

● MLlib can act as a powerful tool for machine learning.

References

● MLI Tutorial : http://ampcamp.berkeley.edu/3/exercises/mli-document-categorization.html

● Mllib : http://spark.apache.org/docs/latest/mllib-guide.html

http://ampcamp.berkeley.edu/3/exercises/mli-document-categorization.html

http://ampcamp.berkeley.edu/3/exercises/mli-document-categorization.html

http://spark.apache.org/docs/latest/mllib-guide.html

http://spark.apache.org/docs/latest/mllib-guide.html

Documents

MLI - An API for Distributed Machine Learning Sarang Devchandola/teaching/mlseminardocs/MLI.pdfMLI - API Simplify the development of high-performance, scalable, distributed algorithms