Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
MLI - An API for Distributed Machine Learning
Sarang Dev
MLI - API
● Simplify the development of high-performance, scalable, distributed algorithms.
● Targets common ML problems related to data loading, feature extraction, model training.
● Usability : Comparable to Matlab, R
● Scalability : Matches low level systems like Graphlab,Vowpal Wabbit
Big Picture-MLBase
● ML Optimizer: This layer aims to automate the task of ML pipeline construction.
● MLI: An experimental API for feature extraction and algorithm development that introduces high-level ML programming abstractions.
● MLlib: Apache Spark's distributed ML library. Many features in MLlib have been borrowed from ML Optimizer and MLI. Maintained by Spark community.
Installing MLI
● https://github.com/amplab/MLI● Java 7 (not compatible with 8)● Scala 2.9.3● Spark 0.8.0● Needs some change in build files to compile
https://drive.google.com/open?id=0B64IP8kXPIDpTVE0NmFaanFWOUU
Uses sbt(interactive build tool) for building● Run command in sbt prompt
>compile
>assembly (makes a jar in the target)
MLI Interfaces
MLTable● MLTable is an object which provides a familiar
table-like interface to a developer, and is designed to mimic a SQL table.
● Interface for processing the semi-structured, mixed type data.
● Once data is featurized, it can be cast into an MLNumericTable, which is a convenience type that most ML algorithms will expect as input.
MLI Interfaces
LocalMatrix● LocalMatrix provides linear algebra primitives on partitions
of data. The partitions are automatically determined by the system.
Optimizer, Algorithm, and Model● Can implement algorithms using the Algorithm interface,
which should return a model as specified by the Model interface.
● Optimization techniques are used to converge to an approximate solution while iterating over the data.
Using MLI
● ADD_JARS = <path to mli jar> spark-shell
We can perform all the tasks in a spark-shell which always has a initialized spark context.import mli.feat._import mli.interface._
val mc = new MLContext(sc)val inputTable = mc.loadFile("/home/sarang/Downloads/sample.txt").cache() //MLTable
// c is the column on which we want to perform N-gram extraction// n is the N-gram length, e.g., n=2 corresponds to bigrams// k is the number of top N-grams we want to use (sorted by N-gram frequency)val (featurizedData, ngfeaturizer) = NGrams.extractNGrams(inputTable, c=0, n=2, k=10, stopWords = NGrams.stopWords)val (scaledData, featurizer) = Scale.scale(featurizedData.filter(_.nonZeros.length > 5).cache(), 0, ngfeaturizer)
Spark
Engine for large-scale data processing● Speed : Run programs up to 100x faster than
Hadoop MapReduce in memory● Ease of Use : Write applications in Java, Scala,
Python, R● Libraries : Spark powers a stack of libraries
including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming
Spark
Spark
Spark
Spark
Spark
MLlib
MLlib is a standard component of Spark providing machine learning primitives on top of Spark.
● Scalability● Performance● User-friendly APIs● Integration with Spark and its other components● Support for Java, Scala, Python
MLib● Classification: logistic regression, naive Bayes,...● Regression: generalized linear regression, isotonic regression,...● Decision trees, random forests, and gradient-boosted trees● Recommendation: alternating least squares (ALS)● Clustering: K-means, Gaussian mixtures (GMMs),...● Topic modeling: latent Dirichlet allocation (LDA)● Feature transformations: standardization, normalization, hashing,...● Model evaluation and hyper-parameter tuning● ML Pipeline construction● ML persistence: saving and loading models and Pipelines● Survival analysis: accelerated failure time model● Frequent itemset and sequential pattern mining: FP-growth, association rules, PrefixSpan● Distributed linear algebra: singular value decomposition (SVD), principal component analysis (PCA),...● Statistics: summary statistics, hypothesis testing,...
Data Types in MLlib
● Local vectorA local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine
● Labeled pointA labeled point is a local vector, either dense or sparse, associated with a label/response.
Eg . val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
● Local matrix● Distributed matrix
– RowMatrix
– IndexedRowMatrix
– CoordinateMatrix
– BlockMatrix
DataFrames in Spark SQL
A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6. The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API.
Can be used as MLTable interface of MLI.
val sentenceData = spark.createDataFrame(Seq(
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
)).toDF("label", "sentence")
Using MLib
● Example :
K-Means Clustering( partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean)
Implement using Pyspark
Implement using Scala
Kmeans Pyspark from numpy import array
from math import sqrt
from pyspark import SparkContext, SparkConf
from pyspark.mllib.clustering import KMeans, KMeansModel
conf = SparkConf().setAppName("KMeans").setMaster("local")
sc = SparkContext(conf=conf)
# Load and parse the data
data = sc.textFile("/home/sarang/Downloads/kmeans_data.txt")
parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
# Build the model (cluster the data)
clusters = KMeans.train(parsedData, 2, maxIterations=10,
runs=10, initializationMode="random")
# Evaluate clustering by computing Within Set Sum of Squared Errors
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))
# Save and load model
clusters.save(sc, "/home/sarang/KMeansModel")
sameModel = KMeansModel.load(sc, "/home/sarang/KmeansModel")
We can also use the pyspark shell instead
Kmeans Scala
import org.apache.spark.ml.clustering.KMeans
val dataset = spark.read.format("libsvm").load("/home/sarang/Downloads/kmeans_data1.txt")
// Trains a k-means model.
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(dataset)
// Evaluate clustering by computing Within Set Sum of Squared Errors.
val WSSSE = model.computeCost(dataset)
model.clusterCenters.foreach(println)
Conclusion
● MLI is outdated and most of its features have been included in MLlib.
● MLlib can act as a powerful tool for machine learning.
References
● MLI Tutorial : http://ampcamp.berkeley.edu/3/exercises/mli-document-categorization.html
● Mllib : http://spark.apache.org/docs/latest/mllib-guide.html