9
Spark MLlib

Machine Learning with Spark MLlib

Embed Size (px)

Citation preview

Spark MLlib

Overview

• MLlib is Spark’s library of machine learning (ML) functions designed to run in parallel on clusters. MLlib contains a variety of learning algorithms

• MLlib invokes various algorithms on RDDs

• Some classic ML algorithms are not included with Spark MLlib because they were not designed for parallel

Overview

• Divided into two packages:

• spark.mllib contains the original API built on top of RDDs.

• spark.ml provides higher-level API built on top of DataFrames

• Using spark.ml is recommended because with DataFrames the API is more versatile and flexible. Plan is to keep supporting spark.mllib along with the development of spark.ml.

Machine Learning Recap

• Machine learning algorithms try to predict or make decisions based on training data.

• There are multiple types of learning problems, including classification, regression, or clustering. All of which have different objectives.

Spark MLlib Data Types

• MLlib contains a few specific data types including Vector, LabeledPoint, Rating, Matrix (local and distributed) and various Model classes.

MLlib Supported Supervised Algorithm Methods

• Binary Classification Problems

• linear SVMs, logistic regression, decision trees, random forests, gradient-boosted trees, naive bayes

• Multiclass Classification Problems

• logistic regression, decision trees, random forests, naive Bayes

• Regression Problems

• linear least squares, Lasso, ridge regression, decision trees, random forests, gradient-boosted trees, isotonic regression

MLlib Supported Unsupervised Models

• K-means

• Gaussian mixture

• Power iteration clustering (PIC)

• Latent Dirichlet allocation (LDA)

• Bisecting k-means

• Streaming k-means

Recommender Systems

• Collaborative filtering is commonly used for recommender systems.

• spark.mllib currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries.

• spark.mllib uses the alternating least squares (ALS) algorithm to learn these latent factors.

For more, visit https://supergloo.com