Upload
todd-mcgrath
View
213
Download
5
Embed Size (px)
Citation preview
Overview
• MLlib is Spark’s library of machine learning (ML) functions designed to run in parallel on clusters. MLlib contains a variety of learning algorithms
• MLlib invokes various algorithms on RDDs
• Some classic ML algorithms are not included with Spark MLlib because they were not designed for parallel
Overview
• Divided into two packages:
• spark.mllib contains the original API built on top of RDDs.
• spark.ml provides higher-level API built on top of DataFrames
• Using spark.ml is recommended because with DataFrames the API is more versatile and flexible. Plan is to keep supporting spark.mllib along with the development of spark.ml.
Machine Learning Recap
• Machine learning algorithms try to predict or make decisions based on training data.
• There are multiple types of learning problems, including classification, regression, or clustering. All of which have different objectives.
Spark MLlib Data Types
• MLlib contains a few specific data types including Vector, LabeledPoint, Rating, Matrix (local and distributed) and various Model classes.
MLlib Supported Supervised Algorithm Methods
• Binary Classification Problems
• linear SVMs, logistic regression, decision trees, random forests, gradient-boosted trees, naive bayes
• Multiclass Classification Problems
• logistic regression, decision trees, random forests, naive Bayes
• Regression Problems
• linear least squares, Lasso, ridge regression, decision trees, random forests, gradient-boosted trees, isotonic regression
MLlib Supported Unsupervised Models
• K-means
• Gaussian mixture
• Power iteration clustering (PIC)
• Latent Dirichlet allocation (LDA)
• Bisecting k-means
• Streaming k-means
Recommender Systems
• Collaborative filtering is commonly used for recommender systems.
• spark.mllib currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries.
• spark.mllib uses the alternating least squares (ALS) algorithm to learn these latent factors.