Machine Learning With Spark

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Machine Learning with Apache Spark


Agenda

• Machine Learning Overview

• Spark

– Spark Essentials

– Sample Code

• Machine Learning Libraries in Spark

• MLIB

• Graphx

• Code Example


© Hortonworks Inc. 2014

Machine Learning Overview

Architecting the Future of Big DataPage 3


Machine Learning

Arthur Samuel (1959) – Machine Learning: Field of Study

that gives the ability to learn without being explicitly

programmed.

– Checker Programmer


Machine Learning

• Supervised

Regression

Classification

– SVM (Support Vector Machines)

• Unsupervised

Clustering

Recommendation

Outlier detection

Affinity analysis

• Learning theory

• Re-enforcement Learning


Supervised Learning

Infer a target function from labeled dataset

Example: classification, regression

Labeled dataset

Test data

ID Total$ Age City Target

101 200 25 SF 2

102 350 35 LA 2

103 25 15 LA 1

… … … … 1

1

2

ID Total$ Age City

105 234 22 NYC

106 112 67 BOS

Model

Target


Unsupervised Learning

Identify naturally occurring patterns in data

Example: clustering

ID Total$ Age City

101 200 25 SF

102 350 35 LA

103 25 15 LA

… … … …

No labels

ModelNaturally occurringhidden structure


Example: Email Spam Detection

2 classes: Spam or Not-Spam

Features: words that appear (or not) in the email text


Regression Analysis

ID Age City Target

101 25 SF $200

102 35 LA $350

103 15 LA $25

… … … …

Labeled

dataset

Test dataID Age City Target

104 17 NYC ?

Model

Techniques: linear regression, decision trees, etc

Many more


Regression Example: Ad Click Through Rate (CTR) Prediction

Rank = bid * CTR

Predict CTR for each ad to

determine placement, based on:

- Historical CTR

- Keyword match

- Etc…


Example: Netflix Movie Recommendations


Clustering


Recommendation Engine

5 2 4 ? ?

? ? 5 2 ?

1 2 ? ? 3

Harr

y p

otter

X-M

en

Hobbit

Arg

o

Pir

ate

s

5 2 4 1 3

4 1 5 2 3

1 2 4 1 3

101

102

103

104

105

…

101

102

103

104

105

…

Harr

y p

otter

X-M

en

Hobbit

Arg

o

Pir

ate

s


Detecting Natural Groups


Determining Class Labels

ID Total$ Age City Class

101 $200 25 SF 2

102 $350 35 LA 2

103 $25 15 LA 1

… … … … 1

1

2

2

2

N Variables Some techniques:

- Kmeans

- Spectral clustering

- DB-scan

- Hierarchical clustering


Detecting Outliers

Outlier point


Example: Credit Card Fraud Detection


Task #6: Affinity Analysis

Y N N Y N

Y N N Y N

Y Y N Y N

N N Y Y Y

Tx 1

Tx 2

Tx 3

Tx 4

Tx 5

…

Item

1

Item

2

Item

3

Item

4

Item

5

…

Y N N Y N

Y N N Y N

Y Y N Y N

N N Y Y Y

Tx 1

Tx 2

Tx 3

Tx 4

Tx 5

…

Item

1

Item

2

Item

3

Item

4

Item

5

…

Goal: identify frequent itemset

Techniques: FP Growth, Apriori


Example: Market Basket Analysis

Use affinity analysis for

- store layout design

- Coupons



Apache Spark



Apache Spark

• Apache Spark is an open source project for fast and large scale data processing.

– Simple and expressive programming model

– Machine learning, graph computation and Streaming

– in-memory compute for iterative workloads

• It does most of the processing in memory

• It support programming languages

– Java, Scala and Python

• It provides a high level modules for

– Mlib

– GraphX

– Sprak Streaming

– Sprark SQL

• Cluster Manager

– Yarn (recommended)

– Mesos

– Sparks Own


Brief History


RDD

• It is Spark’s abstraction for a distributed collection of items

• Resilient Distributed Dataset

• It can be created

– from Hadoop Inputformats

– Transforming other RDD


Actions and Transformation

Actions

Which Returns Values

Actions results into a DAG of operations

DAG is compiled into stages where each stage is executed as series of tasks

Tasks : Fundamental units of work

Transformations

Which return pointers to new RDD

Transformations are lazy (Not computed immediately)

Transformed RDDs gets recomputed when actions run on it


Spark in a Cluster

Spark Applications runs as independent set of process in a Cluster

SparkContext Object/Driver Manager initiates and co-ordinates it

– SparkContext is created when you start the spark-shell.

– It is accessible by “sc”

– SparkContext(master: String, jobName: String)

– Master : This is the location of the cluster

Cluster Manager allocates resources on the cluster

Spark acquires Executors on the Nodes

Spark sends you application code and tasks to the Executors


Spark Monitoring


Sample Program (Java)


Spark Sample

Executing the spark Shell on a Cluster

$ spark-shell --master yarn-client --num-executors 1 --driver-memory 512m --executor-

memory 512m --executor-cores 1

Executing the Spark Pi

$ spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-

executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1

/root/Spark11/lib/spark-examples*.jar


Demo

• Fire up a Spark VM with HDP

• Start the spark-shell

• Spark Pi

• Word Count Example



MLIB (Machine Learning Library)



Spark Stack


Machine Learning

MLlib is Spark’s scalable machine learning library consisting of common

learning algorithms and utilities, including classification, regression, clustering,

collaborative filtering, dimensionality reduction.

Dependency

• Breeze

• Breeze is a library for numerical processing

• netlib-java, and jblas

• Numeric and Matrix library for Java

• gfortran runtime library


MLliB

• Basic Statistics

• Correlations

• Stratified sampling

• Hypothesis testing

• Random data generation

• MLlib - Classification and Regression

Problem Type Supported Methods

Binary Classification Linear SVMs, logistic regression, decision tree, naïve Bayes

Multiclass Classification decision trees, naive Bayes

Regression Linear least squares, Lasso, ridge regression, descision trees


Mllib – Collaborative Filtering

• Collaborative Filtering - These techniques aim to fill in the missing entries

of a user-item association matrix.

• MLlib currently supports model-based collaborative filtering, in which users

and products are described by a small set of latent factors that can be used

to predict missing entries.

• MLlib uses the alternating least squares (ALS)

• Large-scale Parallel Collaborative Filtering for the Netflix Prize


Mllib – Clustering

• Clustering is an unsupervised learning problem

• Mllib supports K-Means Clustering


Mllib - Dimensionality reduction

Dimensionality reduction is the process of reducing the number of variables

under consideration. It can be used to extract latent features from raw and noisy

features or compress data while maintaining the structure. MLlib provides

support for dimensionality reduction on the RowMatrix class.


MLlib - Feature Extraction and Transformation

• TF-IDF - Term frequency-inverse document frequency (TF-IDF) is a feature

vectorization method widely used in text mining to reflect the importance of a

term to a document in the corpus

• Word2Vec - Word2Vec computes distributed vector representation of words.

The main advantage of the distributed representations is that similar words

are close in the vector space, which makes generalization to novel patterns

easier and model estimation more robust.

• StandardScaler

• Normalizer


Machine Learning Demo

• Movie Rating



Graphx



GraphX

• Spark API for graphs and graph-parallel computation


GraphX - Demo

http://ampcamp.berkeley.edu/4/exercises/graph-analytics-with-

graphx.html

http://ampcamp.berkeley.edu/4/exercises/graph-analytics-with-graphx.html


Questions

Data & Analytics

Machine Learning With Spark