42
Page 1 © Hortonworks Inc. 2011 2014. All Rights Reserved Machine Learning with Apache Spark

Machine Learning With Spark

Embed Size (px)

Citation preview

Page 1: Machine Learning With Spark

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Machine Learning with Apache Spark

Page 2: Machine Learning With Spark

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Agenda

• Machine Learning Overview

• Spark

– Spark Essentials

– Sample Code

• Machine Learning Libraries in Spark

• MLIB

• Graphx

• Code Example

Page 3: Machine Learning With Spark

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

© Hortonworks Inc. 2014

Machine Learning Overview

Architecting the Future of Big DataPage 3

Page 4: Machine Learning With Spark

Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Machine Learning

Arthur Samuel (1959) – Machine Learning: Field of Study

that gives the ability to learn without being explicitly

programmed.

– Checker Programmer

Page 5: Machine Learning With Spark

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Machine Learning

• Supervised

Regression

Classification

– SVM (Support Vector Machines)

• Unsupervised

Clustering

Recommendation

Outlier detection

Affinity analysis

• Learning theory

• Re-enforcement Learning

Page 6: Machine Learning With Spark

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Supervised Learning

Infer a target function from labeled dataset

Example: classification, regression

Labeled dataset

Test data

ID Total$ Age City Target

101 200 25 SF 2

102 350 35 LA 2

103 25 15 LA 1

… … … … 1

1

2

ID Total$ Age City

105 234 22 NYC

106 112 67 BOS

Model

Target

Page 7: Machine Learning With Spark

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Unsupervised Learning

Identify naturally occurring patterns in data

Example: clustering

ID Total$ Age City

101 200 25 SF

102 350 35 LA

103 25 15 LA

… … … …

No labels

ModelNaturally occurringhidden structure

Page 8: Machine Learning With Spark

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Example: Email Spam Detection

2 classes: Spam or Not-Spam

Features: words that appear (or not) in the email text

Page 9: Machine Learning With Spark

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Regression Analysis

ID Age City Target

101 25 SF $200

102 35 LA $350

103 15 LA $25

… … … …

Labeled

dataset

Test dataID Age City Target

104 17 NYC ?

Model

Techniques: linear regression, decision trees, etc

Many more

Page 10: Machine Learning With Spark

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Regression Example: Ad Click Through Rate (CTR) Prediction

Rank = bid * CTR

Predict CTR for each ad to

determine placement, based on:

- Historical CTR

- Keyword match

- Etc…

Page 11: Machine Learning With Spark

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Example: Netflix Movie Recommendations

Page 12: Machine Learning With Spark

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Clustering

Page 13: Machine Learning With Spark

Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Recommendation Engine

5 2 4 ? ?

? ? 5 2 ?

1 2 ? ? 3

Harr

y p

otter

X-M

en

Hobbit

Arg

o

Pir

ate

s

5 2 4 1 3

4 1 5 2 3

1 2 4 1 3

101

102

103

104

105

101

102

103

104

105

Harr

y p

otter

X-M

en

Hobbit

Arg

o

Pir

ate

s

Page 14: Machine Learning With Spark

Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Detecting Natural Groups

Page 15: Machine Learning With Spark

Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Determining Class Labels

ID Total$ Age City Class

101 $200 25 SF 2

102 $350 35 LA 2

103 $25 15 LA 1

… … … … 1

1

2

2

2

N Variables Some techniques:

- Kmeans

- Spectral clustering

- DB-scan

- Hierarchical clustering

Page 16: Machine Learning With Spark

Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Detecting Outliers

Outlier point

Page 17: Machine Learning With Spark

Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Example: Credit Card Fraud Detection

Page 18: Machine Learning With Spark

Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Task #6: Affinity Analysis

Y N N Y N

Y N N Y N

Y Y N Y N

N N Y Y Y

Tx 1

Tx 2

Tx 3

Tx 4

Tx 5

Item

1

Item

2

Item

3

Item

4

Item

5

Y N N Y N

Y N N Y N

Y Y N Y N

N N Y Y Y

Tx 1

Tx 2

Tx 3

Tx 4

Tx 5

Item

1

Item

2

Item

3

Item

4

Item

5

Goal: identify frequent itemset

Techniques: FP Growth, Apriori

Page 19: Machine Learning With Spark

Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Example: Market Basket Analysis

Use affinity analysis for

- store layout design

- Coupons

Page 20: Machine Learning With Spark

Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

© Hortonworks Inc. 2014

Apache Spark

Architecting the Future of Big DataPage 20

Page 21: Machine Learning With Spark

Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Spark

• Apache Spark is an open source project for fast and large scale data processing.

– Simple and expressive programming model

– Machine learning, graph computation and Streaming

– in-memory compute for iterative workloads

• It does most of the processing in memory

• It support programming languages

– Java, Scala and Python

• It provides a high level modules for

– Mlib

– GraphX

– Sprak Streaming

– Sprark SQL

• Cluster Manager

– Yarn (recommended)

– Mesos

– Sparks Own

Page 22: Machine Learning With Spark

Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Brief History

Page 23: Machine Learning With Spark

Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

RDD

• It is Spark’s abstraction for a distributed collection of items

• Resilient Distributed Dataset

• It can be created

– from Hadoop Inputformats

– Transforming other RDD

Page 24: Machine Learning With Spark

Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Actions and Transformation

Actions

Which Returns Values

Actions results into a DAG of operations

DAG is compiled into stages where each stage is executed as series of tasks

Tasks : Fundamental units of work

Transformations

Which return pointers to new RDD

Transformations are lazy (Not computed immediately)

Transformed RDDs gets recomputed when actions run on it

Page 25: Machine Learning With Spark

Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Spark in a Cluster

Spark Applications runs as independent set of process in a Cluster

SparkContext Object/Driver Manager initiates and co-ordinates it

– SparkContext is created when you start the spark-shell.

– It is accessible by “sc”

– SparkContext(master: String, jobName: String)

– Master : This is the location of the cluster

Cluster Manager allocates resources on the cluster

Spark acquires Executors on the Nodes

Spark sends you application code and tasks to the Executors

Page 26: Machine Learning With Spark

Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Spark Monitoring

Page 27: Machine Learning With Spark

Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Sample Program (Java)

Page 28: Machine Learning With Spark

Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Spark Sample

Executing the spark Shell on a Cluster

$ spark-shell --master yarn-client --num-executors 1 --driver-memory 512m --executor-

memory 512m --executor-cores 1

Executing the Spark Pi

$ spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-

executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1

/root/Spark11/lib/spark-examples*.jar

Page 29: Machine Learning With Spark

Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Demo

• Fire up a Spark VM with HDP

• Start the spark-shell

• Spark Pi

• Word Count Example

Page 30: Machine Learning With Spark

Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

© Hortonworks Inc. 2014

MLIB (Machine Learning Library)

Architecting the Future of Big DataPage 30

Page 31: Machine Learning With Spark

Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Spark Stack

Page 32: Machine Learning With Spark

Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Machine Learning

MLlib is Spark’s scalable machine learning library consisting of common

learning algorithms and utilities, including classification, regression, clustering,

collaborative filtering, dimensionality reduction.

Dependency

• Breeze

• Breeze is a library for numerical processing

• netlib-java, and jblas

• Numeric and Matrix library for Java

• gfortran runtime library

Page 33: Machine Learning With Spark

Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

MLliB

• Basic Statistics

• Correlations

• Stratified sampling

• Hypothesis testing

• Random data generation

• MLlib - Classification and Regression

Problem Type Supported Methods

Binary Classification Linear SVMs, logistic regression, decision tree, naïve Bayes

Multiclass Classification decision trees, naive Bayes

Regression Linear least squares, Lasso, ridge regression, descision trees

Page 34: Machine Learning With Spark

Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Mllib – Collaborative Filtering

• Collaborative Filtering - These techniques aim to fill in the missing entries

of a user-item association matrix.

• MLlib currently supports model-based collaborative filtering, in which users

and products are described by a small set of latent factors that can be used

to predict missing entries.

• MLlib uses the alternating least squares (ALS)

• Large-scale Parallel Collaborative Filtering for the Netflix Prize

Page 35: Machine Learning With Spark

Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Mllib – Clustering

• Clustering is an unsupervised learning problem

• Mllib supports K-Means Clustering

Page 36: Machine Learning With Spark

Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Mllib - Dimensionality reduction

Dimensionality reduction is the process of reducing the number of variables

under consideration. It can be used to extract latent features from raw and noisy

features or compress data while maintaining the structure. MLlib provides

support for dimensionality reduction on the RowMatrix class.

Page 37: Machine Learning With Spark

Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

MLlib - Feature Extraction and Transformation

• TF-IDF - Term frequency-inverse document frequency (TF-IDF) is a feature

vectorization method widely used in text mining to reflect the importance of a

term to a document in the corpus

• Word2Vec - Word2Vec computes distributed vector representation of words.

The main advantage of the distributed representations is that similar words

are close in the vector space, which makes generalization to novel patterns

easier and model estimation more robust.

• StandardScaler

• Normalizer

Page 38: Machine Learning With Spark

Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Machine Learning Demo

• Movie Rating

Page 39: Machine Learning With Spark

Page 39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

© Hortonworks Inc. 2014

Graphx

Architecting the Future of Big DataPage 39

Page 40: Machine Learning With Spark

Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

GraphX

• Spark API for graphs and graph-parallel computation

Page 41: Machine Learning With Spark

Page 41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

GraphX - Demo

http://ampcamp.berkeley.edu/4/exercises/graph-analytics-with-

graphx.html

Page 42: Machine Learning With Spark

Page 42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Questions