Upload
shivaji-dutta
View
505
Download
1
Embed Size (px)
Citation preview
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Machine Learning with Apache Spark
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda
• Machine Learning Overview
• Spark
– Spark Essentials
– Sample Code
• Machine Learning Libraries in Spark
• MLIB
• Graphx
• Code Example
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2014
Machine Learning Overview
Architecting the Future of Big DataPage 3
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Machine Learning
Arthur Samuel (1959) – Machine Learning: Field of Study
that gives the ability to learn without being explicitly
programmed.
– Checker Programmer
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Machine Learning
• Supervised
Regression
Classification
– SVM (Support Vector Machines)
• Unsupervised
Clustering
Recommendation
Outlier detection
Affinity analysis
• Learning theory
• Re-enforcement Learning
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Supervised Learning
Infer a target function from labeled dataset
Example: classification, regression
Labeled dataset
Test data
ID Total$ Age City Target
101 200 25 SF 2
102 350 35 LA 2
103 25 15 LA 1
… … … … 1
1
2
ID Total$ Age City
105 234 22 NYC
106 112 67 BOS
Model
Target
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Unsupervised Learning
Identify naturally occurring patterns in data
Example: clustering
ID Total$ Age City
101 200 25 SF
102 350 35 LA
103 25 15 LA
… … … …
No labels
ModelNaturally occurringhidden structure
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example: Email Spam Detection
2 classes: Spam or Not-Spam
Features: words that appear (or not) in the email text
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Regression Analysis
ID Age City Target
101 25 SF $200
102 35 LA $350
103 15 LA $25
… … … …
Labeled
dataset
Test dataID Age City Target
104 17 NYC ?
Model
Techniques: linear regression, decision trees, etc
Many more
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Regression Example: Ad Click Through Rate (CTR) Prediction
Rank = bid * CTR
Predict CTR for each ad to
determine placement, based on:
- Historical CTR
- Keyword match
- Etc…
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example: Netflix Movie Recommendations
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Clustering
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Recommendation Engine
5 2 4 ? ?
? ? 5 2 ?
1 2 ? ? 3
Harr
y p
otter
X-M
en
Hobbit
Arg
o
Pir
ate
s
5 2 4 1 3
4 1 5 2 3
1 2 4 1 3
101
102
103
104
105
…
101
102
103
104
105
…
Harr
y p
otter
X-M
en
Hobbit
Arg
o
Pir
ate
s
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Detecting Natural Groups
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Determining Class Labels
ID Total$ Age City Class
101 $200 25 SF 2
102 $350 35 LA 2
103 $25 15 LA 1
… … … … 1
1
2
2
2
N Variables Some techniques:
- Kmeans
- Spectral clustering
- DB-scan
- Hierarchical clustering
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Detecting Outliers
Outlier point
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example: Credit Card Fraud Detection
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Task #6: Affinity Analysis
Y N N Y N
Y N N Y N
Y Y N Y N
N N Y Y Y
Tx 1
Tx 2
Tx 3
Tx 4
Tx 5
…
Item
1
Item
2
Item
3
Item
4
Item
5
…
Y N N Y N
Y N N Y N
Y Y N Y N
N N Y Y Y
Tx 1
Tx 2
Tx 3
Tx 4
Tx 5
…
Item
1
Item
2
Item
3
Item
4
Item
5
…
Goal: identify frequent itemset
Techniques: FP Growth, Apriori
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Example: Market Basket Analysis
Use affinity analysis for
- store layout design
- Coupons
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2014
Apache Spark
Architecting the Future of Big DataPage 20
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Spark
• Apache Spark is an open source project for fast and large scale data processing.
– Simple and expressive programming model
– Machine learning, graph computation and Streaming
– in-memory compute for iterative workloads
• It does most of the processing in memory
• It support programming languages
– Java, Scala and Python
• It provides a high level modules for
– Mlib
– GraphX
– Sprak Streaming
– Sprark SQL
• Cluster Manager
– Yarn (recommended)
– Mesos
– Sparks Own
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Brief History
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
RDD
• It is Spark’s abstraction for a distributed collection of items
• Resilient Distributed Dataset
• It can be created
– from Hadoop Inputformats
– Transforming other RDD
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Actions and Transformation
Actions
Which Returns Values
Actions results into a DAG of operations
DAG is compiled into stages where each stage is executed as series of tasks
Tasks : Fundamental units of work
Transformations
Which return pointers to new RDD
Transformations are lazy (Not computed immediately)
Transformed RDDs gets recomputed when actions run on it
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark in a Cluster
Spark Applications runs as independent set of process in a Cluster
SparkContext Object/Driver Manager initiates and co-ordinates it
– SparkContext is created when you start the spark-shell.
– It is accessible by “sc”
– SparkContext(master: String, jobName: String)
– Master : This is the location of the cluster
Cluster Manager allocates resources on the cluster
Spark acquires Executors on the Nodes
Spark sends you application code and tasks to the Executors
Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark Monitoring
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Sample Program (Java)
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark Sample
Executing the spark Shell on a Cluster
$ spark-shell --master yarn-client --num-executors 1 --driver-memory 512m --executor-
memory 512m --executor-cores 1
Executing the Spark Pi
$ spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-
executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1
/root/Spark11/lib/spark-examples*.jar
Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Demo
• Fire up a Spark VM with HDP
• Start the spark-shell
• Spark Pi
• Word Count Example
Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2014
MLIB (Machine Learning Library)
Architecting the Future of Big DataPage 30
Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark Stack
Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Machine Learning
MLlib is Spark’s scalable machine learning library consisting of common
learning algorithms and utilities, including classification, regression, clustering,
collaborative filtering, dimensionality reduction.
Dependency
• Breeze
• Breeze is a library for numerical processing
• netlib-java, and jblas
• Numeric and Matrix library for Java
• gfortran runtime library
Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
MLliB
• Basic Statistics
• Correlations
• Stratified sampling
• Hypothesis testing
• Random data generation
• MLlib - Classification and Regression
Problem Type Supported Methods
Binary Classification Linear SVMs, logistic regression, decision tree, naïve Bayes
Multiclass Classification decision trees, naive Bayes
Regression Linear least squares, Lasso, ridge regression, descision trees
Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Mllib – Collaborative Filtering
• Collaborative Filtering - These techniques aim to fill in the missing entries
of a user-item association matrix.
• MLlib currently supports model-based collaborative filtering, in which users
and products are described by a small set of latent factors that can be used
to predict missing entries.
• MLlib uses the alternating least squares (ALS)
• Large-scale Parallel Collaborative Filtering for the Netflix Prize
Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Mllib – Clustering
• Clustering is an unsupervised learning problem
• Mllib supports K-Means Clustering
Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Mllib - Dimensionality reduction
Dimensionality reduction is the process of reducing the number of variables
under consideration. It can be used to extract latent features from raw and noisy
features or compress data while maintaining the structure. MLlib provides
support for dimensionality reduction on the RowMatrix class.
Page 37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
MLlib - Feature Extraction and Transformation
• TF-IDF - Term frequency-inverse document frequency (TF-IDF) is a feature
vectorization method widely used in text mining to reflect the importance of a
term to a document in the corpus
• Word2Vec - Word2Vec computes distributed vector representation of words.
The main advantage of the distributed representations is that similar words
are close in the vector space, which makes generalization to novel patterns
easier and model estimation more robust.
• StandardScaler
• Normalizer
Page 38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Machine Learning Demo
• Movie Rating
Page 39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
© Hortonworks Inc. 2014
Graphx
Architecting the Future of Big DataPage 39
Page 40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
GraphX
• Spark API for graphs and graph-parallel computation
Page 41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
GraphX - Demo
http://ampcamp.berkeley.edu/4/exercises/graph-analytics-with-
graphx.html
Page 42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Questions