Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Machine Learning :: Introduction
Part II
Konstantin Tretyakov ([email protected])
MTAT.03.183 – Data Mining
November 5, 2009
In the previous episode
Machine learning :: Introduction :: Part II 12.11.20092
Approaches to data analysis
The general principle is the same, though:
1. Define a set of patterns of interest
2. Define a measure of goodness for the patterns
3. Find the best pattern in the data
In the previous episode
Machine learning :: Introduction :: Part II 12.11.20093
Supervised Learning: Manual
In the previous episode
Machine learning :: Introduction :: Part II 12.11.20094
Regression Classification
Supervised Learning: Automatic
In the previous episode
Machine learning :: Introduction :: Part II 12.11.20095
Supervised Learning
Regression Classification
Training set
Test set
Holdout
Cross-validation
Accuracy,
Precision, Recall
TP, FP, TN, FN
Validation
In the previous episode
Machine learning :: Introduction :: Part II 12.11.20096
Supervised Learning
Regression Classification
Decision trees: ID3, C4.5
Today
Machine learning :: Introduction :: Part II 12.11.20097
Supervised Learning
Regression Classification
Naïve BayesLinear regression KNN
Supervised learning
Four examples of approaches
Ad-hoc
Decision tree induction
Probabilistic modeling
Naïve Bayes classifier
Objective function optimization
Linear least squares regression
Instance-based methods
K-nearest neighbors
12.11.2009Machine learning :: Introduction :: Part II8
Shall we play tennis today?
12.11.2009Machine learning :: Introduction :: Part II12
P(Yes) = 9/14 = 0.64
P(No) = 5/14 = 0.36
Yes
Probabilistic model:
It’s windy today. Tennis, anyone?
Probabilistic model:
12.11.2009Machine learning :: Introduction :: Part II14
P(Weak) = 8/14
P(Strong) = 6/14
P(Yes | Weak) = 6/8
P(No | Weak) = 2/8
P(Yes | Strong) = 3/6
P(No | Strong) = 3/6
More attributes
Probabilistic model:
12.11.2009Machine learning :: Introduction :: Part II15
P(High,Weak) = 4/14
P(Yes | High,Weak) = 2/4
P(No | High,Weak) = 2/4
P(High,Strong) = 3/14
P(Yes | High,Strong) = 1/3
P(No | High,Strong) = 2/3
…
The Bayes Classifier
In general:
1. Estimate from data:
P(Class | X1,X2,X3,…)
2. For a given instance (X1,X2,X3 …)
predict class whose conditional
probability is greater:
P(C1 | X1,X2,X3,…) > P(C2 | X1,X2,X3,…)
predict C1
12.11.2009Machine learning :: Introduction :: Part II16
The Bayes Classifier
In general:
1. P(C1 | X) > P(C2 | X)
predict C1
12.11.2009Machine learning :: Introduction :: Part II17
The Bayes Classifier
In general:
1. P(C1 | X) > P(C2 | X)
predict C1
2. P(X | C1)P(C1)/P(X) > P(X | C2)P(C2)/P(X)
predict C1
12.11.2009Machine learning :: Introduction :: Part II18
The Bayes Classifier
In general:
1. P(C1 | X) > P(C2 | X)
predict C1
2. P(X | C1)P(C1)/P(X) > P(X | C2)P(C2)/P(X)
predict C1
3. P(X | C1)P(C1) > P(X | C2)P(C2)
predict C1
12.11.2009Machine learning :: Introduction :: Part II19
The Bayes Classifier
In general:
1. P(C1 | X) > P(C2 | X)
predict C1
2. P(X | C1)P(C1)/P(X) > P(X | C2)P(C2)/P(X)
predict C1
3. P(X | C1)P(C1) > P(X | C2)P(C2)
predict C1
4. P(X | C1)/P(X | C2) > P(C2)/P(C1)
predict C1
12.11.2009Machine learning :: Introduction :: Part II20
The Bayes Classifier
3. P(X | C1)P(C1) > P(X | C2)P(C2) predict C1
12.11.2009Machine learning :: Introduction :: Part II21
The Bayes Classifier
If the true underlying distribution is
known, the Bayes classifier is optimal
(i.e. it achieves minimal error probability).
12.11.2009Machine learning :: Introduction :: Part II22
Problem
We need exponential amount of data
12.11.2009Machine learning :: Introduction :: Part II23
P(High,Weak) = 4/14
P(Yes | High,Weak) = 2/4
P(No | High,Weak) = 2/4
P(High,Strong) = 3/14
P(Yes | High,Strong) = 1/3
P(No | High,Strong) = 2/3
…
The Naïve Bayes Classifier
To scale beyond 2-3 attributes, use a trick:
Assume that attributes of each class are
independent:
P(X1,X2,X3 | Class) =
= P(X1|Class)P(X2|Class)P(X3|Class)…
12.11.2009Machine learning :: Introduction :: Part II24
The Naïve Bayes Classifier
P(X | C1)/P(X | C2) > P(C2)/P(C1)
predict C1
becomes
12.11.2009Machine learning :: Introduction :: Part II25
Naïve Bayes Classifier
1. Training:
For each value v of each attribute i, compute
2. Classification:
For a given instance (x1, x2, …) compute
If classify as C1 , else C2
12.11.2009Machine learning :: Introduction :: Part II26
Naïve Bayes Classifier
1. Training:
For each value v of each attribute i, compute
2. Classification:
For a given instance (x1, x2, …) compute
If classify as C1 , else C2
12.11.2009Machine learning :: Introduction :: Part II27
Naïve Bayes Classifier
Extendable to continuous attributes via kernel density
estimation.
The goods:
Easy to implement, efficient
Won’t overfit, intepretable
Works better than you would expect (e.g. spam filtering)
The bads
“Naïve”, linear
Usually won’t work well for too many classes
12.11.2009Machine learning :: Introduction :: Part II28
Supervised learning
Four examples of approaches
Ad-hoc
Decision tree induction
Probabilistic modeling
Naïve Bayes classifier
Objective function optimization
Linear least squares regression
Instance-based methods
K-nearest neighbors
12.11.2009Machine learning :: Introduction :: Part II29
Linear regression
Consider data points of the type
Let us search for a function of the form
that would have the least possible sum of error
squares:
12.11.2009Machine learning :: Introduction :: Part II30
Linear regression
Set of functions f(x) = wx for various w.
12.11.2009Machine learning :: Introduction :: Part II31
Linear regression
The function having the least error
12.11.2009Machine learning :: Introduction :: Part II32
Linear regression
We need to solve:
where
Setting the derivative to 0:
we get
12.11.2009Machine learning :: Introduction :: Part II33
Linear regression
The idea naturally generalizes to more
complex cases (e.g. multivariate regression)
The goods
Simple, easy
Intepretable, popular, extendable to many variants
Won’t overfit
The bads
Linear, i.e. works for a limited set of datasets.
Nearly never perfect.
12.11.2009Machine learning :: Introduction :: Part II34
Supervised learning
Four examples of approaches
Ad-hoc
Decision tree induction
Probabilistic modeling
Naïve Bayes classifier
Objective function optimization
Linear least squares regression
Instance-based methods
K-nearest neighbors
12.11.2009Machine learning :: Introduction :: Part II35
K-Nearest Neighbors
Training:
Store and index all training instances
Classification
Given an instance to classify, find k instances from the
training sample nearest to it.
Predict the majority class among these k instances
12.11.2009Machine learning :: Introduction :: Part II39
K-Nearest Neighbors
The goods
Trivial concept
Easy to implement
Asymptotically optimal
Allows various kinds of data
The bads
Difficult to implement efficiently
Not interpretable (no model)
On smaller datasets looses to other methods
12.11.2009Machine learning :: Introduction :: Part II40
Objective optimization
Regression models
Kernel methods, SVM, RBF
Neural networks
…
Instance-based
K-NN, LOWESS
Kernel densities
SVM, RBF
…
Probabilistic models
Naïve Bayes
Graphical models
Regression models
Density estimation
…
Supervised learning
Ad-hoc
Decision trees, forests
Rule induction, ILP
Fuzzy reasoning
…
12.11.2009Machine learning :: Introduction :: Part II41
Objective optimization
Regression models,
Kernel methods, SVM, RBF
Neural networks
…
Instance-based
K-NN, LOWESS
Kernel densities
SVM, RBF
…
Probabilistic models
Naïve Bayes
Graphical models
Regression models
Density estimation
…
Supervised learning
Ad-hoc
Decision trees, forests
Rule induction, ILP
Fuzzy reasoning
…
12.11.2009Machine learning :: Introduction :: Part II42
Ensemble-learners
Arcing, Boosting,
Bagging, Dagging,
Voting, Stacking
Coming up next
“Machine learning”
Terminology, foundations, general framework.
Supervised machine learning
Basic ideas, algorithms & toy examples.
Statistical challenges
Learning theory, consistency, bias-variance, overfitting, ...
State of the art techniques
SVM, kernel methods, graphical models, latent variable
models, boosting, bagging, LASSO, on-line learning, deep
learning, reinforcement learning, …
12.11.2009Machine learning :: Introduction :: Part II43