View
58
Download
3
Category
Preview:
Citation preview
Machine Learning: Classification Algorithms
Review
A picture’s worth a thousand words..
• http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#example-classification-plot-classifier-comparison-py
AlgorithmProblem Type
Results interpretable by you?
Easy to explain algorithm to others?
Average predictive accuracy
Training speed
Prediction speed
Amount of parameter tuning needed (excluding feature selection)
Performs well with small number of observations?
Handles lots of irrelevant features well (separates signal from noise)?
Automatically learns feature interactions?
Gives calibrated probabilities of class membership? Parametric?
Features might need scaling? Algorithm
KNN Either Yes Yes Low er FastDepends on n Minimal No No No Yes No Yes KNN
Linear regression Regression Yes Yes Low er Fast Fast
None (excluding regularization) Yes No No N/A Yes
No (unless regularized)
Linear regression
Logistic regression
Classification Somew hat Somew hat Low er Fast Fast
None (excluding regularization) Yes No No Yes Yes
No (unless regularized)
Logistic regression
Naive Bayes
Classification Somew hat Somew hat Low er
Fast (excluding feature extraction) Fast
Some for feature extraction Yes Yes No No Yes No
Naive Bayes
Decision trees Either Somew hat Somew hat Low er Fast Fast Some No No Yes Possibly No No
Decision trees
Random Forests Either A little No Higher Slow Moderate Some No
Yes (unless noise ratio is very high) Yes Possibly No No
Random Forests
AdaBoost Either A little No Higher Slow Fast Some No Yes Yes Possibly No No AdaBoostNeural netw orks Either No No Higher Slow Fast Lots No Yes Yes Possibly No Yes
Neural netw orks
parametric: assumptions of an underlying distributionnon-parametric-no underlying distirbutional assumptions
calibrated probabilities: probablity between 0 and 1 computed, rather than simply determining the class. tuning parameters-variables that you can manipulate to get better fits.
SUMMARY OF MACHINE LEARNING ALGORITHM FEATURES
Nearest Neighbor Classifiers• Basic idea:
– If it walks like a duck, quacks like a duck, then it’s probably a duck
Training Records
Test Record
Compute Distance
Choose k of the “nearest” records
Nearest-Neighbor Classifiers Requires three things
– The set of stored records– Distance Metric to compute
distance between records– The value of k, the number of
nearest neighbors to retrieve
To classify an unknown record:– Compute distance to other
training records– Identify k nearest neighbors – Use class labels of nearest
neighbors to determine the class label of unknown record (e.g., by taking majority vote)
Unknown record
Definition of Nearest Neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points that have the k smallest distance to x
Nearest Neighbor Classification• Compute distance between two points:
– Euclidean distance
• Determine the class from nearest neighbor list– take the majority vote of class labels among the k-
nearest neighbors– Weigh the vote according to distance
• weight factor, w = 1/d2
i ii
qpqpd 2)(),(
Nearest Neighbor Classification…
• Choosing the value of k:– If k is too small, sensitive to noise points– If k is too large, neighborhood may include points
from other classes
X
Nearest Neighbor Classification…
• Scaling issues– Attributes may have to be scaled to prevent
distance measures from being dominated by one of the attributes
– Example:• height of a person may vary from 1.5m to 1.8m• weight of a person may vary from 90lb to 300lb• income of a person may vary from $10K to $1M
• An inductive learning task– Use particular facts to make more generalized
conclusions
• A predictive model based on a branching series of Boolean tests– These smaller Boolean tests are less complex than a
one-stage classifier
• Let’s look at a sample decision tree…
What is a Decision Tree?
Predicting Commute Time
Leave At
Stall? Accident?
10 AM 9 AM8 AM
Long
Long
Short Medium Long
No Yes No Yes
If we leave at 10 AM and there are no cars stalled on the road, what will our commute time be?
Inductive Learning• In this decision tree, we made a series of
Boolean decisions and followed the corresponding branch– Did we leave at 10 AM?– Did a car stall on the road?– Is there an accident on the road?
• By answering each of these yes/no questions, we then came to a conclusion on how long our commute might take
Decision Tree Algorithms• The basic idea behind any decision tree
algorithm is as follows:– Choose the best attribute(s) to split the remaining
instances and make that attribute a decision node– Repeat this process for recursively for each child– Stop when:
• All the instances have the same target attribute value• There are no more attributes• There are no more instances
How to determine the Best Split
OwnCar?
C0: 6C1: 4
C0: 4C1: 6
C0: 1C1: 3
C0: 8C1: 0
C0: 1C1: 7
CarType?
C0: 1C1: 0
C0: 1C1: 0
C0: 0C1: 1
StudentID?
...
Yes No Family
Sports
Luxury c1c10
c20
C0: 0C1: 1
...
c11
Before Splitting: 10 records of class 0,10 records of class 1
Which test condition is the best?
How to determine the Best Split• Greedy approach:
– Nodes with homogeneous class distribution are preferred
• Need a measure of node impurity:
C0: 5C1: 5
C0: 9C1: 1
Non-homogeneous,High degree of impurity
Homogeneous,Low degree of impurity
Measures of Node Impurity• Gini Index
• Entropy
• Misclassification error
Measure of Impurity: GINI
• Gini Index for a given node t :
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information
– Minimum (0.0) when all records belong to one class, implying most interesting information
j
tjptGINI 2)]|([1)(
C1 0C2 6
Gini=0.000
C1 2C2 4
Gini=0.444
C1 3C2 3
Gini=0.500
C1 1C2 5
Gini=0.278
Examples for computing GINI
C1 0 C2 6
C1 2 C2 4
C1 1 C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
j
tjptGINI 2)]|([1)(
P(C1) = 1/6 P(C2) = 5/6Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C1) = 2/6 P(C2) = 4/6Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Alternative Splitting Criteria based on INFO
• Entropy at a given node t:
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Measures homogeneity of a node. • Maximum (log nc) when records are equally distributed
among all classes implying least information• Minimum (0.0) when all records belong to one class,
implying most information– Entropy based computations are similar to the
GINI index computations
j
tjptjptEntropy )|(log)|()(
Examples for computing Entropy
C1 0 C2 6
C1 2 C2 4
C1 1 C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
P(C1) = 1/6 P(C2) = 5/6Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
j
tjptjptEntropy )|(log)|()(2
Splitting Based on INFO...• Information Gain:
Parent Node, p is split into k partitions;ni is number of records in partition i
– Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN)
– Used in ID3 and C4.5– Disadvantage: Tends to prefer splits that result in large number of
partitions, each being small but pure.
k
i
i
splitiEntropy
nnpEntropyGAIN
1)()(
Stopping Criteria for Tree Induction
• Stop expanding a node when all the records belong to the same class
• Stop expanding a node when all the records have similar attribute values
• Early termination (to be discussed later)
Decision Tree Based Classification
• Advantages:– Inexpensive to construct– Extremely fast at classifying unknown records– Easy to interpret for small-sized trees– Accuracy is comparable to other classification
techniques for many simple data sets
Practical Issues of Classification• Underfitting and Overfitting
• Missing Values
• Costs of Classification
Notes on Overfitting• Overfitting results in decision trees that are
more complex than necessary
• Training error no longer provides a good estimate of how well the tree will perform on previously unseen records
• Need new ways for estimating errors
How to Address Overfitting• Pre-Pruning (Early Stopping Rule)
– Stop the algorithm before it becomes a fully-grown tree– Typical stopping conditions for a node:
• Stop if all instances belong to the same class• Stop if all the attribute values are the same
– More restrictive conditions:• Stop if number of instances is less than some user-specified threshold• Stop if class distribution of instances are independent of the available
features (e.g., using 2 test)• Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
Bayes Classifiers
Intuitively, Naïve Bayes computes the probability of the previously unseen instance belonging to each class, then simply pick the most probable class.
http://blog.yhat.com/posts/naive-bayes-in-python.html
Bayes Classifiers• Bayesian classifiers use Bayes theorem, which
saysp(cj | d ) = p(d | cj ) p(cj)
p(d)
• p(cj | d) = probability of instance d being in class cj, This is what we are trying to compute
• p(d | cj) = probability of generating instance d given class cj,We can imagine that being in class cj, causes you to have feature d with some probability
• p(cj) = probability of occurrence of class cj, This is just how frequent the class cj, is in our database
• p(d) = probability of instance d occurring This can actually be ignored, since it is the same for all classes
Different Naïve Bayes Models• Multi-variate Bernoulli Naive Bayes The binomial model is useful if your
feature vectors are binary (i.e., 0s and 1s). One application would be text classification with a bag of words model where the 0s 1s are "word occurs in the document" and "word does not occur in the document"
• Multinomial Naive Bayes The multinomial naive Bayes model is typically used for discrete counts. E.g., if we have a text classification problem, we can take the idea of bernoulli trials one step further and instead of "word occurs in the document" we have "count how often word occurs in the document", you can think of it as "number of times outcome number x_i is observed over the n trials"
• Gaussian Naive Bayes Here, we assume that the features follow a normal distribution. Instead of discrete counts, we have continuous features (e.g., the popular Iris dataset where the features are sepal width, petal width, sepal length, petal length).
Check out these websites for more!
• http://www.datasciencecentral.com/profiles/blogs/naive-bayes-for-dummies-a-simple-explanation
• http://blog.yhat.com/posts/naive-bayes-in-python.html
• In Sklearn: • http://scikit-learn.org/stable/modules/naive_
bayes.html
Logistic Regression vs. Naïve Bayes
• Logistic Regression Idea: • Naïve Bayes allows computing P(Y|X) by
learning P(Y) and P(X|Y)
• Why not learn P(Y|X) directly?
The Logistic Function• We want a model that predicts probabilities between 0 and 1, that is, S-
shaped.• There are lots of s-shaped curves. We use the logistic model:• Probability = exp(0+ 1X) /[1 + exp(0+ 1X) ] or loge[P/(1-P)] = 0+ 1X• The function on left, loge[P/(1-P)], is called the logistic function.
0.0
0.2
0.4
0.6
0.8
1.0
x
P y x ee
x
x( )
1
Logistic Regression Function • Logistic regression models the logit of the outcome, instead of the
outcome i.e. instead of winning or losing, we build a model for log odds of winning or losing
• Natural logarithm of the odds of the outcome• ln(Probability of the outcome (p)/Probability of not having the
outcome (1-p))
ii2211 xβ ... xβ xβαP-1
P ln
P y x e
e
x
x( )
1
ROC Curves
• Originated from signal detection theory– Binary signal corrupted by Guassian noise– What is the optimal threshold (i.e. operating
point)?
• Dependence on 3 factors– Signal Strength– Noise Variance– Personal tolerance in Hit / False Alarm Rate
ROC Curves
• Receiver operator characteristic
• Summarize & present performance of any binary classification model
• Models ability to distinguish between false & true positives
Use Multiple Contingency Tables
• Sample contingency tables from range of threshold/probability.
• TRUE POSITIVE RATE (also called SENSITIVITY)
True Positives(True Positives) + (False Negatives)
• FALSE POSITIVE RATE (also called 1 - SPECIFICITY)
False Positives(False Positives) + (True Negatives)
• Plot Sensitivity vs. (1 – Specificity) for sampling and you are done
Pros/Cons of Various Classification Algorithms
Logistic regression: no distribution requirement, perform well with few categories categorical variables, compute the logistic distribution, good for few categories variables, easy to interpret, compute CI, suffer multicollinearity
Decision Trees: no distribution requirement, heuristic, good for few categories variables, not suffer multicollinearity (by choosing one of them), Interpretable
Naïve Bayes: generally no requirements, good for few categories variables, compute the multiplication of independent distributions, suffer multicollinearity
SVM: no distribution requirement, compute hinge loss, flexible selection of kernels for nonlinear correlation, not suffer multicollinearity, hard to interpret
Bagging, boosting, ensemble methods(RF, Ada, etc): generally outperform single algorithm listed above.Source: Quora
Prediction Error and the Bias-variance tradeoff
• A good measure of the quality of an estimator ˆf (x) is the mean squared error. Let f0(x) be the true value of f (x) at the point x. Then
• This can be written as
variance + bias^2.• Typically, when bias is low, variance is high and vice-versa.
Choosing estimators often involves a tradeoff between bias and variance.
20 )]()(ˆE[)](ˆMse[ xfxfxf
20 )]()(ˆ[)](ˆVar[)](ˆMse[ xfxfExfxf
Note the tradeoff between Bias and Variance!
Recommended