Upload
charlene-ward
View
220
Download
1
Tags:
Embed Size (px)
Citation preview
1
Classification II
Tamara Berg
CS 560 Artificial Intelligence
Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan
2
Announcements
• HW3 due tomorrow, 11:59pm
• Midterm2 next Wednesday, Nov 4 – Bring a simple calculator– You may bring one 3x5 notecard of notes (both sides)
• Monday, Nov 2 we will have in class practice questions
3
Midterm Topic List
Probability– Random variables – Axioms of probability– Joint, marginal, conditional probability distributions – Independence and conditional independence– Product rule, chain rule, Bayes rule
Bayesian Networks General– Structure and parameters – Calculating joint and conditional probabilities– Independence in Bayes Nets (Bayes Ball)
Bayesian Inference– Exact Inference (Inference by Enumeration, Variable Elimination)– Approximate Inference (Forward Sampling, Rejection Sampling, Likelihood
Weighting)– Networks for which efficient inference is possible
4
Midterm Topic ListNaïve Bayes
– Parameter learning including Laplace smoothing– Likelihood, prior, posterior – Maximum likelihood (ML), maximum a posteriori (MAP) inference – Application to spam/ham classification and image classification
HMMs– Markov Property– Markov Chains– Hidden Markov Model (initial distribution, transitions, emissions)– Filtering (forward algorithm)– Application to speech recognition and robot localization
5
Midterm Topic List
Machine Learning– Unsupervised/supervised/semi-supervised learning– K Means clustering– Hierarchical clustering (agglomerative, divisive)– Training, tuning, testing, generalization– Nearest Neighbor– Decision Trees– Boosting– Application of algorithms to research problems (e.g. visual word discovery, pose
estimation, im2gps, scene completion, face detection)
The basic classification framework
y = f(x)
• Learning: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the parameters of the prediction function f
• Inference: apply f to a never before seen test example x and output the predicted value y = f(x)
output classification function
input
7
Classification by Nearest Neighbor
Word vector document classification – here the vector space is illustrated as having 2 dimensions. How many dimensions would the data actually live in?
8
Classification by Nearest Neighbor
9
Classification by Nearest Neighbor
Classify the test document as the class of the document “nearest” to the query document (use vector similarity, e.g. Euclidean distance, to find most similar doc)
10
Classification by kNN
Classify the test document as the majority class of the k documents “nearest” to the query document.
11
Decision tree classifier
Example problem: decide whether to wait for a table at a restaurant, based on the following attributes:1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
12
Decision tree classifier
Decision tree classifier
13
14
Shall I play tennis today?
15
16
How do we choose the best attribute?
Leaf nodes
Choose next attribute for splitting
17
Criterion for attribute selection
• Which is the best attribute?– The one which will result in the smallest tree– Heuristic: choose the attribute that produces
the “purest” nodes• Need a good measure of purity!
18
Information Gain
Which test is more informative?
Humidity
=>75%>75% =>20>20
Wind
19
Information Gain
Impurity/Entropy (informal)– Measures the level of impurity in a group
of examples
20
Impurity
Very impure group Less impure Minimum impurity
21
Entropy: a common way to measure impurity
• Entropy =
pi is the probability of class i
Compute it as the proportion of class i in the set.
i
ii pp 2log
22
2-Class Cases:
• What is the entropy of a group in which all examples belong to the same class?• entropy = - 1 log21 = 0
• What is the entropy of a group with 50% in either class?• entropy = -0.5 log20.5 – 0.5 log20.5 =1
Minimum impurity
Maximumimpurity
23
Information Gain
• We want to determine which attribute in a given set of training feature vectors is most useful for discriminating between the classes to be learned.
• Information gain tells us how useful a given attribute of the feature vectors is.
• We can use it to decide the ordering of attributes in the nodes of a decision tree.
24
Calculating Information Gain
996.030
16log
30
16
30
14log
30
1422
impurity
787.017
4log
17
4
17
13log
17
1322
impurity
Entire population (30 instances)17 instances
13 instances
(Weighted) Average Entropy of Children = 615.0391.030
13787.0
30
17
Information Gain= 0.996 - 0.615 = 0.38
391.013
12log
13
12
13
1log
13
122
impurity
Information Gain = entropy(parent) – [weighted average entropy(children)]
parententropy
childentropy
childentropy
25
e.g. based on information gain
Model Ensembles
30
Random Forests
A variant of bagging proposed by Breiman
Classifier consists of a collection of decision tree-structure classifiers.
Each tree cast a vote for the class of input x.
31
• A simple algorithm for learning robust classifiers– Freund & Shapire, 1995– Friedman, Hastie, Tibshhirani, 1998
• Provides efficient algorithm for sparse visual feature selection– Tieu & Viola, 2000– Viola & Jones, 2003
• Easy to implement, doesn’t require external optimization tools. Used for many real problems in AI.
Boosting
32
• Defines a classifier using an additive model:
Boosting
Strong classifier
Weak classifier
WeightInput featurevector
33
• Defines a classifier using an additive model:
• We need to define a family of weak classifiers
Boosting
Strong classifier
Weak classifier
WeightInput featurevector
Selected from a family of weak classifiers
AdaboostInput: training samplesInitialize weights on samples
For T iterations:
Select best weak classifier based on weighted error
Update sample weights
Output: final strong classifier (combination of selected weak classifier predictions)
35
Each data point has
a class label:
wt =1and a weight:
+1 ( )
-1 ( )yt =
Boosting• It is a sequential procedure:
xt=1
xt=2
xt
36
Toy exampleWeak learners from the family of lines
h => p(error) = 0.5 it is at chance
Each data point has
a class label:
wt =1and a weight:
+1 ( )
-1 ( )yt =
37
Toy example
This one seems to be the best
Each data point has
a class label:
wt =1and a weight:
+1 ( )
-1 ( )yt =
This is a ‘weak classifier’: It performs slightly better than chance.
38
Toy example
Each data point has
a class label:
wt wt exp{-yt Ht}
We update the weights:
+1 ( )
-1 ( )yt =
39
Toy example
Each data point has
a class label:
wt wt exp{-yt Ht}
We update the weights:
+1 ( )
-1 ( )yt =
40
Toy example
Each data point has
a class label:
wt wt exp{-yt Ht}
We update the weights:
+1 ( )
-1 ( )yt =
41
Toy example
Each data point has
a class label:
wt wt exp{-yt Ht}
We update the weights:
+1 ( )
-1 ( )yt =
42
Toy example
The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers.
f1 f2
f3
f4
AdaboostInput: training samplesInitialize weights on samples
For T iterations:
Select best weak classifier based on weighted error
Update sample weights
Output: final strong classifier (combination of selected weak classifier predictions)
44
Boosting for Face Detection
Face detection
features?
classify+1 face
-1 not face
• We slide a window over the image• Extract features for each window• Classify each window into face/non-face
x F(x) y
? ?
What is a face?
• Eyes are dark (eyebrows+shadows)• Cheeks and forehead are bright.• Nose is bright
Paul Viola, Michael Jones, Robust Real-time Object Detection, IJCV 04
Basic feature extraction
• Information type:– intensity
• Sum over:– gray and white rectangles
• Output: gray-white• Separate output value for
– Each type– Each scale– Each position in the window
• FEX(im)=x=[x1,x2,…….,xn]
Paul Viola, Michael Jones, Robust Real-time Object Detection, IJCV 04
x120x357
x629 x834
Decision trees
• Stump: – 1 root– 2 leaves
• If xi > athen positive
else negative
• Very simple• “Weak classifier”
Paul Viola, Michael Jones, Robust Real-time Object Detection, IJCV 04
x120x357
x629 x834
Summary: Face detection
• Use decision stumps as week classifiers
• Use boosting to build a strong classifier
• Use sliding window to detect the face
x120x357
x629 x834
X234>1.3
-1Non-face
+1 Face
YesNo
50
Discriminant Function• It can be arbitrary functions of x, such as:
Nearest Neighbor
Decision Tree
LinearFunctions