Upload
claude-pope
View
218
Download
1
Tags:
Embed Size (px)
Citation preview
The Classification Problem
• Label instances, usually represented by feature vectors, into one of the predefined categories.
• Example: Image classification
Starting from the simplest setting
• Two-class• Samples are linearly separable
Class 1
Class 2
Hyperplaneg(x) = wTx + w0 = 0
How many classifiers we may have to separate the data? infinite!
weight vector threshold
> 0< 0
Formulation
• Given training data: (xi, yi), i = 1, 2, …, N,– xi: feature vector– yi: label
• Learn a hyper-plane which separates all data– variables: w and w0
• Testing: decision function f(x) = sign(wTx + w0)– x: test data
Class 1
Class 2
H1
H2H3
Hyperplanes H1, H2, and H3 are candidate classifiers.Which one is preferred? Why?
Class 1
Class 2
wTx + w0 = 0
scale w, w0 so that
20
10
,1
,1
xxw
xxw
w
wT
T
margin?
w
1 w
1
wwwww T
2211
wTx + w0 = δ
wTx + w0 = -δ
1
-1
Formulation
• Compute w, w0 so that to:
Niwy
J
Ti
T
,...,2,1 ,1)( subject to
2
1)( minimize
0
ixw
www
Side information:
Formulation
• The problem is equal to the optimization task:
• w can be recovered by• Classification rule:– Assign x to ω1 (ω2) if
0,0 subject to
2
1max
1
1 ,
i
N
ii
N
i
Tj
jiijii
y
yy ji xx
N
iiii y
1
x
0)( 01
i
N
i
Tiii y xx
Lagrange multipliers
Remarks• Just some λ are not zeros.• xi with non-zero λ are called
support vectors.• The hyperplane is
determined only by the support vectors.
• The cost function is in the form of inner products.– does not depend explicitly on
the dimensionality of the input space!
Class 1
Class 2
Non-separable Classes
Class 1
Class 2
Allow training errors!
Previous constraint:yi(wTxi + w0) ≥ 1
Introduce errors:yi(wTxi + w0) ≥ 1- ξi
ξi > 1
0 < ξi ≤ 1
others, ξi = 0
Formulation
• Compute w, w0 so that to:
Ni
Niwy
J
i
iT
i
iT
,...,2,1 ,0
,,...,2,1 ,1)( subject to
2
1 minimize
0
i
N
1i
xw
Cww
penalty parameter
Formulation
• The dual problem:
0
,...,2,1,0 subject to
2
1max
1
1 ,
i
N
ii
i
N
i
Tj
jiijii
y
NiC
yy
ji xx
Non-linear Case
• Linear separable in other spaces?• Idea: map the feature vector to higher
dimensional space
Ni
Niwy
J
i
iT
i
iT
,...,2,1 ,0
,,...,2,1 ,1))( subject to
2
1 minimize
0
i
N
1i
xw
Cww
(
• Problems– High computation burden– Hard to get a good estimate
Kernel Trick
• Recall that in the dual problem, w can be recovered by
• g(x) = wTx + w0
=
N
iiii y
1
x
01
wysN
ii
Tiii
xx
All we need here is the inner product of (transformed) feature vectors!
Kernel Trick
• Decision function
• Kernel function– K(xi, xj) = (xi)(xj)
01
0
)()(
)(
wy
wsN
ii
Tiii
T
xx
xw
Remarks
• In practice, we specify K, thereby specifying (.) indirectly, instead of choosing (.)
• Intuitively, K(x, y) represents the similarity between data x and y
• K(x, y) needs to satisfy the Mercer condition in order for (.) to exist
Examples of Kernel Functions
• Polynomial kernel with degree d
• Radial basis function kernel with width
• Sigmoid with parameter and
Pros and Cons
• Strengths– Training is relatively easy – It scales relatively well to high dimensional data– Tradeoff between classifier complexity and error
can be controlled explicitly• Weaknesses– No practical method for the best selection of the
kernel function– Binary classification alone
Combing SVM binary classifiers for multi-class problem (1)
• M-category classification (ω1, ω2, … , ωM)
• Two popular approaches1. One-against-all (ωi, M-1 others)• M classifiers• Choose the one with the largest outputExample: 5 categories
Winner: ω1
Combing SVM binary classifiers for multi-class problem (2)
2. Pair-wise coupling (ωi, ωj)• M(M-1)/2 classifiers• Aggregate the outputsExample: 5 categories
svm outputs
decision
Voting!1: 42: 13: 34: 05: 2
Winner: ω1
Data normalization
• The features may have different ranges.Example: We use weight (w) and height (h) for classifying male and female college students.– male: avg.(w) = 69.80 kg, avg.(h) = 174.36 cm– female: avg.(w) = 52.86 kg, avg.(h) = 159.77 cm
Different scales!
Data normalization
• “Data pre-processing”• Equalize scales among different features– Zero mean and unit variance– Two cases in practice• (0, 1) if all feature values are positive• (-1, 1) if feature values may be positive or negative
Assignment #4
• Develop a SVM classifier using either– OpenCV, or– LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
• Use “training.txt” to train your classifier, and evaluate the performance “test.txt”
• Write a 1-page report that summarizes how you implement your classifier, and the classification accuracy rate.