Support Vector Machines (SVM): A Tool for Machine Learning
Yixin ChenPh.D Candidate, CSE
1/10/2002
Presentation Outline Introduction Linear Learning Machines Support Vector Machines (SVM) Examples Conclusions
Introduction Building machines capable of learning from
experiences. Experiences are usually specified by finite
amount of training data. The goal is to achieve high generalization
performance via learning from the training set. The construction of a good learning machine is
a compromise between the accuracy attained on a particular training set and the “capacity” of the machine.
SVMs have large learning capacity and can have excellent generalization performance.
Linear Learning Machines Binary classification uses a linear
function g(x) = wtx+w0. x is the feature vector, w is the
weight vector and w0 the bias or threshold weight.
A two-category classifier implements the decision rule: Decide class 1 if g(x)>0 and class -1 if g(x)<0.
A Simple Linear Classifier
Some Properties of Linear Learning Machines Decision surface is a hyperplane. The feature space is divided into
two half-spaces.
Several Questions Does there exist a hyperplane
which separates the training set? If yes, how to compute it? Is it unique? If not unique, can we and how can
we find an “optimal” one? What can we do if there doesn’t
exist one?
Facts If the training set is linearly
separable, then there exist infinitely many separating hyperplanes for the given training set.
If the training set is linearly inseparable then there does not exist any separating hyperplane for the given training set.
Support Vector Machines Linearly Separable
Support Vector Machines Margin: 2/|w|
H1: wtx-w0=1 H: wtx-w0=0 H2: wtx-w0=-1
Support Vector Machines Maximize the margin Minimize |w|/2
Support Vector Machines Quadratic Program (Maximal Margin)
minw,w0 |w|2/2,
s.t. wtxi≥w0+1 for yi=1, and wtxiw0-1 for yi=-1.(or equivalently yi(wtxi-w0) ≥1)
Dual QP (Maximal Margin)min 0.5i=1,…,mj=1,…,myiyjijxi
txj - i=1,…,mis.t. i=1,…,myii=0, i0, i=1,…,m
Support Vectorsw is a linear combination of support vectors.
Support Vector Machines Linearly Inseparable
Support Vector Machines Maximize Margin and Minimize Error (Soft Margin)
minw,w0,z |w|2/2+Ci=1,…mzi,s. t. yi(wtxi-w0)+zi ≥1,zi ≥0, i=1,…,m.(zi is slack or error variable)
Dual QP (Soft Margin)min 0.5i=1,…,mj=1,…,myiyjijxi
txj - i=1,…,mi
s.t. i=1,…,myii=0Ci0, i=1,…,m
Support Vector Machines Nonlinear Mappings via Kernels
Idea: Map original features into higher dimensional feature space x(x). Design classifier in the new feature space. The classifier is nonlinear in the original feature space but linear in the new feature space. (With an appropriate nonlinear mapping to a sufficiently high dimension, data from two categories can always be separated by a hyperplane.)
Support Vector Machines Maximal Margin
min 0.5i=1,…,mj=1,…,myiyjij(xi)t(xj) - i=1,…,mi
s.t. i=1,…,myii=0, i0, i=1,…,m
Soft Marginmin 0.5i=1,…,mj=1,…,myiyjij(xi)t(xj) - i=1,…,mi
s.t. i=1,…,myii=0, Ci0, i=1,…,m
Support Vector Machines Role of Kernels
Simplify the computation of inner product in the new feature space:
K(x,y) = (x)t(y). Some Popular Kernels
Polynomial K(x,y)=(xty+1)p
Gaussian K(x,y)=e-|x-y|2/22
Sigmoid K(x,y)=tanh(xty-) Maximal Margin and Soft Margin
Support Vector Machines Maximal Margin
min 0.5i=1,…,mj=1,…,myiyjijK(xi,xj) - i=1,…,mi
s.t. i=1,…,myii=0, i0, i=1,…,m
Soft Marginmin 0.5i=1,…,mj=1,…,myiyjijK(xi,xj) - i=1,…,mi
s.t. i=1,…,myii=0, Ci0, i=1,…,m
Examples Checker-Board Problem
0 20 40 60 80 100 120 140 160 1800
20
40
60
80
100
120
140
160
1803 3 Checker-Board
Checker-Board Problem
0 60 120 1800
60
120
180Region Boundary ( = 10)
0 60 120 1800
60
120
180Region Boundary ( = 5)
169 training samples, Gauss Kernel, Soft Margin, C=1000
Checker-Board Problem
0 60 120 1800
60
120
180Region Boundary ( = 15)
0 60 120 1800
60
120
180Region Boundary ( = 20)
169 training samples, Gauss Kernel, Soft Margin, C=1000
Examples Two-Spiral
Problem
-30 -20 -10 0 10 20 30-30
-20
-10
0
10
20
30
Two-Spiral Problem
-30 -20 -10 0 10 20 30-30
-20
-10
0
10
20
30Region Boundary ( =2 )
Spiral 1spiral 2
-30 -20 -10 0 10 20 30-30
-20
-10
0
10
20
30Region Boundary ( = 1)
Spiral 1spiral 2
154 training samples, Gauss Kernel, Soft Margin, C=1000
Two-Spiral Problem
-30 -20 -10 0 10 20 30-30
-20
-10
0
10
20
30Region Boundary ( = 5)
Spiral 1spiral 2
-30 -20 -10 0 10 20 30-30
-20
-10
0
10
20
30Region Boundary ( = 7)
Spiral 1spiral 2
154 training samples, Gauss Kernel, Soft Margin, C=1000
Conclusions Advantages
Always finds a global minimum. Simple and clear geometric interpretation.
Limitations Choice of Kernel. Training a multi-class SVM in one step.
References N. Cristianini and J.Shawe-Taylor, An
Intorduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press, 2000.
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, John Wiley & Sons, INC., 2001.
C. J.C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, c, 121-167, 1998.
K. P. Bennett and C. Campbell, Support Vector Machines: Hype or Hallelujah?, SIGKDD Explorations, 2, 2, 1-13, 2000.
SVMLight, http://svmlight.joachims.org/