Download ppt - Support Vector Machines (SVM): A Tool for Machine Learning Yixin Chen Ph.D Candidate, CSE 1/10/2002

Support Vector Machines (SVM): A Tool for Machine Learning

Yixin ChenPh.D Candidate, CSE

1/10/2002

Presentation Outline Introduction Linear Learning Machines Support Vector Machines (SVM) Examples Conclusions

Introduction Building machines capable of learning from

experiences. Experiences are usually specified by finite

amount of training data. The goal is to achieve high generalization

performance via learning from the training set. The construction of a good learning machine is

a compromise between the accuracy attained on a particular training set and the “capacity” of the machine.

SVMs have large learning capacity and can have excellent generalization performance.

Linear Learning Machines Binary classification uses a linear

function g(x) = wtx+w0. x is the feature vector, w is the

weight vector and w0 the bias or threshold weight.

A two-category classifier implements the decision rule: Decide class 1 if g(x)>0 and class -1 if g(x)<0.

A Simple Linear Classifier

Some Properties of Linear Learning Machines Decision surface is a hyperplane. The feature space is divided into

two half-spaces.

Several Questions Does there exist a hyperplane

which separates the training set? If yes, how to compute it? Is it unique? If not unique, can we and how can

we find an “optimal” one? What can we do if there doesn’t

exist one?

Facts If the training set is linearly

separable, then there exist infinitely many separating hyperplanes for the given training set.

If the training set is linearly inseparable then there does not exist any separating hyperplane for the given training set.

Support Vector Machines Linearly Separable

Support Vector Machines Margin: 2/|w|

H1: wtx-w0=1 H: wtx-w0=0 H2: wtx-w0=-1

Support Vector Machines Maximize the margin Minimize |w|/2

Support Vector Machines Quadratic Program (Maximal Margin)

minw,w0 |w|2/2,

s.t. wtxi≥w0+1 for yi=1, and wtxiw0-1 for yi=-1.(or equivalently yi(wtxi-w0) ≥1)

Dual QP (Maximal Margin)min 0.5i=1,…,mj=1,…,myiyjijxi

txj - i=1,…,mis.t. i=1,…,myii=0, i0, i=1,…,m

Support Vectorsw is a linear combination of support vectors.

Support Vector Machines Linearly Inseparable

Support Vector Machines Maximize Margin and Minimize Error (Soft Margin)

minw,w0,z |w|2/2+Ci=1,…mzi,s. t. yi(wtxi-w0)+zi ≥1,zi ≥0, i=1,…,m.(zi is slack or error variable)

Dual QP (Soft Margin)min 0.5i=1,…,mj=1,…,myiyjijxi

txj - i=1,…,mi

s.t. i=1,…,myii=0Ci0, i=1,…,m

Support Vector Machines Nonlinear Mappings via Kernels

Idea: Map original features into higher dimensional feature space x(x). Design classifier in the new feature space. The classifier is nonlinear in the original feature space but linear in the new feature space. (With an appropriate nonlinear mapping to a sufficiently high dimension, data from two categories can always be separated by a hyperplane.)

Support Vector Machines Maximal Margin

min 0.5i=1,…,mj=1,…,myiyjij(xi)t(xj) - i=1,…,mi

s.t. i=1,…,myii=0, i0, i=1,…,m

Soft Marginmin 0.5i=1,…,mj=1,…,myiyjij(xi)t(xj) - i=1,…,mi

s.t. i=1,…,myii=0, Ci0, i=1,…,m

Support Vector Machines Role of Kernels

Simplify the computation of inner product in the new feature space:

K(x,y) = (x)t(y). Some Popular Kernels

Polynomial K(x,y)=(xty+1)p

Gaussian K(x,y)=e-|x-y|2/22

Sigmoid K(x,y)=tanh(xty-) Maximal Margin and Soft Margin

Support Vector Machines Maximal Margin

min 0.5i=1,…,mj=1,…,myiyjijK(xi,xj) - i=1,…,mi

s.t. i=1,…,myii=0, i0, i=1,…,m

Soft Marginmin 0.5i=1,…,mj=1,…,myiyjijK(xi,xj) - i=1,…,mi

s.t. i=1,…,myii=0, Ci0, i=1,…,m

Examples Checker-Board Problem

0 20 40 60 80 100 120 140 160 1800

20

40

60

80

100

120

140

160

1803 3 Checker-Board

Checker-Board Problem

0 60 120 1800

60

120

180Region Boundary ( = 10)

0 60 120 1800

60

120


169 training samples, Gauss Kernel, Soft Margin, C=1000

Checker-Board Problem

0 60 120 1800

60

120


0 60 120 1800

60

120



Examples Two-Spiral

Problem

-30 -20 -10 0 10 20 30-30

-20

-10

0

10

20

30

Two-Spiral Problem

-30 -20 -10 0 10 20 30-30

-20

-10

0

10

20

30Region Boundary ( =2 )

Spiral 1spiral 2

-30 -20 -10 0 10 20 30-30

-20

-10

0

10

20


Spiral 1spiral 2


Two-Spiral Problem

-30 -20 -10 0 10 20 30-30

-20

-10

0

10

20


Spiral 1spiral 2

-30 -20 -10 0 10 20 30-30

-20

-10

0

10

20


Spiral 1spiral 2


Conclusions Advantages

Always finds a global minimum. Simple and clear geometric interpretation.

Limitations Choice of Kernel. Training a multi-class SVM in one step.

References N. Cristianini and J.Shawe-Taylor, An

Intorduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press, 2000.

R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, John Wiley & Sons, INC., 2001.

C. J.C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, c, 121-167, 1998.

K. P. Bennett and C. Campbell, Support Vector Machines: Hype or Hallelujah?, SIGKDD Explorations, 2, 2, 1-13, 2000.

SVMLight, http://svmlight.joachims.org/

http://svmlight.joachims.org/