13
1 Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 40 An Equivalent QP Maximize " k k =1 R # $ 1 2 " k " l Q kl l =1 R # k =1 R # where Q kl = y k y l ( x k " x l ) Subject to these constraints: 0 " # k " C $k Then define: w = " k y k x k k =1 R # b = y K (1 " # K ) " x K $ w K where K = argmax k % k Then classify with: f(x,w,b) = sign(w · x + b) " k y k k =1 R # = 0 Warning: up until Rong Zhang spotted my error in Oct 2003, this equation had been wrong in earlier versions of the notes. This version is correct. " k Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 41 An Equivalent QP Maximize " k k =1 R # $ 1 2 " k " l Q kl l =1 R # k =1 R # where Q kl = y k y l ( x k " x l ) Subject to these constraints: 0 " # k " C $k Then define: w = " k y k x k k =1 R # b = y K (1 " # K ) " x K $ w K where K = argmax k % k Then classify with: f(x,w,b) = sign(w · x + b) " k y k k =1 R # = 0 Warning: up until Rong Zhang spotted my error in Oct 2003, this equation had been wrong in earlier versions of the notes. This version is correct. " k Datapoints with α k > 0 will be the support vectors ..so this sum only needs to be over the support vectors. (prolly << R)

An Equivalent QP Warning: up until Rong Zhang spotted my ...dlizotte/teaching/slides/svm_2.pdf · Not a big surprise Positive “plane ... Higher Order Polynomials All m4/24 m4R/481

Embed Size (px)

Citation preview

Page 1: An Equivalent QP Warning: up until Rong Zhang spotted my ...dlizotte/teaching/slides/svm_2.pdf · Not a big surprise Positive “plane ... Higher Order Polynomials All m4/24 m4R/481

1

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 40

An Equivalent QP

Maximize

!

"k

k=1

R

# $1

2"k" lQkl

l=1

R

#k=1

R

# where

!

Qkl = ykyl (x k " x l )

Subject to theseconstraints:

!

0 "#k" C $k

Then define:

!

w = "k ykx kk=1

R

#

!

b = yK (1"#K ) " xK $wK

where K = argmaxk

%k

Then classify with:

f(x,w,b) = sign(w · x + b)!

"k ykk=1

R

# = 0

Warning: up until Rong Zhang spotted my error inOct 2003, this equation had been wrong in earlierversions of the notes. This version is correct.

!

"k

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 41

An Equivalent QP

Maximize

!

"k

k=1

R

# $1

2"k" lQkl

l=1

R

#k=1

R

# where

!

Qkl = ykyl (x k " x l )

Subject to theseconstraints:

!

0 "#k" C $k

Then define:

!

w = "k ykx kk=1

R

#

!

b = yK (1"#K ) " xK $wK

where K = argmaxk

%k

Then classify with:

f(x,w,b) = sign(w · x + b)!

"k ykk=1

R

# = 0

Warning: up until Rong Zhang spotted my error inOct 2003, this equation had been wrong in earlierversions of the notes. This version is correct.

!

"k

Datapoints with αk > 0will be the supportvectors

..so this sum only needsto be over thesupport vectors.

(prolly << R)

Page 2: An Equivalent QP Warning: up until Rong Zhang spotted my ...dlizotte/teaching/slides/svm_2.pdf · Not a big surprise Positive “plane ... Higher Order Polynomials All m4/24 m4R/481

2

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 42

An Equivalent QP

Maximize

!

"k

k=1

R

# $1

2"k" lQkl

l=1

R

#k=1

R

# where

!

Qkl = ykyl (x k " x l )

Subject to theseconstraints:

!

0 "#k" C $k

Then define:

!

w = "k ykx kk=1

R

#

!

b = yK (1"#K ) " xK $wK

where K = argmaxk

%k

Then classify with:

f(x,w,b) = sign(w · x + b)!

"k ykk=1

R

# = 0

Warning: up until Rong Zhang spotted my error inOct 2003, this equation had been wrong in earlierversions of the notes. This version is correct.

!

"k

Datapoints with αk > 0will be the supportvectors

..so this sum only needsto be over thesupport vectors.

(prolly << R)

Why did I tell you about thisequivalent QP?

• It’s a formulation that QPpackages can optimize morequickly

• Because of further jaw-dropping developments you’reabout to learn.

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 43

Suppose we’re in 1-dimension

What wouldSVMs do withthis data?

x=0

Page 3: An Equivalent QP Warning: up until Rong Zhang spotted my ...dlizotte/teaching/slides/svm_2.pdf · Not a big surprise Positive “plane ... Higher Order Polynomials All m4/24 m4R/481

3

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 44

Suppose we’re in 1-dimension

Not a big surprise

Positive “plane” Negative “plane”

x=0

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 45

Harder 1-dimensional dataset

That’s wiped thesmirk off SVM’sface.

What can bedone aboutthis?

x=0

Doesn’t look like slack variables will save us this time…

Page 4: An Equivalent QP Warning: up until Rong Zhang spotted my ...dlizotte/teaching/slides/svm_2.pdf · Not a big surprise Positive “plane ... Higher Order Polynomials All m4/24 m4R/481

4

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 46

Harder 1-dimensional datasetWe’re going to

make up a newfeature.

Sort of. We’llcompute it fromthe feature(s) wehave.

x=0 ),( 2

kkkxx=z

New features are sometimes called basis functions.

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 47

Harder 1-dimensional dataset

x=0 ),( 2

kkkxx=z

We’re going tomake up a newfeature.

Sort of. We’llcompute it fromthe feature(s) wehave.

Separable! MAGIC!

Just put this “augmented” data into our linear SVM.

Page 5: An Equivalent QP Warning: up until Rong Zhang spotted my ...dlizotte/teaching/slides/svm_2.pdf · Not a big surprise Positive “plane ... Higher Order Polynomials All m4/24 m4R/481

5

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 48

Common SVM “extra features”

zk = ( polynomial terms of xk of degree 1 to q )

zk = ( radial basis functions of xk )

zk = ( sigmoid functions of xk )

This is sensible.

Is that the end of the story?

No…there’s one more trick!

!

zk[ j] =" j (x k ) =KernelFn| x k # c j |

KW

$

% &

'

( )

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 49

QuadraticBasis Functions

!

"(x) =

1

2x1

2x2

:

2xm

x1

2

x2

2

:

xm

2

2x1x2

2x1x3

:

2x1xm

2x2x3

:

2x1xm

:

2xm#1xm

$

%

& & & & & & & & & & & & & & & & & & & & & & & & &

'

(

) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )

Constant Term

Linear Terms

PureQuadratic

Terms

QuadraticCross-Terms

Number of terms (assuming m inputdimensions) = (m+2)-choose-2

= (m+2)(m+1)/2

= (as near as makes no difference) m2/2

You may be wondering what those

’s are doing.

•You should be happy that they do noharm

•You’ll find out why they’re there soon.

2

Page 6: An Equivalent QP Warning: up until Rong Zhang spotted my ...dlizotte/teaching/slides/svm_2.pdf · Not a big surprise Positive “plane ... Higher Order Polynomials All m4/24 m4R/481

6

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 50

Qua

drat

ic D

otPr

oduc

ts

!

"(a)•"(b) =

1

2a1

2a2

:

2am

a1

2

a2

2

:

am

2

2a1a2

2a1a3

:

2a1am

2a2a3

:

2a1am

:

2am#1am

$

%

& & & & & & & & & & & & & & & & & & & & & & & & &

'

(

) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )

1

2b1

2b2

:

2bm

b1

2

b2

2

:

bm

2

2b1b2

2b1b3

:

2b1bm

2b2b3

:

2b1bm

:

2bm#1bm

$

%

& & & & & & & & & & & & & & & & & & & & & & & & &

'

(

) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )

1

!=

m

i

iiba

1

2

!=

m

i

iiba

1

22

!!= +=

m

i

m

ij

jiji bbaa1 1

2

+

+

+

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 51

Qua

drat

ic D

otPr

oduc

ts

!

"(a)•"(b) =

!!!!= +===

+++m

i

m

ij

jiji

m

i

ii

m

i

ii bbaababa1 11

22

1

221

Just out of casual, innocent, interest,let’s look at another function of a andb:

!

(a "b+1)2

!

= (a "b)2

+ 2a "b+1

12

1

2

1

++!"

#$%

&= ''

==

m

i

ii

m

i

iibaba

12

11 1

++= !!!== =

m

i

ii

m

i

m

j

jjii bababa

122)(11 11

2+++= !!!!

== +==

m

i

ii

m

i

m

ij

jjii

m

i

ii babababa

Page 7: An Equivalent QP Warning: up until Rong Zhang spotted my ...dlizotte/teaching/slides/svm_2.pdf · Not a big surprise Positive “plane ... Higher Order Polynomials All m4/24 m4R/481

7

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 52

Qua

drat

ic D

otPr

oduc

ts

!

"(a)•"(b) =

!!!!= +===

+++m

i

m

ij

jiji

m

i

ii

m

i

ii bbaababa1 11

22

1

221

Just out of casual, innocent, interest,let’s look at another function of a andb:

!

(a "b+1)2

!

= (a "b)2

+ 2a "b+1

12

1

2

1

++!"

#$%

&= ''

==

m

i

ii

m

i

iibaba

12

11 1

++= !!!== =

m

i

ii

m

i

m

j

jjii bababa

122)(11 11

2+++= !!!!

== +==

m

i

ii

m

i

m

ij

jjii

m

i

ii babababa

They’re the same!

And this is only O(m) tocompute!

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 53

Higher Order Polynomials

50 R2m R2 / 2(a·b+1)41 960 000R2m4 R2 /48All m4/24terms up todegree 4

Quartic

50 R2m R2 / 2(a·b+1)383 000 R2m3 R2 /12All m3/6terms up todegree 3

Cubic

50 R2m R2 / 2(a·b+1)22 500 R2m2 R2 /4All m2/2terms up todegree 2

Quadratic

Cost if100inputs

Cost tobuild Qklmatrixsneakily

φ(a)·φ(b)Cost if 100inputs

Cost tobuild Qklmatrixtraditionally

φ(x)Poly-nomial

Page 8: An Equivalent QP Warning: up until Rong Zhang spotted my ...dlizotte/teaching/slides/svm_2.pdf · Not a big surprise Positive “plane ... Higher Order Polynomials All m4/24 m4R/481

8

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 54

QP with Quintic basis functionswhere

!

Qkl = ykyl ("(x k ) # "(x l ))

Subject to theseconstraints:

!

0 "#k" C $k

Then define:

!

b = yK (1"#K ) " xK $wK

where K = argmaxk

%k

Then classify with:

f(x,w,b) = sign(w · φ(x) + b)!

"k ykk=1

R

# = 0

!

w = "k yk#(x k )k s.t. "k >0

$

Maximize

!

"k

k=1

R

# $1

2"k" lQkl

l=1

R

#k=1

R

#We must do R2/2 dot products to get thismatrix ready.

In 100-d, each dot product now needs 103operations instead of 75 million

But there are still worrying things lurking away.What are they?

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 55

QP with Quintic basis functionswhere

!

Qkl = ykyl ("(x k ) # "(x l ))

Subject to theseconstraints:

!

0 "#k" C $k

Then define:

!

b = yK (1"#K ) " xK $wK

where K = argmaxk

%k

Then classify with:

f(x,w,b) = sign(w · φ(x) + b)!

"k ykk=1

R

# = 0

!

w = "k yk#(x k )k s.t. "k >0

$

Maximize

!

"k

k=1

R

# $1

2"k" lQkl

l=1

R

#k=1

R

#We must do R2/2 dot products to get thismatrix ready.

In 100-d, each dot product now needs 103operations instead of 75 million

But there are still worrying things lurking away.What are they?

•The fear of overfitting with this enormousnumber of terms

•The evaluation phase (doing a set ofpredictions on a test set) will be veryexpensive (why?)

Page 9: An Equivalent QP Warning: up until Rong Zhang spotted my ...dlizotte/teaching/slides/svm_2.pdf · Not a big surprise Positive “plane ... Higher Order Polynomials All m4/24 m4R/481

9

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 56

QP with Quintic basis functionswhere

!

Qkl = ykyl ("(x k ) # "(x l ))

Subject to theseconstraints:

!

0 "#k" C $k

Then define:

!

b = yK (1"#K ) " xK $wK

where K = argmaxk

%k

Then classify with:

f(x,w,b) = sign(w · φ(x) + b)

!

"k ykk=1

R

# = 0

!

w = "k yk#(x k )k s.t. "k >0

$

Maximize

!

"k

k=1

R

# $1

2"k" lQkl

l=1

R

#k=1

R

#We must do R2/2 dot products to get thismatrix ready.

In 100-d, each dot product now needs 103operations instead of 75 million

But there are still worrying things lurking away.What are they?

•The fear of overfitting with this enormousnumber of terms

•The evaluation phase (doing a set ofpredictions on a test set) will be veryexpensive (why?)

Because each w·φ(x) (see below)needs 75 million operations. Whatcan be done?

The use of Maximum Marginmagically makes this not aproblem

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 57

QP with Quintic basis functionswhere

!

Qkl = ykyl ("(x k ) # "(x l ))

Subject to theseconstraints:

!

0 "#k" C $k

Then define:

!

b = yK (1"#K ) " xK .wK

where K = argmaxk

$k

Then classify with:

f(x,w,b) = sign(w · φ(x) + b)

!

"k ykk=1

R

# = 0

!

w = "k yk#(x k )k s.t. "k >0

$

Maximize

!

"k

k=1

R

# $1

2"k" lQkl

l=1

R

#k=1

R

#We must do R2/2 dot products to get thismatrix ready.

In 100-d, each dot product now needs 103operations instead of 75 million

But there are still worrying things lurking away.What are they?

•The fear of overfitting with this enormousnumber of terms

•The evaluation phase (doing a set ofpredictions on a test set) will be veryexpensive (why?)

Because each w·φ(x) (see below)needs 75 million operations. Whatcan be done?

The use of Maximum Marginmagically makes this not aproblem

Only Sm operations (S=#support vectors)!

w " #(x) = $k yk#(x k ) " #(x)k s.t. $k >0

%

!

= "k yk (x k # x +1)5

k s.t. "k >0

$

Page 10: An Equivalent QP Warning: up until Rong Zhang spotted my ...dlizotte/teaching/slides/svm_2.pdf · Not a big surprise Positive “plane ... Higher Order Polynomials All m4/24 m4R/481

10

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 58

SVM Kernel Functions• K(a,b)=(a . b +1)d is an example of an SVM

Kernel Function• Beyond polynomials there are other very high

dimensional basis functions that can be madepractical by finding the right Kernel Function• Radial-Basis-style Kernel Function:

• Neural-net-style Kernel Function:

!!"

#$$%

& ''=

2

2

2

)(exp),(

(

babaK

).tanh(),( !" #= babaK

σ, κ and δ are magicparameters that mustbe chosen by a modelselection methodsuch as CV orVCSRM*

*see last lecture

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 59

VC-dimension of an SVM• Very very very loosely speaking there is some theory which

under some different assumptions puts an upper bound onthe VC dimension as

• where• Diameter is the diameter of the smallest sphere that can

enclose all the high-dimensional term-vectors derivedfrom the training set.

• Margin is the smallest margin we’ll let the SVM use• This can be used in SRM (Structural Risk Minimization) for

choosing the polynomial degree, RBF σ, etc.

• But most people just use Cross-Validation

!!

"##

$

Margin

Diameter

Page 11: An Equivalent QP Warning: up until Rong Zhang spotted my ...dlizotte/teaching/slides/svm_2.pdf · Not a big surprise Positive “plane ... Higher Order Polynomials All m4/24 m4R/481

11

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 60

SVM Performance• Anecdotally they work very very well indeed.• Example: They are currently the best-known

classifier on a well-studied hand-written-characterrecognition benchmark

• Another Example: Andrew knows several reliablepeople doing practical real-world work who claimthat SVMs have saved them when their otherfavorite classifiers did poorly.

• There is a lot of excitement and religious fervorabout SVMs as of 2001.

• Despite this, some practitioners (including yourlecturer) are a little skeptical.

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 61

Doing multi-class classification• SVMs can only handle two-class outputs (i.e. a

categorical output variable with arity 2).• What can be done?• Answer: with output arity N, learn N SVM’s

• SVM 1 learns “Output==1” vs “Output != 1”• SVM 2 learns “Output==2” vs “Output != 2”• :• SVM N learns “Output==N” vs “Output != N”

• Then to predict the output for a new input, justpredict with each SVM and find out which one putsthe prediction the furthest into the positive region.

Page 12: An Equivalent QP Warning: up until Rong Zhang spotted my ...dlizotte/teaching/slides/svm_2.pdf · Not a big surprise Positive “plane ... Higher Order Polynomials All m4/24 m4R/481

12

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 62

References• An excellent tutorial on VC-dimension and Support

Vector Machines:C.J.C. Burges. A tutorial on support vector machines

for pattern recognition. Data Mining and KnowledgeDiscovery, 2(2):955-974, 1998.http://citeseer.nj.nec.com/burges98tutorial.html

• The VC/SRM/SVM Bible:Statistical Learning Theory by Vladimir Vapnik, Wiley-

Interscience; 1998.BUT YOU SHOULD PROBABLY READ ALMOST

ANYTHING ELSE ABOUT SVMS FIRST.

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 63

What You Should Know• Linear SVMs• The definition of a maximum margin classifier• What QP can do for you (but, for this class, you

don’t need to know how it does it)• How Maximum Margin can be turned into a QP

problem• How we deal with noisy (non-separable) data• How we permit “non-linear boundaries”• How SVM Kernel functions permit us to pretend

we’re working with a zillion features

Page 13: An Equivalent QP Warning: up until Rong Zhang spotted my ...dlizotte/teaching/slides/svm_2.pdf · Not a big surprise Positive “plane ... Higher Order Polynomials All m4/24 m4R/481

13

Copyright © 2001, 2003, Andrew W. Moore Support Vector Machines: Slide 64

What really happens• Johnny Machine Learning gets a dataset

• Wants to try SVMs• Linear: “Not bad, but I think it could be better.”• Adjusts C to trade off margin vs. slack• Still not satisfied: Tries kernels, typically polynomial.

Starts with quadratic, then goes up to about degree 5.

• Johnny goes to Machine Learning conference• Johnny: “Wow, a quartic kernel with C=2.375 works

great!”• Audience member: “Why did you pick those, Johnny?”• Johnny: “Cross validation told me to!”