44
CS840a Learning and Computer Vision Prof. Olga Veksler Lecture 8 SVM Some pictures from C. Burges

CS434a/541a: Pattern Recognition Prof. Olga Veksler

  • Upload
    others

  • View
    19

  • Download
    1

Embed Size (px)

Citation preview

Page 1: CS434a/541a: Pattern Recognition Prof. Olga Veksler

CS840a Learning and Computer Vision

Prof. Olga Veksler

Lecture 8 SVM

Some pictures from C. Burges

Page 2: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM

• Said to start in 1979 with Vladimir Vapnik’s paper

• Major developments throughout 1990’s • Elegant theory

• Has good generalization properties

• Have been applied to diverse problems very successfully in the last 15-20 years

Page 3: CS434a/541a: Pattern Recognition Prof. Olga Veksler

Linear Discriminant Functions

g(x) = wtx + w0

( )( ) 20

10classxxgclassxxg

∈⇒<∈⇒>

• which separating hyperplane should we choose?

Page 4: CS434a/541a: Pattern Recognition Prof. Olga Veksler

Margin Intuition • Training data is just a subset of of all possible data • Suppose hyperplane is close to sample xi

• If sample is close to sample xi, it is likely to be on the wrong side

xi

• Poor generalization

Page 5: CS434a/541a: Pattern Recognition Prof. Olga Veksler

Margin Intuition • Hyperplane as far as possible from any sample

xi

• More likely that new samples close to old samples classified correctly

• Good generalization

Page 6: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM • Idea: maximize distance to the closest example

xi xi

smaller distance larger distance

• For the optimal hyperplane • distance to the closest negative example = distance to the closest positive

example

Page 7: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM: Linearly Separable Case • SVM: maximize the margin

• margin is twice the absolute value of distance b of the closest example to the separating hyperplane

• Better generalization • in practice and in theory

Page 8: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM: Linearly Separable Case

• Support vectors are samples closest to separating hyperplane • they are the most difficult patterns to classify, intuitively • optimal hyperplane is completely defined by support vectors

• do not know which samples are support vectors beforehand

Page 9: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM: Formula for the Margin • g(x) = wtx + w0 x

• absolute distance between x and the boundary g(x) = 0

wwxw t

0+

• distance is unchanged for hyperplane g1(x)=α g (x)

w

wxwt

α

α+α 0

• Let xi be an example closest to the boundary. Set

10 =+ wxw it

w

wxwt0+

=

• Now the largest margin hyperplane is unique

Page 10: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM: Formula for the Margin

• now distance from closest sample xi to g(x) = 0 is

w

wxw it

0+

w1

=

• Thus the margin is

wm 2

=

• For uniqueness, set for any example xi closest to the boundary

10 =+ wxw it

w1

w1 w

1

w2

Page 11: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM: Optimal Hyperplane

−≤+≥+

examplenegativeisxifwxwexamplepositiveisxifwxw

ii

ii

t

t

11

0

0

• Maximize margin w

m 2=

• subject to constraints

• Let

−==

examplenegativeisxifzexamplepositiveisxifz

ii

ii

11

• Convert our problem to

minimize

constrained to

( ) 2

21 wwJ =

( ) iwxwz ii t ∀≥+ 10

• J(w) is a convex function, thus it has a single global minimum

Page 12: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM: Optimal Hyperplane • Use Kuhn-Tucker theorem to convert our problem to:

maximize

constrained to

( ) ∑∑∑= ==

αα−α=αn

i

n

jj

tijiji

n

iiD xxzzL

1 11 21

∑=

=α∀≥αn

iiii zandi

1

00

• α = {α1,…, αn} are new variables, one for each sample

( )

α

α

α

α−α=α ∑

=n

t

n

n

iiD HL

11

1 21

• Rewrite LD(α) using n by n matrix H:

jtijiij xxzzH =

• where the value in the i th row and j th column of H is

Page 13: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM: Optimal Hyperplane

• Use Kuhn-Tucker theorem to convert our problem to:

maximize

constrained to

( ) ∑∑∑= ==

αα−α=αn

i

n

jj

tijiji

n

iiD xxzzL

1 11 21

∑=

=α∀≥αn

iiii zandi

1

00

• α = {α1,…,α n} are new variables, one for each sample • LD(α) can be optimized by quadratic programming

• LD(α) formulated in terms of α • depends on w and w0

Page 14: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM: Optimal Hyperplane

• Final discriminant function:

( ) 0wxxzxgt

Sxiii

i

+

α= ∑

• After finding the optimal α = {α1,…, αn}

• solve for w0 using any αi > 0 and ( )[ ] 010 =−+α wxwz it

ii

• where S is the set of support vectors

{ }0| ≠α= iixS

• compute ∑=

α=n

iiii xzw

1

• for every sample i, one of the following must hold

• αi = 0 (sample i is not a support vector) • αi ≠ 0 and zi(wtxi+w0 - 1) = 0 (sample i is support vector)

it

i

xwz

w −=1

0

Page 15: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM: Optimal Hyperplane

maximize

constrained to

( ) ∑∑∑= ==

αα−α=αn

i

n

jj

tijiji

n

iiD xxzzL

1 11 21

∑=

=α∀≥αn

iiii zandi

1

00

• LD(α) depends on the number of samples, not on dimension of samples

• samples appear only through the dot products jti xx

• Will become important when looking for a nonlinear discriminant function

Page 16: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM: Non Separable Case

• Linear classifier still be appropriate when data is not linearly separable, but almost linearly separable

outliers

• Can adapt SVM to almost linearly separable case

Page 17: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM: Non Separable Case • Introduce non-negative slack variables ξ1,…, ξn

• one for each sample

( ) iwxwz iit

i ∀ξ−≥+ 10

• ξi measures deviation from the ideal position for sample xi • ξi >1: xi is on the wrong side of

the hyperplane • 0< ξi <1: xi is on the right side of

the hyperplane but within the region of maximum margin

ξi > 1

0< ξi <1

( ) iwxwz it

i ∀≥+ 10• Change constraints from to

Page 18: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM: Non Separable Case

• Wish to minimize

• where ( )

≤ξ>ξ

=>ξ0001

0i

ii if

ifI

( ) ( )∑=

>ξβ+=ξξn

iin IwwJ

1

21 0

21,...,,

• β measures relative weight of first and second terms • if β is small, we allow a lot of samples not in ideal position • if β is large, we allow very few samples not in ideal position • choosing β appropriately is important

( ) iit

i wxwz ξ−≥+ 10• constrained to and ii ∀≥ξ 0

# of samples not in ideal location

Page 19: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM: Non Separable Case

large β, few samples not in ideal position

( ) ( )∑=

>ξβ+=ξξn

iin IwwJ

1

21 0

21,...,,

small β, many samples not in ideal position

# of samples not in ideal location

Page 20: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM: Non Separable Case

• where ( )

≤ξ>ξ

=>ξ0001

0i

ii if

ifI

( ) ( )∑=

>ξβ+=ξξn

iin IwwJ

1

21 0

21,...,,

• Minimization problem is NP-hard due to discontinuity of I(ξi)

( ) iit

i wxwz ξ−≥+ 10• constrained to and ii ∀≥ξ 0

# of samples not in ideal location

Page 21: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM: Non Separable Case • Instead we minimize

( ) ∑=

ξβ+=ξξn

iin wwJ

1

21 2

1,...,,

( )

∀≥ξ∀ξ−≥+

iiwxwz

i

iit

i

010• constrained to

a measure of # of misclassified

examples

• Use Kuhn-Tucker theorem to converted to

maximize

constrained to

( ) ∑∑∑= ==

αα−α=αn

i

n

jj

tijiii

n

iiD xxzzL

1 11 21

∑=

=α∀β≤α≤n

iiii zandi

1

00

• find w using ∑=

α=n

iiii xzw

1

• solve for w0 using any 0 <αi < β and ( )[ ] 010 =−+α wxwz it

ii

Page 22: CS434a/541a: Pattern Recognition Prof. Olga Veksler

Non Linear Mapping • Cover’s theorem:

• “pattern-classification problem cast in a high dimensional space non-linearly is more likely to be linearly separable than in a low-dimensional space”

• Not linearly separable in 1D

0 1 2 3 4 -2 -3

• Lift to 2D space with h(x) = (x,x2 )

Page 23: CS434a/541a: Pattern Recognition Prof. Olga Veksler

Non Linear Mapping

• To solve a non linear problem with a linear classifier 1. Project data x to high dimension using function ϕ(x) 2. Find a linear discriminant function for transformed data ϕ(x) 3. Final nonlinear discriminant function is g(x) = wt ϕ(x) +w0

0 1 2 3 4 -2 -3

ϕ(x) = (x,x2 )

• In 2D, discriminant function is linear ( )

( ) [ ]( )

( ) 02

1

212

1

wxx

wwxx

g +

=

• In 1D, discriminant function is not linear ( ) 02

21 wxwxwxg ++=

R1 R2 R2

Page 24: CS434a/541a: Pattern Recognition Prof. Olga Veksler

Non Linear Mapping: Another Example

Page 25: CS434a/541a: Pattern Recognition Prof. Olga Veksler

Non Linear SVM

• Can use any linear classifier after lifting data into a higher dimensional space

• However we will have to deal with the “curse of dimensionality” 1. poor generalization to test data 2. computationally expensive

• SVM avoids the “curse of dimensionality” by

• enforcing largest margin permits good generalization • computation in the higher dimensional case is performed only

implicitly through the use of kernel functions

Page 26: CS434a/541a: Pattern Recognition Prof. Olga Veksler

Non Linear SVM: Kernels

• Recall SVM optimization

maximize ( ) ∑∑∑= ==

αα−α=αn

i

n

jj

tijiii

n

iiD xxzzL

1 11 21

• Optimization depends on samples xi only through the dot product xi

txj

• If we lift xi to high dimension using φ(x), need to compute high dimensional product φ(xi)tφ(xj)

maximize ( ) ( ) ( )∑∑∑= ==

ϕϕαα−α=αn

i

n

jj

tijiii

n

iiD xxzzL

1 11 21

• Idea: find kernel function K(xi,xj) s.t. K(xi,xj) = φ(xi)tφ(xj)

K(xi,xj)

Page 27: CS434a/541a: Pattern Recognition Prof. Olga Veksler

Non Linear SVM: Kernels

• Kernel trick • only need to compute K(xi,xj) instead of φ(xi)tφ(xj) • no need to lift data in high dimension explicitely,

computation is performed in the original dimension

maximize ( ) ( ) ( )∑∑∑= ==

ϕϕαα−α=αn

i

n

jj

tijiii

n

iiD xxzzL

1 11 21

K(xi,xj)

Page 28: CS434a/541a: Pattern Recognition Prof. Olga Veksler

Non Linear SVM: Kernels

• Suppose we have 2 features and K(x,y) = (xty)2

• Which mapping φ(x) does it correspond to?

( ) ( )2, yxyxK t=( ) ( )[ ]

( )

( )

2

2

121

=

yy

xx ( ) ( ) ( ) ( )( )22211 yxyx +=

( ) ( )( ) ( ) ( )( ) ( ) ( )( ) ( ) ( )( )2222211211 2 yxyxyxyx ++=

( )( ) ( ) ( ) ( ) ( )( )[ ] ( )( ) ( ) ( ) ( ) ( )( )[ ]tyyyyyxxxxx 22211212221121 22=

• Thus

( ) ( )( ) ( ) ( ) ( )( )[ ]2211 2 xxxxx =ϕ

Page 29: CS434a/541a: Pattern Recognition Prof. Olga Veksler

Non Linear SVM: Kernels

• How to choose kernel K(xi,xj)? • K(xi,xj) should correspond to product φ(xi)tφ(xj) in a higher

dimensional space • Mercer’s condition states which kernel function can be

expressed as dot product of two vectors • Kernel’s not satisfying Mercer’s condition can be sometimes

used, but no geometrical interpretation

• Common choices satisfying Mercer’s condition • Polynomial kernel ( ) ( )p

jtiji xxxxK 1, +=

• Gaussian radial Basis kernel (data is lifted in infinite dimensions)

( )

σ−=

2

221exp, jiji xxxxK

Page 30: CS434a/541a: Pattern Recognition Prof. Olga Veksler

Non Linear SVM

• Choose ϕ(x) so that the first (“0”th) dimension is the augmented dimension with feature value fixed to 1

( ) ( ) ( ) ( ) ( )[ ]txxxxx 21211=ϕ

• search for separating hyperplane in high dimension

( ) 00 =+ϕ wxw

• Threshold w0 gets folded into vector w

[ ] 0*1

0 =

ww

ϕ(x)

Page 31: CS434a/541a: Pattern Recognition Prof. Olga Veksler

Non Linear SVM

• Thus seeking hyperplane

( ) 0=ϕ xw

• Or, equivalently, a hyperplane that goes through the origin in high dimensions

• removes only one degree of freedom • but we introduced many new degrees when lifted the data

in high dimension

Page 32: CS434a/541a: Pattern Recognition Prof. Olga Veksler

Non Linear SVM Recepie

• Choose kernel K(xi,xj) • implicitly chooses function φ(xi) that takes xi to a higher dimensional space • gives dot product in the high dimensional space

• Start with x1,…,xn in original feature space of dimension d

• Find largest margin linear classifier in the higher dimensional space by using quadratic programming package to solve

maximize

constrained to

( ) ( )∑∑∑= ==

αα−α=αn

i

n

jjijiii

n

iiD xxKzzL

1 11

,21

∑=

=α∀β≤α≤n

iiii zandi

1

00

Page 33: CS434a/541a: Pattern Recognition Prof. Olga Veksler

Non Linear SVM Recipe

• Linear discriminant function in the high dimensional space

( ) ( )xxzt

Sxiii

i

ϕ

ϕα= ∑

• where S is the set of support vectors { }0| ≠α= iixS

( ) ( )∑∈

ϕϕα=Sx

it

iii

xxz ( )∑∈

α=Sx

iiii

xxKz ,

• Non linear discriminant function in the original space:

( ) ( ) ( )xxzxgt

Sxiii

i

ϕ

ϕα= ∑

• Decide class 1 if g(x ) > 0, otherwise decide class 2

( )∑∈

ϕα=Sx

iiii

xzw• Weight vector w in the high dimensional space

( )( ) ( )xwxg tϕ=ϕ

Page 34: CS434a/541a: Pattern Recognition Prof. Olga Veksler

Non Linear SVM

( ) ( )∑∈

α=Sx

iiii

xxKzxg ,

• Nonlinear discriminant function

( ) ∑=xgmost important

training samples, i.e. support vectors

weight of support vector xi

1 similarity between x and

support vector xi

( )

σ−= 2

221exp, xxxxK ii

Page 35: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM Example: XOR Problem

• Class 2: x3 = [1,1], x4 = [-1,-1]

• Class 1: x1 = [1,-1], x2 = [-1,1]

• Use polynomial kernel of degree 2 • K(xi,xj) = (xi

t xj + 1)2

• Kernel corresponds to mapping

( ) ( ) ( ) ( ) ( ) ( )( ) ( )( )[ ]txxxxxxx 22212121 2221=

• Need to maximize ( ) ( )∑∑∑4

1

4

1

24

1

121

= ==

+ααα=αi j

jtijiii

iiD xxzzL

constrained to 00 4321 =α−α−α+α∀α≤ andii

Page 36: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM Example: XOR Problem

• Rewrite ( ) αα−α=α ∑=

HL t

iiD 2

14

1

• where and [ ] t4321 αααα=α

−−−−

−−−−

=

9111191111911119

H

• Take derivative with respect to α and set it to 0

( ) 0

9111191111911119

1111

−−−−

−−−−

=αDLdad

• Solution to the above is α1= α2 = α3 = α4 = 0.25

• all samples are support vectors • satisfies the constraints 00, 4321 =α−α−α+αα≤∀ andi i

Page 37: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM Example: XOR Problem

• Weight vector w is:

( ) ( )xwxg ϕ=

( )∑=

ϕα=4

1iiii xzw ( ) ( ) ( ) ( )( )432125.0 xxxx ϕ−ϕ−ϕ+ϕ=

( ) ( ) ( ) ( ) ( ) ( )( ) ( )( )[ ]txxxxxxx 22212121 2221=

[ ]002000=

• Nonlinear discriminant function is

( ) ( )( )2122 xx=( )xw ii

iϕ= ∑=

6

1

( ) ( )212 xx=

• by plugging in x1 = [1,-1], x2 = [-1,1], x3 = [1,1], x4 = [-1,-1]

Page 38: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM Example: XOR Problem

( ) ( ) ( )212 xxxg −=

( )1x

( )2x

-1 1

1

-1

decision boundaries nonlinear

( )12x

( ) ( )212 xx

-1 1

1

-1

-2

2

2 -2

decision boundary is linear

Page 39: CS434a/541a: Pattern Recognition Prof. Olga Veksler

Degree 3 Polynomial Kernel

• Left: In linearly separable case, decision boundary is roughly linear, indicating that dimensionality is controlled

• Right: nonseparable case is handled by a polynomial of degree 3

Page 40: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM as Unconstrained Minimization • SVM formulated as constrained optimization, minimize

( ) ∑=

ξβ+=ξξn

iin wwJ

1

21 2

1,...,,

( )

∀≥ξ∀ξ−≥+

iiwxwz

i

iit

i

010• constrained to

• Let us name ( ) 0wxwxf it

i +=

• The constraint can be rewritten as ( )

∀≥ξ∀ξ−≥

iixfz

i

iii

01

( )( )iii xfz−=ξ 1,0max• Which implies

weights regularization

loss function

• SVM objective can be rewritten as unconstrained optimization

( ) ( )( )∑=

−β+=ξξn

iiin xfzwwJ

1

21 1,0max

21,...,,

Page 41: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM as Unconstrained Minimization

weights regularization

loss function

• SVM objective can be rewritten as unconstrained optimization

( ) ( )( )∑=

−β+=n

iii xfzwwJ

1

2 1,0max21

• zi f(xi) > 1 : xi is on the right side

of the hyperplane and outside margin, no loss

• zi f(xi) = 1 : xi on the margin, no loss

• zi f(xi) < 1 : xi is inside margin, or on the wrong side of the hyperplane, contributes to loss

Page 42: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM: Hinge Loss • SVM uses Hinge loss per sample xi

( ) ( )( )iiii xfzxL −= 1,0max

• Hinge loss encourages classification with a margin of 1

( )ii xfz

( )ixL

Page 43: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM: Hinge Loss • Can optimize with gradient descent, convex function

( ) ( )( )∑=

−β+=n

iii xfzwwJ

1

2 1,0max21

( ) 0wxwxf it

i +=

• Gradient

( )ii xfz

( )ixL

iixzwwJ

−=∂∂

wwJ

=∂∂

• Gradient descent, single sample

( ) ( )

α−<β−α−

=otherwiseww

xfzifxzwww iiii 1

Page 44: CS434a/541a: Pattern Recognition Prof. Olga Veksler

SVM Summary • Advantages:

• nice theory • good generalization properties • objective function has no local minima • can be used to find non linear discriminant functions • often works well in practice, even if not a lot of training data

• Disadvantages:

• tends to be slower than other methods • quadratic programming is computationally expensive