Download pdf - A Simple Review on SVM

A Simple Review on SVM

Honglin Yu

Australian National University, NICTA

September 2, 2013

Outline

1 The Tutorial RoutineOverviewLinear SVC in Separable Case: Largest Margin ClassifierSoft MarginSolving SVMKernel Trick and Non-linear SVM

2 Some TopicsWhy the Name: Support Vectors?Why SVC Works Well: A Simple ExampleRelation with Logistic Regression etc.

3 Packages

The Tutorial Routine Some Topics Packages

Overview

SVM (Support Vector Machines) are supervised learningmethods

It includes both methods for classification and regression

In this talk, we focus on binary classifications.


Symbols

training data: (x1, y1), ..., (xm, ym) ∈ X × {±1}patterns: xi , i = 1, 2, ...,m

pattern space: Xtargets: yi , i = 1, 2, ...,m

features: xi = Φ(xi )

feature space: Hfeature mapping: Φ : X → H


Separable Case: Largest Margin Classifier

Figure: Simplest Case

“Separable” means: ∃ linew · x + b = 0 correctly separates allthe training data.

“Margin”: d+ + d−(d± = min

yi=±1dist(xi ,w · x + b = 0))

In this case, the SVC just looks fora line maximizing the margins.


Separable Case: Largest Margin Classifier

Another form of expressing separable: yi (w · xi + b) > 0

Because the training data is finite, ∃ε, yi (w · xi + b) ≥ εThis is equivalent to yi (

wε · xi + b

ε ) ≥ 1

w · xi + b = 0 and wε · xi + b

ε = 0 are same line.

We can directly write the constraints as yi (w · xi + b) ≥ 1

This removes the scaling redundancy in w, b


We also want the separating plane to place in the middle(which means d+ = d−).

So the optimization problem can be formulated as

arg maxw,b

(2 minx

|w · xi + b||w|

)

s.t. yi (w · xi + b) ≥ 1, i = 1, 2, ...,N

(1)

This is equivalent to:

arg minw,b

|w|2

s.t. yi (w · xi + b) ≥ 1, i = 1, 2, ...,N(2)

But, until now, it can only be confirmed that Eq.2 is only thenecessary condition of finding the plane we want (correct andin the middle)


Largest Margin Classifier

It can be proved that, when the data is separable, for the followingproblem

minw,b

1

2||w||2

s.t. yi · (w · xi + b) ≥ 1, i = 1, ...,m.

(3)

we have,

1 When the ||w|| is minimized, the equality holds for some x.

2 The equality holds at least for some xi , xj where yiyj < 0.

3 Based on 1) and 2) we can calculate that the margin is 2||w|| ,

so the margin is maximized.


Proof of Previous Slide (Warning: My Proof)

1 If ∃c > 1 that ∀xi , yi · (w · xi + b) ≥ c , then wc and b

c alsosatisfy the constraints and the length is smaller.

2 If not, assume that ∃c > 1,

yi · (w · xi + b) ≥ 1,where yi = 1

yi · (w · xi + b) ≥ c ,where yi = −1(4)

Add c−12 to each side where yi = 1, minus c−1

2 to each sidewhere yi = −1, we can get:

yi · (w · xi + b +c − 1

2) ≥ c + 1

2(5)

Because c+12 > 1, similar to 1), the |w| here is not the

smallest3 Pick x1, x2 where the equality holds and y1y2 < 0, the margin

is just the distance between x1 and the line y2 · (w · x + b) = 1which can be easily calculated as 2

||w|| .


Non Separable Case

Figure: Non separable case: miss classified points exist


Non Separable Case

Constraints yi (w · xi + b) ≥ 1, i = 1, 2, ...,m can not besatisfied

Solution: add slack variables ξi , reformulate form problem as,

minw,b,ξ

1

2||w||2 + C

m∑i=1

ξi

s.t. yi ((w · xi) + b) ≥ 1− ξi , i = 1, 2, ...,m

ξi ≥ 0

(6)

Show the trade off (C) between margins ( 1|w|) and penalty (ξi ).

http://www.csie.ntu.edu.tw/~cjlin/libsvm/


Solving SVM: Lagrangian Dual

Constraint optimization → Lagrangian Dual

Primal form:

minw,b,ξ

1

2||w||2 + C

m∑i=1

ξi

s.t. yi ((w · xi) + b) ≥ 1− ξi , i = 1, 2, ...,m

ξi ≥ 0

(7)

The Primal Lagrangian:

L(w, b, ξ, α, µ) =1

2||w||2+C

∑i

ξi−∑i

αi{yi (w·x+b−1−ξi )}−∑i

µiξi

Because [7] is convex, Karush-Kuhn-Tucker conditions hold.


Applying KKT Conditions

Stationarity

∂L

∂w= 0 → w =

∑i

αiyixi

∂L

∂b= 0 →

∑i

αiyi = 0

∂L

∂ξ= 0 → C − αi − µi = 0,∀i

Primal Feasibility: yi ((w · xi) + b) ≥ 1− ξi ,∀i

Dual Feasibility: αi ≥ 0, ui ≥ 0

Complementary Slackness, ∀i

µiξi = 0

αi{yi (w · xi + b)− 1 + ξi} = 0

When αi 6= 0, corresponding xi is called support vectors


Dual Form

Using the equations derived from KKT conditions, removew, b, ξi , µi in the primal form to get the dual form:

minα

∑i

αi −1

2

∑i ,j

αiαjyiyjxTi xj

s.t.∑i

αiyi = 0

C ≥ αi ≥ 0

(8)

And the decision function is:y = sign(∑

i αiyixTi x + b)

(b = yk −w · xk , ∀k ,C > αk > 0)


We Need Nonlinear Classifier

-1

-0.5

0

0.5

1

-1 -0.5 0 0.5 1

Figure: Case that linear classifier can not handle

Finding appropriate form of curves is hard, but we can transformthe data!


Mapping Training Data to Feature Space

Φ(x) = (x , x2)T

Figure: Feature Mapping Helps Classification

To solve nonlinear classification problem, we can define somemapping Φ : X → H and do linear classification on feature spaceH


Recap the Dual Form: An important Fact

Dual form:

minα

∑i

αi −1

2

∑i ,j

αiαjyiyjxTi xj

s.t.∑i

αiyi = 0

C ≥ αi ≥ 0

(9)

Decision function: y = sign(∑

i αiyixTi x + b)

To train SVC or use SVC to predict, we only need to know theinner product between x s!

If we want to apply linear SVC in H, we do NOT need to knowΦ(x), we ONLY need to know k(x , x ′) =< Φ(x),Φ(x ′) >.And k(x , x ′) is called “kernel function”.


Kernel Functions

The input of kernel function k : X × X → R is two patternsx , x ′ in X , the output is the canonical inner product betweenΦ(x),Φ(x ′) in HBy using k(·, ·), we can implicitly transform the data by someΦ(·) (which is often with infinite dimension) E.g. fork(x , x ′) = (xx ′ + 1)2, Φ(x) = (x2,

√2x , 1)T

But not for all functions X × X → R, we can findcorresponding Φ(x). Kernel functions should satisfy Mercer’sconditions


Conditions of Kernel Functions

Necessity: Kernel Matrix K = [k(xi , xj)]m×m must be positivesemidefinite:

tTKt =∑i ,j

ti tjk(xi , xj) =∑i ,j

ti tj < Φ(xi ),Φ(xj) >

=<∑i

tiΦ(xi ),∑j

tjΦ(xj) >= |∑i

tiΦ(xi )|2 ≥ 0

Sufficiency in Continuous Form: Mercer’s Condition:For any symmetric function k : X × X → R which is squareintegrable in X × X , if it satisfies∫

X×Xk(x , x ′)f (x)f (x ′)dxdx ′ ≥ 0 for all f ∈ L2(X )

there exist functions φi : X → R and numbers λi ≥ 0 that,

k(x , x ′) =∑i

λiφi (x)φi (x ′) for all x , x ′ in X


Commonly Used Kernel Functions

Linear Kernel: k(x , x ′) = x ′T x

RBF Kernel: k(x , x ′) = e−γ|x−x′|2 , for gamma = 1

2 (fromwiki)

Polynomial Kernel: k(x , x ′) = (γx ′T x + r)d , for γ = 1, d = 2(from wiki)

etc.


Mechanical Analogy

Remember from KKT conditions,

∂L

∂w= 0 → w =

∑i

αiyixi

∂L

∂w= 0 →

∑i

αiyi = 0

imagine every support vector xi exerts a force Fi = αiyiw|w| on

the “separating plane + margin”, we have,∑Forces =

∑i

αiyiw

|w|=

w

|w|∑i

αiyi = 0

∑Torques =

∑i

xi × (αiyiw

|w|) = (

∑i

αiyixi )×w

|w|= w × w

|w|= 0

This is why {xi} are called “support vectors”


Why SVC Works Well

Let’s first consider using linear regression to do classification, thedecision function is y = sign(w · x + b)

Figure: Feature Mapping Helps Classification

In SVM, we only considers about the boundaries


Min-Loss Framework

Primal form:

minw,b,ξ

1

2||w||2 + C

m∑i=1

ξi

s.t. yi ((w · xi) + b) ≥ 1− ξi , i = 1, 2, ...,m

ξi ≥ 0

(10)

Rewrite into min-loss form,

minw,b,ξ

1

2||w||2 + C

m∑i=1

max{0, (1− yi ((w · xi) + b))} (11)

This is called hinge loss.


See C-SVM and LMC from a Unified Direction

Rewriting LMC classifier,

minw

1

2||w||2 +

m∑i=0

∞ · (sign(1− y(w · xi + b)) + 1) (12)

Regularised Logistic Regression(y ∈ {0, 1}, not {−1, 1}, pi = 1

1+e−w·xi )

minw

1

2||w||2 +

m∑i=0

−(yi log(pi ) + (1− yi )log(1− pi )) (13)


Relation with Logistic Regression etc.

Figure: black:0-1 loss; red: logistic loss (−log( 11+e−yiw·x )); blue: hinge

loss; green: quadratic loss.

“0-1 loss” and “hinge loss” are not affected by correctlyclassified outliers.

BTW, logistic regression can also be “kernelised”.


Commonly Used Packages

libsvm(liblinear), svmlight and sklearn (python wrap-up oflibsvm)

Code example in sklearn

impor t numpy as npX = np . a r r a y ( [ [ −1 , −1] , [−2 , −1] , [ 1 , 1 ] , [ 2 , 1 ] ] )y = np . a r r a y ( [ 1 , 1 , 2 , 2 ] )from s k l e a r n . svm impor t SVCc l f = SVC( )c l f . f i t (X, y )c l f . p r e d i c t ( [ [ −0 . 8 , −1] ])


Things Not Covered

Algorithms (SMO, SGD)

Generalisation bound and VC dimension

ν-SVM, one-class SVM etc.

SVR

etc.