A Simple Review on SVM
Honglin Yu
Australian National University, NICTA
September 2, 2013
Outline
1 The Tutorial RoutineOverviewLinear SVC in Separable Case: Largest Margin ClassifierSoft MarginSolving SVMKernel Trick and Non-linear SVM
2 Some TopicsWhy the Name: Support Vectors?Why SVC Works Well: A Simple ExampleRelation with Logistic Regression etc.
3 Packages
The Tutorial Routine Some Topics Packages
Overview
SVM (Support Vector Machines) are supervised learningmethods
It includes both methods for classification and regression
In this talk, we focus on binary classifications.
The Tutorial Routine Some Topics Packages
Symbols
training data: (x1, y1), ..., (xm, ym) ∈ X × {±1}patterns: xi , i = 1, 2, ...,m
pattern space: Xtargets: yi , i = 1, 2, ...,m
features: xi = Φ(xi )
feature space: Hfeature mapping: Φ : X → H
The Tutorial Routine Some Topics Packages
Separable Case: Largest Margin Classifier
Figure: Simplest Case
“Separable” means: ∃ linew · x + b = 0 correctly separates allthe training data.
“Margin”: d+ + d−(d± = min
yi=±1dist(xi ,w · x + b = 0))
In this case, the SVC just looks fora line maximizing the margins.
The Tutorial Routine Some Topics Packages
Separable Case: Largest Margin Classifier
Another form of expressing separable: yi (w · xi + b) > 0
Because the training data is finite, ∃ε, yi (w · xi + b) ≥ εThis is equivalent to yi (
wε · xi + b
ε ) ≥ 1
w · xi + b = 0 and wε · xi + b
ε = 0 are same line.
We can directly write the constraints as yi (w · xi + b) ≥ 1
This removes the scaling redundancy in w, b
The Tutorial Routine Some Topics Packages
We also want the separating plane to place in the middle(which means d+ = d−).
So the optimization problem can be formulated as
arg maxw,b
(2 minx
|w · xi + b||w|
)
s.t. yi (w · xi + b) ≥ 1, i = 1, 2, ...,N
(1)
This is equivalent to:
arg minw,b
|w|2
s.t. yi (w · xi + b) ≥ 1, i = 1, 2, ...,N(2)
But, until now, it can only be confirmed that Eq.2 is only thenecessary condition of finding the plane we want (correct andin the middle)
The Tutorial Routine Some Topics Packages
Largest Margin Classifier
It can be proved that, when the data is separable, for the followingproblem
minw,b
1
2||w||2
s.t. yi · (w · xi + b) ≥ 1, i = 1, ...,m.
(3)
we have,
1 When the ||w|| is minimized, the equality holds for some x.
2 The equality holds at least for some xi , xj where yiyj < 0.
3 Based on 1) and 2) we can calculate that the margin is 2||w|| ,
so the margin is maximized.
The Tutorial Routine Some Topics Packages
Proof of Previous Slide (Warning: My Proof)
1 If ∃c > 1 that ∀xi , yi · (w · xi + b) ≥ c , then wc and b
c alsosatisfy the constraints and the length is smaller.
2 If not, assume that ∃c > 1,
yi · (w · xi + b) ≥ 1,where yi = 1
yi · (w · xi + b) ≥ c ,where yi = −1(4)
Add c−12 to each side where yi = 1, minus c−1
2 to each sidewhere yi = −1, we can get:
yi · (w · xi + b +c − 1
2) ≥ c + 1
2(5)
Because c+12 > 1, similar to 1), the |w| here is not the
smallest3 Pick x1, x2 where the equality holds and y1y2 < 0, the margin
is just the distance between x1 and the line y2 · (w · x + b) = 1which can be easily calculated as 2
||w|| .
The Tutorial Routine Some Topics Packages
Non Separable Case
Figure: Non separable case: miss classified points exist
The Tutorial Routine Some Topics Packages
Non Separable Case
Constraints yi (w · xi + b) ≥ 1, i = 1, 2, ...,m can not besatisfied
Solution: add slack variables ξi , reformulate form problem as,
minw,b,ξ
1
2||w||2 + C
m∑i=1
ξi
s.t. yi ((w · xi) + b) ≥ 1− ξi , i = 1, 2, ...,m
ξi ≥ 0
(6)
Show the trade off (C) between margins ( 1|w|) and penalty (ξi ).
The Tutorial Routine Some Topics Packages
Solving SVM: Lagrangian Dual
Constraint optimization → Lagrangian Dual
Primal form:
minw,b,ξ
1
2||w||2 + C
m∑i=1
ξi
s.t. yi ((w · xi) + b) ≥ 1− ξi , i = 1, 2, ...,m
ξi ≥ 0
(7)
The Primal Lagrangian:
L(w, b, ξ, α, µ) =1
2||w||2+C
∑i
ξi−∑i
αi{yi (w·x+b−1−ξi )}−∑i
µiξi
Because [7] is convex, Karush-Kuhn-Tucker conditions hold.
The Tutorial Routine Some Topics Packages
Applying KKT Conditions
Stationarity
∂L
∂w= 0 → w =
∑i
αiyixi
∂L
∂b= 0 →
∑i
αiyi = 0
∂L
∂ξ= 0 → C − αi − µi = 0,∀i
Primal Feasibility: yi ((w · xi) + b) ≥ 1− ξi ,∀i
Dual Feasibility: αi ≥ 0, ui ≥ 0
Complementary Slackness, ∀i
µiξi = 0
αi{yi (w · xi + b)− 1 + ξi} = 0
When αi 6= 0, corresponding xi is called support vectors
The Tutorial Routine Some Topics Packages
Dual Form
Using the equations derived from KKT conditions, removew, b, ξi , µi in the primal form to get the dual form:
minα
∑i
αi −1
2
∑i ,j
αiαjyiyjxTi xj
s.t.∑i
αiyi = 0
C ≥ αi ≥ 0
(8)
And the decision function is:y = sign(∑
i αiyixTi x + b)
(b = yk −w · xk , ∀k ,C > αk > 0)
The Tutorial Routine Some Topics Packages
We Need Nonlinear Classifier
-1
-0.5
0
0.5
1
-1 -0.5 0 0.5 1
Figure: Case that linear classifier can not handle
Finding appropriate form of curves is hard, but we can transformthe data!
The Tutorial Routine Some Topics Packages
Mapping Training Data to Feature Space
Φ(x) = (x , x2)T
Figure: Feature Mapping Helps Classification
To solve nonlinear classification problem, we can define somemapping Φ : X → H and do linear classification on feature spaceH
The Tutorial Routine Some Topics Packages
Recap the Dual Form: An important Fact
Dual form:
minα
∑i
αi −1
2
∑i ,j
αiαjyiyjxTi xj
s.t.∑i
αiyi = 0
C ≥ αi ≥ 0
(9)
Decision function: y = sign(∑
i αiyixTi x + b)
To train SVC or use SVC to predict, we only need to know theinner product between x s!
If we want to apply linear SVC in H, we do NOT need to knowΦ(x), we ONLY need to know k(x , x ′) =< Φ(x),Φ(x ′) >.And k(x , x ′) is called “kernel function”.
The Tutorial Routine Some Topics Packages
Kernel Functions
The input of kernel function k : X × X → R is two patternsx , x ′ in X , the output is the canonical inner product betweenΦ(x),Φ(x ′) in HBy using k(·, ·), we can implicitly transform the data by someΦ(·) (which is often with infinite dimension) E.g. fork(x , x ′) = (xx ′ + 1)2, Φ(x) = (x2,
√2x , 1)T
But not for all functions X × X → R, we can findcorresponding Φ(x). Kernel functions should satisfy Mercer’sconditions
The Tutorial Routine Some Topics Packages
Conditions of Kernel Functions
Necessity: Kernel Matrix K = [k(xi , xj)]m×m must be positivesemidefinite:
tTKt =∑i ,j
ti tjk(xi , xj) =∑i ,j
ti tj < Φ(xi ),Φ(xj) >
=<∑i
tiΦ(xi ),∑j
tjΦ(xj) >= |∑i
tiΦ(xi )|2 ≥ 0
Sufficiency in Continuous Form: Mercer’s Condition:For any symmetric function k : X × X → R which is squareintegrable in X × X , if it satisfies∫
X×Xk(x , x ′)f (x)f (x ′)dxdx ′ ≥ 0 for all f ∈ L2(X )
there exist functions φi : X → R and numbers λi ≥ 0 that,
k(x , x ′) =∑i
λiφi (x)φi (x ′) for all x , x ′ in X
The Tutorial Routine Some Topics Packages
Commonly Used Kernel Functions
Linear Kernel: k(x , x ′) = x ′T x
RBF Kernel: k(x , x ′) = e−γ|x−x′|2 , for gamma = 1
2 (fromwiki)
Polynomial Kernel: k(x , x ′) = (γx ′T x + r)d , for γ = 1, d = 2(from wiki)
etc.
The Tutorial Routine Some Topics Packages
Mechanical Analogy
Remember from KKT conditions,
∂L
∂w= 0 → w =
∑i
αiyixi
∂L
∂w= 0 →
∑i
αiyi = 0
imagine every support vector xi exerts a force Fi = αiyiw|w| on
the “separating plane + margin”, we have,∑Forces =
∑i
αiyiw
|w|=
w
|w|∑i
αiyi = 0
∑Torques =
∑i
xi × (αiyiw
|w|) = (
∑i
αiyixi )×w
|w|= w × w
|w|= 0
This is why {xi} are called “support vectors”
The Tutorial Routine Some Topics Packages
Why SVC Works Well
Let’s first consider using linear regression to do classification, thedecision function is y = sign(w · x + b)
Figure: Feature Mapping Helps Classification
In SVM, we only considers about the boundaries
The Tutorial Routine Some Topics Packages
Min-Loss Framework
Primal form:
minw,b,ξ
1
2||w||2 + C
m∑i=1
ξi
s.t. yi ((w · xi) + b) ≥ 1− ξi , i = 1, 2, ...,m
ξi ≥ 0
(10)
Rewrite into min-loss form,
minw,b,ξ
1
2||w||2 + C
m∑i=1
max{0, (1− yi ((w · xi) + b))} (11)
This is called hinge loss.
The Tutorial Routine Some Topics Packages
See C-SVM and LMC from a Unified Direction
Rewriting LMC classifier,
minw
1
2||w||2 +
m∑i=0
∞ · (sign(1− y(w · xi + b)) + 1) (12)
Regularised Logistic Regression(y ∈ {0, 1}, not {−1, 1}, pi = 1
1+e−w·xi )
minw
1
2||w||2 +
m∑i=0
−(yi log(pi ) + (1− yi )log(1− pi )) (13)
The Tutorial Routine Some Topics Packages
Relation with Logistic Regression etc.
Figure: black:0-1 loss; red: logistic loss (−log( 11+e−yiw·x )); blue: hinge
loss; green: quadratic loss.
“0-1 loss” and “hinge loss” are not affected by correctlyclassified outliers.
BTW, logistic regression can also be “kernelised”.
The Tutorial Routine Some Topics Packages
Commonly Used Packages
libsvm(liblinear), svmlight and sklearn (python wrap-up oflibsvm)
Code example in sklearn
impor t numpy as npX = np . a r r a y ( [ [ −1 , −1] , [−2 , −1] , [ 1 , 1 ] , [ 2 , 1 ] ] )y = np . a r r a y ( [ 1 , 1 , 2 , 2 ] )from s k l e a r n . svm impor t SVCc l f = SVC( )c l f . f i t (X, y )c l f . p r e d i c t ( [ [ −0 . 8 , −1] ])
The Tutorial Routine Some Topics Packages
Things Not Covered
Algorithms (SMO, SGD)
Generalisation bound and VC dimension
ν-SVM, one-class SVM etc.
SVR
etc.