Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem

Survey of Kernel Methods

byJinsan Yang

(c) 2003 SNU Biointelligence Lab.

Introduction

Support Vector MachinesFormulation of SVMOptimization Theorem Dual Formulation of SVM

Reproducing Kernel Hilbert Space Kernel Machines


SVM Formulation

Support vector classifiers:

L

xx

Linxxfor

||||/

0)('

,

*

21

21

*

0'0 xL

||)('||/)(

)'(||||

1)('

:),(

00*

xfxf

xxx

Lxdist

C/1|||| 1x

2x


SVM Formulation

}0'|),({ 021 xLxxx

}0'|),({ 021 xLxxx

}0'|),({ 021 xLxxx


Optimal separating hyperplane

Optimize:

Note: Any positively scaled (multiple of a vector) satisfies

Set by

})1,1{(

),,1(,),(max0,

i

ii

ywhere

NiCLxdistytosubjectC

Cxy ii )'(||||

10

0,

C/1||||

),,1(,1)'(||||2

1min 0

2

, 0

Nixytosubject ii


Optimization Theorem

Because of the many constraints, the optimization of SVM is still too complicate to solve

Change this to the corresponding dual formulation

Need to use some theorems about duality: Kuhn Tucker Theorem Kuhn Tucker Saddle Point Condition (saddle point theorem) Wolfe (existence of dual solution)



Generalization of the following optimization problemTheorem (Fermat,1629):

For a convex f , w* is a minimum of f(w) iff Theorem (Lagrange,1797):

For a convex Lagrangian , w* is a minimum of f(w) subject to hi(w) = 0, i=1, 2, .. , m iff

m

iii whwfwL

1

)()(),(

0)( *

w

wf

0),(

,0),( ****

wL

w

wL



Kuhn and Tucker suggested a solution to the so called convex optimization problem, where one minimizes a certain type of (convex) objective function under certain (convex) constraints of inequality type.Problem: minimize

subject to :

Generalized Lagrangian function:

)(wf

),,1(,0)(

),,1(,0)(

miwh

kiwg

i

i

)(')(')(),,( whwgwfwL



Lagrangian dual problem: maximize subject to (where )

Theorem (weak duality theorem): for solutions between primal and dual problems, Corollary:Corollary: If , then Duality Gap: the difference between the primal and dual problems

),( 0 ),,(inf),( wL

w

),()( wf}0)(,0)(:)(inf{}0:),(sup{ whwgwf

),()( *** wf ),,1(,0)( ** kiwgii



Saddle point

Theorem: is a saddle point of the Lagrangian function for the primal problem iff there is no duality gap

for the optimal solutionsTheorem (strong duality theorem, Wolfe): In the convex domain of primal problem and affine functions h and g, the duality gap is zero.

),,( *** w

),,(),,(),,( ****** wLwLwL ),,( *** w



Theorem (Kuhn-Tucker, 1951) For a primal optimization problems with convex domain and affine g and h,

is an optimal solution iff there are such that*w **,

),,1(0)(

),,1(0)(,0

0),,(

,0),,(

**

**

******

kiwg

kiwg

wL

w

wL

i

i

i

i

(The Kuhn-Tucker conditions)



In the Kuhn-Tucker conditions, if and in that case, the corresponding constraint becomes inactive (since ) in the primal optimization problem.The constraint can be active ( or ) when .

0)( * wgi0i

0i0i

0)( * wgi

0i


Dual form of SVM

Primal problem:

Dual problem:

):0,0:0(

)1)'((||||2

1),,(

),,1(,1)'(||||2

1min

11

1

2

2

, 0

l

iiii

l

iii

l

iiii

ii

xyww

Ly

b

L

bwxywbwL

libwxytosubjectw

),,(inf)( bwLw

0),,,1(0

,2

1)(),(max

1

1,1

i

l

iii

jiji

l

jiji

l

ii

ylitosubject

xxyy


Nonlinear SVM

Dual problem:

0),,,1(0

),(2

1

)(),(2

1)(),(max

1

1,1

1,1

i

l

iii

jiji

l

jiji

l

ii

jiji

l

jiji

l

ii

ylitosubject

xxkyy

xxyy


Reproducing Kernel Hilbert Space

Dual representation of the hypothesis

Kernel : a function K such that for all x, z X

Using Kernel, we can compute the inner product

in the feature space directly as a function of the original input space,

l

iiii bxxybxwxf

1

)(),()(')(

)(),(),( zxzxK

l

iiii bxxKyxf

1

),()(

)(),( zx



For a given Kernel, what is the corresponding feature mapping? (Ans: Mercer’s theorem)Theorem (Mercer): If k is a continuous symmetric kernel of a positive integral operator K, that is:

,it can be expanded in a uniformly convergent series in terms of Eigenfunctions and positive Eigenvalues

)(0)()(),(

)(),())((

2 CLfallfordxdyyfxfyxkwith

dxxfyxkyKf

CC

C

jj

''222

'111

1

.

,)()(),(

mmm

F

N

jjjj

vvvvvvAcf

NwhereyxyxkF



Note: construction of a feature map corresponding to a kernel K

Proposition: If k is a continuous kernel of a positive integral operator (positive semi-definite in discrete case), one can construct a mapping into a space where k acts as

a dot product,

k in mercer’s theorem is called mercer kernel.

)),(),((: 2211 xxx

),())(),(( yxkyx



A vector space X is called inner product space if there is a real bilinear map < , > satisfying:• •

Hilbert space: a complete separable inner product space (A space H is separable if there exists a countable subset D s

uch that every element of H is the limit of a sequence of elements of D.) RKHS: RKHS is a Hilbert space of functions f on some set C such that all evaluation functionals

are continuous. (Wahba)

00,,0,

,,

xxxandxx

xyyx

)()( yffT



Riesz representation theorem: Let H be a Hilbert space and let be given. Then there is a unique

such that Recall: If Hrkhs is a RKHS, then for each y in C, Ty (defined as ) is continuous.

By the Riesz representation theorem, for each there exist a unique function of x, say, such that

Cy

(*)),(,)( ykfyf

*H Hf 0||||||||, 00 fandHffff

)()(,: yffTCHT yRKHSy

RKHSHyk ),(



spans the whole RKHS : By (*), implies f = 0.

By (*), :The inner product on the RKHS space corresponds to a value

of the reproducing kernel k

Cf. is the completion of the continuous functions wrt. the L2-norm.

}:),({ Cyyk yykf 0),(,

(**)),())(,(),(),,( yxkyxkykxk

)(2 nRL



For a Mercer kernel k, it is possible to construct a dot product such that k becomes a reproducing kernel for a Hilbert space of functions of the form

(check) Since k is symmetric, choose i orthogonal as,

jjnnj /,

1 11

)()(),()(i

N

jijjji

iii

F

xxaxxkaxf

)(

)(,)(),(,1 1,

yf

yxaykfi

N

njnnnjijji

F



Feature space vs RKHS: feature space is a RKHS. Rewriting the functions of RKHS wrt. the orthonomal basis of M

ercer’s theorem :

(x) is nothing but the coordinate representation of the kernel as a function of one argument.

)(

,)(,

)('))(,()()(

1

1 1

1

ini

in

nni

N

jjijjinnn

n

N

nnn

xa

xaf

xxxxf

F

F

FNnnn ,,1)(

)),(,(

)),(,())(,(

RKHSHxkfwhere

xkfx



The representation ability of a kernel k and l data points: The corresponding feature space H is spanned by

The feature mapping is wrt. the corresponding Mercer’s eigenfunctions and an objective function f(t) may be expressed as a linear combination of these eigenfunctions.

Since H is a RKHS, any such nonlinear function f(t) can be approximated with these kernels

)},(,),,({ 1 lxkxk


Example

Nonlinear Regression for a training set generated from a target function t(x) :Assume a dual representationMinimize the norm

)},(,),,{( 11 ll yxyxS

l

iii xxKxf

1

),()(

l

i

l

jjiji

l

iii

l

iii

l

iiiH

txxKy

xtxxKxtxxKtf

1 1

2

1

11

2

||||),(2

)(),(),(),(||||

Documents

Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem