Support Vector Machines and Kernel Methods

Support Vector Machines Support Vector Machines and Kernel Methodsand Kernel Methods

Kenan GençolKenan GençolDepartment of Electrical and Electronics EngineeringDepartment of Electrical and Electronics Engineering

Anadolu UniversityAnadolu University

submittedsubmittedin the coursein the course

MAT592 SeminarMAT592 Seminar

Advisor: Prof. Dr. Yalçın KüçükAdvisor: Prof. Dr. Yalçın KüçükDepartment of MathematicsDepartment of Mathematics

AgendaAgenda

Linear Discriminant Functions and Linear Discriminant Functions and Decision HyperplanesDecision Hyperplanes

Introduction to SVMIntroduction to SVM Support Vector MachinesSupport Vector Machines Introduction to KernelsIntroduction to Kernels Nonlinear SVMNonlinear SVM Kernel MethodsKernel Methods

Linear Discriminant Functions Linear Discriminant Functions and Decision Hyperplanesand Decision Hyperplanes

Figure 1. Two classes of patterns and a linear decision function


Each pattern is represented by a vectorEach pattern is represented by a vector

Linear decision function has the Linear decision function has the equationequation

where where ww11,w,w22 are weights and are weights and ww00 is the is the

bias termbias term


The general decision hyperplane The general decision hyperplane equation in d-dimensional space has equation in d-dimensional space has the form:the form:

where where w = [ww = [w11 w w22 ....w ....wdd]] is the weight is the weight

vector and vector and ww00 is the bias term. is the bias term.

Introduction to SVMIntroduction to SVM

There are many hyperplanes that There are many hyperplanes that separates two classesseparates two classes

Figure 2. An example of two possible classifiers

Introduction to SVMIntroduction to SVM

THE GOAL:THE GOAL: Our goal is to search for direction Our goal is to search for direction ww

and bias and bias ww00 that gives the that gives the maximum maximum

possible marginpossible margin, or in other words, to , or in other words, to orientate this hyperplaneorientate this hyperplane in such a in such a way as to be way as to be as far as possibleas far as possible from from the closest members of both classes.the closest members of both classes.

SVM: Linearly Separable SVM: Linearly Separable CaseCase

Figure 3. Hyperplane through two linearly separable classes


Our training data is of the form:Our training data is of the form:

This hyperplane can be described byThis hyperplane can be described by

and called and called separating hyperplaneseparating hyperplane..


Select variables Select variables ww and and bb so that: so that:

These equations can be combined These equations can be combined into:into:


The points that lie closest to the The points that lie closest to the separating hyperplane are called separating hyperplane are called support vectorssupport vectors (circled points in (circled points in diagram) anddiagram) and

are called are called supporting hyperplanessupporting hyperplanes. .


Figure 3. Hyperplane through two linearly separable classes (repeated)


The hyperplane’s equidistance from The hyperplane’s equidistance from HH11 and and HH22 means that means that dd11= = dd22 and this and this

quantity is known as quantity is known as SVM MarginSVM Margin::

dd11+ + dd22 ==

dd11= = dd22= =

ww

b

w

b 211

w

1


Maximizing Maximizing Minimizing Minimizing

min such that min such that yyii(x(xii . w + b) -1 >= 0 . w + b) -1 >= 0

Minimizing is equivalent to minimizingMinimizing is equivalent to minimizing

to perform Quadratic Programming (QP) to perform Quadratic Programming (QP)

optimizationoptimization

w

1w

w

2

2

1ww


Optimization problemOptimization problem::

Minimize Minimize

subject to subject to ibwxy ii ,01)(

2

2

1w


This is an inequality constrained optimization This is an inequality constrained optimization problem with Lagrangian function:problem with Lagrangian function:

where where ααii >= 0 >= 0 i=1,2,....,Li=1,2,....,L are Lagrange are Lagrange multipliers.multipliers.

(1)

SVMSVM The corresponding KKT conditions The corresponding KKT conditions

are:are:

(2)

(3)

SVMSVM

This is a convex optimization This is a convex optimization problem.The cost function is convex problem.The cost function is convex and the set of constraints are linear and the set of constraints are linear and define a convex set of feasible and define a convex set of feasible solutions. Such problems can be solutions. Such problems can be solved by considering the so called solved by considering the so called Lagrangian DualityLagrangian Duality

SVMSVM

Substituing (2) and (3) gives a new Substituing (2) and (3) gives a new formulation which being dependent formulation which being dependent on on αα, we need to maximize., we need to maximize.

SVMSVM

This is called Dual form (Lagrangian This is called Dual form (Lagrangian Dual) of the primary form. Dual form Dual) of the primary form. Dual form onlyonly requires the requires the dot productdot product of of each input vector to be calculated.each input vector to be calculated.

This is important for the This is important for the Kernel Kernel TrickTrick which will be described later. which will be described later.

SVMSVM

So the problem becomes a dual So the problem becomes a dual problem:problem:

MaximizeMaximize

subject to subject to

HTL

ii 2

1

1

01

i

L

ii y ii ,0

SVMSVM

Differentiating with respect to Differentiating with respect to ααii ‘s ‘s

and using the constraint equation, a and using the constraint equation, a system of equations is obtained. system of equations is obtained. Solving the system, the Lagrange Solving the system, the Lagrange multipliers are found and optimum multipliers are found and optimum hyperplane is given according to the hyperplane is given according to the formula:formula:

SVMSVM

Some Notes:Some Notes: SUPPORT VECTORS are the feature are the feature

vectors for vectors for ααii > 0 > 0 i=1,2,....,Li=1,2,....,L The cost function is strictly The cost function is strictly convexconvex.. Hessian matrix is Hessian matrix is positive definitepositive definite.. Any local minimum is also global and unique. Any local minimum is also global and unique.

The optimal hyperplane classifier of a SVM is The optimal hyperplane classifier of a SVM is UNIQUE..

Although the solution is unique, the resulting Lagrange multipliers are not unique.

Kernels: IntroductionKernels: Introduction

When applying our SVM to linearly When applying our SVM to linearly separable data we have started byseparable data we have started by creating a matrix H from the dot product creating a matrix H from the dot product of our input variables:of our input variables:

being known as being known as Linear Linear KernelKernel, an example of a family of functions , an example of a family of functions called Kernel functions.called Kernel functions.

jTijiji xxxxxxk ),(


The set of kernelThe set of kernel functions are all based on functions are all based on calculating calculating inner products of two vectorsinner products of two vectors..

This means if the function is mapped to a This means if the function is mapped to a higher dimensionality space by a nonlinear higher dimensionality space by a nonlinear mapping function only the inner mapping function only the inner products of the mapped inputs need to be products of the mapped inputs need to be determined without needing to explicitly determined without needing to explicitly calculate calculate ФФ . .

This is called “This is called “Kernel TrickKernel Trick” ”

)(: xx


Kernel Trick is useful because there Kernel Trick is useful because there are many classification/regression are many classification/regression problems that are not fully problems that are not fully separable/regressable in the input separable/regressable in the input space but separable/regressable in a space but separable/regressable in a higher dimensional space.higher dimensional space.

Hd :

)()( jiji xxxx

)(,)( jiji xxxx


Popular Kernel FamiliesPopular Kernel Families:: Radial Basis Function (RBF) KernelRadial Basis Function (RBF) Kernel

Polynomial KernelPolynomial Kernel

Sigmodial (Hyperbolic Tangent) KernelSigmodial (Hyperbolic Tangent) Kernel

Nonlinear Support Vector Nonlinear Support Vector MachinesMachines

The support vector machine with kernel functions The support vector machine with kernel functions becomes:becomes:

and the resulting classifier:and the resulting classifier:

Nonlinear Support Vector Nonlinear Support Vector MachinesMachines

Figure 4. The SVM architecture employing kernel functions.

Kernel MethodsKernel Methods

Recall that a kernel function computes the Recall that a kernel function computes the inner product of the images under an inner product of the images under an embedding of two data pointsembedding of two data points

is a is a kernelkernel if if 1. 1. kk is symmetric: is symmetric: k(x,y) = k(y,x)k(x,y) = k(y,x) 2. 2. kk is positive semi-definite, i.e., the “ is positive semi-definite, i.e., the “Gram Gram

MatrixMatrix” ” KKijij = = k(xk(xii,x,xjj)) is positive semi-definite.is positive semi-definite.

XXk :

)(),(),( zxzxk


The answer for which kernels does The answer for which kernels does there exist a pair there exist a pair {{H,H,φφ}}, with the , with the properties described above, and for properties described above, and for which does there not is given by which does there not is given by Mercer’s conditionMercer’s condition..

Mercer’s conditionMercer’s condition Let be a compact subset of and let and a Let be a compact subset of and let and a

mapping mapping

where where HH is an Euclidean space. Then the inner is an Euclidean space. Then the inner product operation has an equivalent product operation has an equivalent representationrepresentation

and is a symmetric function satisfying the following and is a symmetric function satisfying the following conditioncondition

for any , such thatfor any , such that

X n XxHxXx )(:

)()()(),(),( zxzxzxk rr

r

),( zxk

XX

dxdzzgxgzxk 0)()(),(

dxxg )(2

)(xg Xx

Mercer’s TheoremMercer’s Theorem TheoremTheorem. . Suppose Suppose KK is a continuous is a continuous symmetricsymmetric

non-negative definitenon-negative definite kernel. Then there is kernel. Then there is anan orthonormal basisorthonormal basis { {eeii}}ii ofof LL22[[aa, , bb] ] consisting of consisting of eigenfunctions of eigenfunctions of TTKK

such that the corresponding sequence of such that the corresponding sequence of eigenvalueseigenvalues {λ {λii}}ii is nonnegative. The is nonnegative. The eigenfunctions corresponding to non-zero eigenfunctions corresponding to non-zero eigenvalues are continuous on [eigenvalues are continuous on [aa, , bb] and] and KK has the has the representationrepresentation

where the convergence is absolute and uniform.where the convergence is absolute and uniform.

b

a

K dsssxKxT )(),()(

)()(),(1

tesetsK jjj

j


Suppose Suppose kk11and and kk22 are valid are valid (symmetric, positive definite) kernels (symmetric, positive definite) kernels on on XX. Then the following are valid . Then the following are valid kernels:kernels: 1.1.

2.2.

3.3.


4.4.

5.5.

6.6.

7.7.

ReferencesReferences

[1] C.J.C. Burges, “Tutorial on support [1] C.J.C. Burges, “Tutorial on support vector machines for pattern recognition”, vector machines for pattern recognition”, Data Mining and Knowledge Discovery 2, Data Mining and Knowledge Discovery 2, 121-167, 1998.121-167, 1998.

[2] Marques de Sa, J.P., “Pattern [2] Marques de Sa, J.P., “Pattern Recognition Concepts,Methods and Recognition Concepts,Methods and Applications”, Springer, 2001.Applications”, Springer, 2001.

[3] S. Theodoridis, “Pattern Recognition”, [3] S. Theodoridis, “Pattern Recognition”, Elsevier Academic Press, 2003.Elsevier Academic Press, 2003.

ReferencesReferences

[4] T. Fletcher, “Support Vector Machines [4] T. Fletcher, “Support Vector Machines Explained”, UCL, March,2005.Explained”, UCL, March,2005.

[5] Cristianini,N., Shawe-Taylor,J., “Kernel [5] Cristianini,N., Shawe-Taylor,J., “Kernel Methods for Pattern Analysis”, Cambridge Methods for Pattern Analysis”, Cambridge University Press, 2004.University Press, 2004.

[6] “Subject Title: Mercer’s Theorem”, [6] “Subject Title: Mercer’s Theorem”, Wikipedia: Wikipedia: http://en.wikipedia.org/wiki/Mercer’s_theorhttp://en.wikipedia.org/wiki/Mercer’s_theoremem

Thank YouThank You

Documents

Support Vector Machines and Kernel Methods