Upload
forrest-shields
View
20
Download
0
Embed Size (px)
DESCRIPTION
Support Vector Machines and Kernel Methods. Kenan Gençol Department of Electrical and Electronics Engineering Anadolu University submitted in the course MAT592 Seminar Advisor: Prof. Dr. Yalçın Küçük Department of Mathematics. Agenda. - PowerPoint PPT Presentation
Citation preview
Support Vector Machines Support Vector Machines and Kernel Methodsand Kernel Methods
Kenan GençolKenan GençolDepartment of Electrical and Electronics EngineeringDepartment of Electrical and Electronics Engineering
Anadolu UniversityAnadolu University
submittedsubmittedin the coursein the course
MAT592 SeminarMAT592 Seminar
Advisor: Prof. Dr. Yalçın KüçükAdvisor: Prof. Dr. Yalçın KüçükDepartment of MathematicsDepartment of Mathematics
AgendaAgenda
Linear Discriminant Functions and Linear Discriminant Functions and Decision HyperplanesDecision Hyperplanes
Introduction to SVMIntroduction to SVM Support Vector MachinesSupport Vector Machines Introduction to KernelsIntroduction to Kernels Nonlinear SVMNonlinear SVM Kernel MethodsKernel Methods
Linear Discriminant Functions Linear Discriminant Functions and Decision Hyperplanesand Decision Hyperplanes
Figure 1. Two classes of patterns and a linear decision function
Linear Discriminant Functions Linear Discriminant Functions and Decision Hyperplanesand Decision Hyperplanes
Each pattern is represented by a vectorEach pattern is represented by a vector
Linear decision function has the Linear decision function has the equationequation
where where ww11,w,w22 are weights and are weights and ww00 is the is the
bias termbias term
Linear Discriminant Functions Linear Discriminant Functions and Decision Hyperplanesand Decision Hyperplanes
The general decision hyperplane The general decision hyperplane equation in d-dimensional space has equation in d-dimensional space has the form:the form:
where where w = [ww = [w11 w w22 ....w ....wdd]] is the weight is the weight
vector and vector and ww00 is the bias term. is the bias term.
Introduction to SVMIntroduction to SVM
There are many hyperplanes that There are many hyperplanes that separates two classesseparates two classes
Figure 2. An example of two possible classifiers
Introduction to SVMIntroduction to SVM
THE GOAL:THE GOAL: Our goal is to search for direction Our goal is to search for direction ww
and bias and bias ww00 that gives the that gives the maximum maximum
possible marginpossible margin, or in other words, to , or in other words, to orientate this hyperplaneorientate this hyperplane in such a in such a way as to be way as to be as far as possibleas far as possible from from the closest members of both classes.the closest members of both classes.
SVM: Linearly Separable SVM: Linearly Separable CaseCase
Figure 3. Hyperplane through two linearly separable classes
SVM: Linearly Separable SVM: Linearly Separable CaseCase
Our training data is of the form:Our training data is of the form:
This hyperplane can be described byThis hyperplane can be described by
and called and called separating hyperplaneseparating hyperplane..
SVM: Linearly Separable SVM: Linearly Separable CaseCase
Select variables Select variables ww and and bb so that: so that:
These equations can be combined These equations can be combined into:into:
SVM: Linearly Separable SVM: Linearly Separable CaseCase
The points that lie closest to the The points that lie closest to the separating hyperplane are called separating hyperplane are called support vectorssupport vectors (circled points in (circled points in diagram) anddiagram) and
are called are called supporting hyperplanessupporting hyperplanes. .
SVM: Linearly Separable SVM: Linearly Separable CaseCase
Figure 3. Hyperplane through two linearly separable classes (repeated)
SVM: Linearly Separable SVM: Linearly Separable CaseCase
The hyperplane’s equidistance from The hyperplane’s equidistance from HH11 and and HH22 means that means that dd11= = dd22 and this and this
quantity is known as quantity is known as SVM MarginSVM Margin::
dd11+ + dd22 ==
dd11= = dd22= =
ww
b
w
b 211
w
1
SVM: Linearly Separable SVM: Linearly Separable CaseCase
Maximizing Maximizing Minimizing Minimizing
min such that min such that yyii(x(xii . w + b) -1 >= 0 . w + b) -1 >= 0
Minimizing is equivalent to minimizingMinimizing is equivalent to minimizing
to perform Quadratic Programming (QP) to perform Quadratic Programming (QP)
optimizationoptimization
w
1w
w
2
2
1ww
SVM: Linearly Separable SVM: Linearly Separable CaseCase
Optimization problemOptimization problem::
Minimize Minimize
subject to subject to ibwxy ii ,01)(
2
2
1w
SVM: Linearly Separable SVM: Linearly Separable CaseCase
This is an inequality constrained optimization This is an inequality constrained optimization problem with Lagrangian function:problem with Lagrangian function:
where where ααii >= 0 >= 0 i=1,2,....,Li=1,2,....,L are Lagrange are Lagrange multipliers.multipliers.
(1)
SVMSVM The corresponding KKT conditions The corresponding KKT conditions
are:are:
(2)
(3)
SVMSVM
This is a convex optimization This is a convex optimization problem.The cost function is convex problem.The cost function is convex and the set of constraints are linear and the set of constraints are linear and define a convex set of feasible and define a convex set of feasible solutions. Such problems can be solutions. Such problems can be solved by considering the so called solved by considering the so called Lagrangian DualityLagrangian Duality
SVMSVM
Substituing (2) and (3) gives a new Substituing (2) and (3) gives a new formulation which being dependent formulation which being dependent on on αα, we need to maximize., we need to maximize.
SVMSVM
This is called Dual form (Lagrangian This is called Dual form (Lagrangian Dual) of the primary form. Dual form Dual) of the primary form. Dual form onlyonly requires the requires the dot productdot product of of each input vector to be calculated.each input vector to be calculated.
This is important for the This is important for the Kernel Kernel TrickTrick which will be described later. which will be described later.
SVMSVM
So the problem becomes a dual So the problem becomes a dual problem:problem:
MaximizeMaximize
subject to subject to
HTL
ii 2
1
1
01
i
L
ii y ii ,0
SVMSVM
Differentiating with respect to Differentiating with respect to ααii ‘s ‘s
and using the constraint equation, a and using the constraint equation, a system of equations is obtained. system of equations is obtained. Solving the system, the Lagrange Solving the system, the Lagrange multipliers are found and optimum multipliers are found and optimum hyperplane is given according to the hyperplane is given according to the formula:formula:
SVMSVM
Some Notes:Some Notes: SUPPORT VECTORS are the feature are the feature
vectors for vectors for ααii > 0 > 0 i=1,2,....,Li=1,2,....,L The cost function is strictly The cost function is strictly convexconvex.. Hessian matrix is Hessian matrix is positive definitepositive definite.. Any local minimum is also global and unique. Any local minimum is also global and unique.
The optimal hyperplane classifier of a SVM is The optimal hyperplane classifier of a SVM is UNIQUE..
Although the solution is unique, the resulting Lagrange multipliers are not unique.
Kernels: IntroductionKernels: Introduction
When applying our SVM to linearly When applying our SVM to linearly separable data we have started byseparable data we have started by creating a matrix H from the dot product creating a matrix H from the dot product of our input variables:of our input variables:
being known as being known as Linear Linear KernelKernel, an example of a family of functions , an example of a family of functions called Kernel functions.called Kernel functions.
jTijiji xxxxxxk ),(
Kernels: IntroductionKernels: Introduction
The set of kernelThe set of kernel functions are all based on functions are all based on calculating calculating inner products of two vectorsinner products of two vectors..
This means if the function is mapped to a This means if the function is mapped to a higher dimensionality space by a nonlinear higher dimensionality space by a nonlinear mapping function only the inner mapping function only the inner products of the mapped inputs need to be products of the mapped inputs need to be determined without needing to explicitly determined without needing to explicitly calculate calculate ФФ . .
This is called “This is called “Kernel TrickKernel Trick” ”
)(: xx
Kernels: IntroductionKernels: Introduction
Kernel Trick is useful because there Kernel Trick is useful because there are many classification/regression are many classification/regression problems that are not fully problems that are not fully separable/regressable in the input separable/regressable in the input space but separable/regressable in a space but separable/regressable in a higher dimensional space.higher dimensional space.
Hd :
)()( jiji xxxx
)(,)( jiji xxxx
Kernels: IntroductionKernels: Introduction
Popular Kernel FamiliesPopular Kernel Families:: Radial Basis Function (RBF) KernelRadial Basis Function (RBF) Kernel
Polynomial KernelPolynomial Kernel
Sigmodial (Hyperbolic Tangent) KernelSigmodial (Hyperbolic Tangent) Kernel
Nonlinear Support Vector Nonlinear Support Vector MachinesMachines
The support vector machine with kernel functions The support vector machine with kernel functions becomes:becomes:
and the resulting classifier:and the resulting classifier:
Nonlinear Support Vector Nonlinear Support Vector MachinesMachines
Figure 4. The SVM architecture employing kernel functions.
Kernel MethodsKernel Methods
Recall that a kernel function computes the Recall that a kernel function computes the inner product of the images under an inner product of the images under an embedding of two data pointsembedding of two data points
is a is a kernelkernel if if 1. 1. kk is symmetric: is symmetric: k(x,y) = k(y,x)k(x,y) = k(y,x) 2. 2. kk is positive semi-definite, i.e., the “ is positive semi-definite, i.e., the “Gram Gram
MatrixMatrix” ” KKijij = = k(xk(xii,x,xjj)) is positive semi-definite.is positive semi-definite.
XXk :
)(),(),( zxzxk
Kernel MethodsKernel Methods
The answer for which kernels does The answer for which kernels does there exist a pair there exist a pair {{H,H,φφ}}, with the , with the properties described above, and for properties described above, and for which does there not is given by which does there not is given by Mercer’s conditionMercer’s condition..
Mercer’s conditionMercer’s condition Let be a compact subset of and let and a Let be a compact subset of and let and a
mapping mapping
where where HH is an Euclidean space. Then the inner is an Euclidean space. Then the inner product operation has an equivalent product operation has an equivalent representationrepresentation
and is a symmetric function satisfying the following and is a symmetric function satisfying the following conditioncondition
for any , such thatfor any , such that
X n XxHxXx )(:
)()()(),(),( zxzxzxk rr
r
),( zxk
XX
dxdzzgxgzxk 0)()(),(
dxxg )(2
)(xg Xx
Mercer’s TheoremMercer’s Theorem TheoremTheorem. . Suppose Suppose KK is a continuous is a continuous symmetricsymmetric
non-negative definitenon-negative definite kernel. Then there is kernel. Then there is anan orthonormal basisorthonormal basis { {eeii}}ii ofof LL22[[aa, , bb] ] consisting of consisting of eigenfunctions of eigenfunctions of TTKK
such that the corresponding sequence of such that the corresponding sequence of eigenvalueseigenvalues {λ {λii}}ii is nonnegative. The is nonnegative. The eigenfunctions corresponding to non-zero eigenfunctions corresponding to non-zero eigenvalues are continuous on [eigenvalues are continuous on [aa, , bb] and] and KK has the has the representationrepresentation
where the convergence is absolute and uniform.where the convergence is absolute and uniform.
b
a
K dsssxKxT )(),()(
)()(),(1
tesetsK jjj
j
Kernel MethodsKernel Methods
Suppose Suppose kk11and and kk22 are valid are valid (symmetric, positive definite) kernels (symmetric, positive definite) kernels on on XX. Then the following are valid . Then the following are valid kernels:kernels: 1.1.
2.2.
3.3.
Kernel MethodsKernel Methods
4.4.
5.5.
6.6.
7.7.
ReferencesReferences
[1] C.J.C. Burges, “Tutorial on support [1] C.J.C. Burges, “Tutorial on support vector machines for pattern recognition”, vector machines for pattern recognition”, Data Mining and Knowledge Discovery 2, Data Mining and Knowledge Discovery 2, 121-167, 1998.121-167, 1998.
[2] Marques de Sa, J.P., “Pattern [2] Marques de Sa, J.P., “Pattern Recognition Concepts,Methods and Recognition Concepts,Methods and Applications”, Springer, 2001.Applications”, Springer, 2001.
[3] S. Theodoridis, “Pattern Recognition”, [3] S. Theodoridis, “Pattern Recognition”, Elsevier Academic Press, 2003.Elsevier Academic Press, 2003.
ReferencesReferences
[4] T. Fletcher, “Support Vector Machines [4] T. Fletcher, “Support Vector Machines Explained”, UCL, March,2005.Explained”, UCL, March,2005.
[5] Cristianini,N., Shawe-Taylor,J., “Kernel [5] Cristianini,N., Shawe-Taylor,J., “Kernel Methods for Pattern Analysis”, Cambridge Methods for Pattern Analysis”, Cambridge University Press, 2004.University Press, 2004.
[6] “Subject Title: Mercer’s Theorem”, [6] “Subject Title: Mercer’s Theorem”, Wikipedia: Wikipedia: http://en.wikipedia.org/wiki/Mercer’s_theorhttp://en.wikipedia.org/wiki/Mercer’s_theoremem
Thank YouThank You