Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
MACHINE LEARNING 09/ 10
Support Vector MachinesHyperplane Classifiers
Support Vector Machines
� Supervised Learning Technique with:
1. Improved generalization ability
� Use low complexity functions (hyperplanes).
� Mimimize bounds on the true risk.
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
2. Global solution
� Convex quadratic programming formulation
3. Can cope with non-linear problems through kernels
� Transform the original data into a higher dimension space (feature space).
� Perform the optimization in the high dimension spaces, where linear methods can be employed.
Motivation
� Animations in:
� http://www.youtube.com/watch?v=3liCbRZPrZA
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
Applications
� Pattern Recognition / Classification
� xi in Rd
� yi in {-1,1}
x2
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
x1
x2
Applications
� Regression
� xi in Rd
� yi in R
y
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
x
y
Historical Background
� Based on the Generalized Portrait Algorithm (60’s, Russia, Vapnik, Lerner, Chervonenkis)
� Developed in the 90’s at AT&T Bell Labs (Vapnik, Boser, Guyon, Cortes, Schölkopf)
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� Developed in the 90’s at AT&T Bell Labs (Vapnik, Boser, Guyon, Cortes, Schölkopf)
� Initial Industrial Context (OCR) (mid 90’s)
� Excelent performances found in regression and time series prediction (late 90’s)
Why “Support Vector” Machine (SVM) ?
� Supervised Learning� Collect data from real experiments {(xi, yi)}, i = 1 L n
xi yif
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� Use training data to estimate an approximation of f
� The SVM selectivelly chooses from the input vectors, the ones that are “important” – SUPPORT VECTORS – all the others are disregarded.
SVMxi
yi
f’
Distinguishing Features
� Sound Theoretical Formulation (statistical learning theory)
� Bounds on performance
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� Bounds on performance
� Addresses the generalization problem (structural risk minimization)
The Generalization Problem
yTraining Samples
Test Samples
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� Lessons from NN
� Too few units (parameters): high training error and high test error
� Too many units (parameters): low training error and high test error
x
Statistical Learning Theory Framework
� Machine to learn the map xi a yi = f(x,α)
� Data drawn iid from P(x,y)
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� Actual Risk
� Empirical Risk
Bound on Generalization Performance
� 2-class pattern recognition problem : yi in {-1,1}
� Choose η in [0,1]. With probability 1-η the following bound holds:
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� h is the Vapnik-Chervonenkis (VC) dimension and is a measure of the “capacity” of the machine.
� Within a set of learning machines, the best is the one that minimizes the right hand side
VC Confidence
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
VC Dimension
� Property of a set of functions F = {f(x,α)}
� A set N of n points can be labeled in 2n possible ways
If for each labeling, a function of the set F can
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� If for each labeling, a function of the set F can correcly assign those labels, then N is shattered by F
� VC Dimension is the maximum number of training points that can be shattered by F.
Shattering with Oriented Hyperplanes in R2
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
Hiperplane VC dimension
� The VC dimension of an Hyperplane in R is d+1
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� The VC dimension of an Hyperplane in R is d+1
VC Dimension and the number of parameters
� VC dimension ≠ number of parameters
� Striking example : 1 parameter function that shatters infinite points
� f(x,α) ´ sin(αx), x,α 2 R
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
Linear Separable SVM’s
� Separating Hyperplane H: w.x + b = 0
� Margin: shortest distance from H to the closest positive (negative) sample.
� SVM computes H with largest symmetric margin.
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
H H1
H2
support vectors H: w.x + b = 0
H1: w.x + b = 1
H2: w.x + b = -1
Margin: m = 2/||w||
The optimization problem
� Maximize the margin =>
=> Minimize ||w||2
� Constraints:
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
Primal Lagrangian
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� Convex quadratic programming problem
Wolfe dual formulation
� Easier solution and better extension to the non-linear case.
� Primal problem equivalent to:Maximize LP w.r.t. α
subject to: αi > 0
Gradw,bLP = 0
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
Primal vs Dual Problems
� There are two equivalent ways to compute the solution:� Maximize the distance between two parallel supporting planes
(primal).
� Find the closest point in the convex hull of the two classes (dual).
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� Ref: Duality and Geometry in SVM Classifiers, Bennett and Bredensteiner, 2000.
Karhush-Khun-Tucker Conditions
� Solve dual problem => obtain αi
� Use Karhush-Kuhn-Tucker conditions (necessary and sufficient in the SVM problem)
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
Computing the Solution
� To compute the solution we must rely on numerical methods.
� There are several numerical packages that solve the quadratic programming problem.
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
quadratic programming problem.
� In the lab it will be used the Support Vector Machine Toolbox by Steve Gunn, that uses interior point optimization.
Complementarity Condition
� KKT Complementarity condition
� Can be satisfied if:
α
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� αi = 0 -> data vector are not at the boundary, are irrelevant to the solution
� yi (xi w + b) – 1 = 0 -> αi = 0 and data vectors are on the boundary, they are the support vectors.
� Support Vectors( ){ }
( ) Sxxf
ixS
ii
ii
∈∀=
>=
,1
0:: α
Primal Solution
� After solving the dual problem, the primal problem solution is:
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
Test Phase
x1 +
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
x1
x2
x3
xi1
xi2
xi3
α1 y1
αi yi
αn yn
+
+
+
+
b
ysgn
Example
� Support Vector Machine � Perceptron
Mechanical Analogy
� Each support vector exerts a force
on the separating hyperplane.
� The resulting force and torque is zero!
i
iii
w
wyα
Alexandre Bernardino, [email protected] Machine Learning, 2009/2010
� The resulting force and torque is zero!
H H1
H2
∑∑ ==⇒=i i
iii
i
iiw
wyFy 00 αα
0=×=×=⇒
⇒=
∑∑
∑
i
i
i
iii
i
iii
i
i
i
iii
w
wxy
w
wyxT
xyw
αα
α