Support Vector Machines Hyperplane Classifiersalex/aauto0910/lecture13SVM.pdf · MACHINE LEARNING 09/10 Support Vector Machines Hyperplane Classifiers Support Vector Machines Supervised

MACHINE LEARNING 09/ 10

Support Vector MachinesHyperplane Classifiers

Support Vector Machines

� Supervised Learning Technique with:

1. Improved generalization ability

� Use low complexity functions (hyperplanes).

� Mimimize bounds on the true risk.

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

2. Global solution

� Convex quadratic programming formulation

3. Can cope with non-linear problems through kernels

� Transform the original data into a higher dimension space (feature space).

� Perform the optimization in the high dimension spaces, where linear methods can be employed.

Motivation

� Animations in:

� http://www.youtube.com/watch?v=3liCbRZPrZA


Applications

� Pattern Recognition / Classification

� xi in Rd

� yi in {-1,1}

x2


x1

x2

Applications

� Regression

� xi in Rd

� yi in R

y


x

y

Historical Background

� Based on the Generalized Portrait Algorithm (60’s, Russia, Vapnik, Lerner, Chervonenkis)

� Developed in the 90’s at AT&T Bell Labs (Vapnik, Boser, Guyon, Cortes, Schölkopf)


� Developed in the 90’s at AT&T Bell Labs (Vapnik, Boser, Guyon, Cortes, Schölkopf)

� Initial Industrial Context (OCR) (mid 90’s)

� Excelent performances found in regression and time series prediction (late 90’s)

Why “Support Vector” Machine (SVM) ?

� Supervised Learning� Collect data from real experiments {(xi, yi)}, i = 1 L n

xi yif


� Use training data to estimate an approximation of f

� The SVM selectivelly chooses from the input vectors, the ones that are “important” – SUPPORT VECTORS – all the others are disregarded.

SVMxi

yi

f’

Distinguishing Features

� Sound Theoretical Formulation (statistical learning theory)

� Bounds on performance


� Bounds on performance

� Addresses the generalization problem (structural risk minimization)

The Generalization Problem

yTraining Samples

Test Samples


� Lessons from NN

� Too few units (parameters): high training error and high test error

� Too many units (parameters): low training error and high test error

x

Statistical Learning Theory Framework

� Machine to learn the map xi a yi = f(x,α)

� Data drawn iid from P(x,y)


� Actual Risk

� Empirical Risk

Bound on Generalization Performance

� 2-class pattern recognition problem : yi in {-1,1}

� Choose η in [0,1]. With probability 1-η the following bound holds:


� h is the Vapnik-Chervonenkis (VC) dimension and is a measure of the “capacity” of the machine.

� Within a set of learning machines, the best is the one that minimizes the right hand side

VC Confidence


VC Dimension

� Property of a set of functions F = {f(x,α)}

� A set N of n points can be labeled in 2n possible ways

If for each labeling, a function of the set F can


� If for each labeling, a function of the set F can correcly assign those labels, then N is shattered by F

� VC Dimension is the maximum number of training points that can be shattered by F.

Shattering with Oriented Hyperplanes in R2


Hiperplane VC dimension

� The VC dimension of an Hyperplane in R is d+1


� The VC dimension of an Hyperplane in R is d+1

VC Dimension and the number of parameters

� VC dimension ≠ number of parameters

� Striking example : 1 parameter function that shatters infinite points

� f(x,α) ´ sin(αx), x,α 2 R


Linear Separable SVM’s

� Separating Hyperplane H: w.x + b = 0

� Margin: shortest distance from H to the closest positive (negative) sample.

� SVM computes H with largest symmetric margin.


H H1

H2

support vectors H: w.x + b = 0

H1: w.x + b = 1

H2: w.x + b = -1

Margin: m = 2/||w||

The optimization problem

� Maximize the margin =>

=> Minimize ||w||2

� Constraints:


Primal Lagrangian


� Convex quadratic programming problem

Wolfe dual formulation

� Easier solution and better extension to the non-linear case.

� Primal problem equivalent to:Maximize LP w.r.t. α

subject to: αi > 0

Gradw,bLP = 0


Primal vs Dual Problems

� There are two equivalent ways to compute the solution:� Maximize the distance between two parallel supporting planes

(primal).

� Find the closest point in the convex hull of the two classes (dual).


� Ref: Duality and Geometry in SVM Classifiers, Bennett and Bredensteiner, 2000.

Karhush-Khun-Tucker Conditions

� Solve dual problem => obtain αi

� Use Karhush-Kuhn-Tucker conditions (necessary and sufficient in the SVM problem)


Computing the Solution

� To compute the solution we must rely on numerical methods.

� There are several numerical packages that solve the quadratic programming problem.


quadratic programming problem.

� In the lab it will be used the Support Vector Machine Toolbox by Steve Gunn, that uses interior point optimization.

Complementarity Condition

� KKT Complementarity condition

� Can be satisfied if:

α


� αi = 0 -> data vector are not at the boundary, are irrelevant to the solution

� yi (xi w + b) – 1 = 0 -> αi = 0 and data vectors are on the boundary, they are the support vectors.

� Support Vectors( ){ }

( ) Sxxf

ixS

ii

ii

∈∀=

>=

,1

0:: α

Primal Solution

� After solving the dual problem, the primal problem solution is:


Test Phase

x1 +


x1

x2

x3

xi1

xi2

xi3

α1 y1

αi yi

αn yn

+

+

+

+

b

ysgn

Example

� Support Vector Machine � Perceptron

Mechanical Analogy

� Each support vector exerts a force

on the separating hyperplane.

� The resulting force and torque is zero!

i

iii

w

wyα


� The resulting force and torque is zero!

H H1

H2

∑∑ ==⇒=i i

iii

i

iiw

wyFy 00 αα

0=×=×=⇒

⇒=

∑∑

∑

i

i

i

iii

i

iii

i

i

i

iii

w

wxy

w

wyxT

xyw

αα

α

Documents

Support Vector Machines Hyperplane Classifiersalex/aauto0910/lecture13SVM.pdf · MACHINE LEARNING 09/10 Support Vector Machines Hyperplane Classifiers Support Vector Machines Supervised