44
Support Vector Machines Javier B´ ejar cbea LSI - FIB Term 2012/2013 Javier B´ ejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 1 / 44

Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

  • Upload
    vankhue

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Support Vector Machines

Javier Bejar cbea

LSI - FIB

Term 2012/2013

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 1 / 44

Page 2: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Outline

1 Support Vector Machines - Introduction

2 Maximum Margin Classification

3 Soft Margin Classification

4 Non linear large margin classification

5 Application

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 2 / 44

Page 3: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Support Vector Machines - Introduction

1 Support Vector Machines - Introduction

2 Maximum Margin Classification

3 Soft Margin Classification

4 Non linear large margin classification

5 Application

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 3 / 44

Page 4: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Support Vector Machines - Introduction

Linear separators

As we saw with the perceptron, one way to perform classification is tofind a linear separator (for binary classification)

The idea is to find the equation of a line wT x + b = 0 that dividesthe set of examples in the target classes so:

wT x + b > 0⇒ Positive class

wT x + b < 0⇒ Negative class

This is just the problem that the single layer perceptron solves (b isthe bias weight)

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 4 / 44

Page 5: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Support Vector Machines - Introduction

Optimal linear separator

Actually any line that divides the examples can be the answer for theproblem

The question that we can ask is, what line is the best one?

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 5 / 44

Page 6: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Support Vector Machines - Introduction

Optimal linear separator

Some boundaries seem a bad choice due to poor generalization

We have to maximize the probability of classifying correctly unseeninstances

We want to minimize the expected generalization loss (instead of theexpected empirical loss)

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 6 / 44

Page 7: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Maximum Margin Classification

1 Support Vector Machines - Introduction

2 Maximum Margin Classification

3 Soft Margin Classification

4 Non linear large margin classification

5 Application

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 7 / 44

Page 8: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Maximum Margin Classification

Maximum Margin Classification

Maximizing the distance of theseparator to the examples seemsthe right choice (actuallysupported by PAC learningtheory)

This means that only thenearest instances to theseparator matter (the rest canbe ignored)

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 8 / 44

Page 9: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Maximum Margin Classification

Classification Margin

We can compute the distancefrom an example xi to theseparator as:

r =wT xi + b

||w ||The examples closest to theseparator are support vectors

The margin (ρ) of the separatoris the distance between supportvectors

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 9 / 44

Page 10: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Maximum Margin Classification

Large Margin Decision Boundary

The separator has to be as far as possible from the examples of bothclasses

This means that we have to maximize the margin

We can normalize the equation of the separator so the distance in thesupports are 1 or −1, by

r =wT x + b

||w ||

So the length of the optimal margin is m = 2||w ||

This means that maximizing the margin is the same that minimizingthe norm of the weights

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 10 / 44

Page 11: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Maximum Margin Classification

Computing the decision boundary

Given a set of examples {x1, x2, . . . , xn} with class labelsyi ∈ {+1,−1}The decision boundary that classify the examples correctly holds

yi (wT xi + b) ≥ 1, ∀i

This allows to define the problem of learning the weights as anoptimization problem

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 11 / 44

Page 12: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Maximum Margin Classification

Computing the decision boundary

The primal formulation of the optimization problem is:

Minimize 12 ||w ||

2

subject to yi (wT xi + b) ≥ 1, ∀i

This problem is difficult to solve in this formulation, but we canrewrite the problem in a more convenient form

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 12 / 44

Page 13: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Maximum Margin Classification

Constrained optimization (a little digression)

Suppose we want to minimize f (x) subject to g(x) = 0

A necessary condition for a point x0 to be a solution is{ ∂∂x (f (x) + αg(x))

∣∣x=x0

= 0

g(x) = 0

Being α the Lagrange multiplier

If we have multiple constraints, we just need one multiplier perconstraint { ∂

∂x (f (x) +∑n

i=1 αigi (x))∣∣x=x0

= 0

gi (x) = 0 ∀i ∈ 1 . . . n

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 13 / 44

Page 14: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Maximum Margin Classification

Constrained optimization (a little digression)

For inequality constraints gi (x) ≤ 0 we have to add the constraintthat the Lagrange multipliers have to be positive αi ≥ 0

If x0 is a solution to the constrained optimization problem

minx f (x) subject to gi (x) ≤ 0 ∀i ∈ 1 . . . n

There must exist αi ≥ 0 such that x0 satisfies{ ∂∂x (f (x) +

∑ni=1 αigi (x))

∣∣x=x0

= 0

gi (x) = 0 ∀i ∈ 1 . . . n

The function f (x) +∑n

i=1 αigi (x) is called the Lagrangian

The solution is the point of gradient 0

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 14 / 44

Page 15: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Maximum Margin Classification

Computing the decision boundary (we are back)

The primal formulation of the optimization problem is:

Minimize 12 ||w ||

2

subject to 1− yi (wT xi + b) ≤ 0, ∀i

The Lagrangian is1:

L =1

2wTw +

n∑i=1

αi (1− yi (wT xi + b))

1||w || = wTwJavier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 15 / 44

Page 16: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Maximum Margin Classification

Computing the decision boundary (we are back)

If we compute the derivative of L with respect to w and b and we setthem to zero (we are computing the gradient):

w +n∑

i=1

αi (−yi )xi = 0⇒ w =n∑

i=1

αiyixi

and

n∑i=1

αiyi = 0

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 16 / 44

Page 17: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Maximum Margin Classification

The dual formulation

If we substitute w in the Lagrangian L

L =1

2

n∑i=1

αiyixTi

n∑j=1

αjyjxj +n∑

i=1

αi

1− yi (n∑

j=1

αjyjxTj xi + b)

=

1

2

n∑i=1

n∑j=1

αiαjyiyjxTi xj +

n∑i=1

αi −n∑

i=1

αiyi

n∑j=1

αjyjxTj xi

−bn∑

i=1

αiyi (and given thatn∑

i=1

αiyi = 0)

= −1

2

n∑i=1

n∑j=1

αiαjyiyjxTi xj +

n∑i=1

αi (rearranging terms)

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 17 / 44

Page 18: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Maximum Margin Classification

The dual formulation

This problem is known as the dual problem (the original is the primal)

This formulation only depends on the αi

Both problems are linked, given the w we can compute the αi andviceversa

This means that we can solve one instead of the other

Now this problem is a maximization problem (αi ≥ 0)

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 18 / 44

Page 19: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Maximum Margin Classification

The dual problem

The formulation of the dual problem is

Maximize∑n

i=1 αi − 12

∑ni=1

∑nj=1 αiαjyiyjx

Ti xj

subject to∑n

i=1 αiyi = 0, αi ≥ 0 ∀i

This a Quadratic Programming problem (QP)

This means that the parameters form a parabolloidal surface, and anoptimal can be found

We can obtain w by

w =n∑

i=1

αiyixi

b can be obtained from a positive support vector (xpsv ) knowing thatwT xpsv + b = 1

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 19 / 44

Page 20: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Maximum Margin Classification

Geometric Interpretation

W

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 20 / 44

Page 21: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Maximum Margin Classification

Characteristics of the solution

Many of the αi are zero

w is a linear combination of a small number of examplesWe obtain a sparse representation of the data (data compression)

The examples xi with non zero αi are the support vectors (SV)

The vector of parameters can be expressed as:

w =∑

∀i∈SVαiyixi

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 21 / 44

Page 22: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Maximum Margin Classification

Classifying new instances

In order to classify a new instance z we just have to compute

sign(wT z + b) = sign(∑

∀i∈SVαiyix

Ti z + b)

This means that w does not need to be explicitly computed

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 22 / 44

Page 23: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Maximum Margin Classification

Solving the QP problem

Quadratic programming is a well known optimization problem

Many algorithms have been proposed

One of the most popular is Sequential Minimal Optimization (SMO)

Picks a pair of variables and solves a QP problem for two variables(trivial)Repeats the procedure until convergence

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 23 / 44

Page 24: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Soft Margin Classification

1 Support Vector Machines - Introduction

2 Maximum Margin Classification

3 Soft Margin Classification

4 Non linear large margin classification

5 Application

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 24 / 44

Page 25: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Soft Margin Classification

Non separable problems

This algorithm works well forseparable problems

Sometimes data has errors andwe want to ignore them toobtain a better solution

Sometimes data is just nonlinearly separable

We can obtain better linearseparators being less strict

Too Close!

Better!

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 25 / 44

Page 26: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Soft Margin Classification

Soft Margin Classification

We want to be permissive for certain examples, allowing that theirclassification by the separator diverge from the real class

This can be obtained by adding to the problem what is called slackvariables (ξi )

This variables represent the deviation of the examples from the margin

Doing this we are relaxing the margin, we are using a soft margin

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 26 / 44

Page 27: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Soft Margin Classification

Soft Margin Classification

We are allowing an error inclassification based on theseparator wT x + b

The values of ξi approximate thenumber of missclassifications

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 27 / 44

Page 28: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Soft Margin Classification

Soft Margin Hyperplane

In order to minimize the error, we can minimize∑

i ξi introducing theslack variables to the constraints of the problem:

wT xi + b ≥ 1− ξi yi = 1wT xi + b ≥ −1 + ξi yi = −1ξi ≥ 0

ξi = 0 if there are no errors (linearly separable problem)

The number of resulting supports and slack variables give an upperbound of the leave one out error

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 28 / 44

Page 29: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Soft Margin Classification

The primal optimization problem

We need to introduce this slack variables on the original problem, wehave now:

Minimize 12 ||w ||

2 + C∑n

i=1 ξi

subject to yi (wT xi + b) ≥ 1− ξi , ∀i , ξi ≥ 0

Now we have an additional parameter C that is the tradeoff betweenthe error and the margin

We will need to adjust this parameter

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 29 / 44

Page 30: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Soft Margin Classification

The dual optimization problem

Performing the transformation to the dual problem we obtain thefollowing:

Maximize∑n

i=1 αi − 12

∑ni=1

∑nj=1 αiαjyiyjx

Ti xj

subject to∑n

i=1 αiyi = 0, C ≥ αi ≥ 0 ∀i

We can recover the solution as w =∑

∀i∈SV αi , yixi

This problem is very similar to the linearly separable case, except thatthere is a upper bound C on the values of αi

This is also a QP problem

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 30 / 44

Page 31: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Non linear large margin classification

1 Support Vector Machines - Introduction

2 Maximum Margin Classification

3 Soft Margin Classification

4 Non linear large margin classification

5 Application

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 31 / 44

Page 32: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Non linear large margin classification

Non linear large margin classification

So far we have only considered large margin classifiers that use alinear boundary

In order to have better performance we have to be able to obtainnon-linear boundaries

The idea is to transform the data from the input space (the originalattributes of the examples) to a higher dimensional space using afunction φ(x)

This new space is called the feature space

The advantage of the transformation is that linear operations in thefeature space are equivalent to non-linear operations in the inputspace

Remember the RBFs networks?

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 32 / 44

Page 33: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Non linear large margin classification

Feature Space transformation

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 33 / 44

Page 34: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Non linear large margin classification

The XOR problem

The XOR problem is not linearly separable in its original definition,but we can make it linearly separable if we add a new feature x1 · x2

x1 x2 x1 · x2 x1 XOR x2

0 0 0 10 1 0 01 0 0 01 1 1 1

The linear function h(x) = 2x1x2 − x1 − x2 + 1 classifies correctly allthe examples

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 34 / 44

Page 35: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Non linear large margin classification

Transforming the data

Working directly in the feature space can be costly

We have to explicitly create the feature space and operate in it

We may need infinite features to make a problem linearly separable

We can use what is called the Kernel trick

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 35 / 44

Page 36: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Non linear large margin classification

The Kernel Trick

In the problem that define a SVM only the inner product of theexamples is needed

This means that if we can define how to compute this product in thefeature space, then it is not necessary to explicitly build it

There are many geometric operations that can be expressed as innerproducts, like angles or distances

We can define the kernel function as:

K (xi , xj) = φ(xi )Tφ(xj)

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 36 / 44

Page 37: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Non linear large margin classification

Kernels - Example

We can show how this kernel trick works in an example

Lets assume a feature space defined as:

φ

([x1

x2

])= (1,

√2x1,√

2x2, x21x

22 ,√

2x1x2)

A inner product in this feature space is:⟨φ

([x1

x2

]), φ

([y1

y2

])⟩= (1 + x1y1 + x2y2)2

So, we can define a kernel function to compute inner products in thisspace as:

K (x , y) = (1 + x1y1 + x2y2)2

and we are using only the features from the input space

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 37 / 44

Page 38: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Non linear large margin classification

Kernel functions - Properties

Given a set of examples X = {x1, x2, . . . xn}, and a symmetricfunction k(x , z) we can define the kernel matrix K as:

K = φTφ =

k(x1, x1) · · · k(x1, xn)· · · · · · · · ·

k(xn, x1) · · · k(xn, xn)

If K is a symmetric and positive definite matrix, then k(x , z) is akernel function

The justification is that if K is a sdp matrix then can be decomposedin:

K = VΛV T

V can be interpreted as a projection in the feature space (rememberfeature projection?)

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 38 / 44

Page 39: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Non linear large margin classification

Kernel functions

Examples of functions that hold this condition are:

Polynomial kernel of degree d : K (x , y) = (xT y + 1)d

Gaussian function with width σ: K (x , y) = exp(−||x − y ||2/2σ2)Sigmoid hiperbolical tangent with parameters k and θ:K (x , y) = tanh(kxT y + θ) (only for certain values of the parameters)

Kernel functions can also be interpreted as similarity function

A similarity function can be transformed in a kernel given that holdthe sdp matrix condition

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 39 / 44

Page 40: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Non linear large margin classification

The dual problem

Due to the introduction of the kernel function, the optimizationproblem has to be modified:

Maximize∑n

i=1 αi − 12

∑ni=1

∑nj=1 αiαjyiyjK (xi , xj)

subject to∑n

i=1 αiyi = 0, C ≥ αi ≥ 0 ∀i

For classifying new examples:

h(z) = sign

( ∑∀i∈SV

αi , yik(xi , z) + b

)

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 40 / 44

Page 41: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Non linear large margin classification

The advantage of using kernels

Since the training of the SVM only needs the value of K (xi , xj) thereis no constrains about how the examples are represented

We only have to define the kernel as a similarity among examples

We can define similarity functions for different representations

Strings, sequencesGraphs/TreesDocuments...

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 41 / 44

Page 42: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Application

1 Support Vector Machines - Introduction

2 Maximum Margin Classification

3 Soft Margin Classification

4 Non linear large margin classification

5 Application

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 42 / 44

Page 43: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Application

Splice-Junction gene sequences

Identification of gene sequence type(Bioinformatics)

60 Attributes (Discrete)

Attributes: DNA base at position 1-60

3190 instances

3 classes

Methods: Support vector machines

Validation: 10 fold cross validation

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 43 / 44

Page 44: Support Vector Machines - cs.upc.edubejar/apren/docum/trans/06-SVM.pdf · Support Vector Machines - Introduction 1 Support Vector Machines - Introduction 2 Maximum Margin Classi cation

Application

Splice-Junction gene sequences: Models

SVM (linear): accuracy 90.9%, 1331 Support Vectors

SVM (linear, C=1): accuracy 91.8%, 608 Support Vectors

SVM (linear, C=5): accuracy 91.5%, 579 Support Vectors

SVM (quadratic): accuracy 91.4%, 1305 Support Vectors

SVM (quadratic, C=5): accuracy 91.6%, 1054 Support Vectors

SVM (cubic, C=5): accuracy 91.9%, 1206 Support Vectors

Javier Bejar cbea (LSI - FIB) Support Vector Machines Term 2012/2013 44 / 44