48
Elements - Chapter 12 - SVM Henry Tan Georgetown University April 13, 2015 Georgetown University SVM 1

Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Elements - Chapter 12 - SVM

Henry Tan

Georgetown University

April 13, 2015

Georgetown University SVM 1

Page 2: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Introduction to Support Vector Machines

General Idea

We want to be able to classify inputs into one of 2 classes. It’s all abouthow you phrase the question, not how you solve it.

First Steps - Support Vector Classification

Solve a linearly separable problem (no overlap) using linear programming.Separating hyperplane must be a “flat” space.

Extend to SVM

Solve

non-separable case by allowing some slack in the constraints.

Non-linear separation using basis expansion and Kernel functions

Georgetown University SVM 2

Page 3: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Support Vectors - Linear, Fully Separable

Georgetown University SVM 3

Page 4: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Hyperplane Separation

Training data - N pairs (x1, y1)...(xN , yN)p dimensions - xi ∈ Rp

Class - yi ∈ {−1, 1}

{x : f (x) = xTβ + β0 = 0} (12.1)

||β|| = 1 and some constant β0

A straight line in 2D, a flat plane in 3D...

Note: Equation numbering follows Elements print 10 pdf

Georgetown University SVM 4

Page 5: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Hyperplane Separation 2

f (x) = xTβ + β0 gives the signed distance from point x to the hyperplane.

Classification Rule

Given the parameters of a hyperplane β, β0 we can plug in any observationxi and get which ‘side’ of the plane it is on.

G (x) = sign[xTβ + β0] (12.2)

Georgetown University SVM 5

Page 6: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Hyperplane Separation 3

f (x) = xTβ + β0 gives the signed distance from point x to the hyperplane.

Since we assume that the classes are linearly separable, we know that theremust exist a separating hyperplane, i.e.,∃f (x) = xTβ + β0 such that yi f (xi ) > 0 ∀i

Optimisation problem

Find the hyperplane with the largest margin M between training points

maxβ,β0,||β||=1

M (12.3)

subject to yi (xTi β + β0) ≥ M, i = 1, ...,N

Georgetown University SVM 6

Page 7: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Hyperplanes for Non-Separable Case

Question

Grace What is the intuition/physical meanings of using the slackvariables?

Georgetown University SVM 7

Page 8: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Hyperplanes for Non-Separable Case

For Non-Separable Classes

Allow some points to be on the wrong side of the margin.Define slack variables ξ = (ξ1, ..., ξN)

yi (xTi β + β0) ≥ M(1− ξi ) ∀i (12.6)

Some observations are allowed to be on the wrong side of the margin, butwe still attempt to maximize the margin.

Georgetown University SVM 8

Page 9: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Hyperplanes for Non-Separable Case 2

More Constraints

Slack must be positive - ξi ≥ 0

Total slack is bound by some constant -N∑i=1

ξi ≤ k

If ξi > 1, that training sample is considered misclassified in the solution

Note

There is another way to introduce the slack but it leads to a nonconvexproblem (I’m not too sure why).

Georgetown University SVM 9

Page 10: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Hyperplanes for Non-Separable Case 3

The norm constraint on β can be dropped and M set to 1/||β|| to get theequivalent formulation

min||β||subject to yi (x

Ti β + β0) ≥ 1− ξi ∀i (12.7)

and ξi ≥ 0,∑

ξi ≤ constant

We can see that correctly classified points far from the boundary, i.e.yi (x

Ti β + β0) = yi f (xi ) > 1, do not matter in the constraints and

therefore do not affect the solution.

Georgetown University SVM 10

Page 11: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Lagrange and his equations

Questions

Yifang Could you walk us through the Lagrange function reduction,from 12.9 to 12.17?

Sicong I am not clear about the Lagrange function in section 12.2, canyou make some detail illustrations to it?

Yuankai What is Lagrange function and how is it used in margin-basedmethods?

Disclaimer

I don’t properly know Lagrange functions and the duals and all that. Thefollowing is what I could figure out in the last few days.

Georgetown University SVM 11

Page 12: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Lagrange and his equations

Questions

Yifang Could you walk us through the Lagrange function reduction,from 12.9 to 12.17?

Sicong I am not clear about the Lagrange function in section 12.2, canyou make some detail illustrations to it?

Yuankai What is Lagrange function and how is it used in margin-basedmethods?

Disclaimer

I don’t properly know Lagrange functions and the duals and all that. Thefollowing is what I could figure out in the last few days.

Georgetown University SVM 11

Page 13: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

A Detour - The Lagrange Function

1

Minimum distance to travel from M→P→CNote that the gradient of the ellipse at the solution is the same as thegradient of P at the solution.

1Source -http://www.slimy.com/~steuard/teaching/tutorials/Lagrange.html

Georgetown University SVM 12

Page 14: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

The Lagrange Function - Lagrange Multipliers2

Consider the following problem -minimize f (x , y) = x2 + y2

subject to the constraint g(x , y) = x + y − 2 = 0

1Readings - http://www.cs.cmu.edu/~ggordon/lp.pdfGeorgetown University SVM 13

Page 15: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

The Lagrange Function - Lagrange Multipliers 2

Gradient of the objective function is a multiple of the gradient of theconstraint.This can be re-stated as a set of simultaneous equations

g(x , y) = 0 ← From Original Constraint

∇f (x , y) = α∇g(x , y) ← New

α is called the Lagrange multiplier.

Georgetown University SVM 14

Page 16: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

The Lagrange Function

The Lagrangian

This can be restated in a compact form as the Lagrangian LL(x , y , α) = f (x , y) + αg(x , y)where the equations are ∇L = 0

Multiple Constraints → Multiple (Independent) Lagrange multipliers

Minimize f (x)with constraints gi (x) = 0 for 1 ≤ i ≤ Nyields the Lagrangian -L(x , α) = f (x) +

∑1≤i≤N

αigi (x)

with lagrange multipliers α = (α1, ..., αN)

Georgetown University SVM 15

Page 17: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

The Lagrange Function - Inequality Constraints

Theorem: Solution is at a Saddle Point

The solution, if it exists, is one where the Lagrangian cannot be decreasedfurther by changing the original variables, or increased by changing themultipliers.

Inequality Constraints?

Previously, all the constraints were equalities.To deal with the constraint gi (x) ≥ 0, we set pi ≤ 0or gi (x) ≤ 0→ pi ≥ 0.This follows from the above theorem.

Georgetown University SVM 16

Page 18: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Back onto linear SVM - The Lagrangian

Reformulating 12.7

The previous equation can be converted into the following form

minβ,β0

1

2||β||2 + C

N∑i=1

ξi (12.8)

subject to ξi ≥ 0, yi (xTi β + β0) ≥ 1− ξi ∀i

Intuition

Previously, we had the constraint∑ξi ≤ constant.

Small ||β||2 → large margin. This means that more slack is required.Instead of bounding the total slack by a constant,12.8 minimizes ||β||2,i.e., maximises the margin, while minimizing the slack.

Georgetown University SVM 17

Page 19: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

The Lagrangian Primal Form of SVM

General Form

min f (x)with constraints gi (x) = 0 for 1 ≤ i ≤ Nyields the Lagrangian - L(x , α) = f (x) +

∑1≤i≤N

αigi (x)

Original SVM Optimization Problem

minβ,β0

12 ||β||

2 + CN∑i=1

ξi

subject to ξi ≥ 0, yi (xTi β + β0) ≥ 1− ξi ∀i

yields the lagrangian

LP =1

2||β||2 +C

N∑i=1

ξi −N∑i=1

αi [yi (xTi β+β0)− (1− ξi )]−

N∑i=1

µiξi (12.9)

with lagrangian multipliers αi , µi for 1 ≤ i ≤ N.Georgetown University SVM 18

Page 20: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

SVM Lagrangian Breakdown

Previously

Note that from earlier, the Lagrangian is mainly a compact representationand what we actually want to solve is ∇L = 0.These yield -

∂Lp∂β

= 0 = β −N∑i=1

αiyixi (12.10)

∂Lp∂β0

= 0 =N∑i=1

αiyi (12.11)

∂Lpξi

= 0 = C − µi − αi ∀i (12.12)

And positivity constraints αi , µi , ξi ≥ 0 from previous constraints, orbecause the previous constraints were inequalities.

Georgetown University SVM 19

Page 21: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

SVM Lagrangian Breakdown - Wolfe Dual

Substituting the previous equations into the Lagrangian yields the WolfeDual objective function.

β =N∑i=1

αiyixi , 0 =N∑i=1

αiyi , αi = C − µi ∀i

Detailed workings done on the board if necessary

LP =1

2||β||2 + C

N∑i=1

ξi −N∑i=1

αi [yi (xTi β + β0)− (1− ξi )]−

N∑i=1

µiξi (12.9)

1

2||β||2 =

1

2

N∑i=1

N∑j=1

αiαjyiyjxTi xj

CN∑i=1

ξi −N∑i=1

µiξi =N∑i=1

αiξi

−N∑i=1

αi [yi (xTi β + β0)− (1− ξi )] = −

N∑i=1

N∑j=1

αiαjyiyjxTi xj −

N∑i=1

αiξi +N∑i=1

αi

Georgetown University SVM 20

Page 22: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

SVM Lagrangian Breakdown - Wolfe Dual 2

Putting the pieces together

Simply adding it up yields-

LD =N∑i=1

αi −1

2

N∑i=1

N∑i=j

αiαjyiyjxTi xj

This also provides a lower bound to the objective function.However, why it yields the Wolfe Dual, or that it provides a lower bound Ido not know.

Georgetown University SVM 21

Page 23: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

SVM Lagrangian Breakdown - Karush-Kuhn-Tucker

Question

Jiyun Why the Karush-Kuhn-Tucker conditions includes the constraints12.14-12.16?

Grace What are the intuition/physical meanings of the KKT conditions(formulas 12.14-12.16)?

Georgetown University SVM 22

Page 24: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

SVM Lagrangian Breakdown - Karush-Kuhn-Tucker3

Karush-Kuhn-Tucker

The KKT conditions are necessary conditions for a solution in non-linearprogramming to be optimal.Equations 12.10-12.12 - stationarity (solution is at stationary point)Equations 12.14 and 12.15 - Complementary SlacknessEquation 12.16 is for primal feasibility - Original constraint must still hold

αi [yi (xTi β + β0)− (1− ξi )] = 0 (12.14)

µiξi = 0 (12.15)

yi (xTi β + β0)− (1− ξi ) ≥ 0 (12.16)

for i = 1, ...,N

3Readings - https://www.cs.cmu.edu/~ggordon/10725-F12/slides/16-kkt.pdfGeorgetown University SVM 23

Page 25: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

SVM Lagrangian Breakdown - KKT

Complementary Slackness Intuition

The primal and dual problems are related in their variables and constraints.If a variable in the primal is non-zero, then the constraint must be bindingin the dual.

Georgetown University SVM 24

Page 26: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

SVM Lagrangian

Looking back

Equation 12.10 already gives us the form for β̂ =N∑i=1

α̂iyixi

Since we have constraint 12.14, for non-zero αi , constraint 12.16 must bean equality

αi [yi (xTi β + β0)− (1− ξi )] = 0 (12.14)

yi (xTi β + β0)− (1− ξi ) ≥ 0 (12.16)

for i = 1, ...,N

Support Vectors

The observations where this is true - yi f (x) = 1− ξi for ξi ≥ 0 can onlyhold for observations on the margin,i.e., ξi = 0, or past it.β0 can be solved by using any of these observations.An average of all the solutions is used.

Georgetown University SVM 25

Page 27: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

SVM Lagrangian

Finale

Maximizing the dual is a simpler convex quadratic programming problemthan the primal and can be solved with standard techniques.

Georgetown University SVM 26

Page 28: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

SVM Kernel Functions

Questions

Yuankai Can you explain what is kernel in section 12.3?

Sicong What is the role of the kernel in SVM? And what about the eigen expansion ofa kernel?

Questions - Kernel Trick

Sicong It seems that SVM can perform non-linear classification by something called the”kernel trick”. Can you introduce a little bit about it in your presentation?

Jiyun Can we explain in detail how does SVM work on non-linear separable data?

Tavish In section 12.3, the text mentions that for SVM a linear boundary function iscalculated on training data and then translated to non-linear boundaries in theoriginal space. Why is this statement valid and how are the functions translated?

Yifang On page 423, it says We can represent the optimization problem (12.9) and itssolution in a special way that only involves the input features via inner products.We do this directly for the transformed feature vectors h(xi ). I do notunderstand why they could do this directly for the transformed feature vectors?

Georgetown University SVM 27

Page 29: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Kernels - Basis Functions

What is basis expansion?

Select a set of basis functions hm(x) for m = 1, ...,M and fit using theprevious method described on h(xi ) = (h1(xi ), ..., hM(xi )) instead of on xi .These functions can be any arbitrary function (in the general case; SVMtrickery will be covered in the following slides).

Support Vector Machine

Support Vector Machine classification uses an extension of the previouslydescribed support vector classification and specific sets of basis functionsto classify using larger, potentially infinite, spaces.

Georgetown University SVM 28

Page 30: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Basis Expansion Example - Non-linearly separable

Consider the toy example shown below.As demonstrated, any straight line will have huge training error.

Georgetown University SVM 29

Page 31: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Basis Expansion Example - Curved separation

A good separator will be some curved function, which we are not allowedin the linear program.

Georgetown University SVM 30

Page 32: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Basis Expansion Example - Choosing proper bases

Let h(x , y) = (h1(x , y), h2(x , y)) andh1(x , y) = x , h2(x , y) = y − x2

Georgetown University SVM 31

Page 33: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Kernel Trickery

As seen previously on...

We saw from the Lagrangian section, that the Lagrangian Dual, which wesolve, looks like-

LD =N∑i=1

αi − 12

N∑i=1

N∑i=jαiαjyiyj 〈h(x)|h(xi )〉

and the solution function is of the form -

f (x) = h(x)Tβ + β0 =N∑i=1

αiyi 〈h(x)|h(xi )〉

where the bra-ket notation is another way of writing the inner product.

Georgetown University SVM 32

Page 34: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Kernel Trickery 2

A Kernel

Solving both only requires the inner product of the mapped features andnot actually computing the h(x) transformations.This inner product over the transformed space is called the kernel function.A specific choice of kernel implies a specific set of transformations.

K (x , x ′) = 〈h(x), h(x ′)〉 (12.21)

Popular Kernels

dth-Degree polynomial: K (x , x ′) = (1 + 〈x , x ′〉)d

Radial Basis: K (x , x ′) = exp(−γ||x − x ′||2)

Neural Network: K (x , x ′) = tanh(κ1 〈x , x ′〉+ κ2)

Georgetown University SVM 33

Page 35: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Kernel Trickery Example

2nd-Degree polynomial

The equations below show that for a 2nd-degree polynomial (with inputsX1,X2) can be simplified and for a given set of transformation functionsh(x), yield the form for the Kernel.

K (X ,Y ) = (1 + 〈X |Y 〉)2

= 1 + 2X1Y1 + 2X2Y2 + (X1Y1)2 + (X2Y2)2 + 2X1Y1X2Y2

Define - h1(X ) = 1, h2(X ) =√

2X1, h3(X ) =√

2X2

h4(X ) = X 21 , h5(X ) = X 2

2 , h6(X ) =√

2X1X2

This simplifies the Kernel function K (X ,Y ) = 〈h(X ), h(Y )〉 as desired.Therefore, at least for these popular functions, the computation of thekernel is simply the computation of a simple function over the originaluntransformed observations.

Georgetown University SVM 34

Page 36: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

The Tuning Parameter C

Larger C will discourage large ξi since errors will have more impact relativeto ||β|| on the value of the objective function.This may lead to overfitting (complicated, wiggly boundaries).

Smaller C encourages smaller ||β||, i.e., a larger margin and smootherboundary.

C is also known as the regularization parameter.

Georgetown University SVM 35

Page 37: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

SVM as a Penalization Method

Question

JiYun - SVM’s loss function seems like a reward function to me. Can wemap SVM’s solution to a gradient algorithm?

Formulating an optimisation problem for the hyperplane with a lossfunction can give the same solution as the original SVM equation.

Intuitively

The objective function is increased when an observation is on the wrongside of the margin, i.e., is a support vector. For observations which arepast the margin, they have no effect on the objective function.

minβ0,β

N∑i=1

[1− yi f (xi )]+ +λ

2||β||2 (12.25)

Georgetown University SVM 36

Page 38: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Function Estimation/Reproducing Kernel Hilbert Spaces

Questions

Sicong SiCong - What is the role of the kernel in SVM? And what aboutthe eigen expansion of a kernel?

YiFang In equation 12.26, what is δm? the Dirac function?

Georgetown University SVM 37

Page 39: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Function Estimation/Reproducing Kernel Hilbert Spaces

More Kernel Trickery

As mentioned previously, using specific Kernel functions allow forhigh-order basis expansion hm without having to calculate hm.

This is possible simply by taking any positive definite kernel K andconsidering its eigen-expansion in some function space -

K (x , x ′) = 〈h(x), h(x ′)〉 =∞∑

m=1

φm(x)φm(x ′)δm

where h(x) =√δmφm(x) and δm is the coefficient of expansion.

Georgetown University SVM 38

Page 40: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Curse of Dimensionality

Questions

Brandon In Section 12.3.4, the authors argue that knowing a priori whichfeatures to discount is uninteresting because it makes statisticallearning easier in general, but knowing which SVM algorithm touse seems roughly equivalent? Will Bruto and Mars always dowell against noise or were the results problem specific?

Tavish With respect to curse of dimensionality, when is it a good idea touse SVM? And from an application perspective, what sort ofdata characteristics/required outcomes make using SVM as thefirst option for classification?

Georgetown University SVM 39

Page 41: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Manages the Curse of Dimensionality?

No, not quite

While SVMs can effectively perform basis expansions to an infinite numberof dimensions, they cannot easily select dimensions to concentrate on.

Via Previous Example

As shown in the 2nd degree polynomial Kernel example, the basisfunctions for that Kernel are fixed and weighted very specifically.

To specifically answer the questions - selecting the Kernel to use is a goodstart, but if you don’t know which Kernel to use, or even how to representthe Kernel you want, you have to make do with a similar one.I’d say SVM is always a good start due to its high dimensionality, or if youthink the distribution of data matches the basis functions.

Georgetown University SVM 40

Page 42: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Error Curves - Question

Question

Brandon - I’m having trouble understanding Figure 12.6. Can you pointout the interesting features?

γ ← coefficient in the exponent of the radial basis functionC ← - the regularisation/cost parameter.

Georgetown University SVM 41

Page 43: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Skipping - Path Algorithm

Question

Grace - can you explain in more details on section 12.3.5. It is related toSVM ranker.

Unfortunately

Didn’t understand it.It seems to be a way to vary C, since C is so important in properly tuningthe model.C can be chosen via cross-validation, or by using the path algorithmdescribed.

Unsure of the connection to a ranking algorithm.

Georgetown University SVM 42

Page 44: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

SVM and Regression

Linear Version

f (x) = xTβ + β0We want to estimate β (as usual).

Consider minimizing the equation

H(β, β0) =N∑i=1

V (yi − f (xi )) +λ

2||β||2 (12.37)

where

Vε(r) =

{0, if |r | < ε,

|r | − ε, otherwise(12.38)

Georgetown University SVM 43

Page 45: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

SVM and Regression 2

As with SVM, this equation has a term which increases as points cross themargin (it is also large for terms far on the right side of the margin).

Note

I don’t know how to derive the following equations.

Georgetown University SVM 44

Page 46: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

SVM and Regression 3If β̂, β̂0 minimize H then the solution functions are -

β̂ =N∑i=1

(α̂∗i − α̂i )xi (12.39)

f̂ (x) =N∑i=1

(α̂∗i 〈x |xi 〉+ β0 (12.40)

where α̂i , α̂∗i are positive and solve the quadratic programming problem

minαi ,α

∗i

ε

N∑i=1

(α∗i + αi )−N∑i=1

yi (α∗i − αi ) +

1

2

N∑i=1

(α∗i − αi )(α∗i ′ − αi ′) 〈xi , xi ′〉

subject to 0 ≤ αi , α∗i ≤ 1/λ

N∑i=1

(α∗i − αi ) = 0 (12.41)

αiα∗i = 0

Georgetown University SVM 45

Page 47: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

SVM and Regression 4

Important Points

Similar to earlier, the solution f̂ (x) depends only on the inner productbetween inputs.

Due to the constraints, typically some of the values (α̂∗i − α̂i ) are non-zero.Only these contribute to the solution function → support vectors.λ is the traditional regularisation parameter, previously C.

Georgetown University SVM 46

Page 48: Elements - Chapter 12 - SVMpeople.cs.georgetown.edu/~huiyang/cosc-878/slides/svm.pdfYifangCould you walk us through the Lagrange function reduction, from 12.9 to 12.17? SicongI am

Multiclass Classification

Perform pairwise classification of a sample and select the dominating class.

Georgetown University SVM 47