39
Support Vector Machines and Kernel Methods Machine Learning March 25, 2010

Support Vector Machines and Kernel Methods

  • Upload
    amos

  • View
    62

  • Download
    0

Embed Size (px)

DESCRIPTION

Support Vector Machines and Kernel Methods. Machine Learning March 25, 2010. Last Time. Basics of the Support Vector Machines. Review: Max Margin. How can we pick which is best? Maximize the size of the margin. Small Margin. Large Margin. Are these really “equally valid”?. - PowerPoint PPT Presentation

Citation preview

Page 1: Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods

Machine LearningMarch 25, 2010

Page 2: Support Vector Machines and Kernel Methods

Last Time

• Basics of the Support Vector Machines

Page 3: Support Vector Machines and Kernel Methods

3

Review: Max Margin

• How can we pick which is best?

• Maximize the size of the margin.

Are these really “equally valid”?

Small Margin

Large Margin

Page 4: Support Vector Machines and Kernel Methods

4

• The margin is the projection of x1 – x2 onto w, the normal of the hyperplane.

Review: Max Margin Optimization

Size of the Margin:

Projection:

Page 5: Support Vector Machines and Kernel Methods

5

Review: Maximizing the margin

• Goal: maximize the margin

Linear Separability of the data by the decision boundary

Page 6: Support Vector Machines and Kernel Methods

Review: Max Margin Loss FunctionPrimal

Dual

Page 7: Support Vector Machines and Kernel Methods

7

Review: Support Vector Expansion

• When αi is non-zero then xi is a support vector

• When αi is zero xi is not a support vector

New decision FunctionIndependent of the

Dimension of x!

Page 8: Support Vector Machines and Kernel Methods

8

Review: Visualization of Support Vectors

Page 9: Support Vector Machines and Kernel Methods

Today

• How support vector machines deal with data that are not linearly separable– Soft-margin– Kernels!

Page 10: Support Vector Machines and Kernel Methods

10

Why we like SVMs

• They work– Good generalization

• Easily interpreted.– Decision boundary is based on the data in the

form of the support vectors.• Not so in multilayer perceptron networks

• Principled bounds on testing error from Learning Theory (VC dimension)

Page 11: Support Vector Machines and Kernel Methods

11

SVM vs. MLP

• SVMs have many fewer parameters– SVM: Maybe just a kernel parameter– MLP: Number and arrangement of nodes and eta

learning rate • SVM: Convex optimization task– MLP: likelihood is non-convex -- local minima

Page 12: Support Vector Machines and Kernel Methods

Linear Separability

• So far, support vector machines can only handle linearly separable data

• But most data isn’t.

Page 13: Support Vector Machines and Kernel Methods

13

• Points are allowed within the margin, but cost is introduced.

Soft margin example

Hinge Loss

Page 14: Support Vector Machines and Kernel Methods

14

Soft margin classification• There can be outliers on the other side of the decision

boundary, or leading to a small margin.• Solution: Introduce a penalty term to the constraint function

Page 15: Support Vector Machines and Kernel Methods

15

Soft Max Dual

Still Quadratic Programming!

Page 16: Support Vector Machines and Kernel Methods

16

Probabilities from SVMs

• Support Vector Machines are discriminant functions

– Discriminant functions: f(x)=c– Discriminative models: f(x) = argmaxc p(c|x)

– Generative Models: f(x) = argmaxc p(x|c)p(c)/p(x)

• No (principled) probabilities from SVMs• SVMs are not based on probability distribution

functions of class instances.

Page 17: Support Vector Machines and Kernel Methods

17

Efficiency of SVMs

• Not especially fast.• Training – n^3– Quadratic Programming efficiency

• Evaluation – n– Need to evaluate against each support vector

(potentially n)

Page 18: Support Vector Machines and Kernel Methods

Kernel Methods

• Points that are not linearly separable in 2 dimension, might be linearly separable in 3.

Page 19: Support Vector Machines and Kernel Methods

Kernel Methods

• Points that are not linearly separable in 2 dimension, might be linearly separable in 3.

Page 20: Support Vector Machines and Kernel Methods

Kernel Methods

• We will look at a way to add dimensionality to the data in order to make it linearly separable.

• In the extreme. we can construct a dimension for each data point

• May lead to overfitting.

Page 21: Support Vector Machines and Kernel Methods

21

Remember the Dual?Primal

Dual

Page 22: Support Vector Machines and Kernel Methods

22

Basis of Kernel Methods

• The decision process doesn’t depend on the dimensionality of the data.• We can map to a higher dimensionality of the data space.

• Note: data points only appear within a dot product.• The objective function is based on the dot product of data points – not

the data points themselves.

Page 23: Support Vector Machines and Kernel Methods

23

Basis of Kernel Methods

• Since data points only appear within a dot product.• Thus we can map to another space through a replacement

• The objective function is based on the dot product of data points – not the data points themselves.

Page 24: Support Vector Machines and Kernel Methods

Kernels

• The objective function is based on a dot product of data points, rather than the data points themselves.

• We can represent this dot product as a Kernel– Kernel Function, Kernel Matrix

• Finite (if large) dimensionality of K(xi,xj) unrelated to dimensionality of x

Page 25: Support Vector Machines and Kernel Methods

Kernels

• Kernels are a mapping

Page 26: Support Vector Machines and Kernel Methods

Kernels

• Gram Matrix:

Consider the following Kernel:

Page 27: Support Vector Machines and Kernel Methods

Kernels

• Gram Matrix:

Consider the following Kernel:

Page 28: Support Vector Machines and Kernel Methods

Kernels

• In general we don’t need to know the form of ϕ.

• Just specifying the kernel function is sufficient.• A good kernel: Computing K(xi,xj) is cheaper

than ϕ(xi)

Page 29: Support Vector Machines and Kernel Methods

Kernels

• Valid Kernels:– Symmetric– Must be decomposable into ϕ functions• Harder to show.• Gram matrix is positive semi-definite (psd).• Positive entries are definitely psd.• Negative entries may still be psd

Page 30: Support Vector Machines and Kernel Methods

Kernels

• Given a valid kernels, K(x,z) and K’(x,z), more kernels can be made from them.– cK(x,z)– K(x,z)+K’(x,z)– K(x,z)K’(x,z)– exp(K(x,z))– …and more

Page 31: Support Vector Machines and Kernel Methods

Incorporating Kernels in SVMs

• Optimize αi’s and bias w.r.t. kernel• Decision function:

Page 32: Support Vector Machines and Kernel Methods

Some popular kernels

• Polynomial Kernel• Radial Basis Functions• String Kernels• Graph Kernels

Page 33: Support Vector Machines and Kernel Methods

33

Polynomial Kernels

• The dot product is related to a polynomial power of the original dot product.

• if c is large then focus on linear terms• if c is small focus on higher order terms• Very fast to calculate

Page 34: Support Vector Machines and Kernel Methods

34

Radial Basis Functions

• The inner product of two points is related to the distance in space between the two points.

• Placing a bump on each point.

Page 35: Support Vector Machines and Kernel Methods

35

String kernels

• Not a gaussian, but still a legitimate Kernel– K(s,s’) = difference in length– K(s,s’) = count of different letters– K(s,s’) = minimum edit distance

• Kernels allow for infinite dimensional inputs.– The Kernel is a FUNCTION defined over the input

space. Don’t need to specify the input space exactly

• We don’t need to manually encode the input.

Page 36: Support Vector Machines and Kernel Methods

36

Graph Kernels

• Define the kernel function based on graph properties– These properties must be computable in poly-time

• Walks of length < k• Paths• Spanning trees• Cycles

• Kernels allow us to incorporate knowledge about the input without direct “feature extraction”.– Just similarity in some space.

Page 37: Support Vector Machines and Kernel Methods

Where else can we apply Kernels?

• Anywhere that the dot product of x is used in an optimization.

• Perceptron:

Page 38: Support Vector Machines and Kernel Methods

Kernels in Clustering

• In clustering, it’s very common to define cluster similarity by the distance between points– k-nn (k-means)

• This distance can be replaced by a kernel.

• We’ll return to this more in the section on unsupervised techniques

Page 39: Support Vector Machines and Kernel Methods

Bye

• Next time– Supervised Learning Review– Clustering