65
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 1

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

Embed Size (px)

Citation preview

Page 1: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

1

Support vector machine

LING 572Fei Xia

Week 8: 2/23/2010

Page 2: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

2

Why another learning method?

• It is based on some “beautifully simple” ideas.

• It can use the kernel trick: kernel-based vs. feature-based.

• It produces good performance on many applications.

• It has some nice properties w.r.t. training and test error rates.

Page 3: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

3

Kernel methods• Rich family of “pattern analysis” algorithms

• Best known element is the Support Vector Machine (SVM)

• Very general task: given a set of data, find patterns

• Examples:– Classification– Regression– Correlation– Clustering– ….

Page 4: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

4

History of SVM

• Linear classifier: 1962– Use a hyperplane to separate examples– Choose the hyperplane that maximizes the

minimal margin

• Kernel trick: 1992

Page 5: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

5

History of SVM (cont)

• Soft margin: 1995– To deal with non-separable data or noise

• Transductive SVM: 1999– To take advantage of unlabeled data

Page 6: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

6

Main ideas

• Use a hyperplane to separate the examples.

• Among all the hyperplanes, choose the one with the maximum margin.

• Maximizing the margin is the same as minimizing ||w|| subject to some constraints.

Page 7: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

7

Main ideas (cont)

• For the data set that are not linear separable, map the data to a higher dimensional space and separate them there by a hyperplane.

• The Kernel trick allows the mapping to be “done” efficiently.

• Soft margin deals with noise and/or inseparable data set.

Page 8: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

8

Outline

• Linear SVM– Maximizing the margin– Soft margin

• Nonlinear SVM– Kernel trick

• A case study

• Handling multi-class problems

Page 9: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

9

Papers

• (Manning et al., 2008) – Chapter 15

• (Collins and Duffy, 2001): tree kernel

• (Scholkopf, 2000)– Sections 3, 4, and 7

• (Hearst et al., 1998)

Page 10: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

Inner product vs. dot product

Page 11: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

Dot product

Page 12: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

Inner product• An inner product is a generalization of the dot

product.

• It is a function that satisfies the following properties:

Page 13: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

Some examples

Page 14: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

14

Linear SVM

Page 15: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

15

The setting

• Input: x 2 X

• Output: y 2 Y, Y = {-1, +1}

• Training set: S = {(x1, y1), …, (xi, yi)}

• Goal: Find a function y = f(x) that fits the data: f: X ! R

Page 16: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

16

Notations

Page 17: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

17

Inner product• Inner product between two vectors

Page 18: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

18

Inner product (cont)

Page 19: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

19

Hyperplane

A hyperplane:

<w, x> + b > 0

<w, x> + b < 0

Page 20: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

20

An example of hyperplane: <w,x> + b = 0

(2,0)

(0,1)

x1

x2

x1 + 2 x2 – 2 = 0 w=(1,2), b=-2

10x1 + 20 x2 – 20 = 0 w=(10,20), b=-20

Page 21: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

21

Finding a hyperplane

• Given the training instances, we want to find a hyperplane that separates them.

• If there are more than one hyperplane, SVM chooses the one with the maximum margin.

Page 22: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

22

Maximizing the margin

++

+++

+

<w,x>+b=0Training: to find w and b.

Page 23: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

23

Support vectors

++

++++

<w,x>+b=0<w,x>+b=-1

<w,x>+b=1

Page 24: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

24

Training: Max margin = minimize norm

subject to the constraint

This is a quadratic programming (QP) problem.

++

++++

Page 25: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

25

Decoding

Hyperplane: w=(1,2), b=-2 f(x) = x1 + 2 x2 – 2

x=(3,1)

x=(0,0)

f(x) = 3+2-2 = 3 > 0f(x) = 0+0-2 = -2 < 0

Page 26: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

26

Lagrangian**

Page 27: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

27

Finding the solution

• This is a Quadratic Programming (QP) problem.

• The function is convex and there is no local minima.

• Solvable in polynomial time.

Page 28: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

28

The dual problem **

• Maximize

• Subject to

Page 29: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

29

Page 30: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

30

The solution

Training:

Decoding:

Page 31: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

kNN vs. SVM• Majority voting: c* = arg maxc g(c) • Weighted voting: weighting is on each neighbor c* = arg maxc i wi (c, fi(x))

• Weighted voting allows us to use more training examples: e.g., wi = 1/dist(x, xi) We can use all the training examples.

31

(weighted kNN, 2-class)

(SVM)

Page 32: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

32

Summary of linear SVM

• Main ideas:– Choose a hyperplane to separate instances: <w,x> + b = 0 – Among all the allowed hyperplanes, choose the

one with the max margin– Maximizing margin is the same as minimizing ||

w||– Choosing w and b is the same as choosing ®i

Page 33: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

33

The problem

Page 34: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

34

The dual problem

Page 35: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

35

Remaining issues

• Linear classifier: what if the data is not separable?– The data would be linear separable without noise soft margin

– The data is not linear separable map the data to a higher-dimension space

Page 36: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

36

The dual problem**

• Maximize

• Subject to

Page 37: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

37

Outline

• Linear SVM– Maximizing the margin– Soft margin

• Nonlinear SVM– Kernel trick

• A case study

• Handling multi-class problems

Page 38: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

38

Non-linear SVM

Page 39: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

39

The highlight

• Problem: Some data are not linear separable.

• Intuition: to transform the data to a high dimension space

Input space Feature space

Page 40: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

40

Example: the two spirals

Separated by a hyperplane in feature space (Gaussian kernels)

Page 41: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

41

Feature space

• Learning a non-linear classifier using SVM:– Define Á– Calculate Á(x) for each training example– Find a linear SVM in the feature space.

• Problems:– Feature space can be high dimensional or even have

infinite dimensions.– Calculating Á(x) is very inefficient and even impossible.– Curse of dimensionality

Page 42: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

42

Kernels

• Kernels are similarity functions that return inner products between the images of data points.

• Kernels can often be computed efficiently even for very high dimensional spaces.

• Choosing K is equivalent to choosing Á. the feature space is implicitly defined by K

Page 43: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

43

An example

Page 44: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

44

An example**

Page 45: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

45

From Page 750 of (Russell and Norvig, 2002)

Page 46: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

46

Another example**

Page 47: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

47

The kernel trick

• No need to know what Á is and what the feature space is.

• No need to explicitly map the data to the feature space.

• Define a kernel function K, and replace the dot produce <x,z> with a kernel function K(x,z) in both training and testing.

Page 48: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

48

TrainingMaximize

Subject to

Page 49: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

49

DecodingLinear SVM: (without mapping)

Non-linear SVM: w could be infinite dimensional

Page 50: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

50

Kernel vs. features

Page 51: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

A tree kernel

Page 52: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

52

Common kernel functions

• Linear :

• Polynominal:

• Radial basis function (RBF):

• Sigmoid:

Page 53: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

53

Other kernels

• Kernels for– trees– sequences– sets– graphs– general structures– …

• A tree kernel example next time

Page 54: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

54

The choice of kernel function

• Given a function, we can test whether it is a kernel function by using Mercer’s theorem.

• Different kernel functions could lead to very different results.

• Need some prior knowledge in order to choose a good kernel.

Page 55: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

55

Summary so far

• Find the hyperplane that maximizes the margin.

• Introduce soft margin to deal with noisy data

• Implicitly map the data to a higher dimensional space to deal with non-linear problems.

• The kernel trick allows infinite number of features and efficient computation of the dot product in the feature space.

• The choice of the kernel function is important.

Page 56: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

56

MaxEnt vs. SVMMaxEnt SVM

Modeling Maximize P(Y|X, ¸) Maximize the margin

Training Learn ¸i for each feature function

Learn ®i for each training instance

Decoding Calculate P(y|x) Calculate the sign of f(x). It is not prob

Things to decide FeaturesRegularizationTraining algorithm

KernelRegularizationTraining algorithmBinarization

Page 57: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

57

More info

• Website: www.kernel-machines.org

• Textbook (2000): www.support-vector.net

• Tutorials: http://www.svms.org/tutorials/

• Workshops at NIPS

Page 58: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

58

Soft margin

Page 59: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

59

The highlight

• Problem: Some data set is not separable or there are mislabeled examples.

• Idea: split the data as cleanly as possible, while maximizing the distance to the nearest cleanly split examples.

• Mathematically, introduce the slack variables

Page 60: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

60

Objective function

• Minimizing

such that

where C is the penalty, k = 1 or 2

Page 61: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

61

Additional slides

Page 62: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

62

Linear kernel

• The map Á is linear.

• The kernel adjusts the weight of the features according to their importance.

Page 63: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

63

The Kernel Matrix(a.k.a. the Gram matrix)

K(1,1) K(1,2) K(1,3) … K(1,m)

K(2,1) K(2,2) K(2,3) … K(2,m)

K(m,1) K(m,2) K(m,3) … K(m,m)

Page 64: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

64

Mercer’s Theorem

• The kernel matrix is symmetric positive definite.

• Any symmetric positive definite matrix can be regarded as a kernel matrix; that is, there exists a Á such that K(x,z) = <Á(x), Á(z)>

Page 65: Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1

65

Making kernels

• The set of kernels is closed under some operations. For instance, if K1 and K2 are kernels, so are the following:– K1 +K2

– cK1 and cK2 for c > 0

– cK1 +dK2 for c> 0 and d>0

• One can make complicated kernels from simples ones