23
Support Vector Machines & Kernels Lecture 6 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, and Vibhav Gogate

Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

Embed Size (px)

Citation preview

Page 1: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

Support Vector Machines & Kernels Lecture 6

David Sontag

New York University

Slides adapted from Luke Zettlemoyer, Carlos Guestrin, and Vibhav Gogate

Page 2: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

SVMs in the dual

Primal: Solve for w, b:

Dual:

dot product

The dual is also a quadratic program, and can be efficiently solved to optimality

Page 3: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

Support vectors

•  Complementary slackness conditions:

•  Support vectors: points xj such that (includes all j such that , but also additional points where )

•  Note: the SVM dual solution may not be unique!

Page 4: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

Dual SVM interpretation: Sparsity

w.x

+ b

= +

1

w.x

+ b

= -

1

w.x

+ b

= 0

Support Vectors: •  αj≥0

Non-support Vectors: • αj=0 • moving them will not change w

Final solution tends to be sparse

• αj=0 for most j

• don’t need to store these points to compute w or make predictions

Page 5: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

Classification rule using dual solution

Using dual solution

dot product of feature vectors of new example with support vectors

Page 6: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

SVM with kernels

•  Never compute features explicitly!!! –  Compute dot products in closed form

•  O(n2) time in size of dataset to compute objective –  much work on speeding up

Predict with:

Page 7: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

Efficient dot-product of polynomials Polynomials of degree exactly d

d=1

φ(x) =

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

∂L

∂w= w −

j

αjyjxj

φ(u).φ(v) =

�u1u2

�.

�v1v2

�= u1v1 + u2v2 = u.v

7

d=2

For any d (we will skip proof):

•  Cool! Taking a dot product and exponentiating gives same results as mapping into high dimensional space and then taking dot product

φ(x) =

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

∂L

∂w= w −

j

αjyjxj

φ(u).φ(v) =

�u1u2

�.

�v1v2

�= u1v1 + u2v2 = u.v

φ(u).φ(v) = (u.v)d

7

φ(x) =

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

∂L

∂w= w −

j

αjyjxj

φ(u).φ(v) =

�u1u2

�.

�v1v2

�= u1v1 + u2v2 = u.v

φ(u).φ(v) =

u21u1u2u2u1u22

.

v21v1v2v2v1v22

= u21v21 + 2u1v1u2v2 + u22v

22

φ(u).φ(v) = (u.v)d

7

φ(x) =

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

∂L

∂w= w −

j

αjyjxj

φ(u).φ(v) =

�u1u2

�.

�v1v2

�= u1v1 + u2v2 = u.v

φ(u).φ(v) =

u21u1u2u2u1u22

.

v21v1v2v2v1v22

= u21v21 + 2u1v1u2v2 + u22v

22

= (u1v1 + u2v2)2

φ(u).φ(v) = (u.v)d

7

φ(x) =

x(1)

. . .x(n)

x(1)x(2)

x(1)x(3)

. . .

ex(1)

. . .

∂L

∂w= w −

j

αjyjxj

φ(u).φ(v) =

�u1u2

�.

�v1v2

�= u1v1 + u2v2 = u.v

φ(u).φ(v) =

u21u1u2u2u1u22

.

v21v1v2v2v1v22

= u21v21 + 2u1v1u2v2 + u22v

22

= (u1v1 + u2v2)2

= (u.v)2

φ(u).φ(v) = (u.v)d

7

Page 8: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

Quadratic kernel

[Tommi Jaakkola]

Page 9: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

Quadratic kernel

[Cynthia Rudin]

Feature mapping given by:

Page 10: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

Common kernels

•  Polynomials of degree exactly d

•  Polynomials of degree up to d

•  Gaussian kernels

•  And many others: very active area of research! (e.g., structured kernels that use dynamic programming to evaluate)

Euclidean distance, squared

Page 11: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

Gaussian kernel

[Cynthia Rudin] [mblondel.org]

Page 12: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

Kernel algebra

[Justin Domke]

Q: How would you prove that the “Gaussian kernel” is a valid kernel? A: Expand the Euclidean norm as follows:

Then, apply (e) from above

To see that this is a kernel, use the Taylor series expansion of the exponential, together with repeated application of (a), (b), and (c):

The feature mapping is infinite dimensional!

Page 13: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

Overfitting?

•  Huge feature space with kernels: should we worry about overfitting? –  SVM objective seeks a solution with large margin

•  Theory says that large margin leads to good generalization (we will see this in a couple of lectures)

–  But everything overfits sometimes!!!

–  Can control by:

•  Setting C

•  Choosing a better Kernel

•  Varying parameters of the Kernel (width of Gaussian, etc.)

Page 14: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

•  SVMlight: one of the most widely used SVM packages. Fast optimization, can handle very large datasets, C++ code.

•  LIBSVM

•  Both of these handle multi-class, weighted SVM for unbalanced data, etc.

•  There are several new approaches to solving the SVM objective that can be much faster:

–  Stochastic subgradient method (discussed in a few lectures)

–  Distributed computation (also to be discussed)

•  See http://mloss.org, “machine learning open source software”

Software

Page 15: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

Machine learning methodology: Cross Validation

Page 16: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

Choosing among several hypotheses

•  Suppose you are considering between several different hypotheses, e.g.

•  For the SVM, we get one linear classifier for each choice of the regularization parameter C

•  How do you choose between them?

regression

Page 17: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

General strategy

Split the data up into three parts:

Assumes that the available data is randomly allocated to these three, e.g. 60/20/20.

Page 18: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

Typical approach

• Learn  a  model  from  the  training  set  (e.g.,  fix  a  C  and  learn  the  SVM)  

•   Es=mate  your  future  performance  with          the  valida=on  data  

This the model you learned.

Page 19: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

More data is better With more data you can learn better

Blue: Observed data Red: Predicted curve True: Green true distribution

Compare the predicted curves

Page 20: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

Cross Validation

Recycle  the  data!  

Use (almost) all of this for training:

Page 21: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

Your  single  test  data  point  Lets  say  we  have  N data  points  k  indices  the  data  points,  i.e.  k=1...N

Let  (xk,yk)  be  the  kth  example  

Temporarily  remove  (xk,yk)  from  the  dataset  

Train on the remaining N-1 data points

Test your error on (xk,yk)    

Do this for each k=1..N and report the average error

LOOCV (Leave-one-out Cross Validation)

Once the best parameters (e.g., choice of C for the SVM) are found, re-train using all of the training data

Page 22: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

LOOCV (Leave-one-out Cross

Validation) There are N data points. Repeat learning N times.

Notice the test data (shown in red) is changing each time

Page 23: Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

K-fold cross validation

k-fold

train validate Train on (k - 1) splits Test

In  3  fold  cross  valida=on,  there  are  3  runs.  In  5  fold  cross  valida=on,  there  are  5  runs.  In  10  fold  cross  valida=on,  there  are  10  runs.  

the  error  is  averaged  over  all  runs