17
1 1 Support Vector Machines Machine Learning – CSE446 Sham Kakade University of Washington November 2, 2016 ©2016 Sham Kakade 2 Announcements: Project Milestones coming up No class Thurs, Nov 10 HW2 Let’s figure it out… HW3 posted this week. Let’s get state of the art on MNIST! It’ll be collaborative Today: Review: Kernels SVMs Generalization/review ©2016 Sham Kakade

Support(Vector( Machines · 1 1 Support(Vector(Machines Machine(Learning(– CSE446 Sham(Kakade(Universityof(Washington November(2,(2016 ©2016(Sham(Kakade 2 Announcements:! Project(Milestonescoming(up

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Support(Vector( Machines · 1 1 Support(Vector(Machines Machine(Learning(– CSE446 Sham(Kakade(Universityof(Washington November(2,(2016 ©2016(Sham(Kakade 2 Announcements:! Project(Milestonescoming(up

1

1

Support  Vector  Machines

Machine  Learning  – CSE446Sham  Kakade  University  of  Washington

November  2,  2016©2016  Sham  Kakade

2

Announcements:

n Project  Milestones  coming  upn No  class  Thurs,  Nov  10n HW2

¨ Let’s  figure  it  out…n HW3  posted  this  week.

¨ Let’s  get  state  of  the  art  on  MNIST!¨ It’ll  be  collaborative

n Today:  ¨Review:  Kernels¨SVMs¨Generalization/review

©2016  Sham  Kakade

Page 2: Support(Vector( Machines · 1 1 Support(Vector(Machines Machine(Learning(– CSE446 Sham(Kakade(Universityof(Washington November(2,(2016 ©2016(Sham(Kakade 2 Announcements:! Project(Milestonescoming(up

2

3

Kernels

Machine  Learning  – CSE446Sham  KakadeUniversity  of  Washington

November  1,  2016©2016  Sham  Kakade

©2016  Sham  Kakade 4

What  if  the  data  is  not  linearly  separable?

Use  features  of  features  of  features  of  features….

Feature  space  can  get  really  large  really  quickly!

Page 3: Support(Vector( Machines · 1 1 Support(Vector(Machines Machine(Learning(– CSE446 Sham(Kakade(Universityof(Washington November(2,(2016 ©2016(Sham(Kakade 2 Announcements:! Project(Milestonescoming(up

3

©2016  Sham  Kakade 5

Common  kernelsn Polynomials  of  degree  exactly  d

n Polynomials  of  degree  up  to  d

n Gaussian  (squared  exponential)  kernel

n Sigmoid

2

©2016  Sham  Kakade 6

Mercer’s  Theoremn When  do  we  have  a  Kernel  K(x,x’)?n Definition  1:  when  there  exists  an  embedding  

n Mercer’s  Theorem:¨K(x,x’)  is  a  valid  kernel  if  and  only  if  K  is  a  positive  semi-­definite.

¨PSD  in  the  following  sense:

Page 4: Support(Vector( Machines · 1 1 Support(Vector(Machines Machine(Learning(– CSE446 Sham(Kakade(Universityof(Washington November(2,(2016 ©2016(Sham(Kakade 2 Announcements:! Project(Milestonescoming(up

4

7

Support  Vector  Machines

Machine  Learning  – CSE446Sham  KakadeUniversity  of  Washington

November  1,  2016©2016  Sham  Kakade

©2016  Sham  Kakade 8

Linear  classifiers  – Which  line  is  better?

Page 5: Support(Vector( Machines · 1 1 Support(Vector(Machines Machine(Learning(– CSE446 Sham(Kakade(Universityof(Washington November(2,(2016 ©2016(Sham(Kakade 2 Announcements:! Project(Milestonescoming(up

5

©2016  Sham  Kakade 9

Pick  the  one  with  the  largest  margin!

“confidence” = yj(w · xj+ w0)

©2016  Sham  Kakade 10

Maximize  the  margin

max

�,w,w0

yj(w · xj+ w0) � �, 8j 2 {1, . . . , N}

Page 6: Support(Vector( Machines · 1 1 Support(Vector(Machines Machine(Learning(– CSE446 Sham(Kakade(Universityof(Washington November(2,(2016 ©2016(Sham(Kakade 2 Announcements:! Project(Milestonescoming(up

6

©2016  Sham  Kakade 11

But  there  are  many  planes…

©2016  Sham  Kakade 12

Review:  Normal  to  a  planex

j = x̄

j + ↵w

||w||

Page 7: Support(Vector( Machines · 1 1 Support(Vector(Machines Machine(Learning(– CSE446 Sham(Kakade(Universityof(Washington November(2,(2016 ©2016(Sham(Kakade 2 Announcements:! Project(Milestonescoming(up

7

©2016  Sham  Kakade 13

A  Convention:  Normalized  margin  –Canonical  hyperplanes

x-­x+

x

j = x̄

j + ↵w

||w||

©2016  Sham  Kakade 14

Margin  maximization  using  canonical  hyperplanes

max

�,w,w0

yj(w · xj+ w0) � �, 8j 2 {1, . . . , N}

minw,w0

||w||22

yj(w · xj + w0) � 1, 8j 2 {1, . . . , N}

Unnormalizedproblem:

Normalized  Problem:  

Page 8: Support(Vector( Machines · 1 1 Support(Vector(Machines Machine(Learning(– CSE446 Sham(Kakade(Universityof(Washington November(2,(2016 ©2016(Sham(Kakade 2 Announcements:! Project(Milestonescoming(up

8

©2016  Sham  Kakade 15

Support  vector  machines  (SVMs)

n Solve  efficiently  by  many  methods,  e.g.,¨ quadratic  programming  (QP)

n Well-­studied  solution  algorithms

¨ Stochastic  gradient  descent  

n Hyperplane defined  by  support  vectors

minw,w0

||w||22

yj(w · xj + w0) � 1, 8j 2 {1, . . . , N}

©2016  Sham  Kakade 16

What  if  the  data  is  not  linearly  separable?

Use  features  of  features  of  features  of  features….

Page 9: Support(Vector( Machines · 1 1 Support(Vector(Machines Machine(Learning(– CSE446 Sham(Kakade(Universityof(Washington November(2,(2016 ©2016(Sham(Kakade 2 Announcements:! Project(Milestonescoming(up

9

©2016  Sham  Kakade 17

What  if  the  data  is  still  not  linearly  separable?

n If  data  is  not  linearly  separable,  some  points  don’t  satisfy  margin  constraint:

n How  bad  is  the  violation?

n Tradeoff  margin  violation  with  ||w||:

minw,w0

||w||22

yj(w · xj + w0) � 1 , 8j

SVMs  for  Non-­Linearly  Separable  meet  my  friend  the  Perceptron…  

n Perceptron  was  minimizing  the  hinge  loss:

n SVMs  minimizes  the  regularized  hinge  loss!!  

©2016  Sham  Kakade 18

NX

j=1

��yj(w · xj + w0)

�+

||w||22 + CNX

j=1

�1� yj(w · xj + w0)

�+

Page 10: Support(Vector( Machines · 1 1 Support(Vector(Machines Machine(Learning(– CSE446 Sham(Kakade(Universityof(Washington November(2,(2016 ©2016(Sham(Kakade 2 Announcements:! Project(Milestonescoming(up

10

Stochastic  Gradient  Descent  for  SVMs

n Perceptron  minimization:

n SGD  for  Perceptron:

n SVMs  minimization:

n SGD  for  SVMs:

©2016  Sham  Kakade 19

NX

j=1

��yj(w · xj + w0)

�+

||w||22 + CNX

j=1

�1� yj(w · xj + w0)

�+

w

(t+1) w

(t) +hy(t)(w(t) · x(t)) 0

iy(t)x(t)

20

SVMs  vs  logistic  regression

n We  often  want  probabilities/confidences  (logistic  wins  here)

n For  classification  loss,  they  are  comparable

n Multiclass  setting:¨Softmax naturally  generalizes  logistic  regression¨SVMs  have

n What  about  good  old  least  squares?

©2016  Sham  Kakade

Page 11: Support(Vector( Machines · 1 1 Support(Vector(Machines Machine(Learning(– CSE446 Sham(Kakade(Universityof(Washington November(2,(2016 ©2016(Sham(Kakade 2 Announcements:! Project(Milestonescoming(up

11

21

Multiple  Classes

n One  can  generalize  the  hinge  loss¨ If  no  error  (by  some  margin)  -­>  no  loss¨ If  error,  penalize  what  you  said  against  the  best  

n SVMs  vs  logistic  regression¨We  often  want  probabilities/confidences  (logistic  wins  here)

¨ For  classification  loss,  they  aren Latent  SVMs

¨When  you  have  many  classes  it’s  difficult  to  do  logistic  regression

n 2)  Kernels¨Warp  the  feature  spaceThis  idea  is  actually  more  general

©2016  Sham  Kakade

©2016  Sham  Kakade 22

Page 12: Support(Vector( Machines · 1 1 Support(Vector(Machines Machine(Learning(– CSE446 Sham(Kakade(Universityof(Washington November(2,(2016 ©2016(Sham(Kakade 2 Announcements:! Project(Milestonescoming(up

12

23

Generalization/Model  Comparisons

Machine  Learning  – CSE446Sham  KakadeUniversity  of  Washington

November  1,  2016©2016  Sham  Kakade

24

What  method  should  I  use?

n Linear  regression,  logistic,  SVMs?n No  regularization?  Ridge?  L1?

n I  ran  SGD  without  any  regularization  and  it  was  ok?

©2016  Sham  Kakade

Page 13: Support(Vector( Machines · 1 1 Support(Vector(Machines Machine(Learning(– CSE446 Sham(Kakade(Universityof(Washington November(2,(2016 ©2016(Sham(Kakade 2 Announcements:! Project(Milestonescoming(up

13

25

Generalization

n You  get  N  samples.n You  learn  a  classifier/regression  f^.

n How  close  are  you  to  optimal?

L(f^)-­L(f*)  <  ???  

n (We  can  look  at  the  above  in  expectation  or  with  ‘high’  probability).

©2016  Sham  Kakade

26

Finite  Case:

n You  get  N  samples.n You  learn  a  classifier/regressor f^  among  K  classifiers:

L(f^)-­L(f*)  <  

©2016  Sham  Kakade

Page 14: Support(Vector( Machines · 1 1 Support(Vector(Machines Machine(Learning(– CSE446 Sham(Kakade(Universityof(Washington November(2,(2016 ©2016(Sham(Kakade 2 Announcements:! Project(Milestonescoming(up

14

27

Linear  Regression

n N  samples,  d  dimensions.n L  is  the  square  loss.n w^  is  the  least  squares  estimate.

L(w^)-­L(w*)  <  O(d/N)

n Need  about  N=O(d)  samples

©2016  Sham  Kakade

28

Sparse  Linear  Regression

n N  samples,  d  dimensions,  L  is  the  square  loss.n f^  is  best  fit  line  which  only  uses  k  features  (computationally  intractable)

L(w^)-­L(w*)  <k  log  (𝑑)/𝑁

n true  of  Lasso  under  stronger  assumptions:  ”incoherence”

n When  do  like  sparse  regression??¨When  we  believe  there  are  a  few  of  GOOD  features.

©2016  Sham  Kakade

Page 15: Support(Vector( Machines · 1 1 Support(Vector(Machines Machine(Learning(– CSE446 Sham(Kakade(Universityof(Washington November(2,(2016 ©2016(Sham(Kakade 2 Announcements:! Project(Milestonescoming(up

15

29

Learning  a  Halfspace

n You  get  N  samples,  in  D  dimensions.n L  is  the  0/1  loss.n f^  is  the  empirical  risk  minimizer  (computationally  infeasible  to  compute)

L(w^)-­L(w*)  <   𝑑  log  (𝑁)/𝑁�

n Need  N=O(d)  samples

©2016  Sham  Kakade

30

What  about  Regularization?

n Let’s  look  at  (dual)  constrained  problemn Minimize:

min  L^(w)such  ||w||??  <  W+

n Where  L^  is  our  training  error.

©2016  Sham  Kakade

Page 16: Support(Vector( Machines · 1 1 Support(Vector(Machines Machine(Learning(– CSE446 Sham(Kakade(Universityof(Washington November(2,(2016 ©2016(Sham(Kakade 2 Announcements:! Project(Milestonescoming(up

16

31

Optimization  and  Regularization?

n I  did  SGD  without  regularization  and  it  was  fine?

n “Early  stopping”  implicitly  regularizes  (in  L2)

©2016  Sham  Kakade

32

L2  Regularization

n Assume  ||w||2 <  W2 ||x||2 <  R2n L  is  some  convex  loss  (logistic,hinge,square)n w^  is  the  constrained  minimizer  (computationally  tractable  to  compute)

L(w^)-­L(w*)<𝑊2R2/ 𝑁�

n DIMENSION  FREE  “margin”  Bound!  

©2016  Sham  Kakade

Page 17: Support(Vector( Machines · 1 1 Support(Vector(Machines Machine(Learning(– CSE446 Sham(Kakade(Universityof(Washington November(2,(2016 ©2016(Sham(Kakade 2 Announcements:! Project(Milestonescoming(up

17

33

L1  Regularization

n Assume  ||w||1 <  W1    ||x||∞ <  R∞

n L  is  some  convex  loss  (logistic,hinge,square)n w^  is  the  constrained  minimizer  (computationally  tractable  to  compute)

L(w^)-­L(w*)<0123456(7)8�

n Promotes  sparsity,  one  can  think  of  W1  as  the  “sparsity    level/k”  (mild  dimension  dependence,  log(d).

©2016  Sham  Kakade