Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)

1/45

SupportVectorMachines

(andKernelMethodsingeneral)

MachineLearning

March23,2010

1


2/45

LastTime

Mul?layerPerceptron/Logis?cRegressionNetworks

NeuralNetworks

ErrorBackpropaga?on

2


3/45

Today

SupportVectorMachines Note:wellrelyonsomemathfromOp?malityTheorythatwewontderive.

3


4/45

MaximumMargin

Perceptron(andotherlinearclassifiers)canleadtomanyequallyvalidchoicesforthedecisionboundary

4

Arethesereally

equallyvalid


5/45

MaxMargin

Howcanwepickwhichisbest

Maximizethesizeofthemargin.

5

Arethesereally

equallyvalid

SmallMargin

LargeMargin


6/45

SupportVectors

SupportVectorsarethoseinputpoints(vectors)

closesttothedecisionboundary

1.Theyarevectors2.Theysupportthedecisionhyperplane

6


7/45

SupportVectors

Definethisasadecisionproblem

Thedecisionhyperplane:

Nofancymath,justtheequa?onofahyperplane.

wT

x+ b = 0

7


8/45

SupportVectors

Aside:Whydosomecassifiersuseor Simplicityofthemath

andinterpreta?on.

Forprobabilitydensityfunc?ones?ma?on0,1

hasaclearcorrelate. Forclassifica?on,a

decisionboundaryof0ismoreeasilyinterpretablethan.5.

8

ti {0, 1

}ti {1,+1}

xi are the data

ti {1, +1} are the labels


9/45

SupportVectors



DecisionFunc?on:w

Tx+ b = 0

D( xi) = sign( wT

xi + b)9


10/45

SupportVectors



Marginhyperplanes:w

Tx+ b = 0

10

wTx+ b =

wTx+ b =


11/45

SupportVectors


Scaleinvariancew

Tx+ b = 0

11

cw

Tx+ cb = 0


12/45

SupportVectors


Scaleinvariancew

Tx+ b = 0

12

cw

Tx+ cb = 0

wTx+ b =

wTx+ b =


13/45

SupportVectors


Scaleinvariancew

Tx+ b = 0

13

Thisscalingdoesnotchangethe

decisionhyperplane,orthesupport

vectorhyperplanes.Butwewill

eliminateavariablefromthe

op?miza?on

wT

x + b

= 1

wT

x + b

= 1

cw

Tx+ cb = 0


14/45

Whatareweop?mizing

Wewillrepresentthesizeofthemarginintermsofw.

Thiswillallowustosimultaneously

Iden?fyadecisionboundary

Maximizethemargin

14


15/45

Howdowerepresentthesizeofthe

marginintermsofw1. Theremustatleastone

pointthatliesoneach

supporthyperplanes

15

x1

x2

Proofoutline:Ifnot,we

coulddefinealarger

marginsupporthyperplane

thatdoestouchthenearest

point(s).


16/45



pointthatliesoneach

supporthyperplanes

16

x1

x2

Proofoutline:Ifnot,we

coulddefinealarger

marginsupporthyperplane

thatdoestouchthenearest

point(s).


17/45



pointthatliesoneach

supporthyperplanes

2. Thus:

17

x1

x2

wTx1 + b = 1

wTx2 + b = 1

3. And:w

T(x1 x2) = 2


18/45



pointthatliesoneach

supporthyperplanes

2. Thus:

18

x1

x2

wTx1 + b = 1

wTx2 + b = 1

3. And:w

T(x1 x2) = 2

wTx+ b = 0


19/45

Thevectorwisperpendiculartothe

decisionhyperplane

Ifthedotproductoftwovectorsequalszero,thetwo

vectorsareperpendicular.


marginintermsofw

19

x1

x2

wTx+ b = 0

w

wT(x1 x2) = 2


20/45

Themarginistheprojec?onofx

1x

2ontow,thenormal

ofthehyperplane.


marginintermsofw

20

x1

x2

wTx+ b = 0

w

wT(x1x2) = 2


21/45

Aside:VectorProjec?on

21

uv

cos() =adjacent

hypotenuse

v

u

uu

vu = vu cos()

hypotenuse=

v adjacent

=

goal

cos() = goal

v

vu

vu cos() = goal

v

v u

vu=

goal

v

v

u

u= goal


22/45

Themarginistheprojec?onofx

1x

2ontow,thenormal

ofthehyperplane.


marginintermsofw

22

x1

x2

wTx+ b = 0

w

wT(x1

x2) = 2

v u

uu

wT(x1 x2)

ww

2

w

SizeoftheMargin:

Projec?on:


23/45


24/45

MaxMarginLossFunc?on

Ifconstraintop7miza7onthenLagrangeMul7pliers

Op?mizethePrimal

24

L(w, b) =

1

2 ww

N1

i=0

i[ti((wxi) + b)

1]

min w

where ti(wTxi + b) 1


25/45


Op?mizethePrimal

25

L(w, b) =1

2

w w

N1

i=0

i[ti((w xi) + b) 1]

L(w, b)

b= 0

N1

i=0

iti = 0

Par?alwrtb


26/45


Op?mizethePrimal

26

L(w, b) =1

2

w w

N1

i=0

i[ti((w xi) + b) 1]

L(w, b)

w= 0

w

N1

i=0

iti xi = 0

w =N1

i=0

iti xi

Par?alwrtw


27/45


28/45


Constructthedual

28

L(w, b) =1

2

w w

N1

i=0

i[ti((w xi) + b) 1]

w =

N1

i=0

iti xi

W() =N1

i=0

i 12

N1

i,j=0

ijtitj(xi xj)

where i 0

N1

i=0

iti = 0


29/45

Dualformula?onoftheerror

Op?mizethisquadra7cprogramtoiden?fythelagrangemul?pliersandthustheweights

29

W() =

N1

i=0

i

12

N1

i,j=0

ijtitj(xixj)

where i 0

Thereexist(extremely)fastapproachestoquadra?c

op?miza?oninbothC,C++,Python,JavaandR


30/45

Quadra?cProgramming

30

subject to (one or more) Ax k

Bx = l

minimize f(x) =1

2xTQx + cTx

IfQisposi?vesemidefinite,thenf(x)isconvex.Iff(x)isconvex,thenthereisasinglemaximum.

W() =N1

i=0

i 1

2

N1

i,j=0

ijtitj(xi xj)

where i 0


31/45

SupportVectorExpansion

Wheniisnon-zerothenxiisasupportvector Wheniiszeroxiisnotasupportvector

31

w =

N1

i=0

iti xi

D(x) = sign(wTx + b)

= signN1

i=0

iti xiT

x + b

= sign

N1

i=0 iti(xiTx)

+ b

NewdecisionFunc?onIndependentofthe

Dimensionofx!


32/45

Kuhn-TuckerCondi?ons

Inconstraintop?miza?on:Attheop?malsolu?on

Constraint*LagrangeMul?plier=0

32

i(1 ti(wTxi + b)) = 0

ifi = 0 ti(wTxi + b) = 1

Onlypointsonthedecisionboundarycontributetothesolu?on!


33/45

Visualiza?onofSupportVectors

33

= 0

> 0


34/45

InterpretabilityofSVMparameters

WhatelsecanwetellfromalphasIfalphaislarge,thentheassociateddatapointisquiteimportant.

Itseitheranoutlier,orincrediblyimportant. Butthisonlygivesusthebestsolu?onforlinearlyseparabledatasets

34


35/45

BasisofKernelMethods

Thedecisionprocessdoesntdependonthedimensionalityofthedata.

Wecanmaptoahigherdimensionalityofthedataspace. Note:datapointsonlyappearwithinadotproduct. Theerrorisbasedonthedotproductofdatapointsnotthedata

pointsthemselves.

35

W() =N1

i=0

i 1

2

N1

i,j=0

ijtitj(xi xj)

w =

N1

i=0

iti xi


36/45

BasisofKernelMethods

Sincedatapointsonlyappearwithinadotproduct. Thuswecanmaptoanotherspacethroughareplacement

Theerrorisbasedonthedotproductofdatapointsnotthedatapointsthemselves.

36

xi xj (xi) ( xj)

W() =N1

i=0

i

1

2

N1

i,j=0

ijtitj

((xi) ( x

j))

W() =N1

i=0

i 1

2

N1

i,j=0

ijtitj(xi xj)


37/45

LearningTheorybasesofSVMs

Theore?calboundsontes?ngerror.Theupperbounddoesntdependonthedimensionalityofthespace

Thelowerboundismaximizedbymaximizingthemargin,,associatedwiththedecisionboundary.

37


38/45

WhywelikeSVMs

TheyworkGoodgeneraliza?on

Easilyinterpreted.Decisionboundaryisbasedonthedataintheformofthesupportvectors.

Notsoinmul?layerperceptronnetworks Principledboundsontes?ngerrorfromLearningTheory(VCdimension)

38


39/45

SVMvs.MLP

SVMshavemanyfewerparametersSVM:MaybejustakernelparameterMLP:Numberandarrangementofnodesandetalearningrate

SVM:Convexop?miza?ontaskMLP:likelihoodisnon-convex--localminima

39

R() =1

N

Nn=0

1

2

yn g

k

wklg

j

wjkg

i

wijxn,i

2


40/45

Somarginclassifica?on

Therecanbeoutliersontheothersideofthedecisionboundary,orleadingtoasmallmargin.

Solu?on:Introduceapenaltytermtotheconstraintfunc?on

40

min w+ CN1

i=0

i

where ti(wTxi + b) 1 i and i 0

L(w, b) = 12w

w + C

N1

i=0

i

N1

i=0

i[ti((wxi) + b) + i

1]


41/45

SoMaxDual

41

min w+ CN1

i=0

i

where ti(wTxi + b) 1 i and i 0

L(w, b) = 12w

w + C

N1

i=0

i

N1

i=0

i[ti((w xi) + b) + i 1]

N1

i=0

iti = 0where 0 i C

W() =

N1

i=0

i

1

2

N1

i,j=0

titjij(xixj)

S?llQuadra?cProgramming!


42/45

Pointsareallowedwithinthemargin,butcostis

introduced.

Somarginexample

42

x1

x2iHingeLoss


43/45

Probabili?esfromSVMs

SupportVectorMachinesarediscriminantfunc?ons

Discriminantfunc?ons:f(x)=cDiscrimina?vemodels:f(x)=argmaxcp(c|x)Genera?veModels:f(x)=argmax cp(x|c)p(c)/p(x)

No(principled)probabili?esfromSVMs SVMsarenotbasedonprobabilitydistribu?onfunc?onsofclassinstances.

43


44/45

EfficiencyofSVMs

Notespeciallyfast. Trainingn^3

Quadra?cProgrammingefficiency

Evalua?onnNeedtoevaluateagainsteachsupportvector(poten?allyn)

44


45/45

GoodBye

Next?me:TheKernelTrick->KernelMethodsorHowcanweuseSVMsthatarenotlinearlyseparable

45

Documents

Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)