Upload
roots999
View
218
Download
0
Embed Size (px)
Citation preview
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
1/45
SupportVectorMachines
(andKernelMethodsingeneral)
MachineLearning
March23,2010
1
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
2/45
LastTime
Mul?layerPerceptron/Logis?cRegressionNetworks
NeuralNetworks
ErrorBackpropaga?on
2
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
3/45
Today
SupportVectorMachines Note:wellrelyonsomemathfromOp?malityTheorythatwewontderive.
3
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
4/45
MaximumMargin
Perceptron(andotherlinearclassifiers)canleadtomanyequallyvalidchoicesforthedecisionboundary
4
Arethesereally
equallyvalid
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
5/45
MaxMargin
Howcanwepickwhichisbest
Maximizethesizeofthemargin.
5
Arethesereally
equallyvalid
SmallMargin
LargeMargin
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
6/45
SupportVectors
SupportVectorsarethoseinputpoints(vectors)
closesttothedecisionboundary
1.Theyarevectors2.Theysupportthedecisionhyperplane
6
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
7/45
SupportVectors
Definethisasadecisionproblem
Thedecisionhyperplane:
Nofancymath,justtheequa?onofahyperplane.
wT
x+ b = 0
7
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
8/45
SupportVectors
Aside:Whydosomecassifiersuseor Simplicityofthemath
andinterpreta?on.
Forprobabilitydensityfunc?ones?ma?on0,1
hasaclearcorrelate. Forclassifica?on,a
decisionboundaryof0ismoreeasilyinterpretablethan.5.
8
ti {0, 1
}ti {1,+1}
xi are the data
ti {1, +1} are the labels
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
9/45
SupportVectors
Definethisasadecisionproblem
Thedecisionhyperplane:
DecisionFunc?on:w
Tx+ b = 0
D( xi) = sign( wT
xi + b)9
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
10/45
SupportVectors
Definethisasadecisionproblem
Thedecisionhyperplane:
Marginhyperplanes:w
Tx+ b = 0
10
wTx+ b =
wTx+ b =
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
11/45
SupportVectors
Thedecisionhyperplane:
Scaleinvariancew
Tx+ b = 0
11
cw
Tx+ cb = 0
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
12/45
SupportVectors
Thedecisionhyperplane:
Scaleinvariancew
Tx+ b = 0
12
cw
Tx+ cb = 0
wTx+ b =
wTx+ b =
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
13/45
SupportVectors
Thedecisionhyperplane:
Scaleinvariancew
Tx+ b = 0
13
Thisscalingdoesnotchangethe
decisionhyperplane,orthesupport
vectorhyperplanes.Butwewill
eliminateavariablefromthe
op?miza?on
wT
x + b
= 1
wT
x + b
= 1
cw
Tx+ cb = 0
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
14/45
Whatareweop?mizing
Wewillrepresentthesizeofthemarginintermsofw.
Thiswillallowustosimultaneously
Iden?fyadecisionboundary
Maximizethemargin
14
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
15/45
Howdowerepresentthesizeofthe
marginintermsofw1. Theremustatleastone
pointthatliesoneach
supporthyperplanes
15
x1
x2
Proofoutline:Ifnot,we
coulddefinealarger
marginsupporthyperplane
thatdoestouchthenearest
point(s).
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
16/45
Howdowerepresentthesizeofthe
marginintermsofw1. Theremustatleastone
pointthatliesoneach
supporthyperplanes
16
x1
x2
Proofoutline:Ifnot,we
coulddefinealarger
marginsupporthyperplane
thatdoestouchthenearest
point(s).
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
17/45
Howdowerepresentthesizeofthe
marginintermsofw1. Theremustatleastone
pointthatliesoneach
supporthyperplanes
2. Thus:
17
x1
x2
wTx1 + b = 1
wTx2 + b = 1
3. And:w
T(x1 x2) = 2
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
18/45
Howdowerepresentthesizeofthe
marginintermsofw1. Theremustatleastone
pointthatliesoneach
supporthyperplanes
2. Thus:
18
x1
x2
wTx1 + b = 1
wTx2 + b = 1
3. And:w
T(x1 x2) = 2
wTx+ b = 0
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
19/45
Thevectorwisperpendiculartothe
decisionhyperplane
Ifthedotproductoftwovectorsequalszero,thetwo
vectorsareperpendicular.
Howdowerepresentthesizeofthe
marginintermsofw
19
x1
x2
wTx+ b = 0
w
wT(x1 x2) = 2
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
20/45
Themarginistheprojec?onofx
1x
2ontow,thenormal
ofthehyperplane.
Howdowerepresentthesizeofthe
marginintermsofw
20
x1
x2
wTx+ b = 0
w
wT(x1x2) = 2
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
21/45
Aside:VectorProjec?on
21
uv
cos() =adjacent
hypotenuse
v
u
uu
vu = vu cos()
hypotenuse=
v adjacent
=
goal
cos() = goal
v
vu
vu cos() = goal
v
v u
vu=
goal
v
v
u
u= goal
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
22/45
Themarginistheprojec?onofx
1x
2ontow,thenormal
ofthehyperplane.
Howdowerepresentthesizeofthe
marginintermsofw
22
x1
x2
wTx+ b = 0
w
wT(x1
x2) = 2
v u
uu
wT(x1 x2)
ww
2
w
SizeoftheMargin:
Projec?on:
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
23/45
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
24/45
MaxMarginLossFunc?on
Ifconstraintop7miza7onthenLagrangeMul7pliers
Op?mizethePrimal
24
L(w, b) =
1
2 ww
N1
i=0
i[ti((wxi) + b)
1]
min w
where ti(wTxi + b) 1
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
25/45
MaxMarginLossFunc?on
Op?mizethePrimal
25
L(w, b) =1
2
w w
N1
i=0
i[ti((w xi) + b) 1]
L(w, b)
b= 0
N1
i=0
iti = 0
Par?alwrtb
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
26/45
MaxMarginLossFunc?on
Op?mizethePrimal
26
L(w, b) =1
2
w w
N1
i=0
i[ti((w xi) + b) 1]
L(w, b)
w= 0
w
N1
i=0
iti xi = 0
w =N1
i=0
iti xi
Par?alwrtw
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
27/45
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
28/45
MaxMarginLossFunc?on
Constructthedual
28
L(w, b) =1
2
w w
N1
i=0
i[ti((w xi) + b) 1]
w =
N1
i=0
iti xi
W() =N1
i=0
i 12
N1
i,j=0
ijtitj(xi xj)
where i 0
N1
i=0
iti = 0
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
29/45
Dualformula?onoftheerror
Op?mizethisquadra7cprogramtoiden?fythelagrangemul?pliersandthustheweights
29
W() =
N1
i=0
i
12
N1
i,j=0
ijtitj(xixj)
where i 0
Thereexist(extremely)fastapproachestoquadra?c
op?miza?oninbothC,C++,Python,JavaandR
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
30/45
Quadra?cProgramming
30
subject to (one or more) Ax k
Bx = l
minimize f(x) =1
2xTQx + cTx
IfQisposi?vesemidefinite,thenf(x)isconvex.Iff(x)isconvex,thenthereisasinglemaximum.
W() =N1
i=0
i 1
2
N1
i,j=0
ijtitj(xi xj)
where i 0
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
31/45
SupportVectorExpansion
Wheniisnon-zerothenxiisasupportvector Wheniiszeroxiisnotasupportvector
31
w =
N1
i=0
iti xi
D(x) = sign(wTx + b)
= signN1
i=0
iti xiT
x + b
= sign
N1
i=0 iti(xiTx)
+ b
NewdecisionFunc?onIndependentofthe
Dimensionofx!
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
32/45
Kuhn-TuckerCondi?ons
Inconstraintop?miza?on:Attheop?malsolu?on
Constraint*LagrangeMul?plier=0
32
i(1 ti(wTxi + b)) = 0
ifi = 0 ti(wTxi + b) = 1
Onlypointsonthedecisionboundarycontributetothesolu?on!
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
33/45
Visualiza?onofSupportVectors
33
= 0
> 0
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
34/45
InterpretabilityofSVMparameters
WhatelsecanwetellfromalphasIfalphaislarge,thentheassociateddatapointisquiteimportant.
Itseitheranoutlier,orincrediblyimportant. Butthisonlygivesusthebestsolu?onforlinearlyseparabledatasets
34
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
35/45
BasisofKernelMethods
Thedecisionprocessdoesntdependonthedimensionalityofthedata.
Wecanmaptoahigherdimensionalityofthedataspace. Note:datapointsonlyappearwithinadotproduct. Theerrorisbasedonthedotproductofdatapointsnotthedata
pointsthemselves.
35
W() =N1
i=0
i 1
2
N1
i,j=0
ijtitj(xi xj)
w =
N1
i=0
iti xi
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
36/45
BasisofKernelMethods
Sincedatapointsonlyappearwithinadotproduct. Thuswecanmaptoanotherspacethroughareplacement
Theerrorisbasedonthedotproductofdatapointsnotthedatapointsthemselves.
36
xi xj (xi) ( xj)
W() =N1
i=0
i
1
2
N1
i,j=0
ijtitj
((xi) ( x
j))
W() =N1
i=0
i 1
2
N1
i,j=0
ijtitj(xi xj)
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
37/45
LearningTheorybasesofSVMs
Theore?calboundsontes?ngerror.Theupperbounddoesntdependonthedimensionalityofthespace
Thelowerboundismaximizedbymaximizingthemargin,,associatedwiththedecisionboundary.
37
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
38/45
WhywelikeSVMs
TheyworkGoodgeneraliza?on
Easilyinterpreted.Decisionboundaryisbasedonthedataintheformofthesupportvectors.
Notsoinmul?layerperceptronnetworks Principledboundsontes?ngerrorfromLearningTheory(VCdimension)
38
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
39/45
SVMvs.MLP
SVMshavemanyfewerparametersSVM:MaybejustakernelparameterMLP:Numberandarrangementofnodesandetalearningrate
SVM:Convexop?miza?ontaskMLP:likelihoodisnon-convex--localminima
39
R() =1
N
Nn=0
1
2
yn g
k
wklg
j
wjkg
i
wijxn,i
2
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
40/45
Somarginclassifica?on
Therecanbeoutliersontheothersideofthedecisionboundary,orleadingtoasmallmargin.
Solu?on:Introduceapenaltytermtotheconstraintfunc?on
40
min w+ CN1
i=0
i
where ti(wTxi + b) 1 i and i 0
L(w, b) = 12w
w + C
N1
i=0
i
N1
i=0
i[ti((wxi) + b) + i
1]
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
41/45
SoMaxDual
41
min w+ CN1
i=0
i
where ti(wTxi + b) 1 i and i 0
L(w, b) = 12w
w + C
N1
i=0
i
N1
i=0
i[ti((w xi) + b) + i 1]
N1
i=0
iti = 0where 0 i C
W() =
N1
i=0
i
1
2
N1
i,j=0
titjij(xixj)
S?llQuadra?cProgramming!
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
42/45
Pointsareallowedwithinthemargin,butcostis
introduced.
Somarginexample
42
x1
x2iHingeLoss
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
43/45
Probabili?esfromSVMs
SupportVectorMachinesarediscriminantfunc?ons
Discriminantfunc?ons:f(x)=cDiscrimina?vemodels:f(x)=argmaxcp(c|x)Genera?veModels:f(x)=argmax cp(x|c)p(c)/p(x)
No(principled)probabili?esfromSVMs SVMsarenotbasedonprobabilitydistribu?onfunc?onsofclassinstances.
43
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
44/45
EfficiencyofSVMs
Notespeciallyfast. Trainingn^3
Quadra?cProgrammingefficiency
Evalua?onnNeedtoevaluateagainsteachsupportvector(poten?allyn)
44
8/3/2019 Andrew Rosenberg- Support Vector Machines (and Kernel Methods in general)
45/45
GoodBye
Next?me:TheKernelTrick->KernelMethodsorHowcanweuseSVMsthatarenotlinearlyseparable
45