Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Mul$classClassifica$on
AlanRi0er(many slides from Greg Durrett, Vivek Srikumar, Stanford CS231n)
ThisLecture
‣Mul$classfundamentals
‣Mul$classlogis$cregression
‣Mul$classSVM
‣ Featureextrac$on
‣ Op$miza$on
Mul$classFundamentals
TextClassifica$on
~20classes
Sports
Health
ImageClassifica$on
‣ Thousandsofclasses(ImageNet)
Car
Dog
En$tyLinking
‣ 4,500,000classes(allar$clesinWikipedia)
Althoughheoriginallywontheevent,theUnitedStatesAn$-DopingAgencyannouncedinAugust2012thattheyhaddisqualifiedArmstrongfromhissevenconsecu$veTourdeFrancewinsfrom1999–2005.
LanceEdwardArmstrongisanAmericanformerprofessionalroadcyclist
ArmstrongCountyisacountyinPennsylvania…
??
ReadingComprehension
‣Mul$plechoiceques$ons,4classes(butclasseschangeperexample)
Richardson(2013)
BinaryClassifica$on
‣ Binaryclassifica$on:oneweightvectordefinesposi$veandnega$veclasses
+++ +
++
++
- - -----
--
Mul$classClassifica$on
1 11 11 12
222
333
3
‣ Canwejustusebinaryclassifiershere?
Mul$classClassifica$on
‣ One-vs-all:trainkclassifiers,onetodis$nguisheachclassfromalltherest
1 11 11 12
222
333
3
1 11 11 12
222
333
3
‣ Howdowereconcilemul$pleposi$vepredic$ons?Highestscore?
Mul$classClassifica$on
‣ Notallclassesmayevenbeseparableusingthisapproach
1 11 11 12 2
222 2
33
3 33 3
‣ Canseparate1from2+3and2from1+3butnot3fromtheothers(withthesefeatures)
Mul$classClassifica$on‣ All-vs-all:trainn(n-1)/2classifierstodifferen$ateeachpairofclasses
1 11 11 1
333
3
1 11 11 12
222
‣ Again,howtoreconcile?
Mul$classClassifica$on
+++ +
++
++
- - -----
--
1 11 11 12
222
333
3
‣ Binaryclassifica$on:oneweightvectordefinesbothclasses
‣Mul$classclassifica$on:differentweightsand/orfeaturesperclass
Mul$classClassifica$on
‣ Decisionrule:
‣ Canalsohaveoneweightvectorperclass:
‣ Formally:insteadoftwolabels,wehaveanoutputspacecontaininganumberofpossibleclasses
Y
‣ Samemachinerythatwe’lluselaterforexponen$allylargeoutputspaces,includingsequencesandtrees
argmaxy2Yw>y f(x)
argmaxy2Yw>f(x, y)
‣ Thesingleweightvectorapproachwillgeneralizetostructuredoutputspaces,whereasper-classweightvectorswon’t
‣Mul$plefeaturevectors,oneweightvector
featuresdependonchoiceoflabelnow!note:thisisn’tthegoldlabel
FeatureExtrac$on
BlockFeatureVectors‣ Decisionrule:argmaxy2Yw
>f(x, y)
toomanydrugtrials,toofewpa5ents
Health
Sports
Science
f(x)=I[containsdrug],I[containspa5ents],I[containsbaseball] =[1,1,0]
[1,1,0,0,0,0,0,0,0]
[0,0,0,1,1,0,0,0,0]
f(x, y = ) =Health
f(x, y = ) =Sports
‣ Equivalenttohavingthreeweightvectorsinthiscase
featurevectorblocksforeachlabel
‣ Basefeaturefunc$on:
I[containsdrug&label=Health]
MakingDecisions
f(x) =I[containsdrug],I[containspa5ents],I[containsbaseball]
w = [+2.1,+2.3,-5,-2.1,-3.8,0,+1.1,-1.7,-1.3]
= Health:+4.4 Sports:-5.9 Science:-0.6
argmax
toomanydrugtrials,toofewpa5ents
Health
Sports
Science
[1,1,0,0,0,0,0,0,0]
[0,0,0,1,1,0,0,0,0]
f(x, y = ) =Health
f(x, y = ) =Sports “worddruginSciencear$cle”=+1.1
w>f(x, y)
Anotherexample:POStaggingblocksNNSVBZNNDT…
f(x,y=VBZ)=I[curr_word=blocks&tag=VBZ],I[prev_word=router&tag=VBZ]I[next_word=the&tag=VBZ]I[curr_suffix=s&tag=VBZ]
‣ Classifyblocksasoneof36POStags
‣ Nexttwolectures:sequencelabeling!
‣ Examplex:sentencewithaword(inthiscase,blocks)highlighted
‣ Extractfeatureswithrespecttothisword:
notsayingthattheistaggedasVBZ!sayingthatthefollowstheVBZword
therouter thepackets
Mul$classLogis$cRegression
Mul$classLogis$cRegression
‣ Comparetobinary:
nega$veclassimplicitlyhadf(x,y=0)=thezerovector
sumoveroutputspacetonormalize
P (y = 1|x) = exp(w>f(x))
1 + exp(w>f(x))
Pw(y|x) =exp
�w>f(x, y)
�P
y02Y exp (w>f(x, y0))
Mul$classLogis$cRegression
sumoveroutputspacetonormalize
Pw(y|x) =exp
�w>f(x, y)
�P
y02Y exp (w>f(x, y0))
Health:+2.2
Sports:+3.1
Science:-0.6w>f(x, y)
Why?Interpretrawclassifierscoresasprobabili/es
exp6.05
22.2
0.55
probabili$esmustbe>=0
unnormalizedprobabili$es
normalize0.21
0.77
0.02
probabili$esmustsumto1
probabili$es
Softmax function
1.00
0.00
0.00correct(gold)probabili$es
toomanydrugtrials,
toofewpa5ents
compare
L(x, y) =nX
j=1
logP (y⇤j |xj)L(xj , y⇤j ) = w>f(xj , y
⇤j )� log
X
y
exp(w>f(xj , y))
log(0.21)=-1.56
Mul$classLogis$cRegression
sumoveroutputspacetonormalize
‣ Training:maximize
=nX
j=1
w>f(xj , y
⇤j )� log
X
y
exp(w>f(xj , y))
!L(x, y) =
nX
j=1
logP (y⇤j |xj)
Pw(y|x) =exp
�w>f(x, y)
�P
y02Y exp (w>f(x, y0))
Training
‣ Mul$classlogis$cregression
‣ Likelihood L(xj , y⇤j ) = w>f(xj , y
⇤j )� log
X
y
exp(w>f(xj , y))
@
@wiL(xj , y
⇤j ) = fi(xj , y
⇤j )�
Py fi(xj , y) exp(w>f(xj , y))P
y exp(w>f(xj , y))
@
@wiL(xj , y
⇤j ) = fi(xj , y
⇤j )� Ey[fi(xj , y)]
goldfeaturevalue
model’sexpecta$onoffeaturevalue
Pw(y|x) =exp
�w>f(x, y)
�P
y02Y exp (w>f(x, y0))
@
@wiL(xj , y
⇤j ) = fi(xj , y
⇤j )�
X
y
fi(xj , y)Pw(y|xj)
Training
toomanydrugtrials,toofewpa5ents[1,1,0,0,0,0,0,0,0]
[0,0,0,1,1,0,0,0,0]
f(x, y = ) =Health
f(x, y = ) =Sports
y*= Health
Pw(y|x)=[0.21,0.77,0.02]
@
@wiL(xj , y
⇤j ) = fi(xj , y
⇤j )�
X
y
fi(xj , y)Pw(y|xj)
[1,1,0,0,0,0,0,0,0] —0.21[1,1,0,0,0,0,0,0,0]—0.77[0,0,0,1,1,0,0,0,0] —0.02[0,0,0,0,0,0,1,1,0]
=[0.79,0.79,0,-0.77,-0.77,0,-0.02,-0.02,0]
gradient:
[1.3,0.9,-5,3.2,-0.1,0,1.1,-1.7,-1.3] +[0.79,0.79,0,-0.77,-0.77,0,-0.02,-0.02,0]=[2.09,1.69,0,2.43,-0.87,0,1.08,-1.72,0]
newPw(y|x)=[0.89,0.10,0.01]
update:w>f(x, y) + `(y, y⇤)
Logis$cRegression:Summary
‣Model:
‣ Learning:gradientascentonthediscrimina$velog-likelihood
‣ Inference:
“towardsgoldfeaturevalue,awayfromexpecta$onoffeaturevalue”
f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X
y
[Pw(y|x)f(x, y)]
Pw(y|x) =exp
�w>f(x, y)
�P
y02Y exp (w>f(x, y0))
argmaxyPw(y|x)
Mul$classSVM
SoxMarginSVM
Minimize
s.t.
8j (2yj � 1)(w>xj) � 1� ⇠j
8j ⇠j � 0
�kwk22 +mX
j=1
⇠jslackvariables>0iffexampleissupportvector
Image credit: Lang Van Tran
Mul$classSVM
Correctpredic$onnowhastobeateveryotherclass
Minimize
s.t.
8j (2yj � 1)(w>xj) � 1� ⇠j
8j ⇠j � 0
�kwk22 +mX
j=1
⇠j
8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j
The1thatwashereisreplacedbyalossfunc$on
Scorecomparisonismoreexplicitnow
slackvariables>0iffexampleissupportvector
Training(loss-augmented)
‣ Arealldecisionsequallycostly?
‣Wecandefinealossfunc$on `(y, y⇤)
toomanydrugtrials,toofewpa5ents
Health
SportsSports
ScienceSports
Science
Predicted
Predicted :notsobad
:baderror
`( , ) =HealthSports
HealthScience`( , ) =
3
1
Mul$classSVM
Health Science Sports
2.4+0
1.3+3
1.8+1
Y
‣ Doesgoldbeateverylabel+loss?No!
‣ =4.3-2.4=1.9⇠j
‣MostviolatedconstraintisSports;whatis?
8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j
w>f(x, y) + `(y, y⇤)
‣ Perceptronwouldmakenoupdatehere
⇠j
Mul$classSVM
Minimize
s.t. 8j ⇠j � 0
�kwk22 +mX
j=1
⇠j
8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j
‣ Oneslackvariableperexample,soit’ssettobewhateverthemostviolatedconstraintisforthatexample
⇠j = maxy2Y
w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )
‣ Pluginthegoldyandyouget0,soslackisalwaysnonnega$ve!
Compu$ngtheSubgradient
‣ If,theexampleisnotasupportvector,gradientiszero⇠j = 0
‣ Otherwise,
(updatelooksbackwards—we’reminimizinghere!)
@
@wi⇠j = fi(xj , ymax)� fi(xj , y
⇤j )
‣ Perceptron-like,butweupdateawayfrom*loss-augmented*predic$on
Minimize
s.t. 8j ⇠j � 0
�kwk22 +mX
j=1
⇠j
8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j
⇠j = maxy2Y
w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )
PuzngitTogether
Minimize
s.t. 8j ⇠j � 0
�kwk22 +mX
j=1
⇠j
8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j
‣ (Unregularized)gradients:‣ SVM:
f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X
y
[Pw(y|x)f(x, y)]‣ Logreg:f(x, y⇤)� f(x, ymax) (loss-augmentedmax)
‣ SVM:maxoverstocomputegradient.LR:needtosumovers`(y, y⇤) `(y, y⇤)
Op$miza$on
Recap‣ Fourelementsofamachinelearningmethod:
‣Model:probabilis$c,max-margin,deepneuralnetwork
‣ Inference:justmaxesandsimpleexpecta$onssofar,butwillgetharder
‣ Training:gradientdescent?
‣ Objec$ve:
Op$miza$on
‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup
w w + ↵g, g =@
@wL
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201820
Optimization
W_1
W_2
Op$miza$on
‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup
w w + ↵g, g =@
@wL
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201821
Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?
Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large
‣Whatiflosschangesquicklyinonedirec$onandslowlyinanotherdirec$on?
Op$miza$on
‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup
w w + ↵g, g =@
@wL
‣Whatiflosschangesquicklyinonedirec$onandslowlyinanotherdirec$on?
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201822
Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?Very slow progress along shallow dimension, jitter along steep direction
Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large
Op$miza$on
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201825
Optimization: Problems with SGD
What if the loss function has a local minima or saddle point?
Saddle points much more common in high dimension
Dauphin et al, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”, NIPS 2014
‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup
w w + ↵g, g =@
@wL
‣Whatifthelossfunc$onhasalocalminimaorsaddlepoint?
Op$miza$on
‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup
‣ “First-order”technique:onlyreliesonhavinggradient
w w + ↵g, g =@
@wL
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201854
First-Order Optimization
Loss
w1
(1) Use gradient form linear approximation(2) Step to minimize the approximation
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201855
Second-Order Optimization
Loss
w1
(1) Use gradient and Hessian to form quadratic approximation(2) Step to the minima of the approximation
Op$miza$on
‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup
‣ “First-order”technique:onlyreliesonhavinggradient
‣ Newton’smethod
‣ Second-ordertechnique
InverseHessian:nxnmat,expensive!‣ Op$mizesquadra$cinstantly
‣ Quasi-Newtonmethods:L-BFGS,etc.approximateinverseHessian
‣ Sezngstepsizeishard(decreasewhenheld-outperformanceworsens?)
w w + ↵g, g =@
@wL
w w +
✓@2
@w2L◆�1
g
AdaGrad
Duchietal.(2011)
‣ Op$mizedforproblemswithsparsefeatures
‣ Per-parameterlearningrate:smallerupdatesaremadetoparametersthatgetupdatedfrequently
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201837
AdaGrad
Added element-wise scaling of the gradient based on the historical sum of squares in each dimension
“Per-parameter learning rates” or “adaptive learning rates”
Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201821
Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?
Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large
AdaGrad
Duchietal.(2011)
‣ Op$mizedforproblemswithsparsefeatures
‣ Per-parameterlearningrate:smallerupdatesaremadetoparametersthatgetupdatedfrequently
(smoothed)sumofsquaredgradientsfromallupdates
‣ GenerallymorerobustthanSGD,requireslesstuningoflearningrate
‣ Othertechniquesforop$mizingdeepmodels—morelater!
wi wi + ↵1q
✏+Pt
⌧=1 g2⌧,i
gti
Summary
‣ Designtradeoffsneedtoreflectinterac$ons:
‣Modelandobjec$vearecoupled:probabilis$cmodel<->maximizelikelihood
‣ …butnotalways:alinearmodelorneuralnetworkcanbetrainedtominimizeanydifferen$ablelossfunc$on
‣ Inferencegovernswhatlearning:needtobeabletocomputeexpecta$onstouselogis$cregression