Mul$class Classiﬁca$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Mul$classClassifica$on

AlanRi0er(many slides from Greg Durrett, Vivek Srikumar, Stanford CS231n)

ThisLecture

‣Mul$classfundamentals

‣Mul$classlogis$cregression

‣Mul$classSVM

‣ Featureextrac$on

‣ Op$miza$on

Mul$classFundamentals

TextClassifica$on

~20classes

Sports

Health

ImageClassifica$on

‣ Thousandsofclasses(ImageNet)

Car

Dog

En$tyLinking

‣ 4,500,000classes(allar$clesinWikipedia)

Althoughheoriginallywontheevent,theUnitedStatesAn$-DopingAgencyannouncedinAugust2012thattheyhaddisqualifiedArmstrongfromhissevenconsecu$veTourdeFrancewinsfrom1999–2005.

LanceEdwardArmstrongisanAmericanformerprofessionalroadcyclist

ArmstrongCountyisacountyinPennsylvania…

??

ReadingComprehension

‣Mul$plechoiceques$ons,4classes(butclasseschangeperexample)

Richardson(2013)

BinaryClassifica$on

‣ Binaryclassifica$on:oneweightvectordefinesposi$veandnega$veclasses

+++ +

++

++

- - -----

--


1 11 11 12

222

333

3

‣ Canwejustusebinaryclassifiershere?


‣ One-vs-all:trainkclassifiers,onetodis$nguisheachclassfromalltherest

1 11 11 12

222

333

3

1 11 11 12

222

333

3

‣ Howdowereconcilemul$pleposi$vepredic$ons?Highestscore?


‣ Notallclassesmayevenbeseparableusingthisapproach

1 11 11 12 2

222 2

33

3 33 3

‣ Canseparate1from2+3and2from1+3butnot3fromtheothers(withthesefeatures)

Mul$classClassifica$on‣ All-vs-all:trainn(n-1)/2classifierstodifferen$ateeachpairofclasses

1 11 11 1

333

3

1 11 11 12

222

‣ Again,howtoreconcile?


+++ +

++

++

- - -----

--

1 11 11 12

222

333

3

‣ Binaryclassifica$on:oneweightvectordefinesbothclasses

‣Mul$classclassifica$on:differentweightsand/orfeaturesperclass


‣ Decisionrule:

‣ Canalsohaveoneweightvectorperclass:

‣ Formally:insteadoftwolabels,wehaveanoutputspacecontaininganumberofpossibleclasses

Y

‣ Samemachinerythatwe’lluselaterforexponen$allylargeoutputspaces,includingsequencesandtrees

argmaxy2Yw>y f(x)

argmaxy2Yw>f(x, y)

‣ Thesingleweightvectorapproachwillgeneralizetostructuredoutputspaces,whereasper-classweightvectorswon’t

‣Mul$plefeaturevectors,oneweightvector

featuresdependonchoiceoflabelnow!note:thisisn’tthegoldlabel

FeatureExtrac$on

BlockFeatureVectors‣ Decisionrule:argmaxy2Yw

>f(x, y)

toomanydrugtrials,toofewpa5ents

Health

Sports

Science

f(x)=I[containsdrug],I[containspa5ents],I[containsbaseball] =[1,1,0]

[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports

‣ Equivalenttohavingthreeweightvectorsinthiscase

featurevectorblocksforeachlabel

‣ Basefeaturefunc$on:

I[containsdrug&label=Health]

MakingDecisions

f(x) =I[containsdrug],I[containspa5ents],I[containsbaseball]

w = [+2.1,+2.3,-5,-2.1,-3.8,0,+1.1,-1.7,-1.3]

= Health:+4.4 Sports:-5.9 Science:-0.6

argmax


Health

Sports

Science

[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports “worddruginSciencear$cle”=+1.1

w>f(x, y)

Anotherexample:POStaggingblocksNNSVBZNNDT…

f(x,y=VBZ)=I[curr_word=blocks&tag=VBZ],I[prev_word=router&tag=VBZ]I[next_word=the&tag=VBZ]I[curr_suffix=s&tag=VBZ]

‣ Classifyblocksasoneof36POStags

‣ Nexttwolectures:sequencelabeling!

‣ Examplex:sentencewithaword(inthiscase,blocks)highlighted

‣ Extractfeatureswithrespecttothisword:

notsayingthattheistaggedasVBZ!sayingthatthefollowstheVBZword

therouter thepackets

Mul$classLogis$cRegression


‣ Comparetobinary:

nega$veclassimplicitlyhadf(x,y=0)=thezerovector

sumoveroutputspacetonormalize

P (y = 1|x) = exp(w>f(x))

1 + exp(w>f(x))

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))



Pw(y|x) =exp

�w>f(x, y)

�P


Health:+2.2

Sports:+3.1

Science:-0.6w>f(x, y)

Why?Interpretrawclassifierscoresasprobabili/es

exp6.05

22.2

0.55

probabili$esmustbe>=0

unnormalizedprobabili$es

normalize0.21

0.77

0.02

probabili$esmustsumto1

probabili$es

Softmax function

1.00

0.00

0.00correct(gold)probabili$es

toomanydrugtrials,

toofewpa5ents

compare

L(x, y) =nX

j=1

logP (y⇤j |xj)L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

log(0.21)=-1.56



‣ Training:maximize

=nX

j=1

w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

!L(x, y) =

nX

j=1

logP (y⇤j |xj)

Pw(y|x) =exp

�w>f(x, y)

�P


Training

‣ Mul$classlogis$cregression

‣ Likelihood L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

Py fi(xj , y) exp(w>f(xj , y))P

y exp(w>f(xj , y))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )� Ey[fi(xj , y)]

goldfeaturevalue

model’sexpecta$onoffeaturevalue

Pw(y|x) =exp

�w>f(x, y)

�P


@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

Training

toomanydrugtrials,toofewpa5ents[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports

y*= Health

Pw(y|x)=[0.21,0.77,0.02]

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

[1,1,0,0,0,0,0,0,0] —0.21[1,1,0,0,0,0,0,0,0]—0.77[0,0,0,1,1,0,0,0,0] —0.02[0,0,0,0,0,0,1,1,0]

=[0.79,0.79,0,-0.77,-0.77,0,-0.02,-0.02,0]

gradient:

[1.3,0.9,-5,3.2,-0.1,0,1.1,-1.7,-1.3] +[0.79,0.79,0,-0.77,-0.77,0,-0.02,-0.02,0]=[2.09,1.69,0,2.43,-0.87,0,1.08,-1.72,0]

newPw(y|x)=[0.89,0.10,0.01]

update:w>f(x, y) + `(y, y⇤)

Logis$cRegression:Summary

‣Model:

‣ Learning:gradientascentonthediscrimina$velog-likelihood

‣ Inference:

“towardsgoldfeaturevalue,awayfromexpecta$onoffeaturevalue”

f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X

y

[Pw(y|x)f(x, y)]

Pw(y|x) =exp

�w>f(x, y)

�P


argmaxyPw(y|x)

Mul$classSVM

SoxMarginSVM

Minimize

s.t.

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

j=1

⇠jslackvariables>0iffexampleissupportvector

Image credit: Lang Van Tran

Mul$classSVM

Correctpredic$onnowhastobeateveryotherclass

Minimize

s.t.

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

The1thatwashereisreplacedbyalossfunc$on

Scorecomparisonismoreexplicitnow

slackvariables>0iffexampleissupportvector

Training(loss-augmented)

‣ Arealldecisionsequallycostly?

‣Wecandefinealossfunc$on `(y, y⇤)


Health

SportsSports

ScienceSports

Science

Predicted

Predicted :notsobad

:baderror

`( , ) =HealthSports

HealthScience`( , ) =

3

1

Mul$classSVM

Health Science Sports

2.4+0

1.3+3

1.8+1

Y

‣ Doesgoldbeateverylabel+loss?No!

‣ =4.3-2.4=1.9⇠j

‣MostviolatedconstraintisSports;whatis?


w>f(x, y) + `(y, y⇤)

‣ Perceptronwouldmakenoupdatehere

⇠j

Mul$classSVM

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j


‣ Oneslackvariableperexample,soit’ssettobewhateverthemostviolatedconstraintisforthatexample

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

‣ Pluginthegoldyandyouget0,soslackisalwaysnonnega$ve!

Compu$ngtheSubgradient

‣ If,theexampleisnotasupportvector,gradientiszero⇠j = 0

‣ Otherwise,

(updatelooksbackwards—we’reminimizinghere!)

@

@wi⇠j = fi(xj , ymax)� fi(xj , y

⇤j )

‣ Perceptron-like,butweupdateawayfrom*loss-augmented*predic$on

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j


⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

PuzngitTogether

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j


‣ (Unregularized)gradients:‣ SVM:

f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X

y

[Pw(y|x)f(x, y)]‣ Logreg:f(x, y⇤)� f(x, ymax) (loss-augmentedmax)

‣ SVM:maxoverstocomputegradient.LR:needtosumovers`(y, y⇤) `(y, y⇤)

Op$miza$on

Recap‣ Fourelementsofamachinelearningmethod:

‣Model:probabilis$c,max-margin,deepneuralnetwork

‣ Inference:justmaxesandsimpleexpecta$onssofar,butwillgetharder

‣ Training:gradientdescent?

‣ Objec$ve:

Op$miza$on

‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup

w w + ↵g, g =@

@wL

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201820

Optimization

W_1

W_2

Op$miza$on


w w + ↵g, g =@

@wL


Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

‣Whatiflosschangesquicklyinonedirec$onandslowlyinanotherdirec$on?

Op$miza$on


w w + ↵g, g =@

@wL

‣Whatiflosschangesquicklyinonedirec$onandslowlyinanotherdirec$on?


Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?Very slow progress along shallow dimension, jitter along steep direction


Op$miza$on


Optimization: Problems with SGD

What if the loss function has a local minima or saddle point?

Saddle points much more common in high dimension

Dauphin et al, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”, NIPS 2014


w w + ↵g, g =@

@wL

‣Whatifthelossfunc$onhasalocalminimaorsaddlepoint?

Op$miza$on


‣ “First-order”technique:onlyreliesonhavinggradient

w w + ↵g, g =@

@wL


First-Order Optimization

Loss

w1

(1) Use gradient form linear approximation(2) Step to minimize the approximation


Second-Order Optimization

Loss

w1

(1) Use gradient and Hessian to form quadratic approximation(2) Step to the minima of the approximation

Op$miza$on


‣ “First-order”technique:onlyreliesonhavinggradient

‣ Newton’smethod

‣ Second-ordertechnique

InverseHessian:nxnmat,expensive!‣ Op$mizesquadra$cinstantly

‣ Quasi-Newtonmethods:L-BFGS,etc.approximateinverseHessian

‣ Sezngstepsizeishard(decreasewhenheld-outperformanceworsens?)

w w + ↵g, g =@

@wL

w w +

✓@2

@w2L◆�1

g

AdaGrad

Duchietal.(2011)

‣ Op$mizedforproblemswithsparsefeatures

‣ Per-parameterlearningrate:smallerupdatesaremadetoparametersthatgetupdatedfrequently


AdaGrad

Added element-wise scaling of the gradient based on the historical sum of squares in each dimension

“Per-parameter learning rates” or “adaptive learning rates”

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011


Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?


AdaGrad

Duchietal.(2011)

‣ Op$mizedforproblemswithsparsefeatures

‣ Per-parameterlearningrate:smallerupdatesaremadetoparametersthatgetupdatedfrequently

(smoothed)sumofsquaredgradientsfromallupdates

‣ GenerallymorerobustthanSGD,requireslesstuningoflearningrate

‣ Othertechniquesforop$mizingdeepmodels—morelater!

wi wi + ↵1q

✏+Pt

⌧=1 g2⌧,i

gti

Summary

‣ Designtradeoffsneedtoreflectinterac$ons:

‣Modelandobjec$vearecoupled:probabilis$cmodel<->maximizelikelihood

‣ …butnotalways:alinearmodelorneuralnetworkcanbetrainedtominimizeanydifferen$ablelossfunc$on

‣ Inferencegovernswhatlearning:needtobeabletocomputeexpecta$onstouselogis$cregression

Documents

Mul$class Classiﬁca$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks