CS224d Deep Learning for Natural Language Processing Lecture 3: More Word Vectors Richard Socher

CS224d Deep Learning for Natural Language Processing ...kanmy/courses/6101_2016_2/CS224d-Lecture3.pdffor Natural Language Processing Lecture 3: More Word Vectors Richard Socher. Refresher:

• MaincostfunctionJ:

• Withprobabilitiesdefinedas:

• Wederivedthegradientfortheinternalvectorsvc

4/5/16RichardSocherLecture1,Slide 2

• Wewentthroughgradientsforeachcentervectorvinawindow• Wealsoneedgradientsforoutsidevectorsu• Deriveathome!

• Generallyineachwindowwewillcomputeupdatesforallparametersthatarebeingusedinthatwindow.

• Forexamplewindowsizec=1,sentence:“Ilikelearning.”

• Firstwindowcomputesgradientsfor:• internalvectorvlike andexternalvectorsuI andulearning

• Nextwindowinthatsentence?

4/5/16RichardSocherLecture1,Slide 3

• WeoftendefinethesetofALLparametersinamodelintermsofonelongvector

• Inourcasewithd-dimensionalvectorsandVmanywords:

4/5/16RichardSocherLecture1,Slide 4

• Tominimizeoverthefullbatch(theentiretrainingdata)wouldrequireustocomputegradientsforallwindows

• Updateswouldbeforeachelementofµ :

• Withstepsize®• Inmatrixnotationforallparameters:

4/5/16RichardSocherLecture1,Slide 5

4/5/16RichardSocherLecture1,Slide 6

4/5/16RichardSocherLecture1,Slide 7

• Forasimpleconvexfunctionovertwoparameters.

• Contourlinesshowlevelsofobjectivefunction•

• ButCorpusmayhave40Btokensandwindows• Youwouldwaitaverylongtimebeforemakingasingleupdate!

• Verybadideaforprettymuchallneuralnets!• Instead:Wewillupdateparametersaftereachwindowt

à Stochasticgradientdescent(SGD)

4/5/16RichardSocherLecture1,Slide 8

• Butineachwindow,weonlyhaveatmost2c-1words,soisverysparse!

4/5/16RichardSocherLecture1,Slide 9

• Wemayaswellonlyupdatethewordvectorsthatactuallyappear!

• Solution:eitherkeeparoundhashforwordvectorsoronlyupdatecertaincolumnsoffullembeddingmatrixU andV

• Importantifyouhavemillionsofwordvectorsanddodistributedcomputingtonothavetosendgiganticupdatesaround.

4/5/16RichardSocherLecture1,Slide 10



Approximations:PSet 1

• Thenormalizationfactoristoocomputationallyexpensive

• Hence,inPSet1youwillimplementtheskip-grammodel

• Mainidea:trainbinarylogisticregressionsforatruepair(centerwordandwordinitscontextwindow)andacoupleofrandompairs(thecenterwordwitharandomword)

4/5/16RichardSocherLecture1,Slide 11

PSet 1:Theskip-grammodelandnegativesampling

• Frompaper:“DistributedRepresentationsofWordsandPhrasesandtheirCompositionality”(Mikolovetal.2013)

• Overallobjectivefunction:

• Wherekisthenumberofnegativesamplesandweuse,

• Thesigmoidfunction!(we’llbecomegoodfriendssoon)

• Sowemaximizetheprobabilityoftwowordsco-occurringinfirstlogà

4/5/16RichardSocherLecture1,Slide 12

PSet 1:Theskip-grammodelandnegativesampling

• Slightlyclearernotation:

• Max.probabilitythatrealoutsidewordappears,minimizeprob.thatrandomwordsappeararoundcenterword

• P(w)=U(w)3/4/Z,theunigramdistributionU(w)raisedtothe3/4rdpower(Weprovidethisfunctioninthestartercode).

• Thepowermakeslessfrequentwordsbesampledmoreoften

4/5/16RichardSocherLecture1,Slide 13

PSet 1:Thecontinuousbagofwordsmodel

• Mainideaforcontinuousbagofwords(CBOW):Predictcenterwordfromsumofsurroundingwordvectorsinsteadofpredictingsurroundingsinglewordsfromcenterwordasinskip-grammodel

• TomakePSet slightlyeasier:


4/5/16RichardSocherLecture1,Slide 14

Countbasedvs directprediction


LSA, HAL (Lund & Burgess), COALS (Rohde et al), Hellinger-PCA (Lebret & Collobert)

• Fast training• Efficient usage of statistics

• Primarily used to capture word similarity

• Disproportionate importance given to large counts

• NNLM, HLBL, RNN, Skip-gram/CBOW, (Bengio et al; Collobert& Weston; Huang et al; Mnih & Hinton; Mikolov et al; Mnih & Kavukcuoglu)

• Scales with corpus size

• Inefficient usage of statistics

• Can capture complex patterns beyond word similarity

• Generate improved performance on other tasks

litoria leptodactylidae

rana eleutherodactylus


• WeendupwithUandVfromallthevectorsuandv(incolumns)

• Bothcapturesimilarco-occurrenceinformation.Itturnsout,thebestsolutionistosimplysumthemup:

Xfinal =U+V

• Oneofmanyhyperparameters exploredinGloVe:GlobalVectorsforWordRepresentation (Penningtonetal.(2014)

4/5/16RichardSocherLecture1,Slide 18

• RelatedtogeneralevaluationinNLP:Intrinsicvs extrinsic• Intrinsic:

• Evaluationonaspecific/intermediatesubtask• Fasttocompute• Helpstounderstandthatsystem• Notclearifreallyhelpfulunlesscorrelationtorealtaskisestablished

• Extrinsic:• Evaluationonarealtask• Cantakealongtimetocomputeaccuracy• Unclearifthesubsystemistheproblemoritsinteractionorothersubsystems

• IfreplacingonesubsystemwithanotherimprovesaccuracyàWinning!

4/5/16RichardSocherLecture1,Slide 19

• WordVectorAnalogies

• Evaluatewordvectorsbyhowwelltheircosinedistanceafteradditioncapturesintuitivesemanticandsyntacticanalogyquestions

• Discardingtheinputwordsfromthesearch!

• Problem:Whatiftheinformationistherebutnotlinear?

4/5/16RichardSocherLecture1,Slide 20





GloveVisualizations:Company- CEO


• WordVectorAnalogies:SyntacticandSemantic examplesfromhttp://code.google.com/p/word2vec/source/browse/trunk/questions-words.txt

:city-in-state problem:differentcitiesChicagoIllinoisHoustonTexas mayhavesamenameChicagoIllinoisPhiladelphiaPennsylvaniaChicagoIllinoisPhoenixArizonaChicagoIllinoisDallasTexasChicagoIllinoisJacksonvilleFloridaChicagoIllinoisIndianapolisIndianaChicagoIllinoisAustinTexasChicagoIllinoisDetroitMichiganChicagoIllinoisMemphisTennesseeChicagoIllinoisBostonMassachusetts

4/5/16RichardSocherLecture1,Slide 24

• WordVectorAnalogies:SyntacticandSemantic examplesfrom

:capital-world problem:canchangeAbujaNigeriaAccraGhanaAbujaNigeriaAlgiersAlgeriaAbujaNigeriaAmmanJordanAbujaNigeriaAnkaraTurkeyAbujaNigeriaAntananarivoMadagascarAbujaNigeriaApiaSamoaAbujaNigeriaAshgabatTurkmenistanAbujaNigeriaAsmaraEritreaAbujaNigeriaAstanaKazakhstan

4/5/16RichardSocherLecture1,Slide 25

• WordVectorAnalogies:Syntactic andSemanticexamplesfrom


4/5/16RichardSocherLecture1,Slide 26

• WordVectorAnalogies:Syntactic andSemanticexamplesfrom


4/5/16RichardSocherLecture1,Slide 27

• Verycarefulanalysis:Glovewordvectors

4/5/16RichardSocherLecture1,Slide 28

• Asymmetriccontext(onlywordstotheleft)arenotasgood

• Bestdimensions~300,slightdrop-offafterwards• Butthismightbedifferentfordownstreamtasks!

• Windowsizeof8aroundeachcenterwordisgoodforGlovevectors

4/5/16RichardSocherLecture1,Slide 29

0 100 200 300 400 500 60020







Vector Dimension





(a) Symmetric context

2 4 6 8 1040







Window Size





(b) Symmetric context

2 4 6 8 1040







Window Size





(c) Asymmetric context

• Moretrainingtimehelps

4/5/16RichardSocherLecture1,Slide 30

1 2 3 4 5 6








5 10 15 20 25

1357 10 15 20 25 30 40 50




Iterations (GloVe)

Negative Samples (CBOW)

Training Time (hrs)


(a) GloVe vs CBOW

3 6 9 12 15 18 21 24








20 40 60 80 100

1 2 3 4 5 6 7 10 12 15 20





Iterations (GloVe)

Negative Samples (Skip-Gram)

Training Time (hrs)

(b) GloVe vs Skip-Gram

• Moredatahelps,Wikipediaisbetterthannewstext!

4/5/16RichardSocherLecture1,Slide 31

• Wordvectordistancesandtheircorrelationwithhumanjudgments• Exampledataset:WordSim353


Word1 Word2 Human(mean)tiger cat 7.35tiger tiger 10.00book paper 7.46computer internet 7.58plane car 5.77professor doctor 6.62stock phone 1.62stock CD 1.31stock jaguar 0.92

4/5/16RichardSocherLecture1,Slide 32

Page 33: CS224d Deep Learning for Natural Language Processing ...kanmy/courses/6101_2016_2/CS224d-Lecture3.pdffor Natural Language Processing Lecture 3: More Word Vectors Richard Socher. Refresher:


• Wordvectordistancesandtheircorrelationwithhumanjudgments

• SomeideasfromGlovepaperhavebeenshowntoimproveskip-gram(SG)modelalso(e.g.sumbothvectors)

4/5/16RichardSocherLecture1,Slide 33

• Youmayhopethatonevectorcapturesbothkindsofinformation(run=verbandnoun)butthenvectorispulledindifferentdirections

• Alternativedescribedin:ImprovingWordRepresentations ViaGlobalContextAndMultipleWordPrototypes (Huangetal.2012)

• Idea:Clusterwordwindowsaroundwords,retrainwitheachwordassignedtomultipledifferentclustersbank1,bank2,etc

4/5/16RichardSocherLecture1,Slide 34

• ImprovingWordRepresentations ViaGlobalContextAndMultipleWordPrototypes (Huangetal.2012)

4/5/16RichardSocherLecture1,Slide 35

• Extrinsicevaluationofwordvectors:Allsubsequenttasksinthisclass

• Oneexamplewheregoodwordvectorsshouldhelpdirectly:namedentityrecognition:findingaperson,organizationorlocation

• Next:Howtousewordvectorsinneuralnetmodels!

4/5/16RichardSocherLecture1,Slide 36

4/5/16RichardSocherLecture1,Slide 37

• Whatisthemajorbenefitofdeeplearnedwordvectors?• Abilitytoalsoclassifywordsaccurately

• Countriesclustertogetherà classifyinglocationwordsshouldbepossiblewithwordvectors

• Incorporateanyinformationintothemothertasks

• Projectsentimentintowordstofindmostpositive/negativewordsincorpus

4/5/16RichardSocherLecture1,Slide 38

Logisticregression=Softmax classificationonwordvectorxtoobtainprobabilityforclassy:



x1 x2x3

a1 a2

Thesoftmax - details

• Terminology:Lossfunction=costfunction=objectivefunction• Lossforsoftmax:Crossentropy

• Tocomputep(y|x):firsttakethey’th rowofWandmultiplythatwithrowwithx:

• Computeallfc forc=1,…,C• Normalizetoobtainprobabilitywithsoftmax function:

4/5/16RichardSocherLecture1,Slide 39

Thesoftmax andcross-entropyerror

• Thelosswantstomaximizetheprobabilityofthecorrectclassy

• Hence,weminimizethenegativelogprobabilityofthatclass:

• Asbefore:wesumupmultiplecrossentropyerrorsifwehavemultipleclassificationsinourtotalerrorfunctionoverthecorpus(morenextlecture)

4/5/16RichardSocherLecture1,Slide 40

• Assumingagroundtruth(orgoldortarget)probabilitydistributionthatis1attherightclassand0everywhereelse:p=[0,…,0,1,0,…0]andourcomputedprobabilityisq,thenthecrossentropyis:

• Becauseofone-hotp,theonlytermleftisthenegativeprobabilityofthetrueclass

• Cross-entropycanbere-writtenintermsoftheentropyandKullback-Leibler divergencebetweenthetwodistributions:

4/5/16RichardSocherLecture1,Slide 41

• Crossentropy:• Becausepiszeroinourcase(andevenifitwasn’titwouldbe


• TheKLdivergenceisnotadistancebutanon-symmetricmeasureofthedifferencebetweentwoprobabilitydistributionsp andq

4/5/16RichardSocherLecture1,Slide 42

PSet 1

• DerivethegradientofthecrossentropyerrorwithrespecttotheinputwordvectorxandthematrixW

4/5/16RichardSocherLecture1,Slide 43

• Example:Sentiment

• Twooptions:trainonlysoftmax weightsWandfixwordvectorsoralsotrainwordvectors

• Question:Whataretheadvantagesanddisadvantagesoftrainingthewordvectors?

• Pro:betterfitontrainingdata• Con:Worsegeneralizationbecausethewordsmoveinthe


4/5/16RichardSocherLecture1,Slide 44

4/5/16RichardSocherLecture1,Slide 45

• Singlewordclassificationhasnocontext!

• Let’saddcontextbytakinginwindowsandclassifyingthecenterwordofthatwindow!

• Possible:Softmax andcrossentropyerrorormax-marginloss

• Nextclass!

4/5/16RichardSocherLecture1,Slide 46

