Word Embeddingssvivek.com/.../spring2019/slides/word-embeddings/4-word2vec-glove… · enough [Mikolov, 2013, word2vec] –Simpler model, fewer parameters –Faster to train 4 Context

CS6956:DeepLearningforNLP

WordEmbeddings

Overview

• Representingmeaning

• Wordembeddings:Earlywork

• Wordembeddings vialanguagemodels

• Word2vecandGlove

• Evaluatingembeddings

• Designchoicesandopenquestions

1

Overview

• Representingmeaning

• Wordembeddings:Earlywork

• Wordembeddings vialanguagemodels

• Word2vecandGlove

• Evaluatingembeddings

• Designchoicesandopenquestions

2

Wordembeddings vialanguagemodels

Thegoal:Tofindvectorembeddings ofwords

Highlevelapproach:1. Trainamodelforasurrogatetask(inthiscase

languagemodeling)2. Wordembeddings areabyproductofthisprocess

3

Neuralnetworklanguagemodels

• Amulti-layerneuralnetwork[Bengio etal2003]– Words→ embeddinglayer→ hiddenlayers→ softmax– Cross-entropyloss

• Insteadofproducingprobability,justproduceascoreforthenextword(nosoftmax)[Collobert andWeston,2008]– Rankingloss– Intuition:Validwordsequencesshouldgetahigherscorethaninvalid

ones

• Noneedforamulti-layernetwork,ashallownetworkisgoodenough[Mikolov,2013,word2vec]– Simplermodel,fewerparameters– Fastertotrain

4

Context=previouswordsinsentence

Context=previousandnextwordsinsentence

Thislecture

• Theword2vecmodels:CBOWandSkipgram

• Connectionbetweenword2vecandmatrixfactorization

• GloVe

5

Word2Vec

• Twoarchitecturesforlearningwordembeddings– Skipgram andCBOW

• BothhavetwokeydifferencesfromtheolderBengio/C&Wapproaches1. Nohiddenlayers2. Extracontext(bothleftandrightcontext)

• Severalcomputationaltrickstomakethingsfaster

6

[Mikolov etalICLR2013,Mikolov etalNIPS2013]

Thislecture

• Theword2vecmodels:CBOW andSkipgram


• GloVe

7

ContinuousBagofWords(CBOW)

Givenawindowofwordsofalength2m+1Callthem:𝑥#$,⋯ , 𝑥#' 𝑥(𝑥',⋯ , 𝑥$

Defineaprobabilisticmodelforpredictingthemiddleword

𝑃(𝑥( ∣ 𝑥#,,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥, )

Trainthemodelbyminimizinglossoverthedataset

𝐿 =1log𝑃(𝑥( ∣ 𝑥#,,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥, )�

�

8




𝑃(𝑥( ∣ 𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$ )


𝐿 =1log𝑃(𝑥( ∣ 𝑥#,,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥, )�

�

9




𝑃(𝑥( ∣ 𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$ )


𝐿 = −1log𝑃(𝑥( ∣ 𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$ )�

�

10




𝑃(𝑥( ∣ 𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$ )


𝐿 = −1log𝑃(𝑥( ∣ 𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$ )�

�

11

Needtodefinethistocompletethemodel

TheCBOWmodel

• Theclassificationtask– Input:contextwords𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$– Output:thecenterword𝑥(– Thesewordscorrespondtoone-hotvectors

• Eg:cat wouldbeassociatedwithadimension,itsone-hotvectorhas1inthatdimensionandzeroeverywhereelse

• Notation:– n:theembeddingdimension(eg 300)– V:Thevocabularyofwordswewanttoembed

• Definetwomatrices:1. 𝒱:amatrixofsize𝑛×|𝑉|2. 𝒲:amatrixofsize 𝑉 ×𝑛

12

TheCBOWmodel

1. Mapallthecontextwordsintothendimensionalspaceusing𝒱– Weget2mvectors𝒱𝑥#$,⋯ , 𝒱𝑥#', 𝒱𝑥',⋯ , 𝒱𝑥$

2. Averagethesevectorstogetacontextvector

𝑣A = 12𝑚 1 𝒱𝑥C

$

CD#$,CE(

3. Usethistocomputeascorevectorfortheoutput𝑠𝑐𝑜𝑟𝑒 = 𝒲𝑣A

4. Usethescoretocomputeprobabilityviasoftmax𝑃 𝑥( =⋅ 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒲𝑣A)

13

Input:context𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$Output:thecenterword𝑥(

n:theembeddingdimension(eg 300)V:Thevocabilary

𝒱:amatrixofsize𝑛×|𝑉|𝒲:amatrixofsize 𝑉 ×𝑛

TheCBOWmodel




$

CD#$,CE(



14




TheCBOWmodel




$

CD#$,CE(



15




TheCBOWmodel




$

CD#$,CE(



16




TheCBOWmodel




$

CD#$,CE(



17




TheCBOWmodel




$

CD#$,CE(



18




Exercise:Writethisasacomputationgraph

TheCBOWmodel




$

CD#$,CE(



19




Wordembeddings:Rowsofthematrixcorrespondingtotheoutput.Thatisrowsof𝒲

TheCBOWloss:Aworkedexample

Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput

• Step1:Projecta,b,d,eusingthematrix𝒱.Thisgivesusrowsofthematrix:𝑣O, 𝑣P, 𝑣Q, 𝑣R.

• Step2:Theiraverage:

𝑣A =𝑣O + 𝑣P + 𝑣T + 𝑣Q

4• Step3:Thescore=𝒲𝑣A

– Eachelementofthisscorecorrespondstothescoreforasingleword.

• Step4:theprobabilityofawordbeingthecenterword𝑃 ⋅ 𝑎, 𝑏, 𝑑, 𝑒 = softmax(𝒲𝑣A)

20









21









22









23









24




Moreconcretely:

𝑃 𝑐 𝑎, 𝑏, 𝑑, 𝑒) = exp(𝑤Ta𝑣A)

∑ exp(𝑤Ca𝑣A)|c|CD'

Thelossrequiresthenegativelogofthisquantity.

𝐿𝑜𝑠𝑠 = −𝑤Ta𝑣A + log1exp(𝑤Ca𝑣A)|c|

CD'

25




Moreconcretely:





CD'

26




Moreconcretely:





CD'

27




Moreconcretely:





CD'

28

Exercise:Calculatethederivativeofthiswithrespecttoallthew’sandthev’s




Moreconcretely:





CD'

29

Exercise:Calculatethederivativeofthiswithrespecttoallthew’sandthev’s

Notethatthissumrequiresustoiterateovertheentirevocabularyforeachexample!

Thislecture



• GloVe

30

Skipgram


31

Theotherword2vecmodel

Skipgram


Defineaprobabilisticmodelforpredictingeachcontextword

𝑃(𝑥Td,eRfe ∣ 𝑥()

32

InvertstheinputsandoutputsfromCBOW

Asfarastheprobabilisticmodelisconcerned:Input:thecenterwordOutput:allthewordsinthecontext

Skipgram


Defineaprobabilisticmodelforpredictingeachcontextword

𝑃(𝑥Td,eRfe ∣ 𝑥()

33





TheSkipgram model

• Theclassificationtask– Input:thecenterword𝑥(– Output:contextwords𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$– Asbefore,thesewordscorrespondtoone-hotvectors

• Notation:– n:theembeddingdimension(eg 300)– V:Thevocabularyofwordswewanttoembed

• Definetwomatrices:1. 𝒱:amatrixofsize𝑛×|𝑉|2. 𝒲:amatrixofsize 𝑉 ×𝑛

34

TheSkipgram model

1. Mapthecenterwordsintothen-dimensionalspaceusing𝒲– Wegetanndimensionalvector𝑤 = 𝒲𝑥(

2. Forthe𝑖ehcontextposition,computethescoreforawordoccupyingthatpositionas

𝑣C = 𝒱𝑤

3. Normalizethescoreforeachpositiontogetaprobability

𝑃 𝑥C =⋅ 𝑥( = softmax(𝑣C)

35

Input:thecenterword𝑥(Output:context𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$



TheSkipgram model



𝑣C = 𝒱𝑤



36




TheSkipgram model



𝑣C = 𝒱𝑤



37




TheSkipgram model



𝑣C = 𝒱𝑤



38




TheSkipgram model



𝑣C = 𝒱𝑤



39




Exercise:Writethisasacomputationgraph

TheSkipgram model



𝑣C = 𝒱𝑤



40




Rememberthegoaloflearning:Makethisprobabilityhighestfortheobservedwordsinthiscontext.

TheSkipgram loss:Aworkedexample


Step1:Getthevector𝑤T = 𝒲𝑐

Step2:Foreverypositioncomputethescoreforawordoccupyingthatpositionas𝑣 = 𝒱𝑤T

Step3:Normalizethescoreforeachpositionusingsoftmax

𝑃 𝑥C =⋅ 𝑥( = 𝑐 = softmax(𝑣)Ormorespecifically:

𝑃 𝑥#i = 𝑎 𝑥( = 𝑐 =exp 𝑣Oa𝑤T

∑ exp(𝑣Ca𝑤T)|c|CD'

41









42









43









44









45






Thelossforthisexampleisthesumofthenegativelogofthisoverallthecontextwords.

𝐿𝑜𝑠𝑠 = 1 −𝑣ja𝑤T + log1exp 𝑣Ca𝑤T

c

CD'

�

j∈{O,P,Q,R}

46








c

CD'

�

j∈{O,P,Q,R}

47








c

CD'

�

j∈{O,P,Q,R}

48Exercise:Calculatethederivativeofthiswithrespecttoallthew’sandthev’s








c

CD'

�

j∈{O,P,Q,R}

49Exercise:Calculatethederivativeofthiswithrespecttoallthew’sandthev’s

Notethatthissumrequiresustoiterateovertheentirevocabularyforeachexample!

Negativesampling

• Canwemakeitfaster?

• Answer[Mikolov etal2013]:changetheobjectivefunctionanddefineanewobjectivefunctionthatdoesnothavethesameproblem– NegativeSampling

• TheoverallmethodiscalledSkipgram withNegativeSampling(SGNS)

50

log1exp 𝑣Ca𝑤T

c

CD'

Thissumrequiresustoiterateovertheentirevocabularyforeachexample!

Negativesampling:Theintuition

• Anewtask:Givenapairofwords(w,c),isthisavalidpairornot?– Thatis,canwordcoccurinthecontextwindowofwornot?

• Thisisabinaryclassificationproblem– Wecansolvethisusinglogisticregression– Theprobabilityofapairofwordsbeingvalidisdefinedas

𝑃 𝑐 𝑤 = 𝜎 𝑣Ta𝑤o =1

1 + exp(−𝑣Ta𝑤o)

• Positiveexamplesareallpairsthatoccurindata,negativeexamplesareallpairsthatdon’toccurindata,butthisisstillamassiveset!

• Keyinsight:Insteadofgeneratingallpossiblenegativeexamples,randomlysamplekofthemineachepochofthelearningloop– Thatis,thereareonlyknegativesforeachpositiveexample,insteadoftheentire

vocabulary

51








vocabulary

52








vocabulary

53








vocabulary

54








vocabulary

55Wewillvisitnegativesamplinginthefirsthomework

Word2vecnotes

Therearemanyothertricksthatareneededtomakethisworkandscale– Ascalingterminthelossfunctiontoensurethatfrequentwordsdonotdominatetheloss

– Hierarchicalsoftmax ifyoudon’twanttousenegativesampling

– Acleverlearningrateschedule– Veryefficientcode

Seereadingformoredetails

56

Thislecture



• GloVe

57

Recall:matrixfactorizationforembeddings

Thegeneralagenda

1. Constructamatrixword-wordMwhoseentriesaresomefunctionextractedfromdatainvolvingwordsincontext(e.g.,counts,normalizedcounts,etc)

2. FactorizethematrixusingSVDtoproducelowerdimensionalembeddings ofthewords

3. Useoneoftheresultingmatricesaswordembeddings– Orsomecombinationthereof

58

Word2vecandmatrixfactorization

[LevyandGoldberg,NIPS2014]:Skipgram negativesamplingisimplicitlyfactorizingaspecificmatrixofthiskind

59



Twokeypointstonote:

60



Twokeypointstonote:1. Theentriesinthematrixareashiftedpointwisemutual

information(SPPMI)betweenawordanditscontextword.

𝑃𝑀𝐼 𝑤, 𝑐 = log𝑝(𝑤, 𝑐)𝑝 𝑤 𝑝(𝑐)

61

Theseprobabilitiesarecomputedbycountingthedataandnormalizingthem



Twokeypointstonote:

1. Theentriesinthematrixareashiftedpointwisemutualinformation(SPPMI)betweenawordanditscontextword.

𝑃𝑀𝐼 𝑤, 𝑐 = log𝑝(𝑤, 𝑐)𝑝 𝑤 𝑝(𝑐)

𝑆𝑃𝑃𝑀𝐼 𝑤, 𝑐 = 𝑃𝑀𝐼 𝑤, 𝑐 − log 𝑘

62



Twokeypointstonote:

2. ThematrixfactorizationmethodisnottruncatedSVD.– Itinsteadminimizestheobjectivefunctiontocompute

thefactorizedmatrices

63

Thislecture



• GloVe [Penningtonetal2014]

64

Whatmatrixtofactorize?

Ifwearebuildingwordembeddings byfactorizingamatrix,whatmatrixshouldweconsider?

• Wordcounts[Rhodeetal2005]• ShiftedPPMI(implicitly)[Mikolov 2013,Levy&Goldberg2014]

• Anotheranswer:logco-occurrencecounts[Penningtonetal2014]

65

Co-occurrenceprobabilities

Giventwowordsi andjthatoccurintext,theirco-occurrenceprobabilityisdefinedastheprobabilityofseeingi inthecontextofj

𝑃 𝑗 𝑖 = count(𝑗incontextof𝑖)∑ count(𝑘incontextif𝑖)�j

66

Co-occurrenceprobabilities

Giventwowordsi andjthatoccurintext,theirco-occurrenceprobabilityisdefinedastheprobabilityofseeingi inthecontextofj

𝑃 𝑗 𝑖 = count(𝑗incontextof𝑖)∑ count(𝑘incontextif𝑖)�j

Claim:Ifwewanttodistinguishbetweentwowords,itisnotenoughtolookattheirco-occurrences,weneedtolookattheratiooftheirco-occurrenceswithotherwords

– Formalizingthisintuitiongivesusanoptimizationproblem

67

TheGloVe objective

Notation:• 𝑖:word,𝑗:acontextword• wC:Thewordembeddingfor𝑖• 𝑐{:Thecontextembeddingforj• 𝑏Co, 𝑏{T:Twobiasterms:wordandcontextspecific• 𝑋C{:Thenumberoftimesword𝑖 occursinthecontextof𝑗

Theintuition:1. Constructaword-contextmatrixwhose 𝑖, 𝑗 eh entryislog𝑋C{

2. FindvectorswC, c}andthebiases𝑏C, 𝑐{ suchthatthedotproductofthevectorsaddedtothebiasesapproximatesthematrixentries

68

TheGloVe objective


Objective

𝐽 = 1 𝑤Ca𝑐{ + 𝑏C + 𝑏{ − log𝑋C{i

|c|

C,{D'

69

TheGloVe objective


Objective


|c|

C,{D'

70

Problem:Pairsthatfrequentlyco-occurtendtodominatetheobjective.

TheGloVe objective


Objective


|c|

C,{D'

71

Problem:Pairsthatfrequentlyco-occurtendtodominatetheobjective.

Answer:Correctforthisbyaddinganextratermthatpreventsthis

TheGloVe objective


Objective

𝐽 = 1 𝑓(𝑋C{) 𝑤Ca𝑐{ + 𝑏C + 𝑏{ − log𝑋C{i

|c|

C,{D'

72

𝑓:Aweightingfunctionthatassignslowerrelativeimportancetofrequentco-occurrences

GloVe:GlobalVectors

Essentiallyamatrixfactorizationmethod

DoesnotcomputestandardSVDthough1. Re-weightingforfrequency2. Two-wayfactorization,unlikeSVDwhichproduces𝑈, Σ, V3. Biasterms

Finalwordembeddings foraword:Theaverageofthewordandthecontextvectorsofthatword

73

Summary

• Wesawthreedifferentmethodsforwordembeddings

• Many,many,manyvariantsandimprovementsexist

• Varioustunableparameters/trainingchoices:– Dimensionalityofembeddings– Textfortrainingtheembeddings– Thecontextwindowsize,whetheritshouldbesymmetric– Andtheusualstuff:Learningalgorithmtouse,thelossfunction,

hyper-parameters

• Seereferencesformoredetails

74

Documents

Word Embeddingssvivek.com/.../spring2019/slides/word-embeddings/4-word2vec-glove… · enough [Mikolov, 2013, word2vec] –Simpler model, fewer parameters –Faster to train 4 Context