Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Overview
• Representingmeaning
• Wordembeddings:Earlywork
• Wordembeddings vialanguagemodels
• Word2vecandGlove
• Evaluatingembeddings
• Designchoicesandopenquestions
1
Overview
• Representingmeaning
• Wordembeddings:Earlywork
• Wordembeddings vialanguagemodels
• Word2vecandGlove
• Evaluatingembeddings
• Designchoicesandopenquestions
2
Wordembeddings vialanguagemodels
Thegoal:Tofindvectorembeddings ofwords
Highlevelapproach:1. Trainamodelforasurrogatetask(inthiscase
languagemodeling)2. Wordembeddings areabyproductofthisprocess
3
Neuralnetworklanguagemodels
• Amulti-layerneuralnetwork[Bengio etal2003]– Words→ embeddinglayer→ hiddenlayers→ softmax– Cross-entropyloss
• Insteadofproducingprobability,justproduceascoreforthenextword(nosoftmax)[Collobert andWeston,2008]– Rankingloss– Intuition:Validwordsequencesshouldgetahigherscorethaninvalid
ones
• Noneedforamulti-layernetwork,ashallownetworkisgoodenough[Mikolov,2013,word2vec]– Simplermodel,fewerparameters– Fastertotrain
4
Context=previouswordsinsentence
Context=previousandnextwordsinsentence
Thislecture
• Theword2vecmodels:CBOWandSkipgram
• Connectionbetweenword2vecandmatrixfactorization
• GloVe
5
Word2Vec
• Twoarchitecturesforlearningwordembeddings– Skipgram andCBOW
• BothhavetwokeydifferencesfromtheolderBengio/C&Wapproaches1. Nohiddenlayers2. Extracontext(bothleftandrightcontext)
• Severalcomputationaltrickstomakethingsfaster
6
[Mikolov etalICLR2013,Mikolov etalNIPS2013]
Thislecture
• Theword2vecmodels:CBOW andSkipgram
• Connectionbetweenword2vecandmatrixfactorization
• GloVe
7
ContinuousBagofWords(CBOW)
Givenawindowofwordsofalength2m+1Callthem:𝑥#$,⋯ , 𝑥#' 𝑥(𝑥',⋯ , 𝑥$
Defineaprobabilisticmodelforpredictingthemiddleword
𝑃(𝑥( ∣ 𝑥#,,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥, )
Trainthemodelbyminimizinglossoverthedataset
𝐿 =1log𝑃(𝑥( ∣ 𝑥#,,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥, )�
�
8
ContinuousBagofWords(CBOW)
Givenawindowofwordsofalength2m+1Callthem:𝑥#$,⋯ , 𝑥#' 𝑥(𝑥',⋯ , 𝑥$
Defineaprobabilisticmodelforpredictingthemiddleword
𝑃(𝑥( ∣ 𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$ )
Trainthemodelbyminimizinglossoverthedataset
𝐿 =1log𝑃(𝑥( ∣ 𝑥#,,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥, )�
�
9
ContinuousBagofWords(CBOW)
Givenawindowofwordsofalength2m+1Callthem:𝑥#$,⋯ , 𝑥#' 𝑥(𝑥',⋯ , 𝑥$
Defineaprobabilisticmodelforpredictingthemiddleword
𝑃(𝑥( ∣ 𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$ )
Trainthemodelbyminimizinglossoverthedataset
𝐿 = −1log𝑃(𝑥( ∣ 𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$ )�
�
10
ContinuousBagofWords(CBOW)
Givenawindowofwordsofalength2m+1Callthem:𝑥#$,⋯ , 𝑥#' 𝑥(𝑥',⋯ , 𝑥$
Defineaprobabilisticmodelforpredictingthemiddleword
𝑃(𝑥( ∣ 𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$ )
Trainthemodelbyminimizinglossoverthedataset
𝐿 = −1log𝑃(𝑥( ∣ 𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$ )�
�
11
Needtodefinethistocompletethemodel
TheCBOWmodel
• Theclassificationtask– Input:contextwords𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$– Output:thecenterword𝑥(– Thesewordscorrespondtoone-hotvectors
• Eg:cat wouldbeassociatedwithadimension,itsone-hotvectorhas1inthatdimensionandzeroeverywhereelse
• Notation:– n:theembeddingdimension(eg 300)– V:Thevocabularyofwordswewanttoembed
• Definetwomatrices:1. 𝒱:amatrixofsize𝑛×|𝑉|2. 𝒲:amatrixofsize 𝑉 ×𝑛
12
TheCBOWmodel
1. Mapallthecontextwordsintothendimensionalspaceusing𝒱– Weget2mvectors𝒱𝑥#$,⋯ , 𝒱𝑥#', 𝒱𝑥',⋯ , 𝒱𝑥$
2. Averagethesevectorstogetacontextvector
𝑣A = 12𝑚 1 𝒱𝑥C
$
CD#$,CE(
3. Usethistocomputeascorevectorfortheoutput𝑠𝑐𝑜𝑟𝑒 = 𝒲𝑣A
4. Usethescoretocomputeprobabilityviasoftmax𝑃 𝑥( =⋅ 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒲𝑣A)
13
Input:context𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$Output:thecenterword𝑥(
n:theembeddingdimension(eg 300)V:Thevocabilary
𝒱:amatrixofsize𝑛×|𝑉|𝒲:amatrixofsize 𝑉 ×𝑛
TheCBOWmodel
1. Mapallthecontextwordsintothendimensionalspaceusing𝒱– Weget2mvectors𝒱𝑥#$,⋯ , 𝒱𝑥#', 𝒱𝑥',⋯ , 𝒱𝑥$
2. Averagethesevectorstogetacontextvector
𝑣A = 12𝑚 1 𝒱𝑥C
$
CD#$,CE(
3. Usethistocomputeascorevectorfortheoutput𝑠𝑐𝑜𝑟𝑒 = 𝒲𝑣A
4. Usethescoretocomputeprobabilityviasoftmax𝑃 𝑥( =⋅ 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒲𝑣A)
14
Input:context𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$Output:thecenterword𝑥(
n:theembeddingdimension(eg 300)V:Thevocabilary
𝒱:amatrixofsize𝑛×|𝑉|𝒲:amatrixofsize 𝑉 ×𝑛
TheCBOWmodel
1. Mapallthecontextwordsintothendimensionalspaceusing𝒱– Weget2mvectors𝒱𝑥#$,⋯ , 𝒱𝑥#', 𝒱𝑥',⋯ , 𝒱𝑥$
2. Averagethesevectorstogetacontextvector
𝑣A = 12𝑚 1 𝒱𝑥C
$
CD#$,CE(
3. Usethistocomputeascorevectorfortheoutput𝑠𝑐𝑜𝑟𝑒 = 𝒲𝑣A
4. Usethescoretocomputeprobabilityviasoftmax𝑃 𝑥( =⋅ 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒲𝑣A)
15
Input:context𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$Output:thecenterword𝑥(
n:theembeddingdimension(eg 300)V:Thevocabilary
𝒱:amatrixofsize𝑛×|𝑉|𝒲:amatrixofsize 𝑉 ×𝑛
TheCBOWmodel
1. Mapallthecontextwordsintothendimensionalspaceusing𝒱– Weget2mvectors𝒱𝑥#$,⋯ , 𝒱𝑥#', 𝒱𝑥',⋯ , 𝒱𝑥$
2. Averagethesevectorstogetacontextvector
𝑣A = 12𝑚 1 𝒱𝑥C
$
CD#$,CE(
3. Usethistocomputeascorevectorfortheoutput𝑠𝑐𝑜𝑟𝑒 = 𝒲𝑣A
4. Usethescoretocomputeprobabilityviasoftmax𝑃 𝑥( =⋅ 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒲𝑣A)
16
Input:context𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$Output:thecenterword𝑥(
n:theembeddingdimension(eg 300)V:Thevocabilary
𝒱:amatrixofsize𝑛×|𝑉|𝒲:amatrixofsize 𝑉 ×𝑛
TheCBOWmodel
1. Mapallthecontextwordsintothendimensionalspaceusing𝒱– Weget2mvectors𝒱𝑥#$,⋯ , 𝒱𝑥#', 𝒱𝑥',⋯ , 𝒱𝑥$
2. Averagethesevectorstogetacontextvector
𝑣A = 12𝑚 1 𝒱𝑥C
$
CD#$,CE(
3. Usethistocomputeascorevectorfortheoutput𝑠𝑐𝑜𝑟𝑒 = 𝒲𝑣A
4. Usethescoretocomputeprobabilityviasoftmax𝑃 𝑥( =⋅ 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒲𝑣A)
17
Input:context𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$Output:thecenterword𝑥(
n:theembeddingdimension(eg 300)V:Thevocabilary
𝒱:amatrixofsize𝑛×|𝑉|𝒲:amatrixofsize 𝑉 ×𝑛
TheCBOWmodel
1. Mapallthecontextwordsintothendimensionalspaceusing𝒱– Weget2mvectors𝒱𝑥#$,⋯ , 𝒱𝑥#', 𝒱𝑥',⋯ , 𝒱𝑥$
2. Averagethesevectorstogetacontextvector
𝑣A = 12𝑚 1 𝒱𝑥C
$
CD#$,CE(
3. Usethistocomputeascorevectorfortheoutput𝑠𝑐𝑜𝑟𝑒 = 𝒲𝑣A
4. Usethescoretocomputeprobabilityviasoftmax𝑃 𝑥( =⋅ 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒲𝑣A)
18
Input:context𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$Output:thecenterword𝑥(
n:theembeddingdimension(eg 300)V:Thevocabilary
𝒱:amatrixofsize𝑛×|𝑉|𝒲:amatrixofsize 𝑉 ×𝑛
Exercise:Writethisasacomputationgraph
TheCBOWmodel
1. Mapallthecontextwordsintothendimensionalspaceusing𝒱– Weget2mvectors𝒱𝑥#$,⋯ , 𝒱𝑥#', 𝒱𝑥',⋯ , 𝒱𝑥$
2. Averagethesevectorstogetacontextvector
𝑣A = 12𝑚 1 𝒱𝑥C
$
CD#$,CE(
3. Usethistocomputeascorevectorfortheoutput𝑠𝑐𝑜𝑟𝑒 = 𝒲𝑣A
4. Usethescoretocomputeprobabilityviasoftmax𝑃 𝑥( =⋅ 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒲𝑣A)
19
Input:context𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$Output:thecenterword𝑥(
n:theembeddingdimension(eg 300)V:Thevocabilary
𝒱:amatrixofsize𝑛×|𝑉|𝒲:amatrixofsize 𝑉 ×𝑛
Wordembeddings:Rowsofthematrixcorrespondingtotheoutput.Thatisrowsof𝒲
TheCBOWloss:Aworkedexample
Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput
• Step1:Projecta,b,d,eusingthematrix𝒱.Thisgivesusrowsofthematrix:𝑣O, 𝑣P, 𝑣Q, 𝑣R.
• Step2:Theiraverage:
𝑣A =𝑣O + 𝑣P + 𝑣T + 𝑣Q
4• Step3:Thescore=𝒲𝑣A
– Eachelementofthisscorecorrespondstothescoreforasingleword.
• Step4:theprobabilityofawordbeingthecenterword𝑃 ⋅ 𝑎, 𝑏, 𝑑, 𝑒 = softmax(𝒲𝑣A)
20
TheCBOWloss:Aworkedexample
Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput
• Step1:Projecta,b,d,eusingthematrix𝒱.Thisgivesusrowsofthematrix:𝑣O, 𝑣P, 𝑣Q, 𝑣R.
• Step2:Theiraverage:
𝑣A =𝑣O + 𝑣P + 𝑣T + 𝑣Q
4• Step3:Thescore=𝒲𝑣A
– Eachelementofthisscorecorrespondstothescoreforasingleword.
• Step4:theprobabilityofawordbeingthecenterword𝑃 ⋅ 𝑎, 𝑏, 𝑑, 𝑒 = softmax(𝒲𝑣A)
21
TheCBOWloss:Aworkedexample
Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput
• Step1:Projecta,b,d,eusingthematrix𝒱.Thisgivesusrowsofthematrix:𝑣O, 𝑣P, 𝑣Q, 𝑣R.
• Step2:Theiraverage:
𝑣A =𝑣O + 𝑣P + 𝑣T + 𝑣Q
4• Step3:Thescore=𝒲𝑣A
– Eachelementofthisscorecorrespondstothescoreforasingleword.
• Step4:theprobabilityofawordbeingthecenterword𝑃 ⋅ 𝑎, 𝑏, 𝑑, 𝑒 = softmax(𝒲𝑣A)
22
TheCBOWloss:Aworkedexample
Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput
• Step1:Projecta,b,d,eusingthematrix𝒱.Thisgivesusrowsofthematrix:𝑣O, 𝑣P, 𝑣Q, 𝑣R.
• Step2:Theiraverage:
𝑣A =𝑣O + 𝑣P + 𝑣T + 𝑣Q
4• Step3:Thescore=𝒲𝑣A
– Eachelementofthisscorecorrespondstothescoreforasingleword.
• Step4:theprobabilityofawordbeingthecenterword𝑃 ⋅ 𝑎, 𝑏, 𝑑, 𝑒 = softmax(𝒲𝑣A)
23
TheCBOWloss:Aworkedexample
Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput
• Step1:Projecta,b,d,eusingthematrix𝒱.Thisgivesusrowsofthematrix:𝑣O, 𝑣P, 𝑣Q, 𝑣R.
• Step2:Theiraverage:
𝑣A =𝑣O + 𝑣P + 𝑣T + 𝑣Q
4• Step3:Thescore=𝒲𝑣A
– Eachelementofthisscorecorrespondstothescoreforasingleword.
• Step4:theprobabilityofawordbeingthecenterword𝑃 ⋅ 𝑎, 𝑏, 𝑑, 𝑒 = softmax(𝒲𝑣A)
24
TheCBOWloss:Aworkedexample
Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput
• Step4:theprobabilityofawordbeingthecenterword𝑃 ⋅ 𝑎, 𝑏, 𝑑, 𝑒 = softmax(𝒲𝑣A)
Moreconcretely:
𝑃 𝑐 𝑎, 𝑏, 𝑑, 𝑒) = exp(𝑤Ta𝑣A)
∑ exp(𝑤Ca𝑣A)|c|CD'
Thelossrequiresthenegativelogofthisquantity.
𝐿𝑜𝑠𝑠 = −𝑤Ta𝑣A + log1exp(𝑤Ca𝑣A)|c|
CD'
25
TheCBOWloss:Aworkedexample
Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput
• Step4:theprobabilityofawordbeingthecenterword𝑃 ⋅ 𝑎, 𝑏, 𝑑, 𝑒 = softmax(𝒲𝑣A)
Moreconcretely:
𝑃 𝑐 𝑎, 𝑏, 𝑑, 𝑒) = exp(𝑤Ta𝑣A)
∑ exp(𝑤Ca𝑣A)|c|CD'
Thelossrequiresthenegativelogofthisquantity.
𝐿𝑜𝑠𝑠 = −𝑤Ta𝑣A + log1exp(𝑤Ca𝑣A)|c|
CD'
26
TheCBOWloss:Aworkedexample
Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput
• Step4:theprobabilityofawordbeingthecenterword𝑃 ⋅ 𝑎, 𝑏, 𝑑, 𝑒 = softmax(𝒲𝑣A)
Moreconcretely:
𝑃 𝑐 𝑎, 𝑏, 𝑑, 𝑒) = exp(𝑤Ta𝑣A)
∑ exp(𝑤Ca𝑣A)|c|CD'
Thelossrequiresthenegativelogofthisquantity.
𝐿𝑜𝑠𝑠 = −𝑤Ta𝑣A + log1exp(𝑤Ca𝑣A)|c|
CD'
27
TheCBOWloss:Aworkedexample
Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput
• Step4:theprobabilityofawordbeingthecenterword𝑃 ⋅ 𝑎, 𝑏, 𝑑, 𝑒 = softmax(𝒲𝑣A)
Moreconcretely:
𝑃 𝑐 𝑎, 𝑏, 𝑑, 𝑒) = exp(𝑤Ta𝑣A)
∑ exp(𝑤Ca𝑣A)|c|CD'
Thelossrequiresthenegativelogofthisquantity.
𝐿𝑜𝑠𝑠 = −𝑤Ta𝑣A + log1exp(𝑤Ca𝑣A)|c|
CD'
28
Exercise:Calculatethederivativeofthiswithrespecttoallthew’sandthev’s
TheCBOWloss:Aworkedexample
Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput
• Step4:theprobabilityofawordbeingthecenterword𝑃 ⋅ 𝑎, 𝑏, 𝑑, 𝑒 = softmax(𝒲𝑣A)
Moreconcretely:
𝑃 𝑐 𝑎, 𝑏, 𝑑, 𝑒) = exp(𝑤Ta𝑣A)
∑ exp(𝑤Ca𝑣A)|c|CD'
Thelossrequiresthenegativelogofthisquantity.
𝐿𝑜𝑠𝑠 = −𝑤Ta𝑣A + log1exp(𝑤Ca𝑣A)|c|
CD'
29
Exercise:Calculatethederivativeofthiswithrespecttoallthew’sandthev’s
Notethatthissumrequiresustoiterateovertheentirevocabularyforeachexample!
Thislecture
• Theword2vecmodels:CBOWandSkipgram
• Connectionbetweenword2vecandmatrixfactorization
• GloVe
30
Skipgram
Givenawindowofwordsofalength2m+1Callthem:𝑥#$,⋯ , 𝑥#' 𝑥(𝑥',⋯ , 𝑥$
Defineaprobabilisticmodelforpredictingeachcontextword
𝑃(𝑥Td,eRfe ∣ 𝑥()
32
InvertstheinputsandoutputsfromCBOW
Asfarastheprobabilisticmodelisconcerned:Input:thecenterwordOutput:allthewordsinthecontext
Skipgram
Givenawindowofwordsofalength2m+1Callthem:𝑥#$,⋯ , 𝑥#' 𝑥(𝑥',⋯ , 𝑥$
Defineaprobabilisticmodelforpredictingeachcontextword
𝑃(𝑥Td,eRfe ∣ 𝑥()
33
InvertstheinputsandoutputsfromCBOW
Asfarastheprobabilisticmodelisconcerned:Input:thecenterwordOutput:allthewordsinthecontext
InvertstheinputsandoutputsfromCBOW
Asfarastheprobabilisticmodelisconcerned:Input:thecenterwordOutput:allthewordsinthecontext
TheSkipgram model
• Theclassificationtask– Input:thecenterword𝑥(– Output:contextwords𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$– Asbefore,thesewordscorrespondtoone-hotvectors
• Notation:– n:theembeddingdimension(eg 300)– V:Thevocabularyofwordswewanttoembed
• Definetwomatrices:1. 𝒱:amatrixofsize𝑛×|𝑉|2. 𝒲:amatrixofsize 𝑉 ×𝑛
34
TheSkipgram model
1. Mapthecenterwordsintothen-dimensionalspaceusing𝒲– Wegetanndimensionalvector𝑤 = 𝒲𝑥(
2. Forthe𝑖ehcontextposition,computethescoreforawordoccupyingthatpositionas
𝑣C = 𝒱𝑤
3. Normalizethescoreforeachpositiontogetaprobability
𝑃 𝑥C =⋅ 𝑥( = softmax(𝑣C)
35
Input:thecenterword𝑥(Output:context𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$
n:theembeddingdimension(eg 300)V:Thevocabilary
𝒱:amatrixofsize𝑛×|𝑉|𝒲:amatrixofsize 𝑉 ×𝑛
TheSkipgram model
1. Mapthecenterwordsintothen-dimensionalspaceusing𝒲– Wegetanndimensionalvector𝑤 = 𝒲𝑥(
2. Forthe𝑖ehcontextposition,computethescoreforawordoccupyingthatpositionas
𝑣C = 𝒱𝑤
3. Normalizethescoreforeachpositiontogetaprobability
𝑃 𝑥C =⋅ 𝑥( = softmax(𝑣C)
36
Input:thecenterword𝑥(Output:context𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$
n:theembeddingdimension(eg 300)V:Thevocabilary
𝒱:amatrixofsize𝑛×|𝑉|𝒲:amatrixofsize 𝑉 ×𝑛
TheSkipgram model
1. Mapthecenterwordsintothen-dimensionalspaceusing𝒲– Wegetanndimensionalvector𝑤 = 𝒲𝑥(
2. Forthe𝑖ehcontextposition,computethescoreforawordoccupyingthatpositionas
𝑣C = 𝒱𝑤
3. Normalizethescoreforeachpositiontogetaprobability
𝑃 𝑥C =⋅ 𝑥( = softmax(𝑣C)
37
Input:thecenterword𝑥(Output:context𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$
n:theembeddingdimension(eg 300)V:Thevocabilary
𝒱:amatrixofsize𝑛×|𝑉|𝒲:amatrixofsize 𝑉 ×𝑛
TheSkipgram model
1. Mapthecenterwordsintothen-dimensionalspaceusing𝒲– Wegetanndimensionalvector𝑤 = 𝒲𝑥(
2. Forthe𝑖ehcontextposition,computethescoreforawordoccupyingthatpositionas
𝑣C = 𝒱𝑤
3. Normalizethescoreforeachpositiontogetaprobability
𝑃 𝑥C =⋅ 𝑥( = softmax(𝑣C)
38
Input:thecenterword𝑥(Output:context𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$
n:theembeddingdimension(eg 300)V:Thevocabilary
𝒱:amatrixofsize𝑛×|𝑉|𝒲:amatrixofsize 𝑉 ×𝑛
TheSkipgram model
1. Mapthecenterwordsintothen-dimensionalspaceusing𝒲– Wegetanndimensionalvector𝑤 = 𝒲𝑥(
2. Forthe𝑖ehcontextposition,computethescoreforawordoccupyingthatpositionas
𝑣C = 𝒱𝑤
3. Normalizethescoreforeachpositiontogetaprobability
𝑃 𝑥C =⋅ 𝑥( = softmax(𝑣C)
39
Input:thecenterword𝑥(Output:context𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$
n:theembeddingdimension(eg 300)V:Thevocabilary
𝒱:amatrixofsize𝑛×|𝑉|𝒲:amatrixofsize 𝑉 ×𝑛
Exercise:Writethisasacomputationgraph
TheSkipgram model
1. Mapthecenterwordsintothen-dimensionalspaceusing𝒲– Wegetanndimensionalvector𝑤 = 𝒲𝑥(
2. Forthe𝑖ehcontextposition,computethescoreforawordoccupyingthatpositionas
𝑣C = 𝒱𝑤
3. Normalizethescoreforeachpositiontogetaprobability
𝑃 𝑥C =⋅ 𝑥( = softmax(𝑣C)
40
Input:thecenterword𝑥(Output:context𝑥#$,⋯ , 𝑥#', 𝑥' ,⋯ , 𝑥$
n:theembeddingdimension(eg 300)V:Thevocabilary
𝒱:amatrixofsize𝑛×|𝑉|𝒲:amatrixofsize 𝑉 ×𝑛
Rememberthegoaloflearning:Makethisprobabilityhighestfortheobservedwordsinthiscontext.
TheSkipgram loss:Aworkedexample
Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput
Step1:Getthevector𝑤T = 𝒲𝑐
Step2:Foreverypositioncomputethescoreforawordoccupyingthatpositionas𝑣 = 𝒱𝑤T
Step3:Normalizethescoreforeachpositionusingsoftmax
𝑃 𝑥C =⋅ 𝑥( = 𝑐 = softmax(𝑣)Ormorespecifically:
𝑃 𝑥#i = 𝑎 𝑥( = 𝑐 =exp 𝑣Oa𝑤T
∑ exp(𝑣Ca𝑤T)|c|CD'
41
TheSkipgram loss:Aworkedexample
Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput
Step1:Getthevector𝑤T = 𝒲𝑐
Step2:Foreverypositioncomputethescoreforawordoccupyingthatpositionas𝑣 = 𝒱𝑤T
Step3:Normalizethescoreforeachpositionusingsoftmax
𝑃 𝑥C =⋅ 𝑥( = 𝑐 = softmax(𝑣)Ormorespecifically:
𝑃 𝑥#i = 𝑎 𝑥( = 𝑐 =exp 𝑣Oa𝑤T
∑ exp(𝑣Ca𝑤T)|c|CD'
42
TheSkipgram loss:Aworkedexample
Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput
Step1:Getthevector𝑤T = 𝒲𝑐
Step2:Foreverypositioncomputethescoreforawordoccupyingthatpositionas𝑣 = 𝒱𝑤T
Step3:Normalizethescoreforeachpositionusingsoftmax
𝑃 𝑥C =⋅ 𝑥( = 𝑐 = softmax(𝑣)Ormorespecifically:
𝑃 𝑥#i = 𝑎 𝑥( = 𝑐 =exp 𝑣Oa𝑤T
∑ exp(𝑣Ca𝑤T)|c|CD'
43
TheSkipgram loss:Aworkedexample
Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput
Step1:Getthevector𝑤T = 𝒲𝑐
Step2:Foreverypositioncomputethescoreforawordoccupyingthatpositionas𝑣 = 𝒱𝑤T
Step3:Normalizethescoreforeachpositionusingsoftmax
𝑃 𝑥C =⋅ 𝑥( = 𝑐 = softmax(𝑣)Ormorespecifically:
𝑃 𝑥#i = 𝑎 𝑥( = 𝑐 =exp 𝑣Oa𝑤T
∑ exp(𝑣Ca𝑤T)|c|CD'
44
TheSkipgram loss:Aworkedexample
Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput
Step1:Getthevector𝑤T = 𝒲𝑐
Step2:Foreverypositioncomputethescoreforawordoccupyingthatpositionas𝑣 = 𝒱𝑤T
Step3:Normalizethescoreforeachpositionusingsoftmax
𝑃 𝑥C =⋅ 𝑥( = 𝑐 = softmax(𝑣)Ormorespecifically:
𝑃 𝑥#i = 𝑎 𝑥( = 𝑐 =exp 𝑣Oa𝑤T
∑ exp(𝑣Ca𝑤T)|c|CD'
45
TheSkipgram loss:Aworkedexample
Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput
Step3:Normalizethescoreforeachpositionusingsoftmax
𝑃 𝑥#i = 𝑎 𝑥( = 𝑐 =exp 𝑣Oa𝑤T
∑ exp(𝑣Ca𝑤T)|c|CD'
Thelossforthisexampleisthesumofthenegativelogofthisoverallthecontextwords.
𝐿𝑜𝑠𝑠 = 1 −𝑣ja𝑤T + log1exp 𝑣Ca𝑤T
c
CD'
�
j∈{O,P,Q,R}
46
TheSkipgram loss:Aworkedexample
Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput
Step3:Normalizethescoreforeachpositionusingsoftmax
𝑃 𝑥#i = 𝑎 𝑥( = 𝑐 =exp 𝑣Oa𝑤T
∑ exp(𝑣Ca𝑤T)|c|CD'
Thelossforthisexampleisthesumofthenegativelogofthisoverallthecontextwords.
𝐿𝑜𝑠𝑠 = 1 −𝑣ja𝑤T + log1exp 𝑣Ca𝑤T
c
CD'
�
j∈{O,P,Q,R}
47
TheSkipgram loss:Aworkedexample
Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput
Step3:Normalizethescoreforeachpositionusingsoftmax
𝑃 𝑥#i = 𝑎 𝑥( = 𝑐 =exp 𝑣Oa𝑤T
∑ exp(𝑣Ca𝑤T)|c|CD'
Thelossforthisexampleisthesumofthenegativelogofthisoverallthecontextwords.
𝐿𝑜𝑠𝑠 = 1 −𝑣ja𝑤T + log1exp 𝑣Ca𝑤T
c
CD'
�
j∈{O,P,Q,R}
48Exercise:Calculatethederivativeofthiswithrespecttoallthew’sandthev’s
TheSkipgram loss:Aworkedexample
Considerthelossforoneexamplewithcontextsize2oneachside.Denotethewordsbyabc dewithc beingtheoutput
Step3:Normalizethescoreforeachpositionusingsoftmax
𝑃 𝑥#i = 𝑎 𝑥( = 𝑐 =exp 𝑣Oa𝑤T
∑ exp(𝑣Ca𝑤T)|c|CD'
Thelossforthisexampleisthesumofthenegativelogofthisoverallthecontextwords.
𝐿𝑜𝑠𝑠 = 1 −𝑣ja𝑤T + log1exp 𝑣Ca𝑤T
c
CD'
�
j∈{O,P,Q,R}
49Exercise:Calculatethederivativeofthiswithrespecttoallthew’sandthev’s
Notethatthissumrequiresustoiterateovertheentirevocabularyforeachexample!
Negativesampling
• Canwemakeitfaster?
• Answer[Mikolov etal2013]:changetheobjectivefunctionanddefineanewobjectivefunctionthatdoesnothavethesameproblem– NegativeSampling
• TheoverallmethodiscalledSkipgram withNegativeSampling(SGNS)
50
log1exp 𝑣Ca𝑤T
c
CD'
Thissumrequiresustoiterateovertheentirevocabularyforeachexample!
Negativesampling:Theintuition
• Anewtask:Givenapairofwords(w,c),isthisavalidpairornot?– Thatis,canwordcoccurinthecontextwindowofwornot?
• Thisisabinaryclassificationproblem– Wecansolvethisusinglogisticregression– Theprobabilityofapairofwordsbeingvalidisdefinedas
𝑃 𝑐 𝑤 = 𝜎 𝑣Ta𝑤o =1
1 + exp(−𝑣Ta𝑤o)
• Positiveexamplesareallpairsthatoccurindata,negativeexamplesareallpairsthatdon’toccurindata,butthisisstillamassiveset!
• Keyinsight:Insteadofgeneratingallpossiblenegativeexamples,randomlysamplekofthemineachepochofthelearningloop– Thatis,thereareonlyknegativesforeachpositiveexample,insteadoftheentire
vocabulary
51
Negativesampling:Theintuition
• Anewtask:Givenapairofwords(w,c),isthisavalidpairornot?– Thatis,canwordcoccurinthecontextwindowofwornot?
• Thisisabinaryclassificationproblem– Wecansolvethisusinglogisticregression– Theprobabilityofapairofwordsbeingvalidisdefinedas
𝑃 𝑐 𝑤 = 𝜎 𝑣Ta𝑤o =1
1 + exp(−𝑣Ta𝑤o)
• Positiveexamplesareallpairsthatoccurindata,negativeexamplesareallpairsthatdon’toccurindata,butthisisstillamassiveset!
• Keyinsight:Insteadofgeneratingallpossiblenegativeexamples,randomlysamplekofthemineachepochofthelearningloop– Thatis,thereareonlyknegativesforeachpositiveexample,insteadoftheentire
vocabulary
52
Negativesampling:Theintuition
• Anewtask:Givenapairofwords(w,c),isthisavalidpairornot?– Thatis,canwordcoccurinthecontextwindowofwornot?
• Thisisabinaryclassificationproblem– Wecansolvethisusinglogisticregression– Theprobabilityofapairofwordsbeingvalidisdefinedas
𝑃 𝑐 𝑤 = 𝜎 𝑣Ta𝑤o =1
1 + exp(−𝑣Ta𝑤o)
• Positiveexamplesareallpairsthatoccurindata,negativeexamplesareallpairsthatdon’toccurindata,butthisisstillamassiveset!
• Keyinsight:Insteadofgeneratingallpossiblenegativeexamples,randomlysamplekofthemineachepochofthelearningloop– Thatis,thereareonlyknegativesforeachpositiveexample,insteadoftheentire
vocabulary
53
Negativesampling:Theintuition
• Anewtask:Givenapairofwords(w,c),isthisavalidpairornot?– Thatis,canwordcoccurinthecontextwindowofwornot?
• Thisisabinaryclassificationproblem– Wecansolvethisusinglogisticregression– Theprobabilityofapairofwordsbeingvalidisdefinedas
𝑃 𝑐 𝑤 = 𝜎 𝑣Ta𝑤o =1
1 + exp(−𝑣Ta𝑤o)
• Positiveexamplesareallpairsthatoccurindata,negativeexamplesareallpairsthatdon’toccurindata,butthisisstillamassiveset!
• Keyinsight:Insteadofgeneratingallpossiblenegativeexamples,randomlysamplekofthemineachepochofthelearningloop– Thatis,thereareonlyknegativesforeachpositiveexample,insteadoftheentire
vocabulary
54
Negativesampling:Theintuition
• Anewtask:Givenapairofwords(w,c),isthisavalidpairornot?– Thatis,canwordcoccurinthecontextwindowofwornot?
• Thisisabinaryclassificationproblem– Wecansolvethisusinglogisticregression– Theprobabilityofapairofwordsbeingvalidisdefinedas
𝑃 𝑐 𝑤 = 𝜎 𝑣Ta𝑤o =1
1 + exp(−𝑣Ta𝑤o)
• Positiveexamplesareallpairsthatoccurindata,negativeexamplesareallpairsthatdon’toccurindata,butthisisstillamassiveset!
• Keyinsight:Insteadofgeneratingallpossiblenegativeexamples,randomlysamplekofthemineachepochofthelearningloop– Thatis,thereareonlyknegativesforeachpositiveexample,insteadoftheentire
vocabulary
55Wewillvisitnegativesamplinginthefirsthomework
Word2vecnotes
Therearemanyothertricksthatareneededtomakethisworkandscale– Ascalingterminthelossfunctiontoensurethatfrequentwordsdonotdominatetheloss
– Hierarchicalsoftmax ifyoudon’twanttousenegativesampling
– Acleverlearningrateschedule– Veryefficientcode
Seereadingformoredetails
56
Thislecture
• Theword2vecmodels:CBOWandSkipgram
• Connectionbetweenword2vecandmatrixfactorization
• GloVe
57
Recall:matrixfactorizationforembeddings
Thegeneralagenda
1. Constructamatrixword-wordMwhoseentriesaresomefunctionextractedfromdatainvolvingwordsincontext(e.g.,counts,normalizedcounts,etc)
2. FactorizethematrixusingSVDtoproducelowerdimensionalembeddings ofthewords
3. Useoneoftheresultingmatricesaswordembeddings– Orsomecombinationthereof
58
Word2vecandmatrixfactorization
[LevyandGoldberg,NIPS2014]:Skipgram negativesamplingisimplicitlyfactorizingaspecificmatrixofthiskind
59
Word2vecandmatrixfactorization
[LevyandGoldberg,NIPS2014]:Skipgram negativesamplingisimplicitlyfactorizingaspecificmatrixofthiskind
Twokeypointstonote:
60
Word2vecandmatrixfactorization
[LevyandGoldberg,NIPS2014]:Skipgram negativesamplingisimplicitlyfactorizingaspecificmatrixofthiskind
Twokeypointstonote:1. Theentriesinthematrixareashiftedpointwisemutual
information(SPPMI)betweenawordanditscontextword.
𝑃𝑀𝐼 𝑤, 𝑐 = log𝑝(𝑤, 𝑐)𝑝 𝑤 𝑝(𝑐)
61
Theseprobabilitiesarecomputedbycountingthedataandnormalizingthem
Word2vecandmatrixfactorization
[LevyandGoldberg,NIPS2014]:Skipgram negativesamplingisimplicitlyfactorizingaspecificmatrixofthiskind
Twokeypointstonote:
1. Theentriesinthematrixareashiftedpointwisemutualinformation(SPPMI)betweenawordanditscontextword.
𝑃𝑀𝐼 𝑤, 𝑐 = log𝑝(𝑤, 𝑐)𝑝 𝑤 𝑝(𝑐)
𝑆𝑃𝑃𝑀𝐼 𝑤, 𝑐 = 𝑃𝑀𝐼 𝑤, 𝑐 − log 𝑘
62
Word2vecandmatrixfactorization
[LevyandGoldberg,NIPS2014]:Skipgram negativesamplingisimplicitlyfactorizingaspecificmatrixofthiskind
Twokeypointstonote:
2. ThematrixfactorizationmethodisnottruncatedSVD.– Itinsteadminimizestheobjectivefunctiontocompute
thefactorizedmatrices
63
Thislecture
• Theword2vecmodels:CBOWandSkipgram
• Connectionbetweenword2vecandmatrixfactorization
• GloVe [Penningtonetal2014]
64
Whatmatrixtofactorize?
Ifwearebuildingwordembeddings byfactorizingamatrix,whatmatrixshouldweconsider?
• Wordcounts[Rhodeetal2005]• ShiftedPPMI(implicitly)[Mikolov 2013,Levy&Goldberg2014]
• Anotheranswer:logco-occurrencecounts[Penningtonetal2014]
65
Co-occurrenceprobabilities
Giventwowordsi andjthatoccurintext,theirco-occurrenceprobabilityisdefinedastheprobabilityofseeingi inthecontextofj
𝑃 𝑗 𝑖 = count(𝑗incontextof𝑖)∑ count(𝑘incontextif𝑖)�j
66
Co-occurrenceprobabilities
Giventwowordsi andjthatoccurintext,theirco-occurrenceprobabilityisdefinedastheprobabilityofseeingi inthecontextofj
𝑃 𝑗 𝑖 = count(𝑗incontextof𝑖)∑ count(𝑘incontextif𝑖)�j
Claim:Ifwewanttodistinguishbetweentwowords,itisnotenoughtolookattheirco-occurrences,weneedtolookattheratiooftheirco-occurrenceswithotherwords
– Formalizingthisintuitiongivesusanoptimizationproblem
67
TheGloVe objective
Notation:• 𝑖:word,𝑗:acontextword• wC:Thewordembeddingfor𝑖• 𝑐{:Thecontextembeddingforj• 𝑏Co, 𝑏{T:Twobiasterms:wordandcontextspecific• 𝑋C{:Thenumberoftimesword𝑖 occursinthecontextof𝑗
Theintuition:1. Constructaword-contextmatrixwhose 𝑖, 𝑗 eh entryislog𝑋C{
2. FindvectorswC, c}andthebiases𝑏C, 𝑐{ suchthatthedotproductofthevectorsaddedtothebiasesapproximatesthematrixentries
68
TheGloVe objective
Notation:• 𝑖:word,𝑗:acontextword• wC:Thewordembeddingfor𝑖• 𝑐{:Thecontextembeddingforj• 𝑏Co, 𝑏{T:Twobiasterms:wordandcontextspecific• 𝑋C{:Thenumberoftimesword𝑖 occursinthecontextof𝑗
Objective
𝐽 = 1 𝑤Ca𝑐{ + 𝑏C + 𝑏{ − log𝑋C{i
|c|
C,{D'
69
TheGloVe objective
Notation:• 𝑖:word,𝑗:acontextword• wC:Thewordembeddingfor𝑖• 𝑐{:Thecontextembeddingforj• 𝑏Co, 𝑏{T:Twobiasterms:wordandcontextspecific• 𝑋C{:Thenumberoftimesword𝑖 occursinthecontextof𝑗
Objective
𝐽 = 1 𝑤Ca𝑐{ + 𝑏C + 𝑏{ − log𝑋C{i
|c|
C,{D'
70
Problem:Pairsthatfrequentlyco-occurtendtodominatetheobjective.
TheGloVe objective
Notation:• 𝑖:word,𝑗:acontextword• wC:Thewordembeddingfor𝑖• 𝑐{:Thecontextembeddingforj• 𝑏Co, 𝑏{T:Twobiasterms:wordandcontextspecific• 𝑋C{:Thenumberoftimesword𝑖 occursinthecontextof𝑗
Objective
𝐽 = 1 𝑤Ca𝑐{ + 𝑏C + 𝑏{ − log𝑋C{i
|c|
C,{D'
71
Problem:Pairsthatfrequentlyco-occurtendtodominatetheobjective.
Answer:Correctforthisbyaddinganextratermthatpreventsthis
TheGloVe objective
Notation:• 𝑖:word,𝑗:acontextword• wC:Thewordembeddingfor𝑖• 𝑐{:Thecontextembeddingforj• 𝑏Co, 𝑏{T:Twobiasterms:wordandcontextspecific• 𝑋C{:Thenumberoftimesword𝑖 occursinthecontextof𝑗
Objective
𝐽 = 1 𝑓(𝑋C{) 𝑤Ca𝑐{ + 𝑏C + 𝑏{ − log𝑋C{i
|c|
C,{D'
72
𝑓:Aweightingfunctionthatassignslowerrelativeimportancetofrequentco-occurrences
GloVe:GlobalVectors
Essentiallyamatrixfactorizationmethod
DoesnotcomputestandardSVDthough1. Re-weightingforfrequency2. Two-wayfactorization,unlikeSVDwhichproduces𝑈, Σ, V3. Biasterms
Finalwordembeddings foraword:Theaverageofthewordandthecontextvectorsofthatword
73
Summary
• Wesawthreedifferentmethodsforwordembeddings
• Many,many,manyvariantsandimprovementsexist
• Varioustunableparameters/trainingchoices:– Dimensionalityofembeddings– Textfortrainingtheembeddings– Thecontextwindowsize,whetheritshouldbesymmetric– Andtheusualstuff:Learningalgorithmtouse,thelossfunction,
hyper-parameters
• Seereferencesformoredetails
74