Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Natural Language Processing
Word vectors
Many slides borrowed from Richard Socher ,Chris Manning, and Hugo Lachorelle
Lecture plan• Word representations
• Word vectors (embeddings)
• skip-gram algorithm
• Relation to matrix factorization
• Evaluation
2
Representing words
3
Representing wordsDefinition: meaning (Webster dictionary)
• the idea that is represented by a word, phrase, etc.
• The idea that a person wants to express by using words, signs, etc.
• the idea that is expressed in a work of writing, art, etc.
In linguistics:
signifier <—> signified (idea or thing) = denotation
4
Taxonomies
Taxonomies
“beverage”
Representing words with computers
A word is the set of meanings it has in a taxonomy (graph of meanings) Hypernym: “is-a” relationHyponym: the opposite of ‘hypernym’
7
Drawbacks• Expensive!
• Subjective (how to split different synsets?)
• Incomplete
• wicked, badass, nifty, crack, ace, wizard, genius, ninja
• Missing functionality:
• how do you compute word similarity?
• How to compose meanings?
8
Discrete representationWords are atomic symbols (one-hot representation):
V = {hotel,motel,walk,wife, spouse}
|V| ⇡ 100, 000
hotel [1 0 0 0 0]
motel [0 1 0 0 0]
walk [0 0 1 0 0]
wife [0 0 0 1 0]
spouse [0 0 0 0 1]
9
DrawbackBarack Obama’s wife ≈ Barack Obama’s spouse Barack Obama’s wife ≉ Barack Obama’s advisors
Seattle motels ≈ Seattle hotels Seattle motels ≉ Seattle attractions
But all words vectors are orthogonal and equidistant
Goal: word vectors with a natural notion of similarity
h“hotel” · “motel”i > h“hotel” · “spouse”i 10
Distributional similarity“You shall know a word by the company it keeps”
(Firth, 1957)
“… cashed a check at the bank across the street…” “… that bank holds the mortgage on my home…” “… said that the bank raised his forecast for…” “… employees of the bank have confessed to the charges”
Central idea: represent words by their context
11
Idea 1word context
wife {met: 3, married: 4, children: 2, wedded: 1, …}
spouse {met: 2, married: 5, children: 2, kids: 1, …}
Problem: • married <==> wedded • children <==> kids
12
Distributed representations
language =
0.278�0.9110.792
�0.1770.109
�0.542�0.0003
• Represent words as low-dimensional vectors• Represent similarity with vector similarity metrics
13
Word vectors
14
Motivation
• Word embeddings are widely used
• (other options exist: word-parts, character-level,…).
• The great innovation of 2018 - contextualized word embeddings.
Supervised learning• Input: training set
• Output (probabilistic model):
• Example: train a spam detector from spam and non-spam e-mails.
f : X ! Yargmax
yp(y | x)
Intro to ML prerequisite 16
{(xi, yi)}Ni=1, (xi, yi) ⇠ D(X ⇥ Y)<latexit sha1_base64="ZipdpF8ZX0WzbIc02+YwhBvVqds=">AAACMXicbVDLSsNAFJ34rPUVdelmsAgtlJKIoC6Eoi50IxWsrTQ1TKbTduhMEmYmYgn5Jjd+ieBCFypu/QknbfHReuDC4Zx7ufceL2RUKst6NqamZ2bn5jML2cWl5ZVVc239SgaRwKSKAxaIuockYdQnVUUVI/VQEMQ9Rmpe7zj1a7dESBr4l6ofkiZHHZ+2KUZKS6555sT5O5cWYd+lBSdxY3poJzfnRfijQkdSDk/yDkeqixGL6wl0FOVEwm/pOim4Zs4qWQPASWKPSA6MUHHNR6cV4IgTX2GGpGzYVqiaMRKKYkaSrBNJEiLcQx3S0NRHemMzHrycwG2ttGA7ELp8BQfq74kYcSn73NOd6Y1y3EvF/7xGpNr7zZj6YaSIj4eL2hGDKoBpfrBFBcGK9TVBWFB9K8RdJBBWOuWsDsEef3mSVHdKByX7YjdXPhqlkQGbYAvkgQ32QBmcggqoAgzuwRN4BW/Gg/FivBsfw9YpYzSzAf7A+PwCViyosA==</latexit><latexit sha1_base64="ZipdpF8ZX0WzbIc02+YwhBvVqds=">AAACMXicbVDLSsNAFJ34rPUVdelmsAgtlJKIoC6Eoi50IxWsrTQ1TKbTduhMEmYmYgn5Jjd+ieBCFypu/QknbfHReuDC4Zx7ufceL2RUKst6NqamZ2bn5jML2cWl5ZVVc239SgaRwKSKAxaIuockYdQnVUUVI/VQEMQ9Rmpe7zj1a7dESBr4l6ofkiZHHZ+2KUZKS6555sT5O5cWYd+lBSdxY3poJzfnRfijQkdSDk/yDkeqixGL6wl0FOVEwm/pOim4Zs4qWQPASWKPSA6MUHHNR6cV4IgTX2GGpGzYVqiaMRKKYkaSrBNJEiLcQx3S0NRHemMzHrycwG2ttGA7ELp8BQfq74kYcSn73NOd6Y1y3EvF/7xGpNr7zZj6YaSIj4eL2hGDKoBpfrBFBcGK9TVBWFB9K8RdJBBWOuWsDsEef3mSVHdKByX7YjdXPhqlkQGbYAvkgQ32QBmcggqoAgzuwRN4BW/Gg/FivBsfw9YpYzSzAf7A+PwCViyosA==</latexit><latexit sha1_base64="ZipdpF8ZX0WzbIc02+YwhBvVqds=">AAACMXicbVDLSsNAFJ34rPUVdelmsAgtlJKIoC6Eoi50IxWsrTQ1TKbTduhMEmYmYgn5Jjd+ieBCFypu/QknbfHReuDC4Zx7ufceL2RUKst6NqamZ2bn5jML2cWl5ZVVc239SgaRwKSKAxaIuockYdQnVUUVI/VQEMQ9Rmpe7zj1a7dESBr4l6ofkiZHHZ+2KUZKS6555sT5O5cWYd+lBSdxY3poJzfnRfijQkdSDk/yDkeqixGL6wl0FOVEwm/pOim4Zs4qWQPASWKPSA6MUHHNR6cV4IgTX2GGpGzYVqiaMRKKYkaSrBNJEiLcQx3S0NRHemMzHrycwG2ttGA7ELp8BQfq74kYcSn73NOd6Y1y3EvF/7xGpNr7zZj6YaSIj4eL2hGDKoBpfrBFBcGK9TVBWFB9K8RdJBBWOuWsDsEef3mSVHdKByX7YjdXPhqlkQGbYAvkgQ32QBmcggqoAgzuwRN4BW/Gg/FivBsfw9YpYzSzAf7A+PwCViyosA==</latexit><latexit sha1_base64="ZipdpF8ZX0WzbIc02+YwhBvVqds=">AAACMXicbVDLSsNAFJ34rPUVdelmsAgtlJKIoC6Eoi50IxWsrTQ1TKbTduhMEmYmYgn5Jjd+ieBCFypu/QknbfHReuDC4Zx7ufceL2RUKst6NqamZ2bn5jML2cWl5ZVVc239SgaRwKSKAxaIuockYdQnVUUVI/VQEMQ9Rmpe7zj1a7dESBr4l6ofkiZHHZ+2KUZKS6555sT5O5cWYd+lBSdxY3poJzfnRfijQkdSDk/yDkeqixGL6wl0FOVEwm/pOim4Zs4qWQPASWKPSA6MUHHNR6cV4IgTX2GGpGzYVqiaMRKKYkaSrBNJEiLcQx3S0NRHemMzHrycwG2ttGA7ELp8BQfq74kYcSn73NOd6Y1y3EvF/7xGpNr7zZj6YaSIj4eL2hGDKoBpfrBFBcGK9TVBWFB9K8RdJBBWOuWsDsEef3mSVHdKByX7YjdXPhqlkQGbYAvkgQ32QBmcggqoAgzuwRN4BW/Gg/FivBsfw9YpYzSzAf7A+PwCViyosA==</latexit>
Word embeddings“… that bank holds the mortgage on my home…”
1. Define a supervised learning task from raw text (no manual annotation!):
1. (x, y) = (bank, that) 2. (x, y) = (bank, holds) 3. (x, y) = (holds, bank) 4. (x, y) = (holds, the) 5. (x, y) = (the, holds) 6. (x, y) = (the, mortgage) 7. (x, y) = (mortgage, the) 8. (x, y) = (mortgage, on) 9. (x, y) = (on, mortgage), 10. (x, y) = (on, my) 11. (x, y) = (my, on) 12. (x, y) = (my, home) …
17
Word embeddings2. Define model for output given input — p(“holds” | “bank”)
p✓(o | c) = exp(u>o vc)PV
w=1 exp(u>wvc)
• u: vector for “outside” word, v: vector for “center”, V: number of words in vocabulary, 𝛉: all parameters
• Multi-class classification model (number of classes?)
• How many parameters are in the model:
Intro to ML prerequisite 18
Word embeddings2. Define model for output given input — p(“holds” | “bank”)
p✓(o | c) = exp(u>o vc)PV
w=1 exp(u>wvc)
• u: vector for “outside” word, v: vector for “center”, V: number of words in vocabulary, 𝛉: all parameters
• Multi-class classification model (number of classes?)
• How many parameters are in the model:
|✓| = 2 · V · d u, v 2 Rd
Intro to ML prerequisite 18
Word embeddings3. Define objective function for corpus of length T:
L(✓) =TY
t=1
Y
�m j mj 6= 0
p✓(wt+j | wt)
J(✓) = logL(✓) =TX
t=1
X
�m j mj 6= 0
log p✓(wt+j | wt)
Find parameters that maximize the objective
Intro to ML prerequisite 19
Class 1 Recap
Intro to ML prerequisite
• Word representations:
• Ontology-based
• Pros: polysemy, similarity metrics
• Cons: expensive, compositionality, granularity
• One-hot
• Pros: cheap, simple, scales, compositionality
• Cons: no similarity
• Embeddings:
• Cheap, simple, scales, compositionality, similarity
Today
Intro to ML prerequisite
• Word2vec
• Efficiency:
• Hierarchical softmax
• Skipgram with negative sampling (assignment 1)
• Skipgram as matrix factorization
• Evaluation (GloVe)
Word embeddings“… that bank holds the mortgage on my home…”
1. Define a supervised learning task from raw text (no manual annotation!):
1. (x, y) = (bank, that) 2. (x, y) = (bank, holds) 3. (x, y) = (holds, bank) 4. (x, y) = (holds, the) 5. (x, y) = (the, holds) 6. (x, y) = (the, mortgage) 7. (x, y) = (mortgage, the) 8. (x, y) = (mortgage, on) 9. (x, y) = (on, mortgage), 10. (x, y) = (on, my) 11. (x, y) = (my, on) 12. (x, y) = (my, home) …
22
Mikolov et al., 2013
Word embeddings2. Define model for output given input — p(“holds” | “bank”)
p✓(o | c) = exp(u>o vc)PV
w=1 exp(u>wvc)
• u: vector for “outside” word, v: vector for “center”, V: number of words in vocabulary, 𝛉: all parameters
• We don’t really need the distribution - only the representation!
Intro to ML prerequisite 23
J(✓) = logL(✓) =TX
t=1
X
�m j mj 6= 0
log p✓(wt+j | wt)
Word embeddings
• What probabilities would maximize the objective?
Intro to ML prerequisite
Intro to ML prerequisite
L(⇥) =Y
c,o
p(o | c)#(c,o)
We can solve separately for each center word c
Lc(⇥) =Y
o
p(o | c)#(c,o)
solve for
Jc(⇥) =X
i
#(c, oi) log p(oi | c)
s.t.X
i
p(oi | c) = 1, p(oi | c) � 0
Use lagrange multipliers:
L(⇥,�) =X
i
#(c, oi) log p(oi | c)� �((X
i
p(oi | c))� 1)
rp(oi|c)L =#(c, oi)
p(oi | c)� � = 0
p(oi | c) =#(c, oi)
�X
i
p(oi | c) =X
i
#(c, oi)
�= 1
� =X
i
#(c, oi)
p(oi | c) =#(c, oi)Pi #(c, oi)
<latexit sha1_base64="y2++QIjhsNq4COprvrYiDUVbsEQ=">AAAFTHicjVRNb9NAEHWbtJTw0RaOXEZEVIlUohiQQEiVKrgglEORmqZSHaz1epKsuvaa3XWhWvkHcuHAjV/BhQMIIbGOXZKmDrCn8Xy893bGs0HCmdLd7peV1Vp9bf3axvXGjZu3bm9ubd85UiKVFPtUcCGPA6KQsxj7mmmOx4lEEgUcB8Hpyzw+OEOpmIgP9XmCw4iMYzZilGjr8rdrwY4XET2hhJte1vIOJ6hJG/bAS6QIfUN3RQZJS4AXsRBo+63xmi3rbGfgeY0dT+MHbQYIlMSgBD9DUJgQSTTycxgJCUjoBCjGGiW8FzIEWzgh2tASYEbumzJQIWNeRJlVSCk/FgQVSnL6wvu6GlulkW+YTZnD8VkbPC7GOZ/PLjPOM3R0JwfPIVhl7h64uxUBsChjfAfdObC+QuBkLEk8RohSrpkdvR3a80J9r9S8a4XZyYYz8ez/pT+8KG61lovO09zymjEJOPHNXBa1Pe7l1CNJqFlgzkwFXjajtXXFlat7tQSzLC5n+7dmX/TjHzh2KgXWH1XLGmmTllEt4agEyhoNf6vZ7XSnB64abmk0nfIc+FufvVDQNLJbQzlR6sTtJnpoiNSMcswaXpovGT0lYzyxZkwiVEMzfQwyeGA94fTfH4lYw9Q7X2FIpNR5FNjMfPfUYix3VsVOUj16NjQsTlKNMS2IRikHLSB/WSBkEqm2ax8yQiWzWoFO7FNA7eqrvAnu4pWvGkePOu7jjvvmSXP/RdmODeeec99pOa7z1Nl3XjkHTt+htY+1r7XvtR/1T/Vv9Z/1X0Xq6kpZc9e5dNbWfwMbE7W3</latexit>
Questions• Intuitions:
• Why should similar words have similar vectors?
• Why do we have different parameters for the center word and the output word?
26
27
Gradient descent3. How to find parameters that minimize the objective?
• Start at some point and move in the opposite direction of the gradient
Intro to ML prerequisite 28
Gradient descent
f(x) = x4 + 3x3 + 2
f 0(x) = 4x3 + 9x2
Intro to ML prerequisite 29
Gradient descent• We want to minimize:
J(✓) = �TX
t=1
X
j
log p✓(wt+j | wt)
• Update rule:
✓newj = ✓oldj � ↵@J(✓)
@✓j
✓new = ✓old � ↵rJ(✓)
• 𝛂 is a step size
✓ 2 R2V d
Intro to ML prerequisite 31
Stochastic gradient descent• For large corpora (billions of tokens) this update is
very slow
• Sample a window t
• Update gradients based on that window
✓new = ✓old � ↵rJt(✓)
Intro to ML prerequisite 32
Deriving the gradient• Mostly applications of the chain rule
• Let’s derive the gradient of a center word for a single output word
• You will do this again in the assignment (and more)
log p✓(wt+j | wt)
33
Gradient derivationL(⇥) = log p(o | c) = log
exp(u>o vc)P
i exp(u>oivc)
= u>o vc � log
X
i
exp(u>oivc)
rvcL(⇥) = uo �1P
j exp(u>ojvc)
·X
i
exp(u>oivc) · uoi
= uo �X
i
exp(u>oivc)P
j exp(u>ojvc)
· uoi
= uo �X
i
p(oi | c) · oi = uo � Eoi⇠p(oi|c)[uoi ]<latexit sha1_base64="754AwtQ+OMmkbOQTSyoNvkp5H6Q=">AAADxXicjVJdb9MwFHUTPkb56uCRF4uKqX2gahAS8IA0MSHtgYchrWxSHSLHcVpv/ohsp6KKIv4jb0j8GJzUoLSDwZUi3Zx7zrnX104LzoydTr/3gvDGzVu39+707967/+DhYP/RJ6NKTeiMKK70eYoN5UzSmWWW0/NCUyxSTs/Sy6Omfrai2jAlT+26oLHAC8lyRrB1ULLf+4EEtkuCefWhHqHTJbV4DA/eQsTVAhYjBZFgGSRj6CGUa0wqRL8UozJRn5FVBVwlZFxXyJQiYdCXKpWwulOGCPUPnElXBJ97z+uUjRBJnHKcVO6//su8zrexa6eL/DAXHcuL7WFIpuy/+rYcj7fj/+7idd1V7J73/ya4xt3tvunht7+hN8gv2vtW6thMbHPruXeN+8lgOJ1M24BXk8gnQ+DjJBl8Q5kipaDSEo6NmUfTwsYV1pYRTus+Kg0tMLnECzp3qcSCmrhqn2ENnzkkg7nS7pMWtmhXUWFhzFqkjtncodmtNeCfavPS5q/jismitFSSTaO85NAq2LxpmDFNieVrl2CimZsVkiV2l2Pdy2+WEO0e+WoyezF5M4k+vhwevvPb2ANPwFMwAhF4BQ7BMTgBM0CCo4AFOjDhcShDG6421KDnNY/BVoRffwJWxTON</latexit><latexit sha1_base64="754AwtQ+OMmkbOQTSyoNvkp5H6Q=">AAADxXicjVJdb9MwFHUTPkb56uCRF4uKqX2gahAS8IA0MSHtgYchrWxSHSLHcVpv/ohsp6KKIv4jb0j8GJzUoLSDwZUi3Zx7zrnX104LzoydTr/3gvDGzVu39+707967/+DhYP/RJ6NKTeiMKK70eYoN5UzSmWWW0/NCUyxSTs/Sy6Omfrai2jAlT+26oLHAC8lyRrB1ULLf+4EEtkuCefWhHqHTJbV4DA/eQsTVAhYjBZFgGSRj6CGUa0wqRL8UozJRn5FVBVwlZFxXyJQiYdCXKpWwulOGCPUPnElXBJ97z+uUjRBJnHKcVO6//su8zrexa6eL/DAXHcuL7WFIpuy/+rYcj7fj/+7idd1V7J73/ya4xt3tvunht7+hN8gv2vtW6thMbHPruXeN+8lgOJ1M24BXk8gnQ+DjJBl8Q5kipaDSEo6NmUfTwsYV1pYRTus+Kg0tMLnECzp3qcSCmrhqn2ENnzkkg7nS7pMWtmhXUWFhzFqkjtncodmtNeCfavPS5q/jismitFSSTaO85NAq2LxpmDFNieVrl2CimZsVkiV2l2Pdy2+WEO0e+WoyezF5M4k+vhwevvPb2ANPwFMwAhF4BQ7BMTgBM0CCo4AFOjDhcShDG6421KDnNY/BVoRffwJWxTON</latexit><latexit sha1_base64="754AwtQ+OMmkbOQTSyoNvkp5H6Q=">AAADxXicjVJdb9MwFHUTPkb56uCRF4uKqX2gahAS8IA0MSHtgYchrWxSHSLHcVpv/ohsp6KKIv4jb0j8GJzUoLSDwZUi3Zx7zrnX104LzoydTr/3gvDGzVu39+707967/+DhYP/RJ6NKTeiMKK70eYoN5UzSmWWW0/NCUyxSTs/Sy6Omfrai2jAlT+26oLHAC8lyRrB1ULLf+4EEtkuCefWhHqHTJbV4DA/eQsTVAhYjBZFgGSRj6CGUa0wqRL8UozJRn5FVBVwlZFxXyJQiYdCXKpWwulOGCPUPnElXBJ97z+uUjRBJnHKcVO6//su8zrexa6eL/DAXHcuL7WFIpuy/+rYcj7fj/+7idd1V7J73/ya4xt3tvunht7+hN8gv2vtW6thMbHPruXeN+8lgOJ1M24BXk8gnQ+DjJBl8Q5kipaDSEo6NmUfTwsYV1pYRTus+Kg0tMLnECzp3qcSCmrhqn2ENnzkkg7nS7pMWtmhXUWFhzFqkjtncodmtNeCfavPS5q/jismitFSSTaO85NAq2LxpmDFNieVrl2CimZsVkiV2l2Pdy2+WEO0e+WoyezF5M4k+vhwevvPb2ANPwFMwAhF4BQ7BMTgBM0CCo4AFOjDhcShDG6421KDnNY/BVoRffwJWxTON</latexit><latexit sha1_base64="754AwtQ+OMmkbOQTSyoNvkp5H6Q=">AAADxXicjVJdb9MwFHUTPkb56uCRF4uKqX2gahAS8IA0MSHtgYchrWxSHSLHcVpv/ohsp6KKIv4jb0j8GJzUoLSDwZUi3Zx7zrnX104LzoydTr/3gvDGzVu39+707967/+DhYP/RJ6NKTeiMKK70eYoN5UzSmWWW0/NCUyxSTs/Sy6Omfrai2jAlT+26oLHAC8lyRrB1ULLf+4EEtkuCefWhHqHTJbV4DA/eQsTVAhYjBZFgGSRj6CGUa0wqRL8UozJRn5FVBVwlZFxXyJQiYdCXKpWwulOGCPUPnElXBJ97z+uUjRBJnHKcVO6//su8zrexa6eL/DAXHcuL7WFIpuy/+rYcj7fj/+7idd1V7J73/ya4xt3tvunht7+hN8gv2vtW6thMbHPruXeN+8lgOJ1M24BXk8gnQ+DjJBl8Q5kipaDSEo6NmUfTwsYV1pYRTus+Kg0tMLnECzp3qcSCmrhqn2ENnzkkg7nS7pMWtmhXUWFhzFqkjtncodmtNeCfavPS5q/jismitFSSTaO85NAq2LxpmDFNieVrl2CimZsVkiV2l2Pdy2+WEO0e+WoyezF5M4k+vhwevvPb2ANPwFMwAhF4BQ7BMTgBM0CCo4AFOjDhcShDG6421KDnNY/BVoRffwJWxTON</latexit>
Recap• Goal: represent words with low-dimensional vectors
• Approach: Define a supervised learning problem from a corpus
• We defined the necessary components for skip-gram:
• Model (softmax over word labels for each word)
• Objective (minimize Negative Log Likelihood)
• Optimize with SGD
• We computed the gradient for some parameters by hand
35
Computational problem• Computing the partition function is too expensive
• Solution 1: hierarchical softmax (Morin and Bengio, 2005) reduces computation time to log|V| by constructing a binary tree over the vocabulary
• Solution 2: Change the objective
• skip-gram with negative sampling (home assignment 1)
36
Hierarchical softmax• p(“cat” | “dog”) = p(left at 1) x p(right at 2) x p(right at 5)
= (1 - p(right at 1)) x p(right at 2) x p(right at 5)dog
2 3
4 75 6
1
he she and cat the have be are
p(cat | dog) = (1� �(o>1 cdog))
= ⇥�(o>2 cdog))
= ⇥�(o>5 cdog))
Hierarchical softmax• How to construct the tree?
• Randomly (doesn’t work well but better than you’d think)
• Using external knowledge like WordNet
• Learn word representations somehow and then cluster
Skip-gram with Negative Sampling
(x, y) = ((bank, holds), 1) (x, y) = ((bank, table), 0) (x, y) = ((bank, eat), 0) (x, y) = ((holds, bank), 1) (x, y) = ((holds, quickly), 0) (x, y) = ((holds, which), 0) (x, y) = ((the, mortgage), 1) (x, y) = ((the, eat), 0) (x, y) = ((the, who), 0)
39
What information is lost?
Skip-gram with Negative Sampling
(x, y) = ((bank, holds), 1) (x, y) = ((bank, table), 0) (x, y) = ((bank, eat), 0) (x, y) = ((holds, bank), 1) (x, y) = ((holds, quickly), 0) (x, y) = ((holds, which), 0) (x, y) = ((the, mortgage), 1) (x, y) = ((the, eat), 0) (x, y) = ((the, who), 0)
39
What information is lost?X
o2Vp(y = 1 | o, c) =?
Skip-gram with Negative Sampling
• Model:
p✓(y = 1 | c, o) = 1
1 + exp(�u>o vc)
= �(u>o vc)
p✓(y = 0 | c, o) = 1� �(u>o vc) = �(�u>
o vc)
• Objective:
• p(w) = U(w)3/4 / T
X
t,j
�log(�(u>
wt+jvwt)) +
X
k⇠p(w)
log(�(�u>w(k)vwt))
�
Intro to ML prerequisite 40
Summary
• We defined the three necessary components.
• Model (binary classification)
• Objective (ML with negative sampling)
• Optimization method (SGD)
41
Many variants• CBOW: predict center word from context
• Defining context:
• How big is the window?
• Is it sequential or based on syntactic information?
• Different model for every context position?
• Use stop words?
• …
42
Matrix factorization
43
Matrix factorization• Consider the word-context co-occurrence matrix for a
corpus:
“I like deep learning. I like NLP. I enjoy flying.”
I Like enjoy deep learning NLP flying .I 2 1
like 2 1 1enjoy 1 1deep 1 1
learning 1 1NLP 1 1flying 1 1
. 1 1 1
Landauer and Dumais (1997)
44 Intro to ML prerequisite
Matrix factorization• Reconstruct matrix from low-dimensional word-
context representations.
• Minimizes: X
i,j
(Aij � Akij)
2 = ||A� Ak||2
45
Matrix factorization
46
Relation to skip-gram• The output of skip-gram can be viewed as
factorizing a word-context matrix:
×=
M VUT
M 2 R|V|⇥|V|, V, U 2 R|V|⇥d
• What M is decomposed by skip-gram?
Levy and Goldberg, 2015
47
Relation to skip-gram#(c) =
X
o0
#(c, o0)
#(o) =X
c0
#(c0, o)
T =X
(c,o)
#(c, o)
#(o)
T: Unigram probability of o
PT : unigram distribution
PT (w) =c(w)
|D| =c(w) ·m|D| ·m =
#(o)
T<latexit sha1_base64="oDeumLyN6Td3XXi6vr8Srx/QkIk=">AAACNnicbVDLSsNAFJ34rPVVdelmsAh1UxIR1IVQ1IUrqdDaQhPKZDpph04yYWailDR/5cbfcKcbFypu/QSnaSi29cLAuefcw5173JBRqUzz1VhYXFpeWc2t5dc3Nre2Czu795JHApM65oyLposkYTQgdUUVI81QEOS7jDTc/tVIbzwQISkPamoQEsdH3YB6FCOlqXbhttqulR6P4AW0PYFwjHWTxMPrYTJFQRt3uIJ+Kk2ayYhdLHFtqyXtQtEsm2nBeWBloAiyqrYLL3aH48gngcIMSdmyzFA5MRKKYkaSvB1JEiLcR13S0jBAPpFOnN6dwEPNdKDHhX6Bgin71xEjX8qB7+pJH6menNVG5H9aK1LemRPTIIwUCfB4kRcxqDgchQg7VBCs2EADhAXVf4W4h3QSSked1yFYsyfPg/px+bxs3Z0UK5dZGjmwDw5ACVjgFFTADaiCOsDgCbyBD/BpPBvvxpfxPR5dMDLPHpgq4+cXtKqqTA==</latexit><latexit sha1_base64="oDeumLyN6Td3XXi6vr8Srx/QkIk=">AAACNnicbVDLSsNAFJ34rPVVdelmsAh1UxIR1IVQ1IUrqdDaQhPKZDpph04yYWailDR/5cbfcKcbFypu/QSnaSi29cLAuefcw5173JBRqUzz1VhYXFpeWc2t5dc3Nre2Czu795JHApM65oyLposkYTQgdUUVI81QEOS7jDTc/tVIbzwQISkPamoQEsdH3YB6FCOlqXbhttqulR6P4AW0PYFwjHWTxMPrYTJFQRt3uIJ+Kk2ayYhdLHFtqyXtQtEsm2nBeWBloAiyqrYLL3aH48gngcIMSdmyzFA5MRKKYkaSvB1JEiLcR13S0jBAPpFOnN6dwEPNdKDHhX6Bgin71xEjX8qB7+pJH6menNVG5H9aK1LemRPTIIwUCfB4kRcxqDgchQg7VBCs2EADhAXVf4W4h3QSSked1yFYsyfPg/px+bxs3Z0UK5dZGjmwDw5ACVjgFFTADaiCOsDgCbyBD/BpPBvvxpfxPR5dMDLPHpgq4+cXtKqqTA==</latexit><latexit sha1_base64="oDeumLyN6Td3XXi6vr8Srx/QkIk=">AAACNnicbVDLSsNAFJ34rPVVdelmsAh1UxIR1IVQ1IUrqdDaQhPKZDpph04yYWailDR/5cbfcKcbFypu/QSnaSi29cLAuefcw5173JBRqUzz1VhYXFpeWc2t5dc3Nre2Czu795JHApM65oyLposkYTQgdUUVI81QEOS7jDTc/tVIbzwQISkPamoQEsdH3YB6FCOlqXbhttqulR6P4AW0PYFwjHWTxMPrYTJFQRt3uIJ+Kk2ayYhdLHFtqyXtQtEsm2nBeWBloAiyqrYLL3aH48gngcIMSdmyzFA5MRKKYkaSvB1JEiLcR13S0jBAPpFOnN6dwEPNdKDHhX6Bgin71xEjX8qB7+pJH6menNVG5H9aK1LemRPTIIwUCfB4kRcxqDgchQg7VBCs2EADhAXVf4W4h3QSSked1yFYsyfPg/px+bxs3Z0UK5dZGjmwDw5ACVjgFFTADaiCOsDgCbyBD/BpPBvvxpfxPR5dMDLPHpgq4+cXtKqqTA==</latexit><latexit sha1_base64="oDeumLyN6Td3XXi6vr8Srx/QkIk=">AAACNnicbVDLSsNAFJ34rPVVdelmsAh1UxIR1IVQ1IUrqdDaQhPKZDpph04yYWailDR/5cbfcKcbFypu/QSnaSi29cLAuefcw5173JBRqUzz1VhYXFpeWc2t5dc3Nre2Czu795JHApM65oyLposkYTQgdUUVI81QEOS7jDTc/tVIbzwQISkPamoQEsdH3YB6FCOlqXbhttqulR6P4AW0PYFwjHWTxMPrYTJFQRt3uIJ+Kk2ayYhdLHFtqyXtQtEsm2nBeWBloAiyqrYLL3aH48gngcIMSdmyzFA5MRKKYkaSvB1JEiLcR13S0jBAPpFOnN6dwEPNdKDHhX6Bgin71xEjX8qB7+pJH6menNVG5H9aK1LemRPTIIwUCfB4kRcxqDgchQg7VBCs2EADhAXVf4W4h3QSSked1yFYsyfPg/px+bxs3Z0UK5dZGjmwDw5ACVjgFFTADaiCOsDgCbyBD/BpPBvvxpfxPR5dMDLPHpgq4+cXtKqqTA==</latexit>
Relation to skip-gram• Re-write objective:
49
distribute
expectationis constant
for o
Openexpectation
Gather terms
L(✓) =X
c,o
#(c, o)�log(�(u>
o vc)) + k · Eo0⇠PT [log(�(�u>o0vc))]
�
=X
c,o
#(c, o) log(�(u>o vc)) +
X
c,o
#(c, o) · k · Eo0⇠PT [log(�(�u>o0vc))]
=X
c,o
#(c, o) log(�(u>o vc)) +
X
c
#(c) · k · Eo0⇠PT [log(�(�u>o0vc))]
=X
c,o
#(c, o) log(�(u>o vc)) +
X
c
#(c) · k ·X
o0
#(o0)
Tlog(�(�u>
o0vc))
=X
c,o
#(c, o) log(�(u>o vc)) + #(c) · k · #(o)
Tlog(�(�u>
o vc))
Relation to skip-gram• Let’s assume the dot products are independent of
one another:Let x = u>
o vc
l(x) = #(c, o) log(�(x)) + #(c) · k · #(o)
Tlog(�(�x))
L(✓) =X
c,o
l(x)
@l(x)
@x= #(c, o)�(�x)�#(c) · k · #(o)
T�(x) = 0
x = log
✓#(c, o) · T#(c) ·#(o)
· 1k
◆
x = log
✓p(c, o)
p(c) · p(o)
◆� log k = PMI(c, o)� log k
50
Relation to skip-gram• Conclusion: Skip-gram with negative sampling
implicitly factorizes a “shifted” PMI matrix
• Many NLP methods factorize the PMI matrix with matrix decomposition methods to obtain dense vectors.
51
Evaluation
52
Evaluation
• Intrinsic vs. extrinsic evaluation:
• Intrinsic: define some artificial task that tries to directly measure the quality of your learning algorithm (a bit of that in home assignment 1).
• Extrinsic: check whether your output is useful in a real NLP task
53
Intrinsic evaluation• Word analogies:
• Normalize all word vectors to 1
• man::woman <—> king::??
• a::b <—> c::d d = argmaxi
(xb � xa + xc)>xi
||xb � xa + xc||
54
Visualization
55
Visualization
56
Visualization
57
GloVe• An objective that attempts to create a semantic
space with linear structure
Pennington et al., 2014
• Probability ratios are more important than probabilities
GlovePennington et al., 2014
• Try to find word embeddings such that (roughly):
• As an example:
(vc1 � vc2)>uo =
Pc1o
Pc2o
Pco is the probability of an output word o given a center word c<latexit sha1_base64="vdM0OrDgI/C1mR7dOcbuj6cWHWk=">AAACcXicbZBNbxMxEIa9S4E2fAXoBSHQqBFVK0S0GyqVC1IFF45BIm2lbFh5ndnEqtde2bOBaLV3fh83/kQv/QM4mxxoy0iWHr/zYc+blUo6iqI/QXhn6+69+9s7nQcPHz1+0n367NSZygocCaOMPc+4QyU1jkiSwvPSIi8yhWfZxedV/myB1kmjv9GyxEnBZ1rmUnDyUtr9tX+wSGuRxg28g5YGzeH3hEwJVWrgIyS55aIetjWmadY08ARJ0tlf3UyTEP6kGqQDmiOU1mQ8k0rSEkwOXIOpqKwIfhg7BQMzuUANHARqQrtWRZN2e1E/agNuQ7yBHtvEMO3+TqZGVIWfIhR3bhxHJU1qbkkKhU0nqRyWXFzwGY49al6gm9StYw288coUcmP90QSt+m9HzQvnlkXmKwtOc3cztxL/lxtXlH+Y1FL7fVGL9UN5pYAMrOyHqbQoSC09cGGl/yuIOfcWeytcx5sQ31z5NpwO+vH7fvz1qHfyaWPHNnvJ9tgBi9kxO2Ff2JCNmGCXwW7wKngdXIUvQgj31qVhsOl5zq5F+PYvS9u75A==</latexit>
vice � vsteam ⇡ usolid
vsteam � vice ⇡ ugas<latexit sha1_base64="s94IU64EXbOLJ9TUCXpfNRBAhX8=">AAACXXicbVFNS8NAEN3E7/hV9eDBy2JRvFgSFfRY9OJRwVahKWGzndbFTTbsTkpLyJ/0phf/its2iLYOLDzemzez+zbOpDDo+x+Ou7S8srq2vuFtbm3v7Nb29ttG5ZpDiyup9EvMDEiRQgsFSnjJNLAklvAcv91N9OchaCNU+oTjDLoJG6SiLzhDS0U1HEZFiDDCQnAoS3pOfwiDdo6lTkOWZVqNaB5VgpKiV9Iw9BZ6f9ln8xbMA2bKqFb3G/606CIIKlAnVT1Etfewp3ieQIpcMmM6gZ9ht2AaBZdQemFuIGP8jQ2gY2HKEjDdYppOSU8s06N9pe1JkU7Z346CJcaMk9h2Jgxfzbw2If/TOjn2b7qFSLMcIeWzRf1cUlR0EjXtCQ0c5dgCxrWwd6X8lWnG0X6IZ0MI5p+8CNoXjeCyETxe1Zu3VRzr5IgckzMSkGvSJPfkgbQIJ58OcTYcz/lyV9wtd2fW6jqV54D8KffwG1M7uA8=</latexit>
Word analogies evaluation
60
Human correlation intrinsic evaluation
word 1 word 2 human judgement
tiger cat 7.35
book paper 7.46
computer internet 7.58
plane car 5.77
stock phone 1.62
stock CD 1.31
stock jaguar 0.92 61
Human correlation intrinsic evaluation
• Compute Spearman rank correlation between human similarity prediction and model similarity predictions (wordsim 353):
62
Extrinsic evaluation• Task: named entity recognition. Find mentions of
person, location, organization in text.
• Using good word representation might be useful
63
Extrinsic evaluation
64
Summary• Words are central to language
• In most NLP systems some word representations are used
• Graph-based representations are difficult to manipulate and compose
• One-hot vectors are useful with enough data but lose all of generalization information
• Word embeddings provide a compact way to encode word meaning and similarity (but what about inference relations?)
• Skip-gram with negative sampling is a popular approach for learning word embeddings by casting an unsupervised problem as a supervised problem
• Strongly related to classical matrix decomposition methods.
65
Current Research
• Contextualized word representations
• Sentence representations
Assignment 1
• Implement skip-gram with negative sampling
• There is ample literature if you want to consider this for a project
67
Gradient checks
• This is the single parameter case • For parameter vectors, iterate over all parameters
and compute the numerical gradient for each one
@J(✓)
@✓= lim
✏!0
J(✓ + ✏)� J(✓ � ✏)
2✏
68