45
Word Embeddings Winter 2020 COS 598C Advanced Topics in Computer Science: Deep Learning for Natural Language Processing

Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Word Embeddings

Winter 2020

COS 598C Advanced Topics in Computer Science:Deep Learning for Natural Language Processing

Page 2: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Most popular topics

• #1: Adversarial examples (63% chose “I love to”) • #2: Bias in Language (53%) • #3: Dialogue I (50%) • #4: Interpretability • #5: Generalization • #6: Reading Comprehension

Page 3: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Least popular topics

• #1: Coreference resolution (47% chose “I am not really interested”) • #2: Annotation artifacts (43%) • #3: Semantic parsing (43%)

Page 4: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Suggested topics

Page 5: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Overview

• (Mikolov et al, 2013) Distributed Representations of Words and Phrases and their Compositionality

• (Baroni et al, 2014) Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors

Page 6: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can
Page 7: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Distributed representation of words

employees =

0

BBBBBBBBBBBB@

0.2860.792�0.177�0.10710.109�0.5420.3490.2710.487

1

CCCCCCCCCCCCA

<latexit sha1_base64="rlyV9z4BFdXN5ATSBwwve48t4Vs=">AAACa3icbZHPT9swFMedjA3oNijjAGI7WKuQdlmVhELaAxKCC0cmrYDUVJXjvhYLx4nsF0QV9bI/cTf+Ay78DzhNQAP2JEsffd8PP38dZ1IY9Lw7x3239P7D8spq4+Onz2vrzY0v5ybNNYc+T2WqL2NmQAoFfRQo4TLTwJJYwkV8fVLmL25AG5Gq3zjLYJiwqRITwRlaadT8EyHcYgFJJtMZgJnTQxrFMBWqyBKGWtzOqdcOugc0iiyEvaCEn17bD8Mn8hbkl9Srtf1OUDXsdXoVBKFfQadry0GNn+ePmi2v7S2CvgW/hhap42zU/BuNU54noJBLZszA9zIcFkyj4BLmjSg3kDF+zaYwsKhYAmZYLLya012rjOkk1fYopAv1346CJcbMkthW2v2uzOtcKf4vN8hx0h0WQmU5guLVRZNcUkxpaTwdCw0c5cwC41rYXSm/YppxtN/TsCb4r5/8Fs6Ddmnzr07r6Li2Y4V8Jd/JD+KTkByRU3JG+oSTe2fN2XK2nQd3091xv1WlrlP3bJIX4e4+Ani5r64=</latexit><latexit sha1_base64="rlyV9z4BFdXN5ATSBwwve48t4Vs=">AAACa3icbZHPT9swFMedjA3oNijjAGI7WKuQdlmVhELaAxKCC0cmrYDUVJXjvhYLx4nsF0QV9bI/cTf+Ay78DzhNQAP2JEsffd8PP38dZ1IY9Lw7x3239P7D8spq4+Onz2vrzY0v5ybNNYc+T2WqL2NmQAoFfRQo4TLTwJJYwkV8fVLmL25AG5Gq3zjLYJiwqRITwRlaadT8EyHcYgFJJtMZgJnTQxrFMBWqyBKGWtzOqdcOugc0iiyEvaCEn17bD8Mn8hbkl9Srtf1OUDXsdXoVBKFfQadry0GNn+ePmi2v7S2CvgW/hhap42zU/BuNU54noJBLZszA9zIcFkyj4BLmjSg3kDF+zaYwsKhYAmZYLLya012rjOkk1fYopAv1346CJcbMkthW2v2uzOtcKf4vN8hx0h0WQmU5guLVRZNcUkxpaTwdCw0c5cwC41rYXSm/YppxtN/TsCb4r5/8Fs6Ddmnzr07r6Li2Y4V8Jd/JD+KTkByRU3JG+oSTe2fN2XK2nQd3091xv1WlrlP3bJIX4e4+Ani5r64=</latexit><latexit sha1_base64="rlyV9z4BFdXN5ATSBwwve48t4Vs=">AAACa3icbZHPT9swFMedjA3oNijjAGI7WKuQdlmVhELaAxKCC0cmrYDUVJXjvhYLx4nsF0QV9bI/cTf+Ay78DzhNQAP2JEsffd8PP38dZ1IY9Lw7x3239P7D8spq4+Onz2vrzY0v5ybNNYc+T2WqL2NmQAoFfRQo4TLTwJJYwkV8fVLmL25AG5Gq3zjLYJiwqRITwRlaadT8EyHcYgFJJtMZgJnTQxrFMBWqyBKGWtzOqdcOugc0iiyEvaCEn17bD8Mn8hbkl9Srtf1OUDXsdXoVBKFfQadry0GNn+ePmi2v7S2CvgW/hhap42zU/BuNU54noJBLZszA9zIcFkyj4BLmjSg3kDF+zaYwsKhYAmZYLLya012rjOkk1fYopAv1346CJcbMkthW2v2uzOtcKf4vN8hx0h0WQmU5guLVRZNcUkxpaTwdCw0c5cwC41rYXSm/YppxtN/TsCb4r5/8Fs6Ddmnzr07r6Li2Y4V8Jd/JD+KTkByRU3JG+oSTe2fN2XK2nQd3091xv1WlrlP3bJIX4e4+Ani5r64=</latexit><latexit sha1_base64="rlyV9z4BFdXN5ATSBwwve48t4Vs=">AAACa3icbZHPT9swFMedjA3oNijjAGI7WKuQdlmVhELaAxKCC0cmrYDUVJXjvhYLx4nsF0QV9bI/cTf+Ay78DzhNQAP2JEsffd8PP38dZ1IY9Lw7x3239P7D8spq4+Onz2vrzY0v5ybNNYc+T2WqL2NmQAoFfRQo4TLTwJJYwkV8fVLmL25AG5Gq3zjLYJiwqRITwRlaadT8EyHcYgFJJtMZgJnTQxrFMBWqyBKGWtzOqdcOugc0iiyEvaCEn17bD8Mn8hbkl9Srtf1OUDXsdXoVBKFfQadry0GNn+ePmi2v7S2CvgW/hhap42zU/BuNU54noJBLZszA9zIcFkyj4BLmjSg3kDF+zaYwsKhYAmZYLLya012rjOkk1fYopAv1346CJcbMkthW2v2uzOtcKf4vN8hx0h0WQmU5guLVRZNcUkxpaTwdCw0c5cwC41rYXSm/YppxtN/TsCb4r5/8Fs6Ddmnzr07r6Li2Y4V8Jd/JD+KTkByRU3JG+oSTe2fN2XK2nQd3091xv1WlrlP3bJIX4e4+Ani5r64=</latexit>

Page 8: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Linguistic regularities

(Mikolov et al, 2013) Linguistic Regularities in Continuous Space Word Representations

Page 9: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Distributional hypothesis

Distributional hypothesis: words that occur in similar contexts tend to have similar meanings

J.R.Firth 1957

• “You shall know a word by the company it keeps”

• One of the most successful ideas of modern statistical NLP!

These context words will represent banking.

Page 10: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Latent Semantic Analysis (SVD-based methods)

“context-counting” vectors

(Deerwester et al, 1990): Indexing by latent semantic analysis

Page 11: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Collobert & Weston vectors

(Collobert et al, 2011) Natural Language Processing (Almost) from Scratch

Idea: a word and its context is a positive training sample; a random word in that sample context gives a negative training sample:

How do we formalize this idea? Ask that

Page 12: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Collobert & Weston vectors

Page 13: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

(Mikolov et al, 2013): Main Contributions

• An improved version of skip-gram algorithm • Negative sampling (vs hierarchical softmax in the earlier paper) • Subsampling of frequent words

• You can also learn good vector presentations for phrases!

Page 14: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

The Skip-gram model

• The idea: we want to use words to predict their context words • Context: a fixed window of size 2m

Page 15: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

The Skip-gram model

• The idea: we want to use words to predict their context words • Context: a fixed window of size 2m

Page 16: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Skip-gram: objective function

• For each position , predict context words within context size m, given center word :

t = 1,2,…Twj

L(✓) =TY

t=1

Y

�mjm,j 6=0

P (wt+j | wt; ✓)

<latexit sha1_base64="eVY3k2h9oFNi4RVzO+rdBhuftGs=">AAACTXicbZFNaxsxEIa1zreTJm567EXUBBySmt1SaCEEQnrpoQcX4iTgdRatdhwrkbRbaTbBLPsHcyn01n/RSw4tJURr7yFfA0KP3pnRx6s4k8Ki7//2GnPzC4tLyyvN1bVX6xut15vHNs0Nhz5PZWpOY2ZBCg19FCjhNDPAVCzhJL78UuVPrsBYkeojnGQwVOxci5HgDJ0UtZJQMRxzJotvZSfEMSDbpvs0zEyaRAXuB+VZcVTWy/eKhhJ+0IvZpHYr0o78suh1rl39zkVJQyUSeh3hHq33K6NW2+/606DPIaihTeroRa1fYZLyXIFGLpm1g8DPcFgwg4JLKJthbiFj/JKdw8ChZgrssJi6UdItpyR0lBo3NNKp+rCjYMraiYpdZfV2+zRXiS/lBjmOPg8LobMcQfPZQaNcUkxpZS1NhAGOcuKAcSPcXSkfM8M4ug9oOhOCp09+DscfuoHfDb5/bB8c1nYsk7fkHemQgHwiB+Qr6ZE+4eSG/CF/yT/vp3fr/ffuZqUNr+55Qx5FY+keuray+g==</latexit><latexit sha1_base64="eVY3k2h9oFNi4RVzO+rdBhuftGs=">AAACTXicbZFNaxsxEIa1zreTJm567EXUBBySmt1SaCEEQnrpoQcX4iTgdRatdhwrkbRbaTbBLPsHcyn01n/RSw4tJURr7yFfA0KP3pnRx6s4k8Ki7//2GnPzC4tLyyvN1bVX6xut15vHNs0Nhz5PZWpOY2ZBCg19FCjhNDPAVCzhJL78UuVPrsBYkeojnGQwVOxci5HgDJ0UtZJQMRxzJotvZSfEMSDbpvs0zEyaRAXuB+VZcVTWy/eKhhJ+0IvZpHYr0o78suh1rl39zkVJQyUSeh3hHq33K6NW2+/606DPIaihTeroRa1fYZLyXIFGLpm1g8DPcFgwg4JLKJthbiFj/JKdw8ChZgrssJi6UdItpyR0lBo3NNKp+rCjYMraiYpdZfV2+zRXiS/lBjmOPg8LobMcQfPZQaNcUkxpZS1NhAGOcuKAcSPcXSkfM8M4ug9oOhOCp09+DscfuoHfDb5/bB8c1nYsk7fkHemQgHwiB+Qr6ZE+4eSG/CF/yT/vp3fr/ffuZqUNr+55Qx5FY+keuray+g==</latexit><latexit sha1_base64="eVY3k2h9oFNi4RVzO+rdBhuftGs=">AAACTXicbZFNaxsxEIa1zreTJm567EXUBBySmt1SaCEEQnrpoQcX4iTgdRatdhwrkbRbaTbBLPsHcyn01n/RSw4tJURr7yFfA0KP3pnRx6s4k8Ki7//2GnPzC4tLyyvN1bVX6xut15vHNs0Nhz5PZWpOY2ZBCg19FCjhNDPAVCzhJL78UuVPrsBYkeojnGQwVOxci5HgDJ0UtZJQMRxzJotvZSfEMSDbpvs0zEyaRAXuB+VZcVTWy/eKhhJ+0IvZpHYr0o78suh1rl39zkVJQyUSeh3hHq33K6NW2+/606DPIaihTeroRa1fYZLyXIFGLpm1g8DPcFgwg4JLKJthbiFj/JKdw8ChZgrssJi6UdItpyR0lBo3NNKp+rCjYMraiYpdZfV2+zRXiS/lBjmOPg8LobMcQfPZQaNcUkxpZS1NhAGOcuKAcSPcXSkfM8M4ug9oOhOCp09+DscfuoHfDb5/bB8c1nYsk7fkHemQgHwiB+Qr6ZE+4eSG/CF/yT/vp3fr/ffuZqUNr+55Qx5FY+keuray+g==</latexit><latexit sha1_base64="eVY3k2h9oFNi4RVzO+rdBhuftGs=">AAACTXicbZFNaxsxEIa1zreTJm567EXUBBySmt1SaCEEQnrpoQcX4iTgdRatdhwrkbRbaTbBLPsHcyn01n/RSw4tJURr7yFfA0KP3pnRx6s4k8Ki7//2GnPzC4tLyyvN1bVX6xut15vHNs0Nhz5PZWpOY2ZBCg19FCjhNDPAVCzhJL78UuVPrsBYkeojnGQwVOxci5HgDJ0UtZJQMRxzJotvZSfEMSDbpvs0zEyaRAXuB+VZcVTWy/eKhhJ+0IvZpHYr0o78suh1rl39zkVJQyUSeh3hHq33K6NW2+/606DPIaihTeroRa1fYZLyXIFGLpm1g8DPcFgwg4JLKJthbiFj/JKdw8ChZgrssJi6UdItpyR0lBo3NNKp+rCjYMraiYpdZfV2+zRXiS/lBjmOPg8LobMcQfPZQaNcUkxpZS1NhAGOcuKAcSPcXSkfM8M4ug9oOhOCp09+DscfuoHfDb5/bB8c1nYsk7fkHemQgHwiB+Qr6ZE+4eSG/CF/yT/vp3fr/ffuZqUNr+55Qx5FY+keuray+g==</latexit>

all the parameters to be optimized

• The objective function is the (average) negative log likelihood:J(θ)

J(✓) = � 1

TlogL(✓) = � 1

T

TX

t=1

X

�mjm,j 6=0

logP (wt+j | wt; ✓)

<latexit sha1_base64="23utKwn7ZJE6urpMOKPMcw5eqOk=">AAACe3icdVFdaxQxFM2MWuuq7aqPggQXca3tMiOKghSKvoj4sEK3Lexsl0z2zm7aJDMmd5Ql5E/403zzn/gimNkdRFu9EHLuuV/JuXklhcUk+R7FV65e27i+eaNz89btre3unbtHtqwNhxEvZWlOcmZBCg0jFCjhpDLAVC7hOD9/28SPP4OxotSHuKxgothci0JwhoGadr++72e4AGRP6D7dywrDuEu9O/SZLOc0UwwXnEn3wf8vzdZq6nA/9ae/vT1FMwmf6Nn6UrsN0gEl3q3aDvtfQs3TMx8GiBltHP+atgP8tNtLBsnK6GWQtqBHWhtOu9+yWclrBRq5ZNaO06TCiWMGBZfgO1ltoWL8nM1hHKBmCuzErbTz9FFgZrQoTTga6Yr9s8IxZe1S5SGzEcNejDXkv2LjGotXEyd0VSNovh5U1JJiSZtF0JkwwFEuA2DciPBWyhcsCIthXZ0gQnrxy5fB0bNBmgzSj897B29aOTbJffKQ9ElKXpID8o4MyYhw8iN6ED2O+tHPuBfvxLvr1Dhqa+6Rvyx+8QtA7b8+</latexit><latexit sha1_base64="23utKwn7ZJE6urpMOKPMcw5eqOk=">AAACe3icdVFdaxQxFM2MWuuq7aqPggQXca3tMiOKghSKvoj4sEK3Lexsl0z2zm7aJDMmd5Ql5E/403zzn/gimNkdRFu9EHLuuV/JuXklhcUk+R7FV65e27i+eaNz89btre3unbtHtqwNhxEvZWlOcmZBCg0jFCjhpDLAVC7hOD9/28SPP4OxotSHuKxgothci0JwhoGadr++72e4AGRP6D7dywrDuEu9O/SZLOc0UwwXnEn3wf8vzdZq6nA/9ae/vT1FMwmf6Nn6UrsN0gEl3q3aDvtfQs3TMx8GiBltHP+atgP8tNtLBsnK6GWQtqBHWhtOu9+yWclrBRq5ZNaO06TCiWMGBZfgO1ltoWL8nM1hHKBmCuzErbTz9FFgZrQoTTga6Yr9s8IxZe1S5SGzEcNejDXkv2LjGotXEyd0VSNovh5U1JJiSZtF0JkwwFEuA2DciPBWyhcsCIthXZ0gQnrxy5fB0bNBmgzSj897B29aOTbJffKQ9ElKXpID8o4MyYhw8iN6ED2O+tHPuBfvxLvr1Dhqa+6Rvyx+8QtA7b8+</latexit><latexit sha1_base64="23utKwn7ZJE6urpMOKPMcw5eqOk=">AAACe3icdVFdaxQxFM2MWuuq7aqPggQXca3tMiOKghSKvoj4sEK3Lexsl0z2zm7aJDMmd5Ql5E/403zzn/gimNkdRFu9EHLuuV/JuXklhcUk+R7FV65e27i+eaNz89btre3unbtHtqwNhxEvZWlOcmZBCg0jFCjhpDLAVC7hOD9/28SPP4OxotSHuKxgothci0JwhoGadr++72e4AGRP6D7dywrDuEu9O/SZLOc0UwwXnEn3wf8vzdZq6nA/9ae/vT1FMwmf6Nn6UrsN0gEl3q3aDvtfQs3TMx8GiBltHP+atgP8tNtLBsnK6GWQtqBHWhtOu9+yWclrBRq5ZNaO06TCiWMGBZfgO1ltoWL8nM1hHKBmCuzErbTz9FFgZrQoTTga6Yr9s8IxZe1S5SGzEcNejDXkv2LjGotXEyd0VSNovh5U1JJiSZtF0JkwwFEuA2DciPBWyhcsCIthXZ0gQnrxy5fB0bNBmgzSj897B29aOTbJffKQ9ElKXpID8o4MyYhw8iN6ED2O+tHPuBfvxLvr1Dhqa+6Rvyx+8QtA7b8+</latexit><latexit sha1_base64="23utKwn7ZJE6urpMOKPMcw5eqOk=">AAACe3icdVFdaxQxFM2MWuuq7aqPggQXca3tMiOKghSKvoj4sEK3Lexsl0z2zm7aJDMmd5Ql5E/403zzn/gimNkdRFu9EHLuuV/JuXklhcUk+R7FV65e27i+eaNz89btre3unbtHtqwNhxEvZWlOcmZBCg0jFCjhpDLAVC7hOD9/28SPP4OxotSHuKxgothci0JwhoGadr++72e4AGRP6D7dywrDuEu9O/SZLOc0UwwXnEn3wf8vzdZq6nA/9ae/vT1FMwmf6Nn6UrsN0gEl3q3aDvtfQs3TMx8GiBltHP+atgP8tNtLBsnK6GWQtqBHWhtOu9+yWclrBRq5ZNaO06TCiWMGBZfgO1ltoWL8nM1hHKBmCuzErbTz9FFgZrQoTTga6Yr9s8IxZe1S5SGzEcNejDXkv2LjGotXEyd0VSNovh5U1JJiSZtF0JkwwFEuA2DciPBWyhcsCIthXZ0gQnrxy5fB0bNBmgzSj897B29aOTbJffKQ9ElKXpID8o4MyYhw8iN6ED2O+tHPuBfvxLvr1Dhqa+6Rvyx+8QtA7b8+</latexit>

Page 17: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

How to define ?P(wt+j ∣ wt; θ)

• We have two sets of vectors for each word in the vocabulary

ui 2 Rd<latexit sha1_base64="Qsgo7bHXmdt/5AAiowyqkJ/9E+0=">AAACBnicbVBNS8NAEJ3Ur1q/oh5FWCyCp5KIoMeiF49V7Ae0MWy2m3bpZhN2N0IJOXnxr3jxoIhXf4M3/42btgdtfTDweG+GmXlBwpnSjvNtlZaWV1bXyuuVjc2t7R17d6+l4lQS2iQxj2UnwIpyJmhTM81pJ5EURwGn7WB0VfjtByoVi8WdHifUi/BAsJARrI3k24e9COthEGZp7jPUYwJNhSC7ze/7vl11as4EaJG4M1KFGRq+/dXrxySNqNCEY6W6rpNoL8NSM8JpXumliiaYjPCAdg0VOKLKyyZv5OjYKH0UxtKU0Gii/p7IcKTUOApMZ3GjmvcK8T+vm+rwwsuYSFJNBZkuClOOdIyKTFCfSUo0HxuCiWTmVkSGWGKiTXIVE4I7//IiaZ3WXKfm3pxV65ezOMpwAEdwAi6cQx2uoQFNIPAIz/AKb9aT9WK9Wx/T1pI1m9mHP7A+fwB1FZkZ</latexit><latexit sha1_base64="Qsgo7bHXmdt/5AAiowyqkJ/9E+0=">AAACBnicbVBNS8NAEJ3Ur1q/oh5FWCyCp5KIoMeiF49V7Ae0MWy2m3bpZhN2N0IJOXnxr3jxoIhXf4M3/42btgdtfTDweG+GmXlBwpnSjvNtlZaWV1bXyuuVjc2t7R17d6+l4lQS2iQxj2UnwIpyJmhTM81pJ5EURwGn7WB0VfjtByoVi8WdHifUi/BAsJARrI3k24e9COthEGZp7jPUYwJNhSC7ze/7vl11as4EaJG4M1KFGRq+/dXrxySNqNCEY6W6rpNoL8NSM8JpXumliiaYjPCAdg0VOKLKyyZv5OjYKH0UxtKU0Gii/p7IcKTUOApMZ3GjmvcK8T+vm+rwwsuYSFJNBZkuClOOdIyKTFCfSUo0HxuCiWTmVkSGWGKiTXIVE4I7//IiaZ3WXKfm3pxV65ezOMpwAEdwAi6cQx2uoQFNIPAIz/AKb9aT9WK9Wx/T1pI1m9mHP7A+fwB1FZkZ</latexit><latexit sha1_base64="Qsgo7bHXmdt/5AAiowyqkJ/9E+0=">AAACBnicbVBNS8NAEJ3Ur1q/oh5FWCyCp5KIoMeiF49V7Ae0MWy2m3bpZhN2N0IJOXnxr3jxoIhXf4M3/42btgdtfTDweG+GmXlBwpnSjvNtlZaWV1bXyuuVjc2t7R17d6+l4lQS2iQxj2UnwIpyJmhTM81pJ5EURwGn7WB0VfjtByoVi8WdHifUi/BAsJARrI3k24e9COthEGZp7jPUYwJNhSC7ze/7vl11as4EaJG4M1KFGRq+/dXrxySNqNCEY6W6rpNoL8NSM8JpXumliiaYjPCAdg0VOKLKyyZv5OjYKH0UxtKU0Gii/p7IcKTUOApMZ3GjmvcK8T+vm+rwwsuYSFJNBZkuClOOdIyKTFCfSUo0HxuCiWTmVkSGWGKiTXIVE4I7//IiaZ3WXKfm3pxV65ezOMpwAEdwAi6cQx2uoQFNIPAIz/AKb9aT9WK9Wx/T1pI1m9mHP7A+fwB1FZkZ</latexit><latexit sha1_base64="Qsgo7bHXmdt/5AAiowyqkJ/9E+0=">AAACBnicbVBNS8NAEJ3Ur1q/oh5FWCyCp5KIoMeiF49V7Ae0MWy2m3bpZhN2N0IJOXnxr3jxoIhXf4M3/42btgdtfTDweG+GmXlBwpnSjvNtlZaWV1bXyuuVjc2t7R17d6+l4lQS2iQxj2UnwIpyJmhTM81pJ5EURwGn7WB0VfjtByoVi8WdHifUi/BAsJARrI3k24e9COthEGZp7jPUYwJNhSC7ze/7vl11as4EaJG4M1KFGRq+/dXrxySNqNCEY6W6rpNoL8NSM8JpXumliiaYjPCAdg0VOKLKyyZv5OjYKH0UxtKU0Gii/p7IcKTUOApMZ3GjmvcK8T+vm+rwwsuYSFJNBZkuClOOdIyKTFCfSUo0HxuCiWTmVkSGWGKiTXIVE4I7//IiaZ3WXKfm3pxV65ezOMpwAEdwAi6cQx2uoQFNIPAIz/AKb9aT9WK9Wx/T1pI1m9mHP7A+fwB1FZkZ</latexit>

: embedding for center word i

: embedding for context word i’vi0 2 Rd<latexit sha1_base64="jlnCkKyjEgmmzyrWVfCH8VFvPB4=">AAACCXicbVBNS8NAEJ34WetX1KOXxSJ6KokIeix68VjFfkAby2a7aZduNmF3UyghVy/+FS8eFPHqP/Dmv3HT5qCtDwYe780wM8+POVPacb6tpeWV1bX10kZ5c2t7Z9fe22+qKJGENkjEI9n2saKcCdrQTHPajiXFoc9pyx9d535rTKVikbjXk5h6IR4IFjCCtZF6NuqGWA/9IB1nvZSdZKjLRKH56V320O/ZFafqTIEWiVuQChSo9+yvbj8iSUiFJhwr1XGdWHsplpoRTrNyN1E0xmSEB7RjqMAhVV46/SRDx0bpoyCSpoRGU/X3RIpDpSahbzrzG9W8l4v/eZ1EB5deykScaCrIbFGQcKQjlMeC+kxSovnEEEwkM7ciMsQSE23CK5sQ3PmXF0nzrOo6Vff2vFK7KuIowSEcwSm4cAE1uIE6NIDAIzzDK7xZT9aL9W59zFqXrGLmAP7A+vwBvMiaVw==</latexit><latexit sha1_base64="jlnCkKyjEgmmzyrWVfCH8VFvPB4=">AAACCXicbVBNS8NAEJ34WetX1KOXxSJ6KokIeix68VjFfkAby2a7aZduNmF3UyghVy/+FS8eFPHqP/Dmv3HT5qCtDwYe780wM8+POVPacb6tpeWV1bX10kZ5c2t7Z9fe22+qKJGENkjEI9n2saKcCdrQTHPajiXFoc9pyx9d535rTKVikbjXk5h6IR4IFjCCtZF6NuqGWA/9IB1nvZSdZKjLRKH56V320O/ZFafqTIEWiVuQChSo9+yvbj8iSUiFJhwr1XGdWHsplpoRTrNyN1E0xmSEB7RjqMAhVV46/SRDx0bpoyCSpoRGU/X3RIpDpSahbzrzG9W8l4v/eZ1EB5deykScaCrIbFGQcKQjlMeC+kxSovnEEEwkM7ciMsQSE23CK5sQ3PmXF0nzrOo6Vff2vFK7KuIowSEcwSm4cAE1uIE6NIDAIzzDK7xZT9aL9W59zFqXrGLmAP7A+vwBvMiaVw==</latexit><latexit sha1_base64="jlnCkKyjEgmmzyrWVfCH8VFvPB4=">AAACCXicbVBNS8NAEJ34WetX1KOXxSJ6KokIeix68VjFfkAby2a7aZduNmF3UyghVy/+FS8eFPHqP/Dmv3HT5qCtDwYe780wM8+POVPacb6tpeWV1bX10kZ5c2t7Z9fe22+qKJGENkjEI9n2saKcCdrQTHPajiXFoc9pyx9d535rTKVikbjXk5h6IR4IFjCCtZF6NuqGWA/9IB1nvZSdZKjLRKH56V320O/ZFafqTIEWiVuQChSo9+yvbj8iSUiFJhwr1XGdWHsplpoRTrNyN1E0xmSEB7RjqMAhVV46/SRDx0bpoyCSpoRGU/X3RIpDpSahbzrzG9W8l4v/eZ1EB5deykScaCrIbFGQcKQjlMeC+kxSovnEEEwkM7ciMsQSE23CK5sQ3PmXF0nzrOo6Vff2vFK7KuIowSEcwSm4cAE1uIE6NIDAIzzDK7xZT9aL9W59zFqXrGLmAP7A+vwBvMiaVw==</latexit><latexit sha1_base64="jlnCkKyjEgmmzyrWVfCH8VFvPB4=">AAACCXicbVBNS8NAEJ34WetX1KOXxSJ6KokIeix68VjFfkAby2a7aZduNmF3UyghVy/+FS8eFPHqP/Dmv3HT5qCtDwYe780wM8+POVPacb6tpeWV1bX10kZ5c2t7Z9fe22+qKJGENkjEI9n2saKcCdrQTHPajiXFoc9pyx9d535rTKVikbjXk5h6IR4IFjCCtZF6NuqGWA/9IB1nvZSdZKjLRKH56V320O/ZFafqTIEWiVuQChSo9+yvbj8iSUiFJhwr1XGdWHsplpoRTrNyN1E0xmSEB7RjqMAhVV46/SRDx0bpoyCSpoRGU/X3RIpDpSahbzrzG9W8l4v/eZ1EB5deykScaCrIbFGQcKQjlMeC+kxSovnEEEwkM7ciMsQSE23CK5sQ3PmXF0nzrOo6Vff2vFK7KuIowSEcwSm4cAE1uIE6NIDAIzzDK7xZT9aL9W59zFqXrGLmAP7A+vwBvMiaVw==</latexit>

• Use inner product to measure how likely word i appears with context word i’, the larger the better

ui · vi0<latexit sha1_base64="RzTZ0bVG1tX3m7GXesoGab/HjRI=">AAACC3icbVBNS8NAEN3Ur1q/oh69LC2ip5KIoMeiF48VbCu0IWw2m3bpZjfsbgol5O7Fv+LFgyJe/QPe/Ddu2gja+mDg8d4MM/OChFGlHefLqqysrq1vVDdrW9s7u3v2/kFXiVRi0sGCCXkfIEUY5aSjqWbkPpEExQEjvWB8Xfi9CZGKCn6npwnxYjTkNKIYaSP5dn0QIz0KoizNfQoHOBQa/kiT3M/oSe7bDafpzACXiVuSBijR9u3PQShwGhOuMUNK9V0n0V6GpKaYkbw2SBVJEB6jIekbylFMlJfNfsnhsVFCGAlpims4U39PZChWahoHprM4Uy16hfif1091dOlllCepJhzPF0Upg1rAIhgYUkmwZlNDEJbU3ArxCEmEtYmvZkJwF19eJt2zpus03dvzRuuqjKMKjkAdnAIXXIAWuAFt0AEYPIAn8AJerUfr2Xqz3uetFaucOQR/YH18A6ZPm2s=</latexit><latexit sha1_base64="RzTZ0bVG1tX3m7GXesoGab/HjRI=">AAACC3icbVBNS8NAEN3Ur1q/oh69LC2ip5KIoMeiF48VbCu0IWw2m3bpZjfsbgol5O7Fv+LFgyJe/QPe/Ddu2gja+mDg8d4MM/OChFGlHefLqqysrq1vVDdrW9s7u3v2/kFXiVRi0sGCCXkfIEUY5aSjqWbkPpEExQEjvWB8Xfi9CZGKCn6npwnxYjTkNKIYaSP5dn0QIz0KoizNfQoHOBQa/kiT3M/oSe7bDafpzACXiVuSBijR9u3PQShwGhOuMUNK9V0n0V6GpKaYkbw2SBVJEB6jIekbylFMlJfNfsnhsVFCGAlpims4U39PZChWahoHprM4Uy16hfif1091dOlllCepJhzPF0Upg1rAIhgYUkmwZlNDEJbU3ArxCEmEtYmvZkJwF19eJt2zpus03dvzRuuqjKMKjkAdnAIXXIAWuAFt0AEYPIAn8AJerUfr2Xqz3uetFaucOQR/YH18A6ZPm2s=</latexit><latexit sha1_base64="RzTZ0bVG1tX3m7GXesoGab/HjRI=">AAACC3icbVBNS8NAEN3Ur1q/oh69LC2ip5KIoMeiF48VbCu0IWw2m3bpZjfsbgol5O7Fv+LFgyJe/QPe/Ddu2gja+mDg8d4MM/OChFGlHefLqqysrq1vVDdrW9s7u3v2/kFXiVRi0sGCCXkfIEUY5aSjqWbkPpEExQEjvWB8Xfi9CZGKCn6npwnxYjTkNKIYaSP5dn0QIz0KoizNfQoHOBQa/kiT3M/oSe7bDafpzACXiVuSBijR9u3PQShwGhOuMUNK9V0n0V6GpKaYkbw2SBVJEB6jIekbylFMlJfNfsnhsVFCGAlpims4U39PZChWahoHprM4Uy16hfif1091dOlllCepJhzPF0Upg1rAIhgYUkmwZlNDEJbU3ArxCEmEtYmvZkJwF19eJt2zpus03dvzRuuqjKMKjkAdnAIXXIAWuAFt0AEYPIAn8AJerUfr2Xqz3uetFaucOQR/YH18A6ZPm2s=</latexit><latexit sha1_base64="RzTZ0bVG1tX3m7GXesoGab/HjRI=">AAACC3icbVBNS8NAEN3Ur1q/oh69LC2ip5KIoMeiF48VbCu0IWw2m3bpZjfsbgol5O7Fv+LFgyJe/QPe/Ddu2gja+mDg8d4MM/OChFGlHefLqqysrq1vVDdrW9s7u3v2/kFXiVRi0sGCCXkfIEUY5aSjqWbkPpEExQEjvWB8Xfi9CZGKCn6npwnxYjTkNKIYaSP5dn0QIz0KoizNfQoHOBQa/kiT3M/oSe7bDafpzACXiVuSBijR9u3PQShwGhOuMUNK9V0n0V6GpKaYkbw2SBVJEB6jIekbylFMlJfNfsnhsVFCGAlpims4U39PZChWahoHprM4Uy16hfif1091dOlllCepJhzPF0Upg1rAIhgYUkmwZlNDEJbU3ArxCEmEtYmvZkJwF19eJt2zpus03dvzRuuqjKMKjkAdnAIXXIAWuAFt0AEYPIAn8AJerUfr2Xqz3uetFaucOQR/YH18A6ZPm2s=</latexit>

P (wt+j | wt) =exp(uwt · vwt+j )Pk2V exp(uwt · vk)

<latexit sha1_base64="YxU1x4J5AlDT3J/Dp+p53Qpgi+U=">AAACcnicjVFda9swFJXdbc3SfaQdfdlg0xYGCRvBHoPupRC6lz1msKSFOBhZllstkmyk67ZB6Af07/Wtv6Iv+wFVPA+6dg87IDicc+/V1VFWCW4giq6CcOPBw0ebncfdrSdPnz3vbe/MTFlryqa0FKU+yohhgis2BQ6CHVWaEZkJdpgtv679w1OmDS/VD1hVbCHJseIFpwS8lPYuJoOz1MKHnw4nkuf4LIUh3sdJoQm1CTuvBokkcJIVtnap9a6vo3kJ+I982sjNBDd0NjG1TO0SJ1zhmfvfCcuhc2mvH42iBvg+iVvSRy0mae8yyUtaS6aACmLMPI4qWFiigVPBXDepDasIXZJjNvdUEcnMwjaROfzeKzkuSu2PAtyotzsskcasZOYr12uau95a/Jc3r6H4srBcVTUwRX9fVNQCQ4nX+eOca0ZBrDwhVHO/K6YnxKcN/pe6PoT47pPvk9mnURyN4u+f++ODNo4OeoXeoQGK0R4ao29ogqaIoutgN3gdvAl+hS/Dt2GbXRi0PS/QXwg/3gCAcL/l</latexit><latexit sha1_base64="YxU1x4J5AlDT3J/Dp+p53Qpgi+U=">AAACcnicjVFda9swFJXdbc3SfaQdfdlg0xYGCRvBHoPupRC6lz1msKSFOBhZllstkmyk67ZB6Af07/Wtv6Iv+wFVPA+6dg87IDicc+/V1VFWCW4giq6CcOPBw0ebncfdrSdPnz3vbe/MTFlryqa0FKU+yohhgis2BQ6CHVWaEZkJdpgtv679w1OmDS/VD1hVbCHJseIFpwS8lPYuJoOz1MKHnw4nkuf4LIUh3sdJoQm1CTuvBokkcJIVtnap9a6vo3kJ+I982sjNBDd0NjG1TO0SJ1zhmfvfCcuhc2mvH42iBvg+iVvSRy0mae8yyUtaS6aACmLMPI4qWFiigVPBXDepDasIXZJjNvdUEcnMwjaROfzeKzkuSu2PAtyotzsskcasZOYr12uau95a/Jc3r6H4srBcVTUwRX9fVNQCQ4nX+eOca0ZBrDwhVHO/K6YnxKcN/pe6PoT47pPvk9mnURyN4u+f++ODNo4OeoXeoQGK0R4ao29ogqaIoutgN3gdvAl+hS/Dt2GbXRi0PS/QXwg/3gCAcL/l</latexit><latexit sha1_base64="YxU1x4J5AlDT3J/Dp+p53Qpgi+U=">AAACcnicjVFda9swFJXdbc3SfaQdfdlg0xYGCRvBHoPupRC6lz1msKSFOBhZllstkmyk67ZB6Af07/Wtv6Iv+wFVPA+6dg87IDicc+/V1VFWCW4giq6CcOPBw0ebncfdrSdPnz3vbe/MTFlryqa0FKU+yohhgis2BQ6CHVWaEZkJdpgtv679w1OmDS/VD1hVbCHJseIFpwS8lPYuJoOz1MKHnw4nkuf4LIUh3sdJoQm1CTuvBokkcJIVtnap9a6vo3kJ+I982sjNBDd0NjG1TO0SJ1zhmfvfCcuhc2mvH42iBvg+iVvSRy0mae8yyUtaS6aACmLMPI4qWFiigVPBXDepDasIXZJjNvdUEcnMwjaROfzeKzkuSu2PAtyotzsskcasZOYr12uau95a/Jc3r6H4srBcVTUwRX9fVNQCQ4nX+eOca0ZBrDwhVHO/K6YnxKcN/pe6PoT47pPvk9mnURyN4u+f++ODNo4OeoXeoQGK0R4ao29ogqaIoutgN3gdvAl+hS/Dt2GbXRi0PS/QXwg/3gCAcL/l</latexit><latexit sha1_base64="hP+6LrUf2d3tZaldqaQQvEKMXyw=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4</latexit><latexit sha1_base64="e+z+9fOroxs43UnXStWJ/Sb1g8o=">AAACZ3icjVFdSxwxFM1MrdXV2rXgiwUbuxR2EZaZvtQXQfDFxy10V2FnGTKZjMZNMkNyR11CfoB/zzd/hS/+ALPjCvXjoQcCh3Nu7r05ySrBDUTRXRB+WPq4/GlltbW2/nnjS3tzfWTKWlM2pKUo9WlGDBNcsSFwEOy00ozITLCTbHo0908umTa8VH9hVrGJJGeKF5wS8FLavhl0r1ILexcOJ5Ln+CqFHj7ASaEJtQm7rrqJJHCeFbZ2qfWur6N5CfhZvmzkpoPrOZuYWqZ2ihOu8Mj9b4dpz7m03Yn6UQP8lsQL0kELDNL2bZKXtJZMARXEmHEcVTCxRAOngrlWUhtWETolZ2zsqSKSmYltInP4p1dyXJTaHwW4Uf+9YYk0ZiYzXzlf07z25uJ73riGYn9iuapqYIo+DSpqgaHE8/xxzjWjIGaeEKq53xXTc+LTBv9LLR9C/PrJb8noVz+O+vGfCK2gb+gH6qIY/UaH6BgN0BBRdB9sBTvB9+Ah3A53n+IKg0VuX9ELhJ1HMpa/Bw==</latexit><latexit sha1_base64="e+z+9fOroxs43UnXStWJ/Sb1g8o=">AAACZ3icjVFdSxwxFM1MrdXV2rXgiwUbuxR2EZaZvtQXQfDFxy10V2FnGTKZjMZNMkNyR11CfoB/zzd/hS/+ALPjCvXjoQcCh3Nu7r05ySrBDUTRXRB+WPq4/GlltbW2/nnjS3tzfWTKWlM2pKUo9WlGDBNcsSFwEOy00ozITLCTbHo0908umTa8VH9hVrGJJGeKF5wS8FLavhl0r1ILexcOJ5Ln+CqFHj7ASaEJtQm7rrqJJHCeFbZ2qfWur6N5CfhZvmzkpoPrOZuYWqZ2ihOu8Mj9b4dpz7m03Yn6UQP8lsQL0kELDNL2bZKXtJZMARXEmHEcVTCxRAOngrlWUhtWETolZ2zsqSKSmYltInP4p1dyXJTaHwW4Uf+9YYk0ZiYzXzlf07z25uJ73riGYn9iuapqYIo+DSpqgaHE8/xxzjWjIGaeEKq53xXTc+LTBv9LLR9C/PrJb8noVz+O+vGfCK2gb+gH6qIY/UaH6BgN0BBRdB9sBTvB9+Ah3A53n+IKg0VuX9ELhJ1HMpa/Bw==</latexit><latexit sha1_base64="JwMv/sOfzaSR0cvFGLscrahs/c0=">AAACcnicjVHRTtswFHXCxlg3oGPihUmbtwqpFVOV7AVeJiH2ssdOWgtSU0WO44Cp7UT2DVBZ/gB+jze+ghc+YG7IpA32sCNZOjrn3uvr46wS3EAU3QbhyrPnqy/WXnZevV7f2Oy+2ZqYstaUjWkpSn2SEcMEV2wMHAQ7qTQjMhPsOJt/W/rHF0wbXqqfsKjYTJJTxQtOCXgp7V6P+pephb1zhxPJc3yZwgB/xUmhCbUJu6r6iSRwlhW2dqn1rq+jeQn4t3zRyM0EN3A2MbVM7RwnXOGJ+98J84FzabcXDaMG+CmJW9JDLUZp9ybJS1pLpoAKYsw0jiqYWaKBU8FcJ6kNqwidk1M29VQRyczMNpE5vOuVHBel9kcBbtQ/OyyRxixk5iuXa5rH3lL8lzetoTiYWa6qGpiiDxcVtcBQ4mX+OOeaURALTwjV3O+K6RnxaYP/pY4PIX785Kdk8mUYR8P4R9Q7PGrjWEPv0CfURzHaR4foOxqhMaLoLtgO3gcfgvtwJ/wYttmFQdvzFv2F8PMvfzC/4Q==</latexit><latexit sha1_base64="YxU1x4J5AlDT3J/Dp+p53Qpgi+U=">AAACcnicjVFda9swFJXdbc3SfaQdfdlg0xYGCRvBHoPupRC6lz1msKSFOBhZllstkmyk67ZB6Af07/Wtv6Iv+wFVPA+6dg87IDicc+/V1VFWCW4giq6CcOPBw0ebncfdrSdPnz3vbe/MTFlryqa0FKU+yohhgis2BQ6CHVWaEZkJdpgtv679w1OmDS/VD1hVbCHJseIFpwS8lPYuJoOz1MKHnw4nkuf4LIUh3sdJoQm1CTuvBokkcJIVtnap9a6vo3kJ+I982sjNBDd0NjG1TO0SJ1zhmfvfCcuhc2mvH42iBvg+iVvSRy0mae8yyUtaS6aACmLMPI4qWFiigVPBXDepDasIXZJjNvdUEcnMwjaROfzeKzkuSu2PAtyotzsskcasZOYr12uau95a/Jc3r6H4srBcVTUwRX9fVNQCQ4nX+eOca0ZBrDwhVHO/K6YnxKcN/pe6PoT47pPvk9mnURyN4u+f++ODNo4OeoXeoQGK0R4ao29ogqaIoutgN3gdvAl+hS/Dt2GbXRi0PS/QXwg/3gCAcL/l</latexit><latexit sha1_base64="YxU1x4J5AlDT3J/Dp+p53Qpgi+U=">AAACcnicjVFda9swFJXdbc3SfaQdfdlg0xYGCRvBHoPupRC6lz1msKSFOBhZllstkmyk67ZB6Af07/Wtv6Iv+wFVPA+6dg87IDicc+/V1VFWCW4giq6CcOPBw0ebncfdrSdPnz3vbe/MTFlryqa0FKU+yohhgis2BQ6CHVWaEZkJdpgtv679w1OmDS/VD1hVbCHJseIFpwS8lPYuJoOz1MKHnw4nkuf4LIUh3sdJoQm1CTuvBokkcJIVtnap9a6vo3kJ+I982sjNBDd0NjG1TO0SJ1zhmfvfCcuhc2mvH42iBvg+iVvSRy0mae8yyUtaS6aACmLMPI4qWFiigVPBXDepDasIXZJjNvdUEcnMwjaROfzeKzkuSu2PAtyotzsskcasZOYr12uau95a/Jc3r6H4srBcVTUwRX9fVNQCQ4nX+eOca0ZBrDwhVHO/K6YnxKcN/pe6PoT47pPvk9mnURyN4u+f++ODNo4OeoXeoQGK0R4ao29ogqaIoutgN3gdvAl+hS/Dt2GbXRi0PS/QXwg/3gCAcL/l</latexit><latexit sha1_base64="YxU1x4J5AlDT3J/Dp+p53Qpgi+U=">AAACcnicjVFda9swFJXdbc3SfaQdfdlg0xYGCRvBHoPupRC6lz1msKSFOBhZllstkmyk67ZB6Af07/Wtv6Iv+wFVPA+6dg87IDicc+/V1VFWCW4giq6CcOPBw0ebncfdrSdPnz3vbe/MTFlryqa0FKU+yohhgis2BQ6CHVWaEZkJdpgtv679w1OmDS/VD1hVbCHJseIFpwS8lPYuJoOz1MKHnw4nkuf4LIUh3sdJoQm1CTuvBokkcJIVtnap9a6vo3kJ+I982sjNBDd0NjG1TO0SJ1zhmfvfCcuhc2mvH42iBvg+iVvSRy0mae8yyUtaS6aACmLMPI4qWFiigVPBXDepDasIXZJjNvdUEcnMwjaROfzeKzkuSu2PAtyotzsskcasZOYr12uau95a/Jc3r6H4srBcVTUwRX9fVNQCQ4nX+eOca0ZBrDwhVHO/K6YnxKcN/pe6PoT47pPvk9mnURyN4u+f++ODNo4OeoXeoQGK0R4ao29ogqaIoutgN3gdvAl+hS/Dt2GbXRi0PS/QXwg/3gCAcL/l</latexit><latexit sha1_base64="YxU1x4J5AlDT3J/Dp+p53Qpgi+U=">AAACcnicjVFda9swFJXdbc3SfaQdfdlg0xYGCRvBHoPupRC6lz1msKSFOBhZllstkmyk67ZB6Af07/Wtv6Iv+wFVPA+6dg87IDicc+/V1VFWCW4giq6CcOPBw0ebncfdrSdPnz3vbe/MTFlryqa0FKU+yohhgis2BQ6CHVWaEZkJdpgtv679w1OmDS/VD1hVbCHJseIFpwS8lPYuJoOz1MKHnw4nkuf4LIUh3sdJoQm1CTuvBokkcJIVtnap9a6vo3kJ+I982sjNBDd0NjG1TO0SJ1zhmfvfCcuhc2mvH42iBvg+iVvSRy0mae8yyUtaS6aACmLMPI4qWFiigVPBXDepDasIXZJjNvdUEcnMwjaROfzeKzkuSu2PAtyotzsskcasZOYr12uau95a/Jc3r6H4srBcVTUwRX9fVNQCQ4nX+eOca0ZBrDwhVHO/K6YnxKcN/pe6PoT47pPvk9mnURyN4u+f++ODNo4OeoXeoQGK0R4ao29ogqaIoutgN3gdvAl+hS/Dt2GbXRi0PS/QXwg/3gCAcL/l</latexit><latexit sha1_base64="YxU1x4J5AlDT3J/Dp+p53Qpgi+U=">AAACcnicjVFda9swFJXdbc3SfaQdfdlg0xYGCRvBHoPupRC6lz1msKSFOBhZllstkmyk67ZB6Af07/Wtv6Iv+wFVPA+6dg87IDicc+/V1VFWCW4giq6CcOPBw0ebncfdrSdPnz3vbe/MTFlryqa0FKU+yohhgis2BQ6CHVWaEZkJdpgtv679w1OmDS/VD1hVbCHJseIFpwS8lPYuJoOz1MKHnw4nkuf4LIUh3sdJoQm1CTuvBokkcJIVtnap9a6vo3kJ+I982sjNBDd0NjG1TO0SJ1zhmfvfCcuhc2mvH42iBvg+iVvSRy0mae8yyUtaS6aACmLMPI4qWFiigVPBXDepDasIXZJjNvdUEcnMwjaROfzeKzkuSu2PAtyotzsskcasZOYr12uau95a/Jc3r6H4srBcVTUwRX9fVNQCQ4nX+eOca0ZBrDwhVHO/K6YnxKcN/pe6PoT47pPvk9mnURyN4u+f++ODNo4OeoXeoQGK0R4ao29ogqaIoutgN3gdvAl+hS/Dt2GbXRi0PS/QXwg/3gCAcL/l</latexit><latexit sha1_base64="YxU1x4J5AlDT3J/Dp+p53Qpgi+U=">AAACcnicjVFda9swFJXdbc3SfaQdfdlg0xYGCRvBHoPupRC6lz1msKSFOBhZllstkmyk67ZB6Af07/Wtv6Iv+wFVPA+6dg87IDicc+/V1VFWCW4giq6CcOPBw0ebncfdrSdPnz3vbe/MTFlryqa0FKU+yohhgis2BQ6CHVWaEZkJdpgtv679w1OmDS/VD1hVbCHJseIFpwS8lPYuJoOz1MKHnw4nkuf4LIUh3sdJoQm1CTuvBokkcJIVtnap9a6vo3kJ+I982sjNBDd0NjG1TO0SJ1zhmfvfCcuhc2mvH42iBvg+iVvSRy0mae8yyUtaS6aACmLMPI4qWFiigVPBXDepDasIXZJjNvdUEcnMwjaROfzeKzkuSu2PAtyotzsskcasZOYr12uau95a/Jc3r6H4srBcVTUwRX9fVNQCQ4nX+eOca0ZBrDwhVHO/K6YnxKcN/pe6PoT47pPvk9mnURyN4u+f++ODNo4OeoXeoQGK0R4ao29ogqaIoutgN3gdvAl+hS/Dt2GbXRi0PS/QXwg/3gCAcL/l</latexit>

are all the parameters in this model!✓ = {{uk}, {vk}}<latexit sha1_base64="uE6wEg+cbVDNn7T6D276YV5+N9k=">AAACGHicbVDLSsNAFJ3UV62vqks3g0VwITURQTdC0Y3LCvYBTQiT6aQdOnkwc1MoIZ/hxl9x40IRt935N07TgNp6hoHDOfdy7z1eLLgC0/wySiura+sb5c3K1vbO7l51/6CtokRS1qKRiGTXI4oJHrIWcBCsG0tGAk+wjje6m/mdMZOKR+EjTGLmBGQQcp9TAlpyq+c2DBkQfIPtVL+AwNDz0yRzR3Z2hn+Uca7YmVutmXUzB14mVkFqqEDTrU7tfkSTgIVABVGqZ5kxOCmRwKlgWcVOFIsJHZEB62kakoApJ80Py/CJVvrYj6T+IeBc/d2RkkCpSeDpytmeatGbif95vQT8ayflYZwAC+l8kJ8IDBGepYT7XDIKYqIJoZLrXTEdEkko6CwrOgRr8eRl0r6oW2bderisNW6LOMroCB2jU2ShK9RA96iJWoiiJ/SC3tC78Wy8Gh/G57y0ZBQ9h+gPjOk3ue6g1w==</latexit><latexit sha1_base64="uE6wEg+cbVDNn7T6D276YV5+N9k=">AAACGHicbVDLSsNAFJ3UV62vqks3g0VwITURQTdC0Y3LCvYBTQiT6aQdOnkwc1MoIZ/hxl9x40IRt935N07TgNp6hoHDOfdy7z1eLLgC0/wySiura+sb5c3K1vbO7l51/6CtokRS1qKRiGTXI4oJHrIWcBCsG0tGAk+wjje6m/mdMZOKR+EjTGLmBGQQcp9TAlpyq+c2DBkQfIPtVL+AwNDz0yRzR3Z2hn+Uca7YmVutmXUzB14mVkFqqEDTrU7tfkSTgIVABVGqZ5kxOCmRwKlgWcVOFIsJHZEB62kakoApJ80Py/CJVvrYj6T+IeBc/d2RkkCpSeDpytmeatGbif95vQT8ayflYZwAC+l8kJ8IDBGepYT7XDIKYqIJoZLrXTEdEkko6CwrOgRr8eRl0r6oW2bderisNW6LOMroCB2jU2ShK9RA96iJWoiiJ/SC3tC78Wy8Gh/G57y0ZBQ9h+gPjOk3ue6g1w==</latexit><latexit sha1_base64="uE6wEg+cbVDNn7T6D276YV5+N9k=">AAACGHicbVDLSsNAFJ3UV62vqks3g0VwITURQTdC0Y3LCvYBTQiT6aQdOnkwc1MoIZ/hxl9x40IRt935N07TgNp6hoHDOfdy7z1eLLgC0/wySiura+sb5c3K1vbO7l51/6CtokRS1qKRiGTXI4oJHrIWcBCsG0tGAk+wjje6m/mdMZOKR+EjTGLmBGQQcp9TAlpyq+c2DBkQfIPtVL+AwNDz0yRzR3Z2hn+Uca7YmVutmXUzB14mVkFqqEDTrU7tfkSTgIVABVGqZ5kxOCmRwKlgWcVOFIsJHZEB62kakoApJ80Py/CJVvrYj6T+IeBc/d2RkkCpSeDpytmeatGbif95vQT8ayflYZwAC+l8kJ8IDBGepYT7XDIKYqIJoZLrXTEdEkko6CwrOgRr8eRl0r6oW2bderisNW6LOMroCB2jU2ShK9RA96iJWoiiJ/SC3tC78Wy8Gh/G57y0ZBQ9h+gPjOk3ue6g1w==</latexit><latexit sha1_base64="uE6wEg+cbVDNn7T6D276YV5+N9k=">AAACGHicbVDLSsNAFJ3UV62vqks3g0VwITURQTdC0Y3LCvYBTQiT6aQdOnkwc1MoIZ/hxl9x40IRt935N07TgNp6hoHDOfdy7z1eLLgC0/wySiura+sb5c3K1vbO7l51/6CtokRS1qKRiGTXI4oJHrIWcBCsG0tGAk+wjje6m/mdMZOKR+EjTGLmBGQQcp9TAlpyq+c2DBkQfIPtVL+AwNDz0yRzR3Z2hn+Uca7YmVutmXUzB14mVkFqqEDTrU7tfkSTgIVABVGqZ5kxOCmRwKlgWcVOFIsJHZEB62kakoApJ80Py/CJVvrYj6T+IeBc/d2RkkCpSeDpytmeatGbif95vQT8ayflYZwAC+l8kJ8IDBGepYT7XDIKYqIJoZLrXTEdEkko6CwrOgRr8eRl0r6oW2bderisNW6LOMroCB2jU2ShK9RA96iJWoiiJ/SC3tC78Wy8Gh/G57y0ZBQ9h+gPjOk3ue6g1w==</latexit>

V is large: 10^5-10^7. Computing probabilities is very expensive!

Page 18: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Hierarchical softmax

(Morin and Bengio et al, 2005) Hierarchical probabilistic neural language model

Page 19: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Hierarchical softmax

• Huffman tree:

Page 20: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Negative sampling

• SGNS = Skip-gram with negative sampling

• Intuition: for each pair, we sample k negative pairs :

(w, c) (w, c′ )

P(D = 1 ∣ w, c) =1

1 + exp(−uw ⋅ vc)

P(D = 0 ∣ w, c′ ) =exp(−uw ⋅ vc′ )

1 + exp(−uw ⋅ vc′ )

Pn(w) =U(w)3/4

Z

Page 21: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Noise Contrastive Estimation (NCE)

• Recommended reading: (Dyer, 2014) Notes on Noise Contrastive Estimation and Negative Sampling

• “They are superficially similar, NCE is a general parameter estimation technique that is asymptotically unbiased, while negative sampling is best understood as a family of binary classification models that are useful for learning word representations but not as a general-purpose estimator.”

Page 22: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Hierarchical softmax vs Negative sampling

• Pros and Cons

Page 23: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Subsampling of Frequent Words

• Probability of discarding a word:

• t = 10−5

Page 24: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Experimental setup

• Google dataset: 1 billion words • Vocabulary size: 692K • Context size: 5 • Dimension: 300 • “Our experiments indicate that values of k in the range 5–20 are useful for

small training datasets, while for large datasets the k can be as small as 2–5.”

• Pre-trained word vectors: 100 billion words, 300-dimensional vectors for 3 million words and phrases.

Page 25: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Evaluation: analogical reasoning

Word analogy man: woman king: ? ≈ argmax

i(cos(ui,ub � ua + uc))

<latexit sha1_base64="JrpgXOIk2wxy6PeogrgqRj0rj24=">AAACQHicbVDLSsNAFJ3UV62vqks3g0VoUUsigi6LblxWsA9oQphMJ+3g5MHMjVhCP82Nn+DOtRsXirh15aTtIrYeGDjn3Hu5d44XC67ANF+NwtLyyupacb20sbm1vVPe3WurKJGUtWgkItn1iGKCh6wFHATrxpKRwBOs491fZ/XOA5OKR+EdjGLmBGQQcp9TAtpyyx2byIEdkEc35WNbMB+qNo1UVVsw9Pw0Gbv8BOeUh0/zkuDjvKQ1W/LBEGpuuWLWzQnwIrFmpIJmaLrlF7sf0SRgIVBBlOpZZgxOSiRwKti4ZCeKxYTekwHraRqSgCknnQQwxkfa6WM/kvqFgCdufiIlgVKjwNOd2a1qvpaZ/9V6CfiXTsrDOAEW0ukiPxEYIpyliftcMgpipAmhkutbMR0SSSjozEs6BGv+y4ukfVa3zLp1e15pXM3iKKIDdIiqyEIXqIFuUBO1EEVP6A19oE/j2Xg3vozvaWvBmM3soz8wfn4BlsGw6A==</latexit><latexit sha1_base64="JrpgXOIk2wxy6PeogrgqRj0rj24=">AAACQHicbVDLSsNAFJ3UV62vqks3g0VoUUsigi6LblxWsA9oQphMJ+3g5MHMjVhCP82Nn+DOtRsXirh15aTtIrYeGDjn3Hu5d44XC67ANF+NwtLyyupacb20sbm1vVPe3WurKJGUtWgkItn1iGKCh6wFHATrxpKRwBOs491fZ/XOA5OKR+EdjGLmBGQQcp9TAtpyyx2byIEdkEc35WNbMB+qNo1UVVsw9Pw0Gbv8BOeUh0/zkuDjvKQ1W/LBEGpuuWLWzQnwIrFmpIJmaLrlF7sf0SRgIVBBlOpZZgxOSiRwKti4ZCeKxYTekwHraRqSgCknnQQwxkfa6WM/kvqFgCdufiIlgVKjwNOd2a1qvpaZ/9V6CfiXTsrDOAEW0ukiPxEYIpyliftcMgpipAmhkutbMR0SSSjozEs6BGv+y4ukfVa3zLp1e15pXM3iKKIDdIiqyEIXqIFuUBO1EEVP6A19oE/j2Xg3vozvaWvBmM3soz8wfn4BlsGw6A==</latexit><latexit sha1_base64="JrpgXOIk2wxy6PeogrgqRj0rj24=">AAACQHicbVDLSsNAFJ3UV62vqks3g0VoUUsigi6LblxWsA9oQphMJ+3g5MHMjVhCP82Nn+DOtRsXirh15aTtIrYeGDjn3Hu5d44XC67ANF+NwtLyyupacb20sbm1vVPe3WurKJGUtWgkItn1iGKCh6wFHATrxpKRwBOs491fZ/XOA5OKR+EdjGLmBGQQcp9TAtpyyx2byIEdkEc35WNbMB+qNo1UVVsw9Pw0Gbv8BOeUh0/zkuDjvKQ1W/LBEGpuuWLWzQnwIrFmpIJmaLrlF7sf0SRgIVBBlOpZZgxOSiRwKti4ZCeKxYTekwHraRqSgCknnQQwxkfa6WM/kvqFgCdufiIlgVKjwNOd2a1qvpaZ/9V6CfiXTsrDOAEW0ukiPxEYIpyliftcMgpipAmhkutbMR0SSSjozEs6BGv+y4ukfVa3zLp1e15pXM3iKKIDdIiqyEIXqIFuUBO1EEVP6A19oE/j2Xg3vozvaWvBmM3soz8wfn4BlsGw6A==</latexit><latexit sha1_base64="JrpgXOIk2wxy6PeogrgqRj0rj24=">AAACQHicbVDLSsNAFJ3UV62vqks3g0VoUUsigi6LblxWsA9oQphMJ+3g5MHMjVhCP82Nn+DOtRsXirh15aTtIrYeGDjn3Hu5d44XC67ANF+NwtLyyupacb20sbm1vVPe3WurKJGUtWgkItn1iGKCh6wFHATrxpKRwBOs491fZ/XOA5OKR+EdjGLmBGQQcp9TAtpyyx2byIEdkEc35WNbMB+qNo1UVVsw9Pw0Gbv8BOeUh0/zkuDjvKQ1W/LBEGpuuWLWzQnwIrFmpIJmaLrlF7sf0SRgIVBBlOpZZgxOSiRwKti4ZCeKxYTekwHraRqSgCknnQQwxkfa6WM/kvqFgCdufiIlgVKjwNOd2a1qvpaZ/9V6CfiXTsrDOAEW0ukiPxEYIpyliftcMgpipAmhkutbMR0SSSjozEs6BGv+y4ukfVa3zLp1e15pXM3iKKIDdIiqyEIXqIFuUBO1EEVP6A19oE/j2Xg3vozvaWvBmM3soz8wfn4BlsGw6A==</latexit>

semantic

Chicago:Illinois Philadelphia: ? ≈ bad:worst cool: ? ≈

syntactic

http://download.tensorflow.org/data/questions-words.txtMore examples at

Page 26: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Evaluation: analogical reasoning

No word similarity evaluation!

Page 27: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Learning phrases

• “New York Times” != “New” + “York” + “Times” • “Air Canada” != “Air” + “Canada”

• A simple data-driven approach to select phrases:

Page 28: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Evaluation: analogical reasoning for phrases

Page 29: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Evaluation: analogical reasoning for phrases

Page 30: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Comparison to previous models

No quantitative evaluation!No downstream evaluation

Page 31: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

What is good about word2vec?

• Discussion

• .. vs Collobert & Weston?

Page 32: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can
Page 33: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Main contributions

• A systematic comparative evaluation of count and predict vectors.

Motivation: these silly deep learning people keep writing papers but don't compare to traditional distributional semantics models. So we will.

Conclusion: okay, those people are actually right.

• Main result: predict vectors >> count vectors.

Page 34: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Count vs predict models

• “Count” models: collect raw co-occurrence counts in a corpus, and transform them into vectors with dimensionality reduction (and reweighting)

• “Predict” models: estimate the word vectors directly by maximizing the probability of the contexts in which the word is observed in the corpus

Page 35: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Experimental setup

• Corpus: 2.8 billion tokens (ukWaC, English Wikipedia, British National Corpus)

• Vocabulary: 300k most frequent words.

• Count models • Context size: 2 or 5 • Two weighting schemes: positive Pointwise Mutual

information (PPMI), Local Mutual Information • SVD, two other non-negative matrix factorization methods • Dimensions: 200, 300, 400, 500

Page 36: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Experimental setup

• Predict models: CBOW • Dimensions: 200, 300, 400, 500 • Context size: 2, 5 • Hierarchical softmax and negative sampling (k = 5 or 10)

• Subsampling t = 1e−5

• Out-of-the-box models • (Baroni and Lenci, 2010): count models relying on syntactic

information • Collobert & Weston vectors

Page 37: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Benchmarks

Page 38: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Semantic relatedness

• Compare the correlation between the average scores that human subjects assigned to the pairs and cosine similarity between corresponding vectors.

• Similarity vs relatedness: “car” vs “vechicle” AND “car” vs “journey”

Metric: Spearman rank correlation

Page 39: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Synonym detection

TOEFL test • levied: imposed, believed, requested, correlated

Page 40: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Concept categorization

• “helicopters” “motorcycles” • “elaphants” “mammal”

Page 41: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Selectional preferences

• Verb-noun pairs• People received a high average score as subject of to

eat, and a low score as object of the same verb.

Page 42: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Final performance

Page 43: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Top count models

Page 44: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Top predict models

Page 45: Word Embeddings · 2020. 2. 12. · • “Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can

Recommended reading

• (Dyer, 2014) Notes on Noise Contrastive Estimation and Negative Sampling

• (Pennington et al, 2014) GloVe: Global Vectors for Word Representation

• (Levy et al, 2015): Improving Distributional Similarity with Lessons Learned from Word Embeddings • “We reveal that much of the performance gains of word embeddings

are due to certain system design choices and hyperparameter optimizations, rather than the embedding algorithms themselves.”