70
Natural Language Processing Word vectors Many slides borrowed from Richard Socher ,Chris Manning, and Hugo Lachorelle

Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Natural Language Processing

Word vectors

Many slides borrowed from Richard Socher ,Chris Manning, and Hugo Lachorelle

Page 2: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Lecture plan• Word representations

• Word vectors (embeddings)

• skip-gram algorithm

• Relation to matrix factorization

• Evaluation

2

Page 3: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Representing words

3

Page 4: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Representing wordsDefinition: meaning (Webster dictionary)

• the idea that is represented by a word, phrase, etc.

• The idea that a person wants to express by using words, signs, etc.

• the idea that is expressed in a work of writing, art, etc.

In linguistics:

signifier <—> signified (idea or thing) = denotation

4

Page 5: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Taxonomies

Page 6: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Taxonomies

“beverage”

Page 7: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Representing words with computers

A word is the set of meanings it has in a taxonomy (graph of meanings) Hypernym: “is-a” relationHyponym: the opposite of ‘hypernym’

7

Page 8: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Drawbacks• Expensive!

• Subjective (how to split different synsets?)

• Incomplete

• wicked, badass, nifty, crack, ace, wizard, genius, ninja

• Missing functionality:

• how do you compute word similarity?

• How to compose meanings?

8

Page 9: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Discrete representationWords are atomic symbols (one-hot representation):

V = {hotel,motel,walk,wife, spouse}

|V| ⇡ 100, 000

hotel [1 0 0 0 0]

motel [0 1 0 0 0]

walk [0 0 1 0 0]

wife [0 0 0 1 0]

spouse [0 0 0 0 1]

9

Page 10: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

DrawbackBarack Obama’s wife ≈ Barack Obama’s spouse Barack Obama’s wife ≉ Barack Obama’s advisors

Seattle motels ≈ Seattle hotels Seattle motels ≉ Seattle attractions

But all words vectors are orthogonal and equidistant

Goal: word vectors with a natural notion of similarity

h“hotel” · “motel”i > h“hotel” · “spouse”i 10

Page 11: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Distributional similarity“You shall know a word by the company it keeps”

(Firth, 1957)

“… cashed a check at the bank across the street…” “… that bank holds the mortgage on my home…” “… said that the bank raised his forecast for…” “… employees of the bank have confessed to the charges”

Central idea: represent words by their context

11

Page 12: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Idea 1word context

wife {met: 3, married: 4, children: 2, wedded: 1, …}

spouse {met: 2, married: 5, children: 2, kids: 1, …}

Problem: • married <==> wedded • children <==> kids

12

Page 13: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Distributed representations

language =

0.278�0.9110.792

�0.1770.109

�0.542�0.0003

• Represent words as low-dimensional vectors• Represent similarity with vector similarity metrics

13

Page 14: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Word vectors

14

Page 15: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Motivation

• Word embeddings are widely used

• (other options exist: word-parts, character-level,…).

• The great innovation of 2018 - contextualized word embeddings.

Page 16: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Supervised learning• Input: training set

• Output (probabilistic model):

• Example: train a spam detector from spam and non-spam e-mails.

f : X ! Yargmax

yp(y | x)

Intro to ML prerequisite 16

{(xi, yi)}Ni=1, (xi, yi) ⇠ D(X ⇥ Y)<latexit sha1_base64="ZipdpF8ZX0WzbIc02+YwhBvVqds=">AAACMXicbVDLSsNAFJ34rPUVdelmsAgtlJKIoC6Eoi50IxWsrTQ1TKbTduhMEmYmYgn5Jjd+ieBCFypu/QknbfHReuDC4Zx7ufceL2RUKst6NqamZ2bn5jML2cWl5ZVVc239SgaRwKSKAxaIuockYdQnVUUVI/VQEMQ9Rmpe7zj1a7dESBr4l6ofkiZHHZ+2KUZKS6555sT5O5cWYd+lBSdxY3poJzfnRfijQkdSDk/yDkeqixGL6wl0FOVEwm/pOim4Zs4qWQPASWKPSA6MUHHNR6cV4IgTX2GGpGzYVqiaMRKKYkaSrBNJEiLcQx3S0NRHemMzHrycwG2ttGA7ELp8BQfq74kYcSn73NOd6Y1y3EvF/7xGpNr7zZj6YaSIj4eL2hGDKoBpfrBFBcGK9TVBWFB9K8RdJBBWOuWsDsEef3mSVHdKByX7YjdXPhqlkQGbYAvkgQ32QBmcggqoAgzuwRN4BW/Gg/FivBsfw9YpYzSzAf7A+PwCViyosA==</latexit><latexit sha1_base64="ZipdpF8ZX0WzbIc02+YwhBvVqds=">AAACMXicbVDLSsNAFJ34rPUVdelmsAgtlJKIoC6Eoi50IxWsrTQ1TKbTduhMEmYmYgn5Jjd+ieBCFypu/QknbfHReuDC4Zx7ufceL2RUKst6NqamZ2bn5jML2cWl5ZVVc239SgaRwKSKAxaIuockYdQnVUUVI/VQEMQ9Rmpe7zj1a7dESBr4l6ofkiZHHZ+2KUZKS6555sT5O5cWYd+lBSdxY3poJzfnRfijQkdSDk/yDkeqixGL6wl0FOVEwm/pOim4Zs4qWQPASWKPSA6MUHHNR6cV4IgTX2GGpGzYVqiaMRKKYkaSrBNJEiLcQx3S0NRHemMzHrycwG2ttGA7ELp8BQfq74kYcSn73NOd6Y1y3EvF/7xGpNr7zZj6YaSIj4eL2hGDKoBpfrBFBcGK9TVBWFB9K8RdJBBWOuWsDsEef3mSVHdKByX7YjdXPhqlkQGbYAvkgQ32QBmcggqoAgzuwRN4BW/Gg/FivBsfw9YpYzSzAf7A+PwCViyosA==</latexit><latexit sha1_base64="ZipdpF8ZX0WzbIc02+YwhBvVqds=">AAACMXicbVDLSsNAFJ34rPUVdelmsAgtlJKIoC6Eoi50IxWsrTQ1TKbTduhMEmYmYgn5Jjd+ieBCFypu/QknbfHReuDC4Zx7ufceL2RUKst6NqamZ2bn5jML2cWl5ZVVc239SgaRwKSKAxaIuockYdQnVUUVI/VQEMQ9Rmpe7zj1a7dESBr4l6ofkiZHHZ+2KUZKS6555sT5O5cWYd+lBSdxY3poJzfnRfijQkdSDk/yDkeqixGL6wl0FOVEwm/pOim4Zs4qWQPASWKPSA6MUHHNR6cV4IgTX2GGpGzYVqiaMRKKYkaSrBNJEiLcQx3S0NRHemMzHrycwG2ttGA7ELp8BQfq74kYcSn73NOd6Y1y3EvF/7xGpNr7zZj6YaSIj4eL2hGDKoBpfrBFBcGK9TVBWFB9K8RdJBBWOuWsDsEef3mSVHdKByX7YjdXPhqlkQGbYAvkgQ32QBmcggqoAgzuwRN4BW/Gg/FivBsfw9YpYzSzAf7A+PwCViyosA==</latexit><latexit sha1_base64="ZipdpF8ZX0WzbIc02+YwhBvVqds=">AAACMXicbVDLSsNAFJ34rPUVdelmsAgtlJKIoC6Eoi50IxWsrTQ1TKbTduhMEmYmYgn5Jjd+ieBCFypu/QknbfHReuDC4Zx7ufceL2RUKst6NqamZ2bn5jML2cWl5ZVVc239SgaRwKSKAxaIuockYdQnVUUVI/VQEMQ9Rmpe7zj1a7dESBr4l6ofkiZHHZ+2KUZKS6555sT5O5cWYd+lBSdxY3poJzfnRfijQkdSDk/yDkeqixGL6wl0FOVEwm/pOim4Zs4qWQPASWKPSA6MUHHNR6cV4IgTX2GGpGzYVqiaMRKKYkaSrBNJEiLcQx3S0NRHemMzHrycwG2ttGA7ELp8BQfq74kYcSn73NOd6Y1y3EvF/7xGpNr7zZj6YaSIj4eL2hGDKoBpfrBFBcGK9TVBWFB9K8RdJBBWOuWsDsEef3mSVHdKByX7YjdXPhqlkQGbYAvkgQ32QBmcggqoAgzuwRN4BW/Gg/FivBsfw9YpYzSzAf7A+PwCViyosA==</latexit>

Page 17: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Word embeddings“… that bank holds the mortgage on my home…”

1. Define a supervised learning task from raw text (no manual annotation!):

1. (x, y) = (bank, that) 2. (x, y) = (bank, holds) 3. (x, y) = (holds, bank) 4. (x, y) = (holds, the) 5. (x, y) = (the, holds) 6. (x, y) = (the, mortgage) 7. (x, y) = (mortgage, the) 8. (x, y) = (mortgage, on) 9. (x, y) = (on, mortgage), 10. (x, y) = (on, my) 11. (x, y) = (my, on) 12. (x, y) = (my, home) …

17

Page 18: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Word embeddings2. Define model for output given input — p(“holds” | “bank”)

p✓(o | c) = exp(u>o vc)PV

w=1 exp(u>wvc)

• u: vector for “outside” word, v: vector for “center”, V: number of words in vocabulary, 𝛉: all parameters

• Multi-class classification model (number of classes?)

• How many parameters are in the model:

Intro to ML prerequisite 18

Page 19: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Word embeddings2. Define model for output given input — p(“holds” | “bank”)

p✓(o | c) = exp(u>o vc)PV

w=1 exp(u>wvc)

• u: vector for “outside” word, v: vector for “center”, V: number of words in vocabulary, 𝛉: all parameters

• Multi-class classification model (number of classes?)

• How many parameters are in the model:

|✓| = 2 · V · d u, v 2 Rd

Intro to ML prerequisite 18

Page 20: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Word embeddings3. Define objective function for corpus of length T:

L(✓) =TY

t=1

Y

�m j mj 6= 0

p✓(wt+j | wt)

J(✓) = logL(✓) =TX

t=1

X

�m j mj 6= 0

log p✓(wt+j | wt)

Find parameters that maximize the objective

Intro to ML prerequisite 19

Page 21: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Class 1 Recap

Intro to ML prerequisite

• Word representations:

• Ontology-based

• Pros: polysemy, similarity metrics

• Cons: expensive, compositionality, granularity

• One-hot

• Pros: cheap, simple, scales, compositionality

• Cons: no similarity

• Embeddings:

• Cheap, simple, scales, compositionality, similarity

Page 22: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Today

Intro to ML prerequisite

• Word2vec

• Efficiency:

• Hierarchical softmax

• Skipgram with negative sampling (assignment 1)

• Skipgram as matrix factorization

• Evaluation (GloVe)

Page 23: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Word embeddings“… that bank holds the mortgage on my home…”

1. Define a supervised learning task from raw text (no manual annotation!):

1. (x, y) = (bank, that) 2. (x, y) = (bank, holds) 3. (x, y) = (holds, bank) 4. (x, y) = (holds, the) 5. (x, y) = (the, holds) 6. (x, y) = (the, mortgage) 7. (x, y) = (mortgage, the) 8. (x, y) = (mortgage, on) 9. (x, y) = (on, mortgage), 10. (x, y) = (on, my) 11. (x, y) = (my, on) 12. (x, y) = (my, home) …

22

Mikolov et al., 2013

Page 24: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Word embeddings2. Define model for output given input — p(“holds” | “bank”)

p✓(o | c) = exp(u>o vc)PV

w=1 exp(u>wvc)

• u: vector for “outside” word, v: vector for “center”, V: number of words in vocabulary, 𝛉: all parameters

• We don’t really need the distribution - only the representation!

Intro to ML prerequisite 23

J(✓) = logL(✓) =TX

t=1

X

�m j mj 6= 0

log p✓(wt+j | wt)

Page 25: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Word embeddings

• What probabilities would maximize the objective?

Intro to ML prerequisite

Page 26: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Intro to ML prerequisite

L(⇥) =Y

c,o

p(o | c)#(c,o)

We can solve separately for each center word c

Lc(⇥) =Y

o

p(o | c)#(c,o)

solve for

Jc(⇥) =X

i

#(c, oi) log p(oi | c)

s.t.X

i

p(oi | c) = 1, p(oi | c) � 0

Use lagrange multipliers:

L(⇥,�) =X

i

#(c, oi) log p(oi | c)� �((X

i

p(oi | c))� 1)

rp(oi|c)L =#(c, oi)

p(oi | c)� � = 0

p(oi | c) =#(c, oi)

�X

i

p(oi | c) =X

i

#(c, oi)

�= 1

� =X

i

#(c, oi)

p(oi | c) =#(c, oi)Pi #(c, oi)

<latexit sha1_base64="y2++QIjhsNq4COprvrYiDUVbsEQ=">AAAFTHicjVRNb9NAEHWbtJTw0RaOXEZEVIlUohiQQEiVKrgglEORmqZSHaz1epKsuvaa3XWhWvkHcuHAjV/BhQMIIbGOXZKmDrCn8Xy893bGs0HCmdLd7peV1Vp9bf3axvXGjZu3bm9ubd85UiKVFPtUcCGPA6KQsxj7mmmOx4lEEgUcB8Hpyzw+OEOpmIgP9XmCw4iMYzZilGjr8rdrwY4XET2hhJte1vIOJ6hJG/bAS6QIfUN3RQZJS4AXsRBo+63xmi3rbGfgeY0dT+MHbQYIlMSgBD9DUJgQSTTycxgJCUjoBCjGGiW8FzIEWzgh2tASYEbumzJQIWNeRJlVSCk/FgQVSnL6wvu6GlulkW+YTZnD8VkbPC7GOZ/PLjPOM3R0JwfPIVhl7h64uxUBsChjfAfdObC+QuBkLEk8RohSrpkdvR3a80J9r9S8a4XZyYYz8ez/pT+8KG61lovO09zymjEJOPHNXBa1Pe7l1CNJqFlgzkwFXjajtXXFlat7tQSzLC5n+7dmX/TjHzh2KgXWH1XLGmmTllEt4agEyhoNf6vZ7XSnB64abmk0nfIc+FufvVDQNLJbQzlR6sTtJnpoiNSMcswaXpovGT0lYzyxZkwiVEMzfQwyeGA94fTfH4lYw9Q7X2FIpNR5FNjMfPfUYix3VsVOUj16NjQsTlKNMS2IRikHLSB/WSBkEqm2ax8yQiWzWoFO7FNA7eqrvAnu4pWvGkePOu7jjvvmSXP/RdmODeeec99pOa7z1Nl3XjkHTt+htY+1r7XvtR/1T/Vv9Z/1X0Xq6kpZc9e5dNbWfwMbE7W3</latexit>

Page 27: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Questions• Intuitions:

• Why should similar words have similar vectors?

• Why do we have different parameters for the center word and the output word?

26

Page 28: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

27

Page 29: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Gradient descent3. How to find parameters that minimize the objective?

• Start at some point and move in the opposite direction of the gradient

Intro to ML prerequisite 28

Page 30: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Gradient descent

f(x) = x4 + 3x3 + 2

f 0(x) = 4x3 + 9x2

Intro to ML prerequisite 29

Page 31: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1
Page 32: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Gradient descent• We want to minimize:

J(✓) = �TX

t=1

X

j

log p✓(wt+j | wt)

• Update rule:

✓newj = ✓oldj � ↵@J(✓)

@✓j

✓new = ✓old � ↵rJ(✓)

• 𝛂 is a step size

✓ 2 R2V d

Intro to ML prerequisite 31

Page 33: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Stochastic gradient descent• For large corpora (billions of tokens) this update is

very slow

• Sample a window t

• Update gradients based on that window

✓new = ✓old � ↵rJt(✓)

Intro to ML prerequisite 32

Page 34: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Deriving the gradient• Mostly applications of the chain rule

• Let’s derive the gradient of a center word for a single output word

• You will do this again in the assignment (and more)

log p✓(wt+j | wt)

33

Page 35: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Gradient derivationL(⇥) = log p(o | c) = log

exp(u>o vc)P

i exp(u>oivc)

= u>o vc � log

X

i

exp(u>oivc)

rvcL(⇥) = uo �1P

j exp(u>ojvc)

·X

i

exp(u>oivc) · uoi

= uo �X

i

exp(u>oivc)P

j exp(u>ojvc)

· uoi

= uo �X

i

p(oi | c) · oi = uo � Eoi⇠p(oi|c)[uoi ]<latexit sha1_base64="754AwtQ+OMmkbOQTSyoNvkp5H6Q=">AAADxXicjVJdb9MwFHUTPkb56uCRF4uKqX2gahAS8IA0MSHtgYchrWxSHSLHcVpv/ohsp6KKIv4jb0j8GJzUoLSDwZUi3Zx7zrnX104LzoydTr/3gvDGzVu39+707967/+DhYP/RJ6NKTeiMKK70eYoN5UzSmWWW0/NCUyxSTs/Sy6Omfrai2jAlT+26oLHAC8lyRrB1ULLf+4EEtkuCefWhHqHTJbV4DA/eQsTVAhYjBZFgGSRj6CGUa0wqRL8UozJRn5FVBVwlZFxXyJQiYdCXKpWwulOGCPUPnElXBJ97z+uUjRBJnHKcVO6//su8zrexa6eL/DAXHcuL7WFIpuy/+rYcj7fj/+7idd1V7J73/ya4xt3tvunht7+hN8gv2vtW6thMbHPruXeN+8lgOJ1M24BXk8gnQ+DjJBl8Q5kipaDSEo6NmUfTwsYV1pYRTus+Kg0tMLnECzp3qcSCmrhqn2ENnzkkg7nS7pMWtmhXUWFhzFqkjtncodmtNeCfavPS5q/jismitFSSTaO85NAq2LxpmDFNieVrl2CimZsVkiV2l2Pdy2+WEO0e+WoyezF5M4k+vhwevvPb2ANPwFMwAhF4BQ7BMTgBM0CCo4AFOjDhcShDG6421KDnNY/BVoRffwJWxTON</latexit><latexit sha1_base64="754AwtQ+OMmkbOQTSyoNvkp5H6Q=">AAADxXicjVJdb9MwFHUTPkb56uCRF4uKqX2gahAS8IA0MSHtgYchrWxSHSLHcVpv/ohsp6KKIv4jb0j8GJzUoLSDwZUi3Zx7zrnX104LzoydTr/3gvDGzVu39+707967/+DhYP/RJ6NKTeiMKK70eYoN5UzSmWWW0/NCUyxSTs/Sy6Omfrai2jAlT+26oLHAC8lyRrB1ULLf+4EEtkuCefWhHqHTJbV4DA/eQsTVAhYjBZFgGSRj6CGUa0wqRL8UozJRn5FVBVwlZFxXyJQiYdCXKpWwulOGCPUPnElXBJ97z+uUjRBJnHKcVO6//su8zrexa6eL/DAXHcuL7WFIpuy/+rYcj7fj/+7idd1V7J73/ya4xt3tvunht7+hN8gv2vtW6thMbHPruXeN+8lgOJ1M24BXk8gnQ+DjJBl8Q5kipaDSEo6NmUfTwsYV1pYRTus+Kg0tMLnECzp3qcSCmrhqn2ENnzkkg7nS7pMWtmhXUWFhzFqkjtncodmtNeCfavPS5q/jismitFSSTaO85NAq2LxpmDFNieVrl2CimZsVkiV2l2Pdy2+WEO0e+WoyezF5M4k+vhwevvPb2ANPwFMwAhF4BQ7BMTgBM0CCo4AFOjDhcShDG6421KDnNY/BVoRffwJWxTON</latexit><latexit sha1_base64="754AwtQ+OMmkbOQTSyoNvkp5H6Q=">AAADxXicjVJdb9MwFHUTPkb56uCRF4uKqX2gahAS8IA0MSHtgYchrWxSHSLHcVpv/ohsp6KKIv4jb0j8GJzUoLSDwZUi3Zx7zrnX104LzoydTr/3gvDGzVu39+707967/+DhYP/RJ6NKTeiMKK70eYoN5UzSmWWW0/NCUyxSTs/Sy6Omfrai2jAlT+26oLHAC8lyRrB1ULLf+4EEtkuCefWhHqHTJbV4DA/eQsTVAhYjBZFgGSRj6CGUa0wqRL8UozJRn5FVBVwlZFxXyJQiYdCXKpWwulOGCPUPnElXBJ97z+uUjRBJnHKcVO6//su8zrexa6eL/DAXHcuL7WFIpuy/+rYcj7fj/+7idd1V7J73/ya4xt3tvunht7+hN8gv2vtW6thMbHPruXeN+8lgOJ1M24BXk8gnQ+DjJBl8Q5kipaDSEo6NmUfTwsYV1pYRTus+Kg0tMLnECzp3qcSCmrhqn2ENnzkkg7nS7pMWtmhXUWFhzFqkjtncodmtNeCfavPS5q/jismitFSSTaO85NAq2LxpmDFNieVrl2CimZsVkiV2l2Pdy2+WEO0e+WoyezF5M4k+vhwevvPb2ANPwFMwAhF4BQ7BMTgBM0CCo4AFOjDhcShDG6421KDnNY/BVoRffwJWxTON</latexit><latexit sha1_base64="754AwtQ+OMmkbOQTSyoNvkp5H6Q=">AAADxXicjVJdb9MwFHUTPkb56uCRF4uKqX2gahAS8IA0MSHtgYchrWxSHSLHcVpv/ohsp6KKIv4jb0j8GJzUoLSDwZUi3Zx7zrnX104LzoydTr/3gvDGzVu39+707967/+DhYP/RJ6NKTeiMKK70eYoN5UzSmWWW0/NCUyxSTs/Sy6Omfrai2jAlT+26oLHAC8lyRrB1ULLf+4EEtkuCefWhHqHTJbV4DA/eQsTVAhYjBZFgGSRj6CGUa0wqRL8UozJRn5FVBVwlZFxXyJQiYdCXKpWwulOGCPUPnElXBJ97z+uUjRBJnHKcVO6//su8zrexa6eL/DAXHcuL7WFIpuy/+rYcj7fj/+7idd1V7J73/ya4xt3tvunht7+hN8gv2vtW6thMbHPruXeN+8lgOJ1M24BXk8gnQ+DjJBl8Q5kipaDSEo6NmUfTwsYV1pYRTus+Kg0tMLnECzp3qcSCmrhqn2ENnzkkg7nS7pMWtmhXUWFhzFqkjtncodmtNeCfavPS5q/jismitFSSTaO85NAq2LxpmDFNieVrl2CimZsVkiV2l2Pdy2+WEO0e+WoyezF5M4k+vhwevvPb2ANPwFMwAhF4BQ7BMTgBM0CCo4AFOjDhcShDG6421KDnNY/BVoRffwJWxTON</latexit>

Page 36: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Recap• Goal: represent words with low-dimensional vectors

• Approach: Define a supervised learning problem from a corpus

• We defined the necessary components for skip-gram:

• Model (softmax over word labels for each word)

• Objective (minimize Negative Log Likelihood)

• Optimize with SGD

• We computed the gradient for some parameters by hand

35

Page 37: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Computational problem• Computing the partition function is too expensive

• Solution 1: hierarchical softmax (Morin and Bengio, 2005) reduces computation time to log|V| by constructing a binary tree over the vocabulary

• Solution 2: Change the objective

• skip-gram with negative sampling (home assignment 1)

36

Page 38: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Hierarchical softmax• p(“cat” | “dog”) = p(left at 1) x p(right at 2) x p(right at 5)

= (1 - p(right at 1)) x p(right at 2) x p(right at 5)dog

2 3

4 75 6

1

he she and cat the have be are

p(cat | dog) = (1� �(o>1 cdog))

= ⇥�(o>2 cdog))

= ⇥�(o>5 cdog))

Page 39: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Hierarchical softmax• How to construct the tree?

• Randomly (doesn’t work well but better than you’d think)

• Using external knowledge like WordNet

• Learn word representations somehow and then cluster

Page 40: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Skip-gram with Negative Sampling

(x, y) = ((bank, holds), 1) (x, y) = ((bank, table), 0) (x, y) = ((bank, eat), 0) (x, y) = ((holds, bank), 1) (x, y) = ((holds, quickly), 0) (x, y) = ((holds, which), 0) (x, y) = ((the, mortgage), 1) (x, y) = ((the, eat), 0) (x, y) = ((the, who), 0)

39

What information is lost?

Page 41: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Skip-gram with Negative Sampling

(x, y) = ((bank, holds), 1) (x, y) = ((bank, table), 0) (x, y) = ((bank, eat), 0) (x, y) = ((holds, bank), 1) (x, y) = ((holds, quickly), 0) (x, y) = ((holds, which), 0) (x, y) = ((the, mortgage), 1) (x, y) = ((the, eat), 0) (x, y) = ((the, who), 0)

39

What information is lost?X

o2Vp(y = 1 | o, c) =?

Page 42: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Skip-gram with Negative Sampling

• Model:

p✓(y = 1 | c, o) = 1

1 + exp(�u>o vc)

= �(u>o vc)

p✓(y = 0 | c, o) = 1� �(u>o vc) = �(�u>

o vc)

• Objective:

• p(w) = U(w)3/4 / T

X

t,j

�log(�(u>

wt+jvwt)) +

X

k⇠p(w)

log(�(�u>w(k)vwt))

Intro to ML prerequisite 40

Page 43: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Summary

• We defined the three necessary components.

• Model (binary classification)

• Objective (ML with negative sampling)

• Optimization method (SGD)

41

Page 44: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Many variants• CBOW: predict center word from context

• Defining context:

• How big is the window?

• Is it sequential or based on syntactic information?

• Different model for every context position?

• Use stop words?

• …

42

Page 45: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Matrix factorization

43

Page 46: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Matrix factorization• Consider the word-context co-occurrence matrix for a

corpus:

“I like deep learning. I like NLP. I enjoy flying.”

I Like enjoy deep learning NLP flying .I 2 1

like 2 1 1enjoy 1 1deep 1 1

learning 1 1NLP 1 1flying 1 1

. 1 1 1

Landauer and Dumais (1997)

44 Intro to ML prerequisite

Page 47: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Matrix factorization• Reconstruct matrix from low-dimensional word-

context representations.

• Minimizes: X

i,j

(Aij � Akij)

2 = ||A� Ak||2

45

Page 48: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Matrix factorization

46

Page 49: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Relation to skip-gram• The output of skip-gram can be viewed as

factorizing a word-context matrix:

×=

M VUT

M 2 R|V|⇥|V|, V, U 2 R|V|⇥d

• What M is decomposed by skip-gram?

Levy and Goldberg, 2015

47

Page 50: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Relation to skip-gram#(c) =

X

o0

#(c, o0)

#(o) =X

c0

#(c0, o)

T =X

(c,o)

#(c, o)

#(o)

T: Unigram probability of o

PT : unigram distribution

PT (w) =c(w)

|D| =c(w) ·m|D| ·m =

#(o)

T<latexit sha1_base64="oDeumLyN6Td3XXi6vr8Srx/QkIk=">AAACNnicbVDLSsNAFJ34rPVVdelmsAh1UxIR1IVQ1IUrqdDaQhPKZDpph04yYWailDR/5cbfcKcbFypu/QSnaSi29cLAuefcw5173JBRqUzz1VhYXFpeWc2t5dc3Nre2Czu795JHApM65oyLposkYTQgdUUVI81QEOS7jDTc/tVIbzwQISkPamoQEsdH3YB6FCOlqXbhttqulR6P4AW0PYFwjHWTxMPrYTJFQRt3uIJ+Kk2ayYhdLHFtqyXtQtEsm2nBeWBloAiyqrYLL3aH48gngcIMSdmyzFA5MRKKYkaSvB1JEiLcR13S0jBAPpFOnN6dwEPNdKDHhX6Bgin71xEjX8qB7+pJH6menNVG5H9aK1LemRPTIIwUCfB4kRcxqDgchQg7VBCs2EADhAXVf4W4h3QSSked1yFYsyfPg/px+bxs3Z0UK5dZGjmwDw5ACVjgFFTADaiCOsDgCbyBD/BpPBvvxpfxPR5dMDLPHpgq4+cXtKqqTA==</latexit><latexit sha1_base64="oDeumLyN6Td3XXi6vr8Srx/QkIk=">AAACNnicbVDLSsNAFJ34rPVVdelmsAh1UxIR1IVQ1IUrqdDaQhPKZDpph04yYWailDR/5cbfcKcbFypu/QSnaSi29cLAuefcw5173JBRqUzz1VhYXFpeWc2t5dc3Nre2Czu795JHApM65oyLposkYTQgdUUVI81QEOS7jDTc/tVIbzwQISkPamoQEsdH3YB6FCOlqXbhttqulR6P4AW0PYFwjHWTxMPrYTJFQRt3uIJ+Kk2ayYhdLHFtqyXtQtEsm2nBeWBloAiyqrYLL3aH48gngcIMSdmyzFA5MRKKYkaSvB1JEiLcR13S0jBAPpFOnN6dwEPNdKDHhX6Bgin71xEjX8qB7+pJH6menNVG5H9aK1LemRPTIIwUCfB4kRcxqDgchQg7VBCs2EADhAXVf4W4h3QSSked1yFYsyfPg/px+bxs3Z0UK5dZGjmwDw5ACVjgFFTADaiCOsDgCbyBD/BpPBvvxpfxPR5dMDLPHpgq4+cXtKqqTA==</latexit><latexit sha1_base64="oDeumLyN6Td3XXi6vr8Srx/QkIk=">AAACNnicbVDLSsNAFJ34rPVVdelmsAh1UxIR1IVQ1IUrqdDaQhPKZDpph04yYWailDR/5cbfcKcbFypu/QSnaSi29cLAuefcw5173JBRqUzz1VhYXFpeWc2t5dc3Nre2Czu795JHApM65oyLposkYTQgdUUVI81QEOS7jDTc/tVIbzwQISkPamoQEsdH3YB6FCOlqXbhttqulR6P4AW0PYFwjHWTxMPrYTJFQRt3uIJ+Kk2ayYhdLHFtqyXtQtEsm2nBeWBloAiyqrYLL3aH48gngcIMSdmyzFA5MRKKYkaSvB1JEiLcR13S0jBAPpFOnN6dwEPNdKDHhX6Bgin71xEjX8qB7+pJH6menNVG5H9aK1LemRPTIIwUCfB4kRcxqDgchQg7VBCs2EADhAXVf4W4h3QSSked1yFYsyfPg/px+bxs3Z0UK5dZGjmwDw5ACVjgFFTADaiCOsDgCbyBD/BpPBvvxpfxPR5dMDLPHpgq4+cXtKqqTA==</latexit><latexit sha1_base64="oDeumLyN6Td3XXi6vr8Srx/QkIk=">AAACNnicbVDLSsNAFJ34rPVVdelmsAh1UxIR1IVQ1IUrqdDaQhPKZDpph04yYWailDR/5cbfcKcbFypu/QSnaSi29cLAuefcw5173JBRqUzz1VhYXFpeWc2t5dc3Nre2Czu795JHApM65oyLposkYTQgdUUVI81QEOS7jDTc/tVIbzwQISkPamoQEsdH3YB6FCOlqXbhttqulR6P4AW0PYFwjHWTxMPrYTJFQRt3uIJ+Kk2ayYhdLHFtqyXtQtEsm2nBeWBloAiyqrYLL3aH48gngcIMSdmyzFA5MRKKYkaSvB1JEiLcR13S0jBAPpFOnN6dwEPNdKDHhX6Bgin71xEjX8qB7+pJH6menNVG5H9aK1LemRPTIIwUCfB4kRcxqDgchQg7VBCs2EADhAXVf4W4h3QSSked1yFYsyfPg/px+bxs3Z0UK5dZGjmwDw5ACVjgFFTADaiCOsDgCbyBD/BpPBvvxpfxPR5dMDLPHpgq4+cXtKqqTA==</latexit>

Page 51: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Relation to skip-gram• Re-write objective:

49

distribute

expectationis constant

for o

Openexpectation

Gather terms

L(✓) =X

c,o

#(c, o)�log(�(u>

o vc)) + k · Eo0⇠PT [log(�(�u>o0vc))]

=X

c,o

#(c, o) log(�(u>o vc)) +

X

c,o

#(c, o) · k · Eo0⇠PT [log(�(�u>o0vc))]

=X

c,o

#(c, o) log(�(u>o vc)) +

X

c

#(c) · k · Eo0⇠PT [log(�(�u>o0vc))]

=X

c,o

#(c, o) log(�(u>o vc)) +

X

c

#(c) · k ·X

o0

#(o0)

Tlog(�(�u>

o0vc))

=X

c,o

#(c, o) log(�(u>o vc)) + #(c) · k · #(o)

Tlog(�(�u>

o vc))

Page 52: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Relation to skip-gram• Let’s assume the dot products are independent of

one another:Let x = u>

o vc

l(x) = #(c, o) log(�(x)) + #(c) · k · #(o)

Tlog(�(�x))

L(✓) =X

c,o

l(x)

@l(x)

@x= #(c, o)�(�x)�#(c) · k · #(o)

T�(x) = 0

x = log

✓#(c, o) · T#(c) ·#(o)

· 1k

x = log

✓p(c, o)

p(c) · p(o)

◆� log k = PMI(c, o)� log k

50

Page 53: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Relation to skip-gram• Conclusion: Skip-gram with negative sampling

implicitly factorizes a “shifted” PMI matrix

• Many NLP methods factorize the PMI matrix with matrix decomposition methods to obtain dense vectors.

51

Page 54: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Evaluation

52

Page 55: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Evaluation

• Intrinsic vs. extrinsic evaluation:

• Intrinsic: define some artificial task that tries to directly measure the quality of your learning algorithm (a bit of that in home assignment 1).

• Extrinsic: check whether your output is useful in a real NLP task

53

Page 56: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Intrinsic evaluation• Word analogies:

• Normalize all word vectors to 1

• man::woman <—> king::??

• a::b <—> c::d d = argmaxi

(xb � xa + xc)>xi

||xb � xa + xc||

54

Page 57: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Visualization

55

Page 58: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Visualization

56

Page 59: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Visualization

57

Page 60: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

GloVe• An objective that attempts to create a semantic

space with linear structure

Pennington et al., 2014

• Probability ratios are more important than probabilities

Page 61: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

GlovePennington et al., 2014

• Try to find word embeddings such that (roughly):

• As an example:

(vc1 � vc2)>uo =

Pc1o

Pc2o

Pco is the probability of an output word o given a center word c<latexit sha1_base64="vdM0OrDgI/C1mR7dOcbuj6cWHWk=">AAACcXicbZBNbxMxEIa9S4E2fAXoBSHQqBFVK0S0GyqVC1IFF45BIm2lbFh5ndnEqtde2bOBaLV3fh83/kQv/QM4mxxoy0iWHr/zYc+blUo6iqI/QXhn6+69+9s7nQcPHz1+0n367NSZygocCaOMPc+4QyU1jkiSwvPSIi8yhWfZxedV/myB1kmjv9GyxEnBZ1rmUnDyUtr9tX+wSGuRxg28g5YGzeH3hEwJVWrgIyS55aIetjWmadY08ARJ0tlf3UyTEP6kGqQDmiOU1mQ8k0rSEkwOXIOpqKwIfhg7BQMzuUANHARqQrtWRZN2e1E/agNuQ7yBHtvEMO3+TqZGVIWfIhR3bhxHJU1qbkkKhU0nqRyWXFzwGY49al6gm9StYw288coUcmP90QSt+m9HzQvnlkXmKwtOc3cztxL/lxtXlH+Y1FL7fVGL9UN5pYAMrOyHqbQoSC09cGGl/yuIOfcWeytcx5sQ31z5NpwO+vH7fvz1qHfyaWPHNnvJ9tgBi9kxO2Ff2JCNmGCXwW7wKngdXIUvQgj31qVhsOl5zq5F+PYvS9u75A==</latexit>

vice � vsteam ⇡ usolid

vsteam � vice ⇡ ugas<latexit sha1_base64="s94IU64EXbOLJ9TUCXpfNRBAhX8=">AAACXXicbVFNS8NAEN3E7/hV9eDBy2JRvFgSFfRY9OJRwVahKWGzndbFTTbsTkpLyJ/0phf/its2iLYOLDzemzez+zbOpDDo+x+Ou7S8srq2vuFtbm3v7Nb29ttG5ZpDiyup9EvMDEiRQgsFSnjJNLAklvAcv91N9OchaCNU+oTjDLoJG6SiLzhDS0U1HEZFiDDCQnAoS3pOfwiDdo6lTkOWZVqNaB5VgpKiV9Iw9BZ6f9ln8xbMA2bKqFb3G/606CIIKlAnVT1Etfewp3ieQIpcMmM6gZ9ht2AaBZdQemFuIGP8jQ2gY2HKEjDdYppOSU8s06N9pe1JkU7Z346CJcaMk9h2Jgxfzbw2If/TOjn2b7qFSLMcIeWzRf1cUlR0EjXtCQ0c5dgCxrWwd6X8lWnG0X6IZ0MI5p+8CNoXjeCyETxe1Zu3VRzr5IgckzMSkGvSJPfkgbQIJ58OcTYcz/lyV9wtd2fW6jqV54D8KffwG1M7uA8=</latexit>

Page 62: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Word analogies evaluation

60

Page 63: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Human correlation intrinsic evaluation

word 1 word 2 human judgement

tiger cat 7.35

book paper 7.46

computer internet 7.58

plane car 5.77

stock phone 1.62

stock CD 1.31

stock jaguar 0.92 61

Page 64: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Human correlation intrinsic evaluation

• Compute Spearman rank correlation between human similarity prediction and model similarity predictions (wordsim 353):

62

Page 65: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Extrinsic evaluation• Task: named entity recognition. Find mentions of

person, location, organization in text.

• Using good word representation might be useful

63

Page 66: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Extrinsic evaluation

64

Page 67: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Summary• Words are central to language

• In most NLP systems some word representations are used

• Graph-based representations are difficult to manipulate and compose

• One-hot vectors are useful with enough data but lose all of generalization information

• Word embeddings provide a compact way to encode word meaning and similarity (but what about inference relations?)

• Skip-gram with negative sampling is a popular approach for learning word embeddings by casting an unsupervised problem as a supervised problem

• Strongly related to classical matrix decomposition methods.

65

Page 68: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Current Research

• Contextualized word representations

• Sentence representations

Page 69: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Assignment 1

• Implement skip-gram with negative sampling

• There is ample literature if you want to consider this for a project

67

Page 70: Natural Language Processingjoberant/teaching/nlp_spring...“I like deep learning. I like NLP. I enjoy flying.” I Like enjoy deep learning NLP flying . I 2 1 like 2 1 1 enjoy 1

Gradient checks

• This is the single parameter case • For parameter vectors, iterate over all parameters

and compute the numerical gradient for each one

@J(✓)

@✓= lim

✏!0

J(✓ + ✏)� J(✓ � ✏)

2✏

68