58
Linguistic Regularities in Sparse and Explicit Word Representations Omer Levy Yoav Goldberg Bar-Ilan University Israel

Linguistic regularities in sparse and explicit word representations conll-2014

Embed Size (px)

Citation preview

Linguistic Regularities in Sparse and Explicit

Word Representations

Omer Levy Yoav Goldberg

Bar-Ilan University

Israel

Papers in ACL 2014*

Neural Networks & Word Embeddings

Other Topics

* Sampling error: +/- 100%

Neural Embeddings

• Dense vectors

• Each dimension is a latent feature

• Common software package: word2vec

𝐼𝑡𝑎𝑙𝑦: (−7.35, 9.42, 0.88,… ) ∈ ℝ100

• “Magic”

king − man + woman = queen

(analogies)

Representing words as vectors is not new!

Explicit Representations (Distributional)

• Sparse vectors

• Each dimension is an explicit context

• Common association metric: PMI, PPMI

𝐼𝑡𝑎𝑙𝑦: 𝑅𝑜𝑚𝑒: 17, 𝑝𝑎𝑠𝑡𝑎: 5, 𝐹𝑖𝑎𝑡: 2, … ∈ ℝ 𝑉𝑜𝑐𝑎𝑏 ≈100,000

• Does the same “magic” work for explicit representations too?

• Baroni et al. (2014) showed that embeddings outperform explicit, but…

Questions

• Are analogies unique to neural embeddings?

Compare neural embeddings with explicit representations

• Why does vector arithmetic reveal analogies?

Unravel the mystery behind neural embeddings and their “magic”

Background

Mikolov et al. (2013a,b,c)

• Neural embeddings have interesting geometries

Mikolov et al. (2013a,b,c)

• Neural embeddings have interesting geometries

• These patterns capture “relational similarities”

• Can be used to solve analogies:

man is to woman as king is to queen

Mikolov et al. (2013a,b,c)

• Neural embeddings have interesting geometries

• These patterns capture “relational similarities”

• Can be used to solve analogies:

𝑎 is to 𝑎∗ as 𝑏 is to 𝑏∗

• Can be recovered by “simple” vector arithmetic:

𝑎 − 𝑎∗ = 𝑏 − 𝑏∗

Mikolov et al. (2013a,b,c)

• Neural embeddings have interesting geometries

• These patterns capture “relational similarities”

• Can be used to solve analogies:

𝑎 is to 𝑎∗ as 𝑏 is to 𝑏∗

• With simple vector arithmetic:

𝑎 − 𝑎∗ = 𝑏 − 𝑏∗

𝑎 − 𝑎∗ = 𝑏 − 𝑏∗

Mikolov et al. (2013a,b,c)

𝑏 − 𝑎 + 𝑎∗ = 𝑏∗

Mikolov et al. (2013a,b,c)

king − man + woman = queen

Mikolov et al. (2013a,b,c)

𝑏 𝑎 𝑎∗ 𝑏∗

Tokyo − Japan + France = Paris

Mikolov et al. (2013a,b,c)

𝑏 𝑎 𝑎∗ 𝑏∗

best − good + strong = strongest

Mikolov et al. (2013a,b,c)

𝑏 𝑎 𝑎∗ 𝑏∗

best − good + strong = strongest

Mikolov et al. (2013a,b,c)

vectors in ℝ𝑛

𝑏 𝑎 𝑎∗ 𝑏∗

Are analogies unique to neural embeddings?

• Experiment: compare embeddings to explicit representations

Are analogies unique to neural embeddings?

Are analogies unique to neural embeddings?

• Experiment: compare embeddings to explicit representations

Are analogies unique to neural embeddings?

• Experiment: compare embeddings to explicit representations

• Learn different representations from the same corpus:

Are analogies unique to neural embeddings?

• Experiment: compare embeddings to explicit representations

• Learn different representations from the same corpus:

• Evaluate with the same recovery method:

argmax𝑏∗

cos 𝑏∗, 𝑏 − 𝑎 + 𝑎∗

Analogy Datasets

• 4 words per analogy: 𝑎 is to 𝑎∗ as 𝑏 is to 𝑏∗

• Given 3 words: 𝑎 is to 𝑎∗ as 𝑏 is to ?

• Guess the best suiting 𝑏∗ from the entire vocabulary 𝑉• Excluding the question words 𝑎, 𝑎∗, 𝑏

• MSR: ~8000 syntactic analogies

• Google: ~19,000 syntactic and semantic analogies

Embedding vs Explicit (Round 1)

Embedding vs Explicit (Round 1)

Embedding54%

Embedding63%

Explicit29%

Explicit45%

0%

10%

20%

30%

40%

50%

60%

70%

MSR Google

Acc

ura

cy

Many analogies recovered by explicit, but many more by embedding.

Why does vector arithmetic reveal analogies?

Why does vector arithmetic reveal analogies?

• We wish to find the closest 𝑏∗ to 𝑏 − 𝑎 + 𝑎∗

• This is done with cosine similarity:

argmax𝑏∗∈𝑉

cos 𝑏∗, 𝑏 − 𝑎 + 𝑎∗ =

argmax𝑏∗∈𝑉

cos 𝑏∗, 𝑏 − cos 𝑏∗, 𝑎 + cos 𝑏∗, 𝑎∗

Problem: one similarity might dominate the rest.

Why does vector arithmetic reveal analogies?

• We wish to find the closest 𝑏∗ to 𝑏 − 𝑎 + 𝑎∗

Why does vector arithmetic reveal analogies?

• We wish to find the closest 𝑏∗ to 𝑏 − 𝑎 + 𝑎∗

• This is done with cosine similarity:

argmax𝑏∗

cos 𝑏∗, 𝑏 − 𝑎 + 𝑎∗ =

argmax𝑏∗∈𝑉

cos 𝑏∗, 𝑏 − cos 𝑏∗, 𝑎 + cos 𝑏∗, 𝑎∗

Why does vector arithmetic reveal analogies?

• We wish to find the closest 𝑏∗ to 𝑏 − 𝑎 + 𝑎∗

• This is done with cosine similarity:

argmax𝑏∗

cos 𝑏∗, 𝑏 − 𝑎 + 𝑎∗ =

argmax𝑏∗

cos 𝑏∗, 𝑏 − cos 𝑏∗, 𝑎 + cos 𝑏∗, 𝑎∗

Why does vector arithmetic reveal analogies?

• We wish to find the closest 𝑏∗ to 𝑏 − 𝑎 + 𝑎∗

• This is done with cosine similarity:

argmax𝑏∗

cos 𝑏∗, 𝑏 − 𝑎 + 𝑎∗ =

argmax𝑏∗

cos 𝑏∗, 𝑏 − cos 𝑏∗, 𝑎 + cos 𝑏∗, 𝑎∗

vector arithmetic = similarity arithmetic

Why does vector arithmetic reveal analogies?

• We wish to find the closest 𝑏∗ to 𝑏 − 𝑎 + 𝑎∗

• This is done with cosine similarity:

argmax𝑏∗

cos 𝑏∗, 𝑏 − 𝑎 + 𝑎∗ =

argmax𝑏∗

cos 𝑏∗, 𝑏 − cos 𝑏∗, 𝑎 + cos 𝑏∗, 𝑎∗

vector arithmetic = similarity arithmetic

Why does vector arithmetic reveal analogies?

• We wish to find the closest 𝑥 to 𝑘𝑖𝑛𝑔 −𝑚𝑎𝑛 + 𝑤𝑜𝑚𝑎𝑛

• This is done with cosine similarity:

argmax𝑥

cos 𝑥, 𝑘𝑖𝑛𝑔 − 𝑚𝑎𝑛 + 𝑤𝑜𝑚𝑎𝑛 =

argmax𝑥

cos 𝑥, 𝑘𝑖𝑛𝑔 − cos 𝑥,𝑚𝑎𝑛 + cos 𝑥, 𝑤𝑜𝑚𝑎𝑛

vector arithmetic = similarity arithmetic

Why does vector arithmetic reveal analogies?

• We wish to find the closest 𝑥 to 𝑘𝑖𝑛𝑔 −𝑚𝑎𝑛 + 𝑤𝑜𝑚𝑎𝑛

• This is done with cosine similarity:

argmax𝑥

cos 𝑥, 𝑘𝑖𝑛𝑔 − 𝑚𝑎𝑛 + 𝑤𝑜𝑚𝑎𝑛 =

argmax𝑥

cos 𝑥, 𝑘𝑖𝑛𝑔 − cos 𝑥,𝑚𝑎𝑛 + cos 𝑥, 𝑤𝑜𝑚𝑎𝑛

vector arithmetic = similarity arithmetic

royal? female?

What does each similarity term mean?

• Observe the joint features with explicit representations!

𝒒𝒖𝒆𝒆𝒏 ∩ 𝒌𝒊𝒏𝒈 𝒒𝒖𝒆𝒆𝒏 ∩ 𝒘𝒐𝒎𝒂𝒏

uncrowned Elizabeth

majesty Katherine

second impregnate

… …

Can we do better?

Let’s look at some mistakes…

Let’s look at some mistakes…

England − London + Baghdad = ?

Let’s look at some mistakes…

England − London + Baghdad = Iraq

Let’s look at some mistakes…

England − London + Baghdad = Mosul?

The Additive Objective

cos 𝐼𝑟𝑎𝑞, 𝐸𝑛𝑔𝑙𝑎𝑛𝑑 − cos 𝐼𝑟𝑎𝑞, 𝐿𝑜𝑛𝑑𝑜𝑛 + cos 𝐼𝑟𝑎𝑞, 𝐵𝑎𝑔ℎ𝑑𝑎𝑑

0.15 0.13 0.63 = 0.65

0.13 0.14 0.75 = 0.74

cos 𝑀𝑜𝑠𝑢𝑙, 𝐸𝑛𝑔𝑙𝑎𝑛𝑑 − cos 𝑀𝑜𝑠𝑢𝑙, 𝐿𝑜𝑛𝑑𝑜𝑛 + cos 𝑀𝑜𝑠𝑢𝑙, 𝐵𝑎𝑔ℎ𝑑𝑎𝑑

The Additive Objective

cos 𝐼𝑟𝑎𝑞, 𝐸𝑛𝑔𝑙𝑎𝑛𝑑 − cos 𝐼𝑟𝑎𝑞, 𝐿𝑜𝑛𝑑𝑜𝑛 + cos 𝐼𝑟𝑎𝑞, 𝐵𝑎𝑔ℎ𝑑𝑎𝑑

0.15 0.13 0.63 = 0.65

0.13 0.14 0.75 = 0.74

cos 𝑀𝑜𝑠𝑢𝑙, 𝐸𝑛𝑔𝑙𝑎𝑛𝑑 − cos 𝑀𝑜𝑠𝑢𝑙, 𝐿𝑜𝑛𝑑𝑜𝑛 + cos 𝑀𝑜𝑠𝑢𝑙, 𝐵𝑎𝑔ℎ𝑑𝑎𝑑

The Additive Objective

cos 𝐼𝑟𝑎𝑞, 𝐸𝑛𝑔𝑙𝑎𝑛𝑑 − cos 𝐼𝑟𝑎𝑞, 𝐿𝑜𝑛𝑑𝑜𝑛 + cos 𝐼𝑟𝑎𝑞, 𝐵𝑎𝑔ℎ𝑑𝑎𝑑

0.15 0.13 0.63 = 0.65

0.13 0.14 0.75 = 0.74

cos 𝑀𝑜𝑠𝑢𝑙, 𝐸𝑛𝑔𝑙𝑎𝑛𝑑 − cos 𝑀𝑜𝑠𝑢𝑙, 𝐿𝑜𝑛𝑑𝑜𝑛 + cos 𝑀𝑜𝑠𝑢𝑙, 𝐵𝑎𝑔ℎ𝑑𝑎𝑑

The Additive Objective

cos 𝐼𝑟𝑎𝑞, 𝐸𝑛𝑔𝑙𝑎𝑛𝑑 − cos 𝐼𝑟𝑎𝑞, 𝐿𝑜𝑛𝑑𝑜𝑛 + cos 𝐼𝑟𝑎𝑞, 𝐵𝑎𝑔ℎ𝑑𝑎𝑑

0.15 0.13 0.63 = 0.65

0.13 0.14 0.75 = 0.74

cos 𝑀𝑜𝑠𝑢𝑙, 𝐸𝑛𝑔𝑙𝑎𝑛𝑑 − cos 𝑀𝑜𝑠𝑢𝑙, 𝐿𝑜𝑛𝑑𝑜𝑛 + cos 𝑀𝑜𝑠𝑢𝑙, 𝐵𝑎𝑔ℎ𝑑𝑎𝑑

The Additive Objective

cos 𝐼𝑟𝑎𝑞, 𝐸𝑛𝑔𝑙𝑎𝑛𝑑 − cos 𝐼𝑟𝑎𝑞, 𝐿𝑜𝑛𝑑𝑜𝑛 + cos 𝐼𝑟𝑎𝑞, 𝐵𝑎𝑔ℎ𝑑𝑎𝑑

0.15 0.13 0.63 = 0.65

0.13 0.14 0.75 = 0.74

cos 𝑀𝑜𝑠𝑢𝑙, 𝐸𝑛𝑔𝑙𝑎𝑛𝑑 − cos 𝑀𝑜𝑠𝑢𝑙, 𝐿𝑜𝑛𝑑𝑜𝑛 + cos 𝑀𝑜𝑠𝑢𝑙, 𝐵𝑎𝑔ℎ𝑑𝑎𝑑

The Additive Objective

cos 𝐼𝑟𝑎𝑞, 𝐸𝑛𝑔𝑙𝑎𝑛𝑑 − cos 𝐼𝑟𝑎𝑞, 𝐿𝑜𝑛𝑑𝑜𝑛 + cos 𝐼𝑟𝑎𝑞, 𝐵𝑎𝑔ℎ𝑑𝑎𝑑

0.15 0.13 0.63 = 0.65

0.13 0.14 0.75 = 0.74

cos 𝑀𝑜𝑠𝑢𝑙, 𝐸𝑛𝑔𝑙𝑎𝑛𝑑 − cos 𝑀𝑜𝑠𝑢𝑙, 𝐿𝑜𝑛𝑑𝑜𝑛 + cos 𝑀𝑜𝑠𝑢𝑙, 𝐵𝑎𝑔ℎ𝑑𝑎𝑑

• Problem: one similarity might dominate the rest

• Much more prevalent in explicit representation

• Might explain why explicit underperformed

How can we do better?

How can we do better?

• Instead of adding similarities, multiply them!

How can we do better?

• Instead of adding similarities, multiply them!

argmax𝑏∗

cos 𝑏∗, 𝑏 cos 𝑏∗, 𝑎∗

cos 𝑏∗, 𝑎

How can we do better?

• Instead of adding similarities, multiply them!

argmax𝑏∗

cos 𝑏∗, 𝑏 cos 𝑏∗, 𝑎∗

cos 𝑏∗, 𝑎

Embedding vs Explicit (Round 2)

Multiplication > Addition

Add54%

Add63%

Add29%

Add45%

Mul59%

Mul67% Mul

57%

Mul68%

0%

10%

20%

30%

40%

50%

60%

70%

80%

MSR Google MSR Google

Embedding Explicit

Acc

ura

cy

Explicit is on-par with Embedding

Embedding59%

Embedding67%Explicit

57%

Explicit68%

0%

10%

20%

30%

40%

50%

60%

70%

80%

MSR Google

Acc

ura

cy

Explicit is on-par with Embedding

• Embeddings are not “magical”

• Embedding-based similarities have a more uniform distribution

• The additive objective performs better on smoother distributions

• The multiplicative objective overcomes this issue

Conclusion

• Are analogies unique to neural embeddings?

No! They occur in sparse and explicit representations as well.

• Why does vector arithmetic reveal analogies?

Because vector arithmetic is equivalent to similarity arithmetic.

• Can we do better?

Yes! The multiplicative objective is significantly better.

More Results and Analyses (in the paper)

• Evaluation on closed-vocabulary analogy questions (SemEval 2012)

• Experiments with a third objective function (PairDirection)

• Do different representations reveal the same analogies?

• Error analysis

• A feature-level interpretation of how word similarity reveals analogies

Thanks − for + listening = )