44
More Distributional Semantics: New Models & Applications CMSC 723 / LING 723 / INST 725 MARINE CARPUAT [email protected]

More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

More Distributional

Semantics: New Models

& ApplicationsCMSC 723 / LING 723 / INST 725

MARINE CARPUAT

[email protected]

Page 2: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Last week…

• Q: what is understanding meaning?

• A: meaning is knowing when words are

similar or not

• Topics

– Word similarity

– Thesaurus-based methods

– Distributional word representations

– Dimensionality reduction

Page 3: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Today

• New models for learning word

representations• From “count”-based models (e.g., LSA)

• to “prediction”-based models (e.g., word2vec)

• … and back

• Beyond semantic similarity• Learning semantic relations between words

Page 4: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

DISTRIBUTIONAL MODELS

OF WORD MEANING

Page 5: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Distributional Approaches:

Intuition

“You shall know a word by the company it keeps!”

(Firth, 1957)

“Differences of meaning correlates with differences

of distribution” (Harris, 1970)

Page 6: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Context Features

• Word co-occurrence within a window:

• Grammatical relations:

Page 7: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Association Metric

• Commonly-used metric: Pointwise Mutual

Information

• Can be used as a feature value or by itself

)()(

),(log),(nassociatio 2PMI

fPwP

fwPfw

Page 8: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Computing Similarity

• Semantic similarity boils down to

computing some measure on context

vectors

• Cosine distance: borrowed from

information retrieval

N

i i

N

i i

N

i ii

wv

wv

wv

wvwv

1

2

1

2

1cosine ),(sim

Page 9: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Dimensionality Reduction with

Latent Semantic Analysis

Page 10: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

NEW DIRECTIONS:

PREDICT VS. COUNT MODELS

Page 11: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Word vectors as a

byproduct of language modeling

A neural probabilistic Language Model. Bengio et al. JMLR 2003

Page 12: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict
Page 13: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Using neural word

representations in NLP • word representations from neural LMs

– aka distributed word representations

– aka word embeddings

• How would you use these word vectors?

• Turian et al. [2010]

– word representations as features consistently

improve performance of

• Named-Entity Recognition

• Text chunking tasks

Page 14: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Word2vec [Mikolov et al. 2013]

introduces simpler models

https://code.google.com/p/word2vec

Page 15: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Word2vec claims

Useful representations for NLP applications

Can discover relations between words using vector

arithmetic

king – male + female = queen

Paper+tool received lots of attention even outside

the NLP research community

try it out at “word2vec playground”:http://deeplearner.fz-qqq.net/

Page 16: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Demystifying the skip-gram model

[Levy & Goldberg, 2014]context wordWord

embeddings

Learn word vector parameters so as to maximize the probability of training set DExpensive!!

http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf

Page 17: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Toward the training objective

for skip-gram

Problem: trivial solution when Vc=Vw and Vc.Vw =

K for all Vc,Vw, with a large K

http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf

Page 18: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Final training objective

(negative sampling)Word context pairs observed in data D

Word context pairs not observed in data D’

(artificially generated)

http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf

Page 19: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Skip-gram model

[Mikolov et al. 2013]

Predict context words given current word

(ie 2(n-1) classifiers for context window of size n)

Use negative samples at each position

Page 20: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict
Page 21: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Don’t count, predict!

[Baroni et al. 2014]

“This paper has presented the first systematic

comparative evaluation of count and predict

vectors.

As seasoned distributional semanticists with

thorough experience in developing and using

count vectors, we set out to conduct this study

because we were annoyed by the triumphalist

overtones surrounding predict models, despite the

almost complete lack of a proper comparison to

count vectors.”

Page 22: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Don’t count, predict!

[Baroni et al. 2014]

“Our secret wish was to discover that it is all

hype, and count vectors are far superior to

their predictive counterparts.

[…] Instead, we found that the predict

models are so good that, while the

triumphalist overtones still sound excessive,

there are very good reasons to switch to the

new architecture.”

Page 23: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Why does word2vec produce

good word representations?

Levy & Goldberg, Apr 2014:

“Good question. We don’t really know.

The distributional hypothesis states that words in similar

contexts have similar meanings. The objective above clearly

tries to increase the quantity v_w.v_c for good word-context

pairs, and decrease it for bad ones. Intuitively, this means

that words that share many contexts will be similar to each

other […]. This is, however, very hand-wavy.”

Page 24: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Learning skip-gram is almost equivalent

to matrix factorization [Levy & Goldberg 2014]

http://www.cs.bgu.ac.il/~yoavg/publications/nips2014pmi.pdf

Page 25: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

New directions: Summary

• There are alternative ways to learn

distributional representations for word

meaning

• Understanding >> Magic

Page 26: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

PREDICTING SEMANTIC

RELATIONS BETWEEN WORDS

BEYOND SIMILARITY

Slides credit: Peter Turney

Page 27: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Recognizing Textual Entailment

• Sample problem

– Text

iTunes software has seen strong sales in Europe

– Hypothesis

Strong sales for iTunes in Europe

– Task: Does Text entails Hypothesis? Yes or No?

Page 28: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Recognizing Textual Entailment

• Sample problem

– Task: Does Text entails Hypothesis? Yes or No?

• Has emerged as a core task for semantic

analysis in NLP

– subsumes many tasks: Paraphrase Detection,

Question Answering, etc.

– fully text based: does not require committing

to a specific semantic representation

[Dagan et al. 2013]

Page 29: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Recognizing lexical entailment

• To recognize entailment between

sentences, we must first recognize

entailment between words

• Sample problem

– Text

George was bitten by a dog

– Hypothesis

George was attacked by an animal

Page 30: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Lexical entailment

& semantic relations• Synonymy: synonyms entail each other

firm entails company

• is-a relations: hyponyms entail hypernyms

automaker entails company

• part-whole relations: it depends

government entails minister

division does not entail company

• entailment also covers other relations

ocean entails water

murder entails death

Page 31: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

• We know how to build word vectors that

represent word meaning

• How can we predict entailment using

these vectors?

Page 32: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Approach 1: context inclusion

hypothesis• Hypothesis:

– if a word a tends to occur in subset of the contexts in

which a word b occur (b contextually includes a)

– then a (the narrower term) tends to entail b (the

broader term)

• Inspired by formal logic

• In practice

– Design an asymmetric real-valued metric to compare

word vectors

[Kotlerman, Dagan, et al. 2010]

Page 33: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Approach 1: the BalAPinc Metric

Complex hand-crafted metric!

Page 34: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Approach 2: context combination

hypothesis • Hypothesis:

– The tendence of word a to entail word b is correlated

with some learnable function of the contexts in which

a occurs, and the contexts in which b occurs

– Some combination of contexts tend to block

entailment, others tend to allow entailment

• In practice

– Binary prediction task

– Supervised learning from labeled word pairs

[Baroni, Bernardini, Do and Shan, 2012]

Page 35: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Approach 3: similarity differences

hypothesis• Hypothesis

– The tendency of a to entail b is correlated with some

learnable function of the differences in their

similarities, sim(a,r) – sim(b,r), to a set of reference

words r in R

– Some differences tend to block entailment, and others

tend to allow entailment

• In practice

– Binary prediction task

– Supervised learning from labeled word pairs +

reference words

[Turney & Mohammad 2015]

Page 36: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Approach 3: similarity differences

hypothesis

Page 37: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Evaluation:

test set 1/3 (KDSZ)

Page 38: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Evaluation:

test set 2/3 (JMTH)

Page 39: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Evaluation:

test set 3/3 (BBDS)

Page 40: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Evaluation

[Turney & Mohammad 2015]

Page 41: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Lessons from

lexical entailment task

• Distributional hypothesis

• can be refined and put to use in various ways

• to detect relations between words beyond

concept of similarity

• Combination of unsupervised similarity+

supervised learning is powerful

Page 42: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

RECAP

Page 43: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

Today

• A glimpse into recent research

• New models for learning word

representations• From “count”-based models (e.g., LSA)

• to “prediction”-based models (e.g., word2vec)

• … and back

• Beyond semantic similarity• Learning lexical entailment

Next topics

• multiword expressions & predicate argument

structure

Page 44: More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

References

Don’t count, predict! [Baroni et al. 2014]

http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal-

countpredict-acl2014.pdf

Word2vec explained [Goldberg & Levy 2014]

http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf

Neural Word Embeddings as Implicit Matrix Factorization [Levy &

Goldberg 2014]

http://www.cs.bgu.ac.il/~yoavg/publications/nips2014pmi.pdf

Experiments with Three Approaches to Recognizing Lexical Entailment

[Turney & Mohammad 2015]

http://arxiv.org/abs/1401.8269