More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict

More Distributional

Semantics: New Models

& ApplicationsCMSC 723 / LING 723 / INST 725

MARINE CARPUAT

[email protected]

mailto:[email protected]

Last week…

• Q: what is understanding meaning?

• A: meaning is knowing when words are

similar or not

• Topics

– Word similarity

– Thesaurus-based methods

– Distributional word representations

– Dimensionality reduction

Today

• New models for learning word

representations• From “count”-based models (e.g., LSA)

• to “prediction”-based models (e.g., word2vec)

• … and back

• Beyond semantic similarity• Learning semantic relations between words

DISTRIBUTIONAL MODELS

OF WORD MEANING

Distributional Approaches:

Intuition

“You shall know a word by the company it keeps!”

(Firth, 1957)

“Differences of meaning correlates with differences

of distribution” (Harris, 1970)

Context Features

• Word co-occurrence within a window:

• Grammatical relations:

Association Metric

• Commonly-used metric: Pointwise Mutual

Information

• Can be used as a feature value or by itself

)()(

),(log),(nassociatio 2PMI

fPwP

fwPfw

Computing Similarity

• Semantic similarity boils down to

computing some measure on context

vectors

• Cosine distance: borrowed from

information retrieval

N

i i

N

i i

N

i ii

wv

wv

wv

wvwv

1

2

1

2

1cosine ),(sim

Dimensionality Reduction with

Latent Semantic Analysis

NEW DIRECTIONS:

PREDICT VS. COUNT MODELS

Word vectors as a

byproduct of language modeling

A neural probabilistic Language Model. Bengio et al. JMLR 2003

Using neural word

representations in NLP • word representations from neural LMs

– aka distributed word representations

– aka word embeddings

• How would you use these word vectors?

• Turian et al. [2010]

– word representations as features consistently

improve performance of

• Named-Entity Recognition

• Text chunking tasks

Word2vec [Mikolov et al. 2013]

introduces simpler models

https://code.google.com/p/word2vec

https://code.google.com/p/word2vec

Word2vec claims

Useful representations for NLP applications

Can discover relations between words using vector

arithmetic

king – male + female = queen

Paper+tool received lots of attention even outside

the NLP research community

try it out at “word2vec playground”:http://deeplearner.fz-qqq.net/

http://deeplearner.fz-qqq.net/

Demystifying the skip-gram model

[Levy & Goldberg, 2014]context wordWord

embeddings

Learn word vector parameters so as to maximize the probability of training set DExpensive!!

http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf


Toward the training objective

for skip-gram

Problem: trivial solution when Vc=Vw and Vc.Vw =

K for all Vc,Vw, with a large K



Final training objective

(negative sampling)Word context pairs observed in data D

Word context pairs not observed in data D’

(artificially generated)



Skip-gram model

[Mikolov et al. 2013]

Predict context words given current word

(ie 2(n-1) classifiers for context window of size n)

Use negative samples at each position

Don’t count, predict!

[Baroni et al. 2014]

“This paper has presented the first systematic

comparative evaluation of count and predict

vectors.

As seasoned distributional semanticists with

thorough experience in developing and using

count vectors, we set out to conduct this study

because we were annoyed by the triumphalist

overtones surrounding predict models, despite the

almost complete lack of a proper comparison to

count vectors.”

Don’t count, predict!

[Baroni et al. 2014]

“Our secret wish was to discover that it is all

hype, and count vectors are far superior to

their predictive counterparts.

[…] Instead, we found that the predict

models are so good that, while the

triumphalist overtones still sound excessive,

there are very good reasons to switch to the

new architecture.”

Why does word2vec produce

good word representations?

Levy & Goldberg, Apr 2014:

“Good question. We don’t really know.

The distributional hypothesis states that words in similar

contexts have similar meanings. The objective above clearly

tries to increase the quantity v_w.v_c for good word-context

pairs, and decrease it for bad ones. Intuitively, this means

that words that share many contexts will be similar to each

other […]. This is, however, very hand-wavy.”

Learning skip-gram is almost equivalent

to matrix factorization [Levy & Goldberg 2014]

http://www.cs.bgu.ac.il/~yoavg/publications/nips2014pmi.pdf


New directions: Summary

• There are alternative ways to learn

distributional representations for word

meaning

• Understanding >> Magic

PREDICTING SEMANTIC

RELATIONS BETWEEN WORDS

BEYOND SIMILARITY

Slides credit: Peter Turney

Recognizing Textual Entailment

• Sample problem

– Text

iTunes software has seen strong sales in Europe

– Hypothesis

Strong sales for iTunes in Europe

– Task: Does Text entails Hypothesis? Yes or No?

Recognizing Textual Entailment

• Sample problem

– Task: Does Text entails Hypothesis? Yes or No?

• Has emerged as a core task for semantic

analysis in NLP

– subsumes many tasks: Paraphrase Detection,

Question Answering, etc.

– fully text based: does not require committing

to a specific semantic representation

[Dagan et al. 2013]

Recognizing lexical entailment

• To recognize entailment between

sentences, we must first recognize

entailment between words

• Sample problem

– Text

George was bitten by a dog

– Hypothesis

George was attacked by an animal

Lexical entailment

& semantic relations• Synonymy: synonyms entail each other

firm entails company

• is-a relations: hyponyms entail hypernyms

automaker entails company

• part-whole relations: it depends

government entails minister

division does not entail company

• entailment also covers other relations

ocean entails water

murder entails death

• We know how to build word vectors that

represent word meaning

• How can we predict entailment using

these vectors?

Approach 1: context inclusion

hypothesis• Hypothesis:

– if a word a tends to occur in subset of the contexts in

which a word b occur (b contextually includes a)

– then a (the narrower term) tends to entail b (the

broader term)

• Inspired by formal logic

• In practice

– Design an asymmetric real-valued metric to compare

word vectors

[Kotlerman, Dagan, et al. 2010]

Approach 1: the BalAPinc Metric

Complex hand-crafted metric!

Approach 2: context combination

hypothesis • Hypothesis:

– The tendence of word a to entail word b is correlated

with some learnable function of the contexts in which

a occurs, and the contexts in which b occurs

– Some combination of contexts tend to block

entailment, others tend to allow entailment

• In practice

– Binary prediction task

– Supervised learning from labeled word pairs

[Baroni, Bernardini, Do and Shan, 2012]

Approach 3: similarity differences

hypothesis• Hypothesis

– The tendency of a to entail b is correlated with some

learnable function of the differences in their

similarities, sim(a,r) – sim(b,r), to a set of reference

words r in R

– Some differences tend to block entailment, and others

tend to allow entailment

• In practice

– Binary prediction task

– Supervised learning from labeled word pairs +

reference words

[Turney & Mohammad 2015]

Approach 3: similarity differences

hypothesis

Evaluation:

test set 1/3 (KDSZ)

Evaluation:

test set 2/3 (JMTH)

Evaluation:

test set 3/3 (BBDS)

Evaluation


Lessons from

lexical entailment task

• Distributional hypothesis

• can be refined and put to use in various ways

• to detect relations between words beyond

concept of similarity

• Combination of unsupervised similarity+

supervised learning is powerful

RECAP

Today

• A glimpse into recent research

• New models for learning word

representations• From “count”-based models (e.g., LSA)

• to “prediction”-based models (e.g., word2vec)

• … and back

• Beyond semantic similarity• Learning lexical entailment

Next topics

• multiword expressions & predicate argument

structure

References

Don’t count, predict! [Baroni et al. 2014]

http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal-

countpredict-acl2014.pdf

Word2vec explained [Goldberg & Levy 2014]


Neural Word Embeddings as Implicit Matrix Factorization [Levy &

Goldberg 2014]


Experiments with Three Approaches to Recognizing Lexical Entailment


http://arxiv.org/abs/1401.8269

http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal-countpredict-acl2014.pdf



http://arxiv.org/abs/1401.8269

Documents

More Distributional Semantics: New Models & …Don’t count, predict! [Baroni et al. 2014] “This paper has presented the first systematic comparative evaluation of count and predict