Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
More Distributional
Semantics: New Models
& ApplicationsCMSC 723 / LING 723 / INST 725
MARINE CARPUAT
Last week…
• Q: what is understanding meaning?
• A: meaning is knowing when words are
similar or not
• Topics
– Word similarity
– Thesaurus-based methods
– Distributional word representations
– Dimensionality reduction
Today
• New models for learning word
representations• From “count”-based models (e.g., LSA)
• to “prediction”-based models (e.g., word2vec)
• … and back
• Beyond semantic similarity• Learning semantic relations between words
DISTRIBUTIONAL MODELS
OF WORD MEANING
Distributional Approaches:
Intuition
“You shall know a word by the company it keeps!”
(Firth, 1957)
“Differences of meaning correlates with differences
of distribution” (Harris, 1970)
Context Features
• Word co-occurrence within a window:
• Grammatical relations:
Association Metric
• Commonly-used metric: Pointwise Mutual
Information
• Can be used as a feature value or by itself
)()(
),(log),(nassociatio 2PMI
fPwP
fwPfw
Computing Similarity
• Semantic similarity boils down to
computing some measure on context
vectors
• Cosine distance: borrowed from
information retrieval
N
i i
N
i i
N
i ii
wv
wv
wv
wvwv
1
2
1
2
1cosine ),(sim
Dimensionality Reduction with
Latent Semantic Analysis
NEW DIRECTIONS:
PREDICT VS. COUNT MODELS
Word vectors as a
byproduct of language modeling
A neural probabilistic Language Model. Bengio et al. JMLR 2003
Using neural word
representations in NLP • word representations from neural LMs
– aka distributed word representations
– aka word embeddings
• How would you use these word vectors?
• Turian et al. [2010]
– word representations as features consistently
improve performance of
• Named-Entity Recognition
• Text chunking tasks
Word2vec [Mikolov et al. 2013]
introduces simpler models
https://code.google.com/p/word2vec
Word2vec claims
Useful representations for NLP applications
Can discover relations between words using vector
arithmetic
king – male + female = queen
Paper+tool received lots of attention even outside
the NLP research community
try it out at “word2vec playground”:http://deeplearner.fz-qqq.net/
Demystifying the skip-gram model
[Levy & Goldberg, 2014]context wordWord
embeddings
Learn word vector parameters so as to maximize the probability of training set DExpensive!!
http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf
Toward the training objective
for skip-gram
Problem: trivial solution when Vc=Vw and Vc.Vw =
K for all Vc,Vw, with a large K
http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf
Final training objective
(negative sampling)Word context pairs observed in data D
Word context pairs not observed in data D’
(artificially generated)
http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf
Skip-gram model
[Mikolov et al. 2013]
Predict context words given current word
(ie 2(n-1) classifiers for context window of size n)
Use negative samples at each position
Don’t count, predict!
[Baroni et al. 2014]
“This paper has presented the first systematic
comparative evaluation of count and predict
vectors.
As seasoned distributional semanticists with
thorough experience in developing and using
count vectors, we set out to conduct this study
because we were annoyed by the triumphalist
overtones surrounding predict models, despite the
almost complete lack of a proper comparison to
count vectors.”
Don’t count, predict!
[Baroni et al. 2014]
“Our secret wish was to discover that it is all
hype, and count vectors are far superior to
their predictive counterparts.
[…] Instead, we found that the predict
models are so good that, while the
triumphalist overtones still sound excessive,
there are very good reasons to switch to the
new architecture.”
Why does word2vec produce
good word representations?
Levy & Goldberg, Apr 2014:
“Good question. We don’t really know.
The distributional hypothesis states that words in similar
contexts have similar meanings. The objective above clearly
tries to increase the quantity v_w.v_c for good word-context
pairs, and decrease it for bad ones. Intuitively, this means
that words that share many contexts will be similar to each
other […]. This is, however, very hand-wavy.”
Learning skip-gram is almost equivalent
to matrix factorization [Levy & Goldberg 2014]
http://www.cs.bgu.ac.il/~yoavg/publications/nips2014pmi.pdf
New directions: Summary
• There are alternative ways to learn
distributional representations for word
meaning
• Understanding >> Magic
PREDICTING SEMANTIC
RELATIONS BETWEEN WORDS
BEYOND SIMILARITY
Slides credit: Peter Turney
Recognizing Textual Entailment
• Sample problem
– Text
iTunes software has seen strong sales in Europe
– Hypothesis
Strong sales for iTunes in Europe
– Task: Does Text entails Hypothesis? Yes or No?
Recognizing Textual Entailment
• Sample problem
– Task: Does Text entails Hypothesis? Yes or No?
• Has emerged as a core task for semantic
analysis in NLP
– subsumes many tasks: Paraphrase Detection,
Question Answering, etc.
– fully text based: does not require committing
to a specific semantic representation
[Dagan et al. 2013]
Recognizing lexical entailment
• To recognize entailment between
sentences, we must first recognize
entailment between words
• Sample problem
– Text
George was bitten by a dog
– Hypothesis
George was attacked by an animal
Lexical entailment
& semantic relations• Synonymy: synonyms entail each other
firm entails company
• is-a relations: hyponyms entail hypernyms
automaker entails company
• part-whole relations: it depends
government entails minister
division does not entail company
• entailment also covers other relations
ocean entails water
murder entails death
• We know how to build word vectors that
represent word meaning
• How can we predict entailment using
these vectors?
Approach 1: context inclusion
hypothesis• Hypothesis:
– if a word a tends to occur in subset of the contexts in
which a word b occur (b contextually includes a)
– then a (the narrower term) tends to entail b (the
broader term)
• Inspired by formal logic
• In practice
– Design an asymmetric real-valued metric to compare
word vectors
[Kotlerman, Dagan, et al. 2010]
Approach 1: the BalAPinc Metric
Complex hand-crafted metric!
Approach 2: context combination
hypothesis • Hypothesis:
– The tendence of word a to entail word b is correlated
with some learnable function of the contexts in which
a occurs, and the contexts in which b occurs
– Some combination of contexts tend to block
entailment, others tend to allow entailment
• In practice
– Binary prediction task
– Supervised learning from labeled word pairs
[Baroni, Bernardini, Do and Shan, 2012]
Approach 3: similarity differences
hypothesis• Hypothesis
– The tendency of a to entail b is correlated with some
learnable function of the differences in their
similarities, sim(a,r) – sim(b,r), to a set of reference
words r in R
– Some differences tend to block entailment, and others
tend to allow entailment
• In practice
– Binary prediction task
– Supervised learning from labeled word pairs +
reference words
[Turney & Mohammad 2015]
Approach 3: similarity differences
hypothesis
Evaluation:
test set 1/3 (KDSZ)
Evaluation:
test set 2/3 (JMTH)
Evaluation:
test set 3/3 (BBDS)
Evaluation
[Turney & Mohammad 2015]
Lessons from
lexical entailment task
• Distributional hypothesis
• can be refined and put to use in various ways
• to detect relations between words beyond
concept of similarity
• Combination of unsupervised similarity+
supervised learning is powerful
RECAP
Today
• A glimpse into recent research
• New models for learning word
representations• From “count”-based models (e.g., LSA)
• to “prediction”-based models (e.g., word2vec)
• … and back
• Beyond semantic similarity• Learning lexical entailment
Next topics
• multiword expressions & predicate argument
structure
References
Don’t count, predict! [Baroni et al. 2014]
http://clic.cimec.unitn.it/marco/publications/acl2014/baroni-etal-
countpredict-acl2014.pdf
Word2vec explained [Goldberg & Levy 2014]
http://www.cs.bgu.ac.il/~yoavg/publications/negative-sampling.pdf
Neural Word Embeddings as Implicit Matrix Factorization [Levy &
Goldberg 2014]
http://www.cs.bgu.ac.il/~yoavg/publications/nips2014pmi.pdf
Experiments with Three Approaches to Recognizing Lexical Entailment
[Turney & Mohammad 2015]
http://arxiv.org/abs/1401.8269