Upload
mamoru-komachi
View
548
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Slides introducing a paper "Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors" by Marco Baroni, Geogiana Dinu and Germán Kruszewski (ACL 2014) Presented by Mamoru Komachi The 6th summer camp of NLP September 5th, 2014
Citation preview
Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors
Marco Baroni, Geogiana Dinu and Germán Kruszewski(ACL 2014)
(Tables are taken from the above-mentioned paper)
Presented by Mamoru Komachi<[email protected]>
The 6th summer camp of NLPSeptember 5th, 2014
2
Well-known Distributional Hypothesis; Any problems so far?
“A word is characterized by the company it keeps.” (Firth, 1957)
Characterize a word by its context (vector)
Widely accepted to the NLP community
(Source: http://www.ircs.upenn.edu/zellig/)
Zellig Harris(1909-1992)
3
Count-vector-based distributional semantic approaches faced a new challenge (deep learning)
“Context-predicting models (more commonly known as embeddings or neural language models) are the new kids on the distributional semantics blocks.”
“[T]he literature is still lacking a systematic comparison of the predictive models with classic, count-vector-based distributional semantic approaches.”
“The results, …, show that the buzz is fully justified, as the context-predicting models obtain a thorough and resounding victory against their count-based counter-parts.”
4
BackgroundCount models and predict models
5
Count models are traditional and standard ways to model distributional semantics
Collect context vectors for each word type Context vectors = n words on the left and
right (symmetric, n = 2 and 5, position independent)
Context scores are calculated by positive pointwise mutual information or local mutual information (log-likelihood ratio)
Reduce dimensionality to k (k = 200 … 500) by singular value decomposition or non-negative matrix factorization
6
Predict models are training-based new ways to model distributional
semantics Optimize context vectors for each word type
Context vectors = n words on the left and right (symmetric, n = 2 and 5, position independent) (Collobert et al., 2011)
Learn a model to predict a word given context vectors Can directly optimize weights of a context vector of a
word using supervised learning (but with no manual annotation, i.e. predict models use the same unannotated data as count models)
Mikolov et al. (2013) Word type is mapped to k (k = 200 … 500)
Collobert & Weston (2008) model
100 dimensional vector, trained on Wikipedia for two months (!)
7
TasksLexical semantics
8
Training data and toolkits are freely available: easy to re-implement
Training data ukWaC + English Wikipedia + British National
Corpus
2.8 billon tokens (retain the top 300K most frequent words for target and context modeling)
Toolkits Count model: DISSECT toolkit (authors’
software)
Predict model: word2vec, Collobert & Weston model
9
Benchmarks: 5 standard tasks in distributional semantic modeling
Semantic relatedness
Synonym detection
Concept categorization
Selectional preferences
Analogy
10
Semantic relatedness: rate the degree of semantic similarity between two words on
a numerical scale Evaluation
Compare the correlation between the average scores that human subjects assigned to the pairs and the cosines between the corresponding vectors using the count/predict models
Datasets Rubenstein and Goodenough (1965): 65 noun
pairs
WordSim353 (Finkelstein et al., 2002): 353 pairs
Agirre et al. (2009): Split WordSim353 into similarity and relatedness subsets
MEN (Bruni et al., 2013): 1,000 word pairs
11
Synonym detection: given a target term, choose a word from 4
synonym candidates Example
(imposed = correct, believed, requested, correlated) -> levied
Methods Compute cosines of each candidate vector
with the target, and pick the candidate with the largest cosine as their answer (extensively tuned count model achieves 100% accuracy)
Datasets TOEFL set (Landauer and Dumais, 1997): 80
multiple-choice questions that pair a target word with 4 synonym candidates
12
Concept categorization: group a set of nominal concepts into natural
categories Example
helicopters and motorcycles -> vehicle class
dogs and elephants -> mammal class
Method Unsupervised clustering into n (n is given by the gold
data)
Datasets Almuhareb-Poesio benchmark (2006): 402 concepts
organized into 21 categories
ESSLLI 2008 Distributional Semantic Workshop shared-task set (Baroni et al., 2008): 44 concepts into 6 categories
Battig set (Baroni et al., 2010): 83 concepts into 10 categories
13
Selectional preferences: given a verb-noun pair, rate the typicality of a noun as a subj
or obj of the verb Example
(eat, people) -> assign high score for subject relation, low score for object relation
Method Take the 20 most strongly associated nouns to the
verb, average the vectors to get a prototype vector, and then compute cos similarity to that vector
Datasets Pado (2007): 211 pairs
MacRae et al. (1998): 100 pairs
14
Analogy: given a pair of words and a test word,
find another word that instantiates the relation Example
(brother : sister, grandson : X) -> X = granddaughter
(work : works, speak : X) -> X = speaks
Method Subtract the second example term vector from the
first, add the test term vector, and find the nearest neighbor to that vector (Mikolov et al., 2013)
Dataset Mikolov et al. (2013): 9K semantic and 10.5K
syntactic analogy questions
15
Experiments: 5 tasks of lexical semantics
16
Results and discussions
Lexical semantics
17
Results: Predict models outperform count models
18
Predict models are not so sensitive to the parameter settings
19
Observations Count model
PMI is better than LLR
SVD outperforms NMF, but no compression improves results
Predict model Negative sampling outperforms costly hierarchical
softmax method
Subsampling frequent words seems to have similar tendency to PMI weighting in count models
Off-the-shelf C&W model Poor performance (under investigation)
20
Discussions Predict models obtained excellent results by
trying few variations in the default settings, whereas count models need to optimize a large number of parameters thoroughly to get maximum performance Predict models scale to large dataset, use only
hundreds of dimensions, without intense tuning
Count models and predict models are complementary in the errors they make State-of-the-art count models incorporate
lexico-syntactic relations
Possibly combined to make a better unified model
21
Open questions “Do the dimension of predict models also
encode latent semantic domains?”
“Do these models afford the same flexibility of count vectors in capturing linguistically rich contexts?”
“Does the structure of predict vectors mimic meaningful semantic relations?”
22
Not feature engineering but context engineering
How to encode syntactic, topical and functional information into context features is still under development
Whether certain properties of vectors reflect semantic relations in the expected way: e.g. whether the vectors of hypernyms “distributionally include” the vectors of hyponyms
23
Summary Context-predicting models perform as good
as the highly-tuned classic count-vector models on a wide range of lexical semantics tasks
Best models: Count model: window size = 2; scoring = PMI;
no dimension reduction; 300k dimensions
Predict model: window size = 5; no hierarchical softmax; negative sampling; 400 dimensions
Suggest a new promising direction in computational semantics
24
Is it true that count models and predict models look at the same
information? (cont.) I heard that word2vec uses a sampling-based method to determine how far it looks for context window.
Possibly not. Predict models overweight near neighbors more than count models. However, it’s not clear that it accounts for the difference in performance.
25
Is there any black-magic in tuning parameters, especially the step variable in
dimension reduction? No. It is possible that the reduced
dimensionality n and the size of context vectors k behave similarly in a given range, but it may be OK for following two reasons: In count models, dimensionality reduction
doesn’t really matter since no compression performs best.
From the development point of view, the size of the final model has a large impact to the deployment of the model, so comparing these two variables makes sense at least in practice.
26
Why predict models outperform count models? Is there any
theoretical analysis? Concerning the paper, the authors do not
mention the reason.
It may be because predict models abstract semantic relations, making stepping stones for inferring semantic relatedness more concisely.
Predict models tune a large number of parameters, so it’s not surprising that predict models achieve better performance than count models.
27
Is there any comparison in a PP-attachment task?
(cont.) I read a paper saying that word2vec features do not improve PP-attachment unlike SVO modeling task.
No. It is possible that PP-attachment may fail since in the setting of this paper only uses local context.