Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors

Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors

Marco Baroni, Geogiana Dinu and Germán Kruszewski(ACL 2014)

(Tables are taken from the above-mentioned paper)

Presented by Mamoru Komachi<[email protected]>

The 6th summer camp of NLPSeptember 5th, 2014

2

Well-known Distributional Hypothesis; Any problems so far?

“A word is characterized by the company it keeps.” (Firth, 1957)

Characterize a word by its context (vector)

Widely accepted to the NLP community

(Source: http://www.ircs.upenn.edu/zellig/)

Zellig Harris(1909-1992)

3

Count-vector-based distributional semantic approaches faced a new challenge (deep learning)

“Context-predicting models (more commonly known as embeddings or neural language models) are the new kids on the distributional semantics blocks.”

“[T]he literature is still lacking a systematic comparison of the predictive models with classic, count-vector-based distributional semantic approaches.”

“The results, …, show that the buzz is fully justified, as the context-predicting models obtain a thorough and resounding victory against their count-based counter-parts.”

4

BackgroundCount models and predict models

5

Count models are traditional and standard ways to model distributional semantics

Collect context vectors for each word type Context vectors = n words on the left and

right (symmetric, n = 2 and 5, position independent)

Context scores are calculated by positive pointwise mutual information or local mutual information (log-likelihood ratio)

Reduce dimensionality to k (k = 200 … 500) by singular value decomposition or non-negative matrix factorization

6

Predict models are training-based new ways to model distributional

semantics Optimize context vectors for each word type

Context vectors = n words on the left and right (symmetric, n = 2 and 5, position independent) (Collobert et al., 2011)

Learn a model to predict a word given context vectors Can directly optimize weights of a context vector of a

word using supervised learning (but with no manual annotation, i.e. predict models use the same unannotated data as count models)

Mikolov et al. (2013) Word type is mapped to k (k = 200 … 500)

Collobert & Weston (2008) model

100 dimensional vector, trained on Wikipedia for two months (!)

7

TasksLexical semantics

8

Training data and toolkits are freely available: easy to re-implement

Training data ukWaC + English Wikipedia + British National

Corpus

2.8 billon tokens (retain the top 300K most frequent words for target and context modeling)

Toolkits Count model: DISSECT toolkit (authors’

software)

Predict model: word2vec, Collobert & Weston model

9

Benchmarks: 5 standard tasks in distributional semantic modeling

Semantic relatedness

Synonym detection

Concept categorization

Selectional preferences

Analogy

10

Semantic relatedness: rate the degree of semantic similarity between two words on

a numerical scale Evaluation

Compare the correlation between the average scores that human subjects assigned to the pairs and the cosines between the corresponding vectors using the count/predict models

Datasets Rubenstein and Goodenough (1965): 65 noun

pairs

WordSim353 (Finkelstein et al., 2002): 353 pairs

Agirre et al. (2009): Split WordSim353 into similarity and relatedness subsets

MEN (Bruni et al., 2013): 1,000 word pairs

11

Synonym detection: given a target term, choose a word from 4

synonym candidates Example

(imposed = correct, believed, requested, correlated) -> levied

Methods Compute cosines of each candidate vector

with the target, and pick the candidate with the largest cosine as their answer (extensively tuned count model achieves 100% accuracy)

Datasets TOEFL set (Landauer and Dumais, 1997): 80

multiple-choice questions that pair a target word with 4 synonym candidates

12

Concept categorization: group a set of nominal concepts into natural

categories Example

helicopters and motorcycles -> vehicle class

dogs and elephants -> mammal class

Method Unsupervised clustering into n (n is given by the gold

data)

Datasets Almuhareb-Poesio benchmark (2006): 402 concepts

organized into 21 categories

ESSLLI 2008 Distributional Semantic Workshop shared-task set (Baroni et al., 2008): 44 concepts into 6 categories

Battig set (Baroni et al., 2010): 83 concepts into 10 categories

13

Selectional preferences: given a verb-noun pair, rate the typicality of a noun as a subj

or obj of the verb Example

(eat, people) -> assign high score for subject relation, low score for object relation

Method Take the 20 most strongly associated nouns to the

verb, average the vectors to get a prototype vector, and then compute cos similarity to that vector

Datasets Pado (2007): 211 pairs

MacRae et al. (1998): 100 pairs

14

Analogy: given a pair of words and a test word,

find another word that instantiates the relation Example

(brother : sister, grandson : X) -> X = granddaughter

(work : works, speak : X) -> X = speaks

Method Subtract the second example term vector from the

first, add the test term vector, and find the nearest neighbor to that vector (Mikolov et al., 2013)

Dataset Mikolov et al. (2013): 9K semantic and 10.5K

syntactic analogy questions

15

Experiments: 5 tasks of lexical semantics

16

Results and discussions

Lexical semantics

17

Results: Predict models outperform count models

18

Predict models are not so sensitive to the parameter settings

19

Observations Count model

PMI is better than LLR

SVD outperforms NMF, but no compression improves results

Predict model Negative sampling outperforms costly hierarchical

softmax method

Subsampling frequent words seems to have similar tendency to PMI weighting in count models

Off-the-shelf C&W model Poor performance (under investigation)

20

Discussions Predict models obtained excellent results by

trying few variations in the default settings, whereas count models need to optimize a large number of parameters thoroughly to get maximum performance Predict models scale to large dataset, use only

hundreds of dimensions, without intense tuning

Count models and predict models are complementary in the errors they make State-of-the-art count models incorporate

lexico-syntactic relations

Possibly combined to make a better unified model

21

Open questions “Do the dimension of predict models also

encode latent semantic domains?”

“Do these models afford the same flexibility of count vectors in capturing linguistically rich contexts?”

“Does the structure of predict vectors mimic meaningful semantic relations?”

22

Not feature engineering but context engineering

How to encode syntactic, topical and functional information into context features is still under development

Whether certain properties of vectors reflect semantic relations in the expected way: e.g. whether the vectors of hypernyms “distributionally include” the vectors of hyponyms

23

Summary Context-predicting models perform as good

as the highly-tuned classic count-vector models on a wide range of lexical semantics tasks

Best models: Count model: window size = 2; scoring = PMI;

no dimension reduction; 300k dimensions

Predict model: window size = 5; no hierarchical softmax; negative sampling; 400 dimensions

Suggest a new promising direction in computational semantics

24

Is it true that count models and predict models look at the same

information? (cont.) I heard that word2vec uses a sampling-based method to determine how far it looks for context window.

Possibly not. Predict models overweight near neighbors more than count models. However, it’s not clear that it accounts for the difference in performance.

25

Is there any black-magic in tuning parameters, especially the step variable in

dimension reduction? No. It is possible that the reduced

dimensionality n and the size of context vectors k behave similarly in a given range, but it may be OK for following two reasons: In count models, dimensionality reduction

doesn’t really matter since no compression performs best.

From the development point of view, the size of the final model has a large impact to the deployment of the model, so comparing these two variables makes sense at least in practice.

26

Why predict models outperform count models? Is there any

theoretical analysis? Concerning the paper, the authors do not

mention the reason.

It may be because predict models abstract semantic relations, making stepping stones for inferring semantic relatedness more concisely.

Predict models tune a large number of parameters, so it’s not surprising that predict models achieve better performance than count models.

27

Is there any comparison in a PP-attachment task?

(cont.) I read a paper saying that word2vec features do not improve PP-attachment unlike SVO modeling task.

No. It is possible that PP-attachment may fail since in the setting of this paper only uses local context.

Engineering

Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors