A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

A practical introductionto distributional semantics

PART I: Co-occurrence matrix models

Marco Baroni

Center for Mind/Brain SciencesUniversity of Trento

Symposium on Semantic Text ProcessingBar Ilan University November 2014

Acknowledging. . .

Georgiana Dinu

COMPOSES:COMPositional Operations in SEmantic Space

The vastness of word meaning

The distributional hypothesisHarris, Charles and Miller, Firth, Wittgenstein? . . .

The meaning of a word is (can be approximatedby, learned from) the set of contexts in which itoccurs in texts

We found a little, hairy wampimuk sleepingbehind the tree

See also MacDonald & Ramscar CogSci 2001

Distributional semantic models in a nutshell“Co-occurrence matrix” models, see Yoav’s part for neural models

I Represent words through vectors recording theirco-occurrence counts with context elements in a corpus

I (Optionally) apply a re-weighting scheme to the resultingco-occurrence matrix

I (Optionally) apply dimensionality reduction techniques tothe co-occurrence matrix

I Measure geometric distance of word vectors in“distributional space” as proxy to semanticsimilarity/relatedness

Co-occurrence

he curtains open and the moon shining in on the barelyars and the cold , close moon " . And neither of the wrough the night with the moon shining so brightly , itmade in the light of the moon . It all boils down , wrsurely under a crescent moon , thrilled by ice-white

sun , the seasons of the moon ? Home , alone , Jay plam is dazzling snow , the moon has risen full and coldun and the temple of the moon , driving out of the hugin the dark and now the moon rises , full and amber a

bird on the shape of the moon over the trees in frontBut I could n’t see the moon or the stars , only the

rning , with a sliver of moon hanging among the starsthey love the sun , the moon and the stars . None of

the light of an enormous moon . The plash of flowing wman ’s first step on the moon ; various exhibits , aerthe inevitable piece of moon rock . Housing The Airsh

oud obscured part of the moon . The Allied guns behind

Extracting co-occurrence countsVariations in context features

Doc1 Doc2 Doc3

stars 38 45 2

The nearest • to Earth stories of • and theirstars 12 10

see bright shiny stars

dobj

mod

mod

1

dobj←−−see mod−−→bright mod−−→shinystars 38 45 44

Extracting co-occurrence countsVariations in the definition of co-occurrence

E.g.: Co-occurrence with words, window of size 2, scaling bydistance to target:

... two [intensely bright stars in the] night sky ...

intensely bright in thestars 0.5 1 1 0.5

Same corpus (BNC), different window sizesNearest neighbours of dog

2-word windowI catI horseI foxI petI rabbitI pigI animalI mongrelI sheepI pigeon

30-word windowI kennelI puppyI petI bitchI terrierI rottweilerI canineI catI to barkI Alsatian

From co-occurrences to vectors

bright in skystars 8 10 6sun 10 15 4dog 2 20 1

Weighting

Re-weight the counts using corpus-level statistics to reflectco-occurrence significance

Positive Pointwise Mutual Information (PPMI)

PPMI(target, ctxt) = max(0, logP(target, ctxt)

P(target)P(ctxt))

Weighting

Adjusting raw co-occurrence counts:

bright instars 385 10788 ... ← Counts

stars 43.6 5.3 ... ← PPMI

Other weighting schemes:I TF-IDFI Local Mutual InformationI Dice

See Ch4 of J.R. Curran’s thesis (2004) and S. Evert’s thesis(2007) for surveys of weighting methods

Dimensionality reduction

I Vector spaces often range from tens of thousands tomillions of context dimensions

I Some of the methods to reduce dimensionality:I Select context features based on various relevance criteriaI Random indexingI Following claimed to also have a beneficial smoothing

effect:I Singular Value DecompositionI Non-negative matrix factorizationI Probabilistic Latent Semantic AnalysisI Latent Dirichlet Allocation

The SVD factorization

Image courtesy of Yoav

Dimensionality reduction as “smoothing”

buy

sell

From geometry to similarity in meaning

stars

sun

Vectors

stars 2.5 2.1sun 2.9 3.1

Cosine similarity

cos(x, y) =〈x, y〉‖x‖‖y‖

=

∑i=ni=1 xi × yi√∑i=n

i=1 x2 ×√∑i=n

i=1 y2

Other similarity measures: Euclidean Distance, Dice, Jaccard,Lin. . .

Geometric neighbours ≈ semantic neighbours

rhino fall good singwoodpecker rise bad dancerhinoceros increase excellent whistleswan fluctuation superb mimewhale drop poor shoutivory decrease improved soundplover reduction perfect listenelephant logarithm clever recitebear decline terrific playsatin cut lucky hearsweatshirt hike smashing hiss

BenchmarksSimilarity/relatednes

E.g: Rubenstein and Goodenough, WordSim-353, MEN,SimLex-99. . .

MEN

chapel church 0.45eat strawberry 0.33jump salad 0.06bikini pizza 0.01

How: Measure correlation of model cosines with humansimilarity/relatedness judgments

Top MEN Spearman correlation for co-occurrence matrixmodels (Baroni et al. ACL 2014): 0.72

BenchmarksCategorization

E.g: Almuhareb/Poesio, ESSLLI 2008 Shared Task, Battig set

ESSLLI

VEHICLE MAMMAL

helicopter dogmotorcycle elephantcar cat

How: Feed model-produced similarity matrix to clusteringalgorithm, look at overlap between clusters and gold categories

Top ESSLLI cluster purity for co-occurrence matrix models(Baroni et al. ACL 2014): 0.84

BenchmarksSelectional preferences

E.g: Ulrike Padó, Ken McRae et al.’s data sets

Padó

eat villager obj 1.7eat pizza obj 6.8eat pizza subj 1.1

How (Erk et al. CL 2010): 1) Create “prototype” argumentvector by averaging vectors of nouns typically occurring asargument fillers (e.g., frequent objects of to eat); 2) measurecosine of target noun with prototype (e.g., cosine of villagervector with eat-object prototype vector); 3) correlate withhuman scores

Top Padó Spearman correlation for co-occurrence matrixmodels (Baroni et al. ACL 2014): 0.41

Selectional preferencesExamples from Baroni/Lenci implementation

To kill. . .object cosine with cosinekangaroo 0.51 hammer 0.26person 0.45 stone 0.25robot 0.15 brick 0.18hate 0.11 smile 0.15flower 0.11 flower 0.12stone 0.05 antibiotic 0.12fun 0.05 person 0.12book 0.04 heroin 0.12conversation 0.03 kindness 0.07sympathy 0.01 graduation 0.04

BenchmarksAnalogy

Method and data sets from Mikolov and collaborators

syntactic analogy semantic analogywork speak brother grandsonworks speaks sister granddaughter

−−−−→speaks ≈ −−−−→works −−−−→work +

−−−→speak

How: Response counts as hit only if nearest neighbour (in largevocabulary) of vector obtained with subtraction and additionoperations above is the intended one

Top accuracy for co-occurrence matrix models (Baroni etal. ACL 2014): 0.49

Distributional semantics: A general-purposerepresentation of lexical meaningBaroni and Lenci 2010

I Similarity (cord-string vs. cord-smile)I Synonymy (zenith-pinnacle)I Concept categorization (car ISA vehicle; banana ISA fruit)I Selectional preferences (eat topinambur vs. *eat sympathy)I Analogy (mason is to stone like carpenter is to wood)I Relation classification (exam-anxiety are in

CAUSE-EFFECT relation)I Qualia (TELIC ROLE of novel is to entertain)I Salient properties (car-wheels, dog-barking)I Argument alternations (John broke the vase - the vase

broke, John minces the meat - *the meat minced)

Practical recommendationsMostly from Baroni et al. ACL 2014, see more evaluation work in reading list below

I Narrow context windows are best (1, 2 words left and right)I Full matrix better than dimensionality reductionI PPMI weighting bestI Dimensionality reduction with SVD better than with NMF

An example applicationBilingual lexicon/phrase table induction from monolingual resources

Saluja et al. (ACL 2014) obtain significant improvements inEnglish-Urdu and English-Arabic BLEU scores using phrasetables enlarged with pairs induced by exploiting distributionalsimilarity structure in source and target languages

Figure credit: Mikolov et al 2013

The infinity of sentence meaning

CompositionalityThe meaning of an utterance is a function of the meaning of its partsand their composition rules (Frege 1892)

Compositional distributional semantics: What for?

Word meaning in context(Mitchell and Lapata ACL 2008)

!"#$%&%&'(#)$*+$),!!#-

!"#$%&%&'(#)$*+$,./

!"#$%&%&'(#)$*+$0-%*#-!

Paraphrase detection (Blacoeand Lapata EMNLP 2012)

0 10 20 30 40 50

010

2030

4050

dim 1

dim

2

"cookie dwarfs hopunder the crimson planet" "gingerbread gnomes

dance underthe red moon"

"red gnomes love gingerbread cookies"

"students eatcup noodles"

Compositional distributional semantics: How?

From:Simple functions

−−→very+−−−→

good+−−−−→

movie

−−−−−−−−−−−→very good movie

Mitchell and LapataACL 2008

To:Complex composition operations

Socher at al. EMNLP 2013

Some references

I Classics:I Schütze’s 1997 CSLI bookI Landauer and Dumais PsychRev 1997I Griffiths et al. PsychRev 2007

I Overviews:I Turney and Pantel JAIR 2010I Erk LLC 2012I Baroni LLC 2013I Clark to appear in Handbook of Contemporary Semantics

I Evaluation:I Sahlgren’s 2006 thesisI Bullinaria and Levy BRM 2007, 2012I Baroni, Dinu and Kruszewski ACL 2014I Kiela and Clark CVSC 2014

Fun with distributional semantics!

http://clic.cimec.unitn.it/infomap-query/

Making Senseof Distributed (Neural) Semantics

Yoav [email protected]

Nov 2014

From Distributional to Distributed Semantics

The new kid on the blockI Deep learning / neural networksI “Distributed” word representations

I Feed text into neural-net. Get back “word embeddings”.I Each word is represented as a low-dimensional vector.I Vectors capture “semantics”

I word2vec (Mikolov et al)

From Distributional to Distributed Semantics

This part of the talk

I word2vec as a black boxI a peek inside the black boxI relation between word-embeddings and the distributional

representationI tailoring word embeddings to your needs usingword2vecf

word2vec

word2vec

word2vec

I dogI cat, dogs, dachshund, rabbit, puppy, poodle, rottweiler,

mixed-breed, doberman, pigI sheep

I cattle, goats, cows, chickens, sheeps, hogs, donkeys,herds, shorthorn, livestock

I novemberI october, december, april, june, february, july, september,

january, august, marchI jerusalem

I tiberias, jaffa, haifa, israel, palestine, nablus, damascuskatamon, ramla, safed

I tevaI pfizer, schering-plough, novartis, astrazeneca,

glaxosmithkline, sanofi-aventis, mylan, sanofi, genzyme,pharmacia

Working with Dense Vectors

Word Similarity

I Similarity is calculated using cosine similarity :

sim( ~dog, ~cat) =~dog · ~cat

|| ~dog|| || ~cat ||

I For normalized vectors (||x || = 1), this is equivalent to adot product:

sim( ~dog, ~cat) = ~dog · ~cat

I Normalize the vectors when loading them.


Finding the most similar words to ~dog

I Compute the similarity from word ~v to all other words.

I This is a single matrix-vector product: W · ~v>

I Result is a |V | sized vector of similarities.I Take the indices of the k -highest values.I FAST! for 180k words, d=300: ∼30ms



I Compute the similarity from word ~v to all other words.I This is a single matrix-vector product: W · ~v>





I Result is a |V | sized vector of similarities.I Take the indices of the k -highest values.

I FAST! for 180k words, d=300: ∼30ms






Most Similar Words, in python+numpy code

W,words = load_and_normalize_vectors("vecs.txt")# W and words are numpy arrays.w2i = {w:i for i,w in enumerate(words)}

dog = W[w2i[’dog’]] # get the dog vector

sims = W.dot(dog) # compute similarities

most_similar_ids = sims.argsort()[-1:-10:-1]sim_words = words[most_similar_ids]


Similarity to a group of words

I “Find me words most similar to cat, dog and cow”.I Calculate the pairwise similarities and sum them:

W · ~cat + W · ~dog + W · ~cow

I Now find the indices of the highest values as before.

I Matrix-vector products are wasteful. Better option:

W · ( ~cat + ~dog + ~cow)


Similarity to a group of words

I “Find me words most similar to cat, dog and cow”.I Calculate the pairwise similarities and sum them:

W · ~cat + W · ~dog + W · ~cow

I Now find the indices of the highest values as before.

I Matrix-vector products are wasteful. Better option:

W · ( ~cat + ~dog + ~cow)

Working with dense word vectors can be very efficient.

But where do these vectors come from?

Working with dense word vectors can be very efficient.

But where do these vectors come from?

How does word2vec work?

word2vec implements several different algorithms:

Two training methods

I Negative SamplingI Hierarchical Softmax

Two context representations

I Continuous Bag of Words (CBOW)I Skip-grams

We’ll focus on skip-grams with negative sampling

intuitions apply for other models as well


word2vec implements several different algorithms:

Two training methods

I Negative SamplingI Hierarchical Softmax

Two context representations

I Continuous Bag of Words (CBOW)I Skip-grams

We’ll focus on skip-grams with negative sampling

intuitions apply for other models as well


I Represent each word as a d dimensional vector.I Represent each context as a d dimensional vector.I Initalize all vectors to random weights.I Arrange vectors in two matrices, W and C.

How does word2vec work?While more text:

I Extract a word window:A springer is [ a cow or heifer close to calving ] .

c1 c2 c3 w c4 c5 c6

I w is the focus word vector (row in W ).I ci are the context word vectors (rows in C).

I Try setting the vector values such that:

σ(w · c1)+σ(w · c2)+σ(w · c3)+σ(w · c4)+σ(w · c5)+σ(w · c6)

is high

I Create a corrupt example by choosing a random word w ′[ a cow or comet close to calving ]

c1 c2 c3 w ′ c4 c5 c6


σ(w ′· c1)+σ(w ′· c2)+σ(w ′· c3)+σ(w ′· c4)+σ(w ′· c5)+σ(w ′· c6)

is low



c1 c2 c3 w c4 c5 c6



is high


c1 c2 c3 w ′ c4 c5 c6



is low



c1 c2 c3 w c4 c5 c6



is high


c1 c2 c3 w ′ c4 c5 c6



is low


The training procedure results in:I w · c for good word-context pairs is highI w · c for bad word-context pairs is lowI w · c for ok-ish word-context pairs is neither high nor low

As a result:I Words that share many contexts get close to each other.I Contexts that share many words get close to each other.

At the end, word2vec throws away C and returns W .

Reinterpretation

Imagine we didn’t throw away C. Consider the product WC>

The result is a matrix M in which:I Each row corresponds to a word.I Each column corresponds to a context.I Each cell correspond to w · c, an association measure

between a word and a context.

Reinterpretation

Imagine we didn’t throw away C. Consider the product WC>

The result is a matrix M in which:I Each row corresponds to a word.I Each column corresponds to a context.I Each cell correspond to w · c, an association measure

between a word and a context.

Reinterpretation

Does this remind you of something?

Very similar to SVD over distributional representation:

Reinterpretation

Does this remind you of something?Very similar to SVD over distributional representation:

Relation between SVD and word2vec

SVDI Begin with a word-context matrix.I Approximate it with a product of low rank (thin) matrices.I Use thin matrix as word representation.

word2vec (skip-grams, negative sampling)

I Learn thin word and context matrices.I These matrices can be thought of as approximating an

implicit word-context matrix.I In Levy and Goldberg (NIPS 2014) we show that this

implicit matrix is related to the well-known PPMI matrix.


word2vec is a dimensionality reduction technique over an(implicit) word-context matrix.

Just like SVD.

With few tricks (Levy, Goldberg and Dagan, in submission) we can getSVD to perform just as well as word2vec.

However, word2vec. . .

I . . . works without building / storing the actual matrix inmemory.

I . . . is very fast to train, can use multiple threads.I . . . can easily scale to huge data and very large word

and context vocabularies.


word2vec is a dimensionality reduction technique over an(implicit) word-context matrix.

Just like SVD.

With few tricks (Levy, Goldberg and Dagan, in submission) we can getSVD to perform just as well as word2vec.

However, word2vec. . .

I . . . works without building / storing the actual matrix inmemory.

I . . . is very fast to train, can use multiple threads.I . . . can easily scale to huge data and very large word

and context vocabularies.

Beyond word2vec

Beyond word2vec

I word2vec is factorizing a word-context matrix.I The content of this matrix affects the resulting similarities.I word2vec allows you to specify a window size.I But what about other types of contexts?

I Example: dependency contexts (Levy and Dagan, ACL 2014)

Australian scientist discovers star with telescope

Bag of Words (BoW) Context


Bag of Words (BoW) Context


Syntactic Dependency Context



prep_withnsubj

dobj



prep_withnsubj

dobj

Embedding Similarity with Different Contexts

Target Word Bag of Words (k=5) Dependencies

Dumbledore Sunnydale

hallows Collinwood

Hogwarts halfblood Calarts

(Harry Potter’s school) Malfoy Greendale

Snape Millfield

Related to Harry Potter

Schools



nondeterministic Pauling

nondeterministic Hotelling

Turing computability Heting

(computer scientist) deterministic Lessing

finitestate Hamming

Related to computability

Scientists

Online Demo!



singing singing

dance rapping

dancing dances breakdancing

(dance gerund) dancers miming

tapdancing busking

Related todance

Gerunds

Context matters

Choose the correct contexts for your application

I larger window sizes – more topicalI dependency relations – more functional

I only noun-adjective relationsI only verb-subject relationsI context: time of the current messageI context: user who wrote the messageI . . .I the sky is the limit

Context matters


I larger window sizes – more topicalI dependency relations – more functionalI only noun-adjective relationsI only verb-subject relations

I context: time of the current messageI context: user who wrote the messageI . . .I the sky is the limit

Context matters


I larger window sizes – more topicalI dependency relations – more functionalI only noun-adjective relationsI only verb-subject relationsI context: time of the current messageI context: user who wrote the message

I . . .I the sky is the limit

Context matters


I larger window sizes – more topicalI dependency relations – more functionalI only noun-adjective relationsI only verb-subject relationsI context: time of the current messageI context: user who wrote the messageI . . .I the sky is the limit

Software

word2vecfhttps://bitbucket.org/yoavgo/word2vecf

I Extension of word2vec.I Allows saving the context matrix.I Allows using arbitraty contexts.

I Input is a (large) file of word context pairs.

Software

hyperwordshttps://bitbucket.org/omerlevy/hyperwords/

I Python library for working with either sparse or dense wordvectors (similarity, analogies).

I Scripts for creating dense representations using word2vecfor SVD.

I Scripts for creating sparse distributional representations.

Software

dissecthttp://clic.cimec.unitn.it/composes/toolkit/

I Given vector representation of words. . .I . . . derive vector representation of phrases/sentencesI Implements various composition methods

SummaryDistributional Semantics

I Words in similar contexts have similar meanings.I Represent a word by the contexts it appears in.I But what is a context?

Neural Models (word2vec)

I Represent each word as dense, low-dimensional vector.I Same intuitions as in distributional vector-space models.I Efficient to run, scales well, modest memory requirement.I Dense vectors are convenient to work with.I Still helpful to think of the context types.

SoftwareI Build your own word representations.

Documents

A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics