79
A practical introduction to distributional semantics PART I: Co-occurrence matrix models Marco Baroni Center for Mind/Brain Sciences University of Trento Symposium on Semantic Text Processing Bar Ilan University November 2014

A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Embed Size (px)

Citation preview

Page 1: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

A practical introductionto distributional semantics

PART I: Co-occurrence matrix models

Marco Baroni

Center for Mind/Brain SciencesUniversity of Trento

Symposium on Semantic Text ProcessingBar Ilan University November 2014

Page 2: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Acknowledging. . .

Georgiana Dinu

COMPOSES:COMPositional Operations in SEmantic Space

Page 3: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

The vastness of word meaning

Page 4: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

The distributional hypothesisHarris, Charles and Miller, Firth, Wittgenstein? . . .

The meaning of a word is (can be approximatedby, learned from) the set of contexts in which itoccurs in texts

We found a little, hairy wampimuk sleepingbehind the tree

See also MacDonald & Ramscar CogSci 2001

Page 5: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Distributional semantic models in a nutshell“Co-occurrence matrix” models, see Yoav’s part for neural models

I Represent words through vectors recording theirco-occurrence counts with context elements in a corpus

I (Optionally) apply a re-weighting scheme to the resultingco-occurrence matrix

I (Optionally) apply dimensionality reduction techniques tothe co-occurrence matrix

I Measure geometric distance of word vectors in“distributional space” as proxy to semanticsimilarity/relatedness

Page 6: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Co-occurrence

he curtains open and the moon shining in on the barelyars and the cold , close moon " . And neither of the wrough the night with the moon shining so brightly , itmade in the light of the moon . It all boils down , wrsurely under a crescent moon , thrilled by ice-white

sun , the seasons of the moon ? Home , alone , Jay plam is dazzling snow , the moon has risen full and coldun and the temple of the moon , driving out of the hugin the dark and now the moon rises , full and amber a

bird on the shape of the moon over the trees in frontBut I could n’t see the moon or the stars , only the

rning , with a sliver of moon hanging among the starsthey love the sun , the moon and the stars . None of

the light of an enormous moon . The plash of flowing wman ’s first step on the moon ; various exhibits , aerthe inevitable piece of moon rock . Housing The Airsh

oud obscured part of the moon . The Allied guns behind

Page 7: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Extracting co-occurrence countsVariations in context features

Doc1 Doc2 Doc3

stars 38 45 2

The nearest • to Earth stories of • and theirstars 12 10

see bright shiny stars

dobj

mod

mod

1

dobj←−−see mod−−→bright mod−−→shinystars 38 45 44

Page 8: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Extracting co-occurrence countsVariations in the definition of co-occurrence

E.g.: Co-occurrence with words, window of size 2, scaling bydistance to target:

... two [intensely bright stars in the] night sky ...

intensely bright in thestars 0.5 1 1 0.5

Page 9: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Same corpus (BNC), different window sizesNearest neighbours of dog

2-word windowI catI horseI foxI petI rabbitI pigI animalI mongrelI sheepI pigeon

30-word windowI kennelI puppyI petI bitchI terrierI rottweilerI canineI catI to barkI Alsatian

Page 10: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

From co-occurrences to vectors

bright in skystars 8 10 6sun 10 15 4dog 2 20 1

Page 11: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Weighting

Re-weight the counts using corpus-level statistics to reflectco-occurrence significance

Positive Pointwise Mutual Information (PPMI)

PPMI(target, ctxt) = max(0, logP(target, ctxt)

P(target)P(ctxt))

Page 12: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Weighting

Adjusting raw co-occurrence counts:

bright instars 385 10788 ... ← Counts

stars 43.6 5.3 ... ← PPMI

Other weighting schemes:I TF-IDFI Local Mutual InformationI Dice

See Ch4 of J.R. Curran’s thesis (2004) and S. Evert’s thesis(2007) for surveys of weighting methods

Page 13: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Dimensionality reduction

I Vector spaces often range from tens of thousands tomillions of context dimensions

I Some of the methods to reduce dimensionality:I Select context features based on various relevance criteriaI Random indexingI Following claimed to also have a beneficial smoothing

effect:I Singular Value DecompositionI Non-negative matrix factorizationI Probabilistic Latent Semantic AnalysisI Latent Dirichlet Allocation

Page 14: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

The SVD factorization

Image courtesy of Yoav

Page 15: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Dimensionality reduction as “smoothing”

buy

sell

Page 16: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

From geometry to similarity in meaning

stars

sun

Vectors

stars 2.5 2.1sun 2.9 3.1

Cosine similarity

cos(x, y) =〈x, y〉‖x‖‖y‖

=

∑i=ni=1 xi × yi√∑i=n

i=1 x2 ×√∑i=n

i=1 y2

Other similarity measures: Euclidean Distance, Dice, Jaccard,Lin. . .

Page 17: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Geometric neighbours ≈ semantic neighbours

rhino fall good singwoodpecker rise bad dancerhinoceros increase excellent whistleswan fluctuation superb mimewhale drop poor shoutivory decrease improved soundplover reduction perfect listenelephant logarithm clever recitebear decline terrific playsatin cut lucky hearsweatshirt hike smashing hiss

Page 18: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

BenchmarksSimilarity/relatednes

E.g: Rubenstein and Goodenough, WordSim-353, MEN,SimLex-99. . .

MEN

chapel church 0.45eat strawberry 0.33jump salad 0.06bikini pizza 0.01

How: Measure correlation of model cosines with humansimilarity/relatedness judgments

Top MEN Spearman correlation for co-occurrence matrixmodels (Baroni et al. ACL 2014): 0.72

Page 19: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

BenchmarksCategorization

E.g: Almuhareb/Poesio, ESSLLI 2008 Shared Task, Battig set

ESSLLI

VEHICLE MAMMAL

helicopter dogmotorcycle elephantcar cat

How: Feed model-produced similarity matrix to clusteringalgorithm, look at overlap between clusters and gold categories

Top ESSLLI cluster purity for co-occurrence matrix models(Baroni et al. ACL 2014): 0.84

Page 20: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

BenchmarksSelectional preferences

E.g: Ulrike Padó, Ken McRae et al.’s data sets

Padó

eat villager obj 1.7eat pizza obj 6.8eat pizza subj 1.1

How (Erk et al. CL 2010): 1) Create “prototype” argumentvector by averaging vectors of nouns typically occurring asargument fillers (e.g., frequent objects of to eat); 2) measurecosine of target noun with prototype (e.g., cosine of villagervector with eat-object prototype vector); 3) correlate withhuman scores

Top Padó Spearman correlation for co-occurrence matrixmodels (Baroni et al. ACL 2014): 0.41

Page 21: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Selectional preferencesExamples from Baroni/Lenci implementation

To kill. . .object cosine with cosinekangaroo 0.51 hammer 0.26person 0.45 stone 0.25robot 0.15 brick 0.18hate 0.11 smile 0.15flower 0.11 flower 0.12stone 0.05 antibiotic 0.12fun 0.05 person 0.12book 0.04 heroin 0.12conversation 0.03 kindness 0.07sympathy 0.01 graduation 0.04

Page 22: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

BenchmarksAnalogy

Method and data sets from Mikolov and collaborators

syntactic analogy semantic analogywork speak brother grandsonworks speaks sister granddaughter

−−−−→speaks ≈ −−−−→works −−−−→work +

−−−→speak

How: Response counts as hit only if nearest neighbour (in largevocabulary) of vector obtained with subtraction and additionoperations above is the intended one

Top accuracy for co-occurrence matrix models (Baroni etal. ACL 2014): 0.49

Page 23: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Distributional semantics: A general-purposerepresentation of lexical meaningBaroni and Lenci 2010

I Similarity (cord-string vs. cord-smile)I Synonymy (zenith-pinnacle)I Concept categorization (car ISA vehicle; banana ISA fruit)I Selectional preferences (eat topinambur vs. *eat sympathy)I Analogy (mason is to stone like carpenter is to wood)I Relation classification (exam-anxiety are in

CAUSE-EFFECT relation)I Qualia (TELIC ROLE of novel is to entertain)I Salient properties (car-wheels, dog-barking)I Argument alternations (John broke the vase - the vase

broke, John minces the meat - *the meat minced)

Page 24: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Practical recommendationsMostly from Baroni et al. ACL 2014, see more evaluation work in reading list below

I Narrow context windows are best (1, 2 words left and right)I Full matrix better than dimensionality reductionI PPMI weighting bestI Dimensionality reduction with SVD better than with NMF

Page 25: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

An example applicationBilingual lexicon/phrase table induction from monolingual resources

Saluja et al. (ACL 2014) obtain significant improvements inEnglish-Urdu and English-Arabic BLEU scores using phrasetables enlarged with pairs induced by exploiting distributionalsimilarity structure in source and target languages

Figure credit: Mikolov et al 2013

Page 26: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

The infinity of sentence meaning

Page 27: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

CompositionalityThe meaning of an utterance is a function of the meaning of its partsand their composition rules (Frege 1892)

Page 28: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Compositional distributional semantics: What for?

Word meaning in context(Mitchell and Lapata ACL 2008)

!"#$%&%&'(#)$*+$),!!#-

!"#$%&%&'(#)$*+$,./

!"#$%&%&'(#)$*+$0-%*#-!

Paraphrase detection (Blacoeand Lapata EMNLP 2012)

0 10 20 30 40 50

010

2030

4050

dim 1

dim

2

"cookie dwarfs hopunder the crimson planet" "gingerbread gnomes

dance underthe red moon"

"red gnomes love gingerbread cookies"

"students eatcup noodles"

Page 29: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Compositional distributional semantics: How?

From:Simple functions

−−→very+−−−→

good+−−−−→

movie

−−−−−−−−−−−→very good movie

Mitchell and LapataACL 2008

To:Complex composition operations

Socher at al. EMNLP 2013

Page 30: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Some references

I Classics:I Schütze’s 1997 CSLI bookI Landauer and Dumais PsychRev 1997I Griffiths et al. PsychRev 2007

I Overviews:I Turney and Pantel JAIR 2010I Erk LLC 2012I Baroni LLC 2013I Clark to appear in Handbook of Contemporary Semantics

I Evaluation:I Sahlgren’s 2006 thesisI Bullinaria and Levy BRM 2007, 2012I Baroni, Dinu and Kruszewski ACL 2014I Kiela and Clark CVSC 2014

Page 31: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Fun with distributional semantics!

http://clic.cimec.unitn.it/infomap-query/

Page 32: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Making Senseof Distributed (Neural) Semantics

Yoav [email protected]

Nov 2014

Page 33: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

From Distributional to Distributed Semantics

The new kid on the blockI Deep learning / neural networksI “Distributed” word representations

I Feed text into neural-net. Get back “word embeddings”.I Each word is represented as a low-dimensional vector.I Vectors capture “semantics”

I word2vec (Mikolov et al)

Page 34: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

From Distributional to Distributed Semantics

This part of the talk

I word2vec as a black boxI a peek inside the black boxI relation between word-embeddings and the distributional

representationI tailoring word embeddings to your needs usingword2vecf

Page 35: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

word2vec

Page 36: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

word2vec

Page 37: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

word2vec

I dogI cat, dogs, dachshund, rabbit, puppy, poodle, rottweiler,

mixed-breed, doberman, pigI sheep

I cattle, goats, cows, chickens, sheeps, hogs, donkeys,herds, shorthorn, livestock

I novemberI october, december, april, june, february, july, september,

january, august, marchI jerusalem

I tiberias, jaffa, haifa, israel, palestine, nablus, damascuskatamon, ramla, safed

I tevaI pfizer, schering-plough, novartis, astrazeneca,

glaxosmithkline, sanofi-aventis, mylan, sanofi, genzyme,pharmacia

Page 38: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Working with Dense Vectors

Word Similarity

I Similarity is calculated using cosine similarity :

sim( ~dog, ~cat) =~dog · ~cat

|| ~dog|| || ~cat ||

I For normalized vectors (||x || = 1), this is equivalent to adot product:

sim( ~dog, ~cat) = ~dog · ~cat

I Normalize the vectors when loading them.

Page 39: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Working with Dense Vectors

Finding the most similar words to ~dog

I Compute the similarity from word ~v to all other words.

I This is a single matrix-vector product: W · ~v>

I Result is a |V | sized vector of similarities.I Take the indices of the k -highest values.I FAST! for 180k words, d=300: ∼30ms

Page 40: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Working with Dense Vectors

Finding the most similar words to ~dog

I Compute the similarity from word ~v to all other words.I This is a single matrix-vector product: W · ~v>

I Result is a |V | sized vector of similarities.I Take the indices of the k -highest values.I FAST! for 180k words, d=300: ∼30ms

Page 41: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Working with Dense Vectors

Finding the most similar words to ~dog

I Compute the similarity from word ~v to all other words.I This is a single matrix-vector product: W · ~v>

I Result is a |V | sized vector of similarities.I Take the indices of the k -highest values.

I FAST! for 180k words, d=300: ∼30ms

Page 42: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Working with Dense Vectors

Finding the most similar words to ~dog

I Compute the similarity from word ~v to all other words.I This is a single matrix-vector product: W · ~v>

I Result is a |V | sized vector of similarities.I Take the indices of the k -highest values.I FAST! for 180k words, d=300: ∼30ms

Page 43: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Working with Dense Vectors

Most Similar Words, in python+numpy code

W,words = load_and_normalize_vectors("vecs.txt")# W and words are numpy arrays.w2i = {w:i for i,w in enumerate(words)}

dog = W[w2i[’dog’]] # get the dog vector

sims = W.dot(dog) # compute similarities

most_similar_ids = sims.argsort()[-1:-10:-1]sim_words = words[most_similar_ids]

Page 44: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Working with Dense Vectors

Similarity to a group of words

I “Find me words most similar to cat, dog and cow”.I Calculate the pairwise similarities and sum them:

W · ~cat + W · ~dog + W · ~cow

I Now find the indices of the highest values as before.

I Matrix-vector products are wasteful. Better option:

W · ( ~cat + ~dog + ~cow)

Page 45: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Working with Dense Vectors

Similarity to a group of words

I “Find me words most similar to cat, dog and cow”.I Calculate the pairwise similarities and sum them:

W · ~cat + W · ~dog + W · ~cow

I Now find the indices of the highest values as before.

I Matrix-vector products are wasteful. Better option:

W · ( ~cat + ~dog + ~cow)

Page 46: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Working with dense word vectors can be very efficient.

But where do these vectors come from?

Page 47: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Working with dense word vectors can be very efficient.

But where do these vectors come from?

Page 48: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

How does word2vec work?

word2vec implements several different algorithms:

Two training methods

I Negative SamplingI Hierarchical Softmax

Two context representations

I Continuous Bag of Words (CBOW)I Skip-grams

We’ll focus on skip-grams with negative sampling

intuitions apply for other models as well

Page 49: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

How does word2vec work?

word2vec implements several different algorithms:

Two training methods

I Negative SamplingI Hierarchical Softmax

Two context representations

I Continuous Bag of Words (CBOW)I Skip-grams

We’ll focus on skip-grams with negative sampling

intuitions apply for other models as well

Page 50: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

How does word2vec work?

I Represent each word as a d dimensional vector.I Represent each context as a d dimensional vector.I Initalize all vectors to random weights.I Arrange vectors in two matrices, W and C.

Page 51: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

How does word2vec work?While more text:

I Extract a word window:A springer is [ a cow or heifer close to calving ] .

c1 c2 c3 w c4 c5 c6

I w is the focus word vector (row in W ).I ci are the context word vectors (rows in C).

I Try setting the vector values such that:

σ(w · c1)+σ(w · c2)+σ(w · c3)+σ(w · c4)+σ(w · c5)+σ(w · c6)

is high

I Create a corrupt example by choosing a random word w ′[ a cow or comet close to calving ]

c1 c2 c3 w ′ c4 c5 c6

I Try setting the vector values such that:

σ(w ′· c1)+σ(w ′· c2)+σ(w ′· c3)+σ(w ′· c4)+σ(w ′· c5)+σ(w ′· c6)

is low

Page 52: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

How does word2vec work?While more text:

I Extract a word window:A springer is [ a cow or heifer close to calving ] .

c1 c2 c3 w c4 c5 c6

I Try setting the vector values such that:

σ(w · c1)+σ(w · c2)+σ(w · c3)+σ(w · c4)+σ(w · c5)+σ(w · c6)

is high

I Create a corrupt example by choosing a random word w ′[ a cow or comet close to calving ]

c1 c2 c3 w ′ c4 c5 c6

I Try setting the vector values such that:

σ(w ′· c1)+σ(w ′· c2)+σ(w ′· c3)+σ(w ′· c4)+σ(w ′· c5)+σ(w ′· c6)

is low

Page 53: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

How does word2vec work?While more text:

I Extract a word window:A springer is [ a cow or heifer close to calving ] .

c1 c2 c3 w c4 c5 c6

I Try setting the vector values such that:

σ(w · c1)+σ(w · c2)+σ(w · c3)+σ(w · c4)+σ(w · c5)+σ(w · c6)

is high

I Create a corrupt example by choosing a random word w ′[ a cow or comet close to calving ]

c1 c2 c3 w ′ c4 c5 c6

I Try setting the vector values such that:

σ(w ′· c1)+σ(w ′· c2)+σ(w ′· c3)+σ(w ′· c4)+σ(w ′· c5)+σ(w ′· c6)

is low

Page 54: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

How does word2vec work?

The training procedure results in:I w · c for good word-context pairs is highI w · c for bad word-context pairs is lowI w · c for ok-ish word-context pairs is neither high nor low

As a result:I Words that share many contexts get close to each other.I Contexts that share many words get close to each other.

At the end, word2vec throws away C and returns W .

Page 55: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Reinterpretation

Imagine we didn’t throw away C. Consider the product WC>

The result is a matrix M in which:I Each row corresponds to a word.I Each column corresponds to a context.I Each cell correspond to w · c, an association measure

between a word and a context.

Page 56: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Reinterpretation

Imagine we didn’t throw away C. Consider the product WC>

The result is a matrix M in which:I Each row corresponds to a word.I Each column corresponds to a context.I Each cell correspond to w · c, an association measure

between a word and a context.

Page 57: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Reinterpretation

Does this remind you of something?

Very similar to SVD over distributional representation:

Page 58: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Reinterpretation

Does this remind you of something?Very similar to SVD over distributional representation:

Page 59: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Relation between SVD and word2vec

SVDI Begin with a word-context matrix.I Approximate it with a product of low rank (thin) matrices.I Use thin matrix as word representation.

word2vec (skip-grams, negative sampling)

I Learn thin word and context matrices.I These matrices can be thought of as approximating an

implicit word-context matrix.I In Levy and Goldberg (NIPS 2014) we show that this

implicit matrix is related to the well-known PPMI matrix.

Page 60: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Relation between SVD and word2vec

word2vec is a dimensionality reduction technique over an(implicit) word-context matrix.

Just like SVD.

With few tricks (Levy, Goldberg and Dagan, in submission) we can getSVD to perform just as well as word2vec.

However, word2vec. . .

I . . . works without building / storing the actual matrix inmemory.

I . . . is very fast to train, can use multiple threads.I . . . can easily scale to huge data and very large word

and context vocabularies.

Page 61: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Relation between SVD and word2vec

word2vec is a dimensionality reduction technique over an(implicit) word-context matrix.

Just like SVD.

With few tricks (Levy, Goldberg and Dagan, in submission) we can getSVD to perform just as well as word2vec.

However, word2vec. . .

I . . . works without building / storing the actual matrix inmemory.

I . . . is very fast to train, can use multiple threads.I . . . can easily scale to huge data and very large word

and context vocabularies.

Page 62: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Beyond word2vec

Page 63: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Beyond word2vec

I word2vec is factorizing a word-context matrix.I The content of this matrix affects the resulting similarities.I word2vec allows you to specify a window size.I But what about other types of contexts?

I Example: dependency contexts (Levy and Dagan, ACL 2014)

Page 64: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Australian scientist discovers star with telescope

Bag of Words (BoW) Context

Page 65: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Australian scientist discovers star with telescope

Bag of Words (BoW) Context

Page 66: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Australian scientist discovers star with telescope

Syntactic Dependency Context

Page 67: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Australian scientist discovers star with telescope

Syntactic Dependency Context

prep_withnsubj

dobj

Page 68: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Australian scientist discovers star with telescope

Syntactic Dependency Context

prep_withnsubj

dobj

Page 69: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Embedding Similarity with Different Contexts

Target Word Bag of Words (k=5) Dependencies

Dumbledore Sunnydale

hallows Collinwood

Hogwarts half­blood Calarts

(Harry Potter’s school) Malfoy Greendale

Snape Millfield

Related to Harry Potter

Schools

Page 70: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Embedding Similarity with Different Contexts

Target Word Bag of Words (k=5) Dependencies

nondeterministic Pauling

non­deterministic Hotelling

Turing computability Heting

(computer scientist) deterministic Lessing

finite­state Hamming

Related to computability

Scientists

Page 71: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Online Demo!

Embedding Similarity with Different Contexts

Target Word Bag of Words (k=5) Dependencies

singing singing

dance rapping

dancing dances breakdancing

(dance gerund) dancers miming

tap­dancing busking

Related todance

Gerunds

Page 72: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Context matters

Choose the correct contexts for your application

I larger window sizes – more topicalI dependency relations – more functional

I only noun-adjective relationsI only verb-subject relationsI context: time of the current messageI context: user who wrote the messageI . . .I the sky is the limit

Page 73: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Context matters

Choose the correct contexts for your application

I larger window sizes – more topicalI dependency relations – more functionalI only noun-adjective relationsI only verb-subject relations

I context: time of the current messageI context: user who wrote the messageI . . .I the sky is the limit

Page 74: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Context matters

Choose the correct contexts for your application

I larger window sizes – more topicalI dependency relations – more functionalI only noun-adjective relationsI only verb-subject relationsI context: time of the current messageI context: user who wrote the message

I . . .I the sky is the limit

Page 75: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Context matters

Choose the correct contexts for your application

I larger window sizes – more topicalI dependency relations – more functionalI only noun-adjective relationsI only verb-subject relationsI context: time of the current messageI context: user who wrote the messageI . . .I the sky is the limit

Page 76: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Software

word2vecfhttps://bitbucket.org/yoavgo/word2vecf

I Extension of word2vec.I Allows saving the context matrix.I Allows using arbitraty contexts.

I Input is a (large) file of word context pairs.

Page 77: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Software

hyperwordshttps://bitbucket.org/omerlevy/hyperwords/

I Python library for working with either sparse or dense wordvectors (similarity, analogies).

I Scripts for creating dense representations using word2vecfor SVD.

I Scripts for creating sparse distributional representations.

Page 78: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

Software

dissecthttp://clic.cimec.unitn.it/composes/toolkit/

I Given vector representation of words. . .I . . . derive vector representation of phrases/sentencesI Implements various composition methods

Page 79: A practical introduction to distributional semanticsclic.cimec.unitn.it/composes/materials/practical-ds-marco-yoav.pdf · A practical introduction to distributional semantics

SummaryDistributional Semantics

I Words in similar contexts have similar meanings.I Represent a word by the contexts it appears in.I But what is a context?

Neural Models (word2vec)

I Represent each word as dense, low-dimensional vector.I Same intuitions as in distributional vector-space models.I Efficient to run, scales well, modest memory requirement.I Dense vectors are convenient to work with.I Still helpful to think of the context types.

SoftwareI Build your own word representations.