22
Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group http://ixa.si.ehu.es

Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Embed Size (px)

Citation preview

Page 1: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Clustering Word Senses

Eneko Agirre, Oier Lopez de Lacalle

IxA NLP grouphttp://ixa.si.ehu.es

Page 2: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 2

Introduction: motivation

• Desired grained of word sense distinctions controversial• Fine-grainedness of word senses unnecessary for some

applications MT: channel (tv, strait) kanal Senseval-2 WSD competition also provides coarse-grained

senses

• The desired sense groupings depend on the application:

MT: same translation (language pair dependant) IR: some related senses: metonymic, diathesis, specialization Dialogue (deeper NLP): in principle, all word senses in order to

do proper inferences

• WSD needs to be tuned, multiple senses returned Clustering of word senses

Page 3: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 3

Introduction: a sample word

• Channel has 7 senses and 4 coarse-grained senses (Senseval 2) Mnemo. Channel definitions

water 4. channel -- (a deep and relatively narrow body of water that allows the best passage for vessels)

passage 2. channel -- (a passage for water (or other fluids) to flow through;)

body 6. duct, epithelial duct, canal, channel -- (a bodily passage or tube lined with epithelial cells and conveying a secretion or other substance)

groove 3. groove, channel -- (a long narrow furrow cut either by a natural process (such as erosion) or by a tool)

tv 7. channel, television channel, TV channel -- (a television station and its programs)

signals 1. channel, transmission channel -- (a path over which electrical signals can pass;)

comms 5. channel, communication channel, line -- ((often plural) a means of communication or access)

Page 4: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 4

Introduction

• Work presented here: Test quality of 4 clustering methods 2 based on distributional similarity Confusion matrix of Senseval-2 systems Translation equivalencies

• Result: hierarchical cluster• Clustering algorithms: CLUTO toolkit • Evaluation: Senseval-2 coarse-grained senses

Page 5: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 5

Clustering toolkit used

• CLUTO (Karypis 2001)• Possible inputs:

context vector for each word sense (corpora) similarity matrix (built from any source)

• Number of clustering parameters• Output:

hierarchical or flat clusters

Page 6: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 6

Distributional similarity methods

• Hypothesis: two word senses are similar if they are used in similar contexts

1. Clustering directly over the examples2. Clustering over similarity among topic

signatures

Page 7: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 7

1. Clustering directly from examples

1. Take examples from tagged data (Senseval 2) OR retrieve sense-examples from the web E.g. if we want examples of first sense of

channel use examples of monosemous synonym: transmision channel

We use: synonyms, hypernyms, all hyponyms, siblings

1000 snippets for each monosemous term from Google

Resource freely available (contact us)

2. Cluster the examples as if they were documents

Page 8: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 8

2. Clustering over similarity among TS

1. Retrieve the examples2. Build topic signatures: vector of words in

context of word sense, with high weights for distinguished words:1. sense: channel, transmission_channel   "a path over which electrical signals can pass "medium(3110.34) optic(2790.34) transmission(2547.13) electronic(1553.85) channel(1352.44) mass(1191.12) fiber(1070.28) public(831.41) fibre(716.95) communication(631.38) technology(368.66) system(363.39) datum(308.50) ...

3. Build similarity matrix of TS4. Cluster

Page 9: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 9

3. Confusion matrix method• Hypothesis: sense A is similar to sense B if many

WSD algorithms tag occurrences of A as B • Implemented using results from all Senseval-2

systems

4. Translation similarity method• Hypothesis: two word senses are similar if they are

translated in the same way in a number of languages• (Resnik & Yarowsky, 2000)• Similarity matrix kindly provided by Chugur &

Gonzalo (2002)

Page 10: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 10

Experiment and results: by method

• Best results for distributional similarity: Topic signatures from web data

Method purity Random 0.748 Confusion Matrixes 0.768 Multilingual Similarity 0.799 TS Senseval (Worse)

(Best) 0.744 0.806

TS Web (Worse) (Best)

0.764 0.840

Page 11: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 11

word by word

noun #senses #clusters #senseval #web purity art 4 2 275 23391 0.750 authority 7 4 262 108560 0.571 bar 13 10 360 75792 0.769 bum 4 3 115 25655 1.000 chair 4 2 201 38057 0.750 channel 7 4 181 46493 0.714 child 4 2 189 70416 0.750 circuit 6 4 247 33754 0.833 day 9 5 427 223899 1.000 facility 5 2 172 17878 1.000 fatigue 4 3 116 8596 1.000 feeling 6 4 153 14569 1.000 hearth 3 2 93 10813 0.667 mouth 8 5 171 1585 0.833 nation 4 2 101 1149 1.000 nature 5 3 137 44242 0.600 post 8 5 201 55049 0.625 restraint 6 4 134 49905 0.667 sense 5 4 158 13688 0.800 stress 5 3 112 14528 0.800

Page 12: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 12

Conclusions

• Meaningful hierarchical clusters For all WordNet nominal synsets (soon) Using Web data and distributional similarity All data freely available (MEANING)

But...• Are the clusters useful for the detection of

relations (homonymy, metonymy, metaphor, ...) among word senses? Which clusters?

• Are the clusters useful for applications? WSD (ongoing work) MT, IR, CLIR, Dialogue Which clusters?

Page 13: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 13

Thank you!

Page 14: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 14

An example of a Topic signature

http://ixa3.si.ehu.es/cgi-bin/signatureak/signaturecgi.cgiSource: web examples using monosemous relatives

1. sense: channel, transmission_channel   "a path over which electrical signals can pass "medium(3110.34) optic(2790.34) transmission(2547.13) electronic(1553.85) channel(1352.44) mass(1191.12) fiber(1070.28) public(831.41) fibre(716.95) communication(631.38) technology(368.66) system(363.39) datum(308.50)  5. sense: channel, communication_channel, line   "(often plural) a means of communication or access; "service(3360.26) postal(2503.25) communication(1868.81) mail(1402.33) communicate(1086.16) us(651.30) channel(479.36) communicating(340.82) united(196.55) protocol(170.02) music(165.93) london(162.61) drama(160.95)  7. sense: channel, television_channel, TV_channel   "a television station and its programs; "station(24288.54) television(13759.75) tv(13226.62) broadcast(1773.82) local(1115.18) radio(646.33) newspaper(333.57) affiliated(301.73) programming(283.02) pb(257.88) own(233.25) independent(230.88)  

Page 15: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 15

Experiment and results: an example

• Sample cluster built for channel:

• Entropy: 0.286, Purity: 0.714.

Page 16: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 16

1. Clustering directly from examples:Retrieving sense-examples from the web

• Examples of word senses scarce• Alternative, automatically acquire examples from

corpora (or web) In this paper we follow the monosemous relative

method (Leacock et al.1998) E.g. if we want examples of first sense of channel use

examples of monosemous synonym: transmision channel

We use: synonyms, hypernyms, all hyponyms, siblings 1000 snippets for each monosemous term from Google Heuristics to extract partial or full meaningful

sentences

• More details of the method in (Agirre et al. 2001)

Page 17: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 17

2. Clustering over similarity among TS Building topic signatures

• Given a set of examples for each word sense• ... build a vector for each word sense: each word

in the vocabulary is a dimension • Steps:1. Get frequencies for each word in context2. Use 2 to assign weight to each word/dimension

in contrast to the other word senses3. Filtering step

• More details of the method in (Agirre et al. 2001)

Page 18: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 18

3. Confusion matrix method

• Hypothesis: sense A is similar to sense B if WSD algorithms tag occurrences of A as B

• Implemented using results from all Senseval-2 systems

• Algorithm to produce similarity matrix: M = number of systems N(x) = number of occurrences of word sense x n(a,b) = number of times sense a is tagged as b confusion-similarity(a,b) = n(a,b) / N(a) * M

• Not symmetric

Page 19: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 19

4. Translation similarity method

• Hypothesis: two word senses are similar if they are translated in the same way in a number of languages

• (Resnik & Yarowsky, 2000)• Similarity matrix kindly provided by Chugur &

Gonzalo (2002)• Simplified algorithm:

L = languages (= 4) n(a,b) = number of languages where a and b share a

translation similarity(a,b) = n(a,b)/L

• Actual formula is more elaborate

Page 20: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 20

Previous work on WordNet clustering

Use of WordNet structure:• Peters et al. 1998: WordNet hierarchy, try to identify

systematic polysemy• Tomuro 2001: WordNet hierarchy (MDL), try to identify

systematic polysemy (60% precision against WordNet cousins, increase in inter-tagger agreement)

Our proposal does not look for systematic polysemy. We get individual relations among word senses:

e.g. television channel and transmission channel

• Mihalcea & Moldovan 2001: heuristics on WordNet, WSD improvement (Polysemy reduction 26%, error 2.1% in Semcor)

Provide complementary information

Page 21: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 21

Previous work (continued)

• Resnik & Yarowsky 2000: (also Chugur & Gonzalo (2002). Translations across different languages, improving evaluation metrics (very high correlation with Hector sense hierarchies).

We only get 80% purity using (Chugur & Gonzalo). Unfortunately the dictionaries are rather different (Senseval-2 results dropped compared to Senseval-1). Difficult to scale to all words.

• Pantel & Lin (2002): induce word senses using soft clustering of word occurrences (overlap with WordNet over 60% prec.)

Use syntactic dependencies rather than bag-of-words vector

• Palmer et al. (submitted): criteria for grouping verb senses.

Page 22: Clustering Word Senses Eneko Agirre, Oier Lopez de Lacalle IxA NLP group

Eneko Agirre – IXA NLP group – University of the Basque Country GWNC 2004 - 22