Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture6b.pdfLecture 6 Word sense disambiguation Part 2 What are word senses? Cognitive definition:

Semantic similarity, vector space models and word-

sense disambiguation

Corpora and Statistical Methods

Lecture 6

Word sense disambiguation

Part 2

What are word senses? Cognitive definition:

mental representation of meaning

used in psychological experiments

relies on introspection (notoriously deceptive)

Dictionary-based definition:

adopt sense definitions in a dictionary

most frequently used resource is WordNet

WordNet Taxonomic representation of words (“concepts”)

Each word belongs to a synset, which contains near-synonyms

Each synset has a gloss

Words with multiple senses (polysemy) belong to multiple synsets

Synsets organised by hyponymy (IS-A) relations

How many senses?

Example: interest

pay 3% interest on a loan

showed interest in something

purchased an interest in a company.

the national interest…

have X’s best interest at heart

have an interest in word senses

The economy is run by business interests

Wordnet entry for interest (noun)1. a sense of concern with and curiosity about someone or something …

(Synonym: involvement)

2. the power of attracting or holding one’s interest… (Synonym: interestingness)

3. a reason for wanting something done ( Synonym: sake)

4. a fixed charge for borrowing money…

5. a diversion that occupies one’s time and thoughts … (Synonym: pastime)

6. a right or legal share of something; a financial involvement with something (Synonym: stake)

7. (usually plural) a social group whose members control some field of activity and who have common aims, (Synonym: interest group)

Some issues Are all these really distinct senses? Is WordNet too fine-

grained?

Would native speakers distinguish all these as different?

Cf. The distinction between sense ambiguity and underspecification (vagueness):

one could argue that there are fewer senses, but these are underspecified out of context

Translation equivalence

Many WSD applications rely on translation equivalence

Given: parallel corpus (e.g. English-German)

if word w in English has n translations in German, then each

translation represents a sense

e.g. German translations of interest:

Zins: financial charge (WN sense 4)

Anteil: stake in a company (WN sense 6)

Interesse: all other senses

Some terminology WSD Task: given an ambiguous word, find the intended sense

in context

Sense tagging: task of labelling words as belonging to one sense or another. needs some a priori characterisation of senses of each relevant word

Discrimination: distinguishes between occurrences of words based on senses not necessarily explicit labelling

Some more terminology

Two types of WSD task:

Lexical sample task: focuses on disambiguating a small set of

target words, using an inventory of the senses of those words.

All-words task: focuses on entire texts and a lexicon, where

every word in the text has to be disambiguated

Serious data sparseness problems!

Approaches to WSD All methods rely on training data. Basic idea: Given word w in context c learn how to predict sense s of w based on various features of w

Supervised learning: training data is labelled with correct senses can do sense tagging

Unsupervised learning: training data is unlabelled but many other knowledge sources used

cannot do sense tagging, since this requires a priori senses

Supervised learning Words in training data labelled with their senses She pays 3% interest/INTEREST-MONEY on the loan. He showed a lot of interest/INTEREST-CURIOSITY in the painting.

Similar to POS tagging given a corpus tagged with senses define features that indicate one sense over another learn a model that predicts the correct sense given the features

Features (e.g. plant) Neighbouring words: plant life

manufacturing plant

assembly plant

plant closure

plant species

Content words in a larger window animal

equipment

employee

automatic

Other features

Syntactically related words

e.g. object, subject….

Topic of the text

is it about SPORT? POLITICS?

Part-of-speech tag, surrounding part-of-speech tags

Some principles proposed (Yarowsky 1995)

One sense per discourse: typically, all occurrences of a word will have the same sense in the

same stretch of discourse (e.g. same document)

One sense per collocation: nearby words provide clues as to the sense, depending on the distance

and syntactic relationship

e.g. plant life: all (?) occurrences of plant+life will indicate the botanic sense of plant

Training data SENSEVAL

Shared Task competition

datasets available for WSD, among other things

annotated corpora in many languages

Pseudo-words

create training corpus by artificially conflating words

e.g. all occurrences of man and hammer with man-hammer

easy way to create training data

Multi-lingual parallel corpora

translated texts aligned at the sentence level

translation indicates sense

Data representation

Example sentence: An electric guitar and bass player stand off to

one side...

Target word: bass

Possible senses: fish, musical instrument...

Relevant features are represented as vectors, e.g.:

22111122 ,,,,,,, iiiiiiii POSwPOSwPOSwPOSw

VBstand,NN,player, CC,and,NN,guitar,

Supervised methods

Naïve Bayes Classifier

Identify the features (F)

e.g. surrounding words

other cues apart from surrounding context

Combine evidence from all features

Decision rule: decide on sense s’ iff

Example: drug. F = words in context

medication sense: price, prescription, pharmaceutical

illegal substance sense: alcohol, illicit, paraphernalia

)|()|'(:', FsPFsPsss kkk

Using Bayes’ rule

We usually don’t know P(sk|f) but we can compute from

training data: P(sk) (the prior) and P(f|sk)

P(f) can be eliminated because it is constant for all senses in the

corpus

)(

)()|()|(

fP

sPsfPfsP kk

k

)()|(maxarg

)(

)()|(maxarg

)|(maxarg

kks

kks

ksbest

sPsfP

fP

sPsfP

fsPs

k

k

k

The independence assumption

It’s called “naïve” because:

i.e. all features are assumed to be independent

Obviously, this is often not true. e.g. finding illicit in the context of drug may not be independent of finding pusher.

cf. our discussion of collocations!

Also, topics often constrain word choice.

n

j

kjk sfPsfP1

)|()|(

Training the naive Bayes classifier We need to compute:

P(s) for all senses s of w

P(f|s) for all features f

)(

),()(

wCount

wsCountsP k

k

)(

),()|(

k

kj

kjsCount

sfCountsfP

Information-theoretic measures Find the single, most informative feature to predict a sense.

E.g. using a parallel corpus:

prendre (FR) can translate as take or make

prendre une décision: make a decision

prendre une mesure: take a measure [to…]

Informative feature in this case: direct object

mesure indicates take

décision indicates make

Problem: need to identify the correct value of the feature that indicates a specific sense.

Brown et al’s algorithm1. Given: translations T of word w

2. Given: values X of a useful feature (e.g. mesure, décision as values of DO)

3. Step 1: random partition P of T

4. While improving, do:

create partition Q of X that maximises I(P;Q)

find a partition P of T that maximises I(P;Q)

comment: relies on mutual info to find clusters of translations mapping to clusters of feature values

Using dictionaries and thesauri Lesk (1986): one of the first to exploit dictionary

definitions the definition corresponding to a sense can contain words which

are good indicators for that sense

Method:1. Given: ambiguous word w with senses s1…sn with glosses g1…gn.

2. Given: the word w in context c

3. compute overlap between c & each gloss

4. select the maximally matching sense

Expanding a dictionary

Problem with Lesk:

often dictionary definitions don’t contain sufficient information

not all words in dictionary definitions are good informants

Solution: use a thesaurus with subject/topic categories

e.g. Roget’s thesaurus

http://www.gutenberg.org/cache/epub/22/pg22.html.utf8

Using topic categories Suppose every sense sk of word w has subject/topic tk

w can be disambiguated by identifying the words related to tk

in the thesaurus

Problems: general-purpose thesauri don’t list domain-specific topics

several potentially useful words can be left out

e.g. … Navratilova plays great tennis …

proper name here useful as indicator of topic SPORT

Expanding a thesaurus: Yarowsky 19921. Given: context c and topic t

2. For all contexts and topics, compute p(c|t) using Naïve Bayes by comparing words pertaining to t in the thesaurus with words in c

if p(c|t) > α, then assign topic t to context c

3. For all words in the vocabulary, update the list of contexts in which the word occurs. Assign topic t to each word in c

4. Finally, compute p(w|t) for all w in the vocabulary this gives the “strength of association” of w with t

Yarowsky 1992: some results

SENSE ROGET TOPICS ACCURACY

star

space object UNIVERSE 96%

celebrity ENTERTAINER 95%

shape INSIGNIA 82.%

sentence

punishment LEGAL_ACTION 99%

set of words GRAMMAR 98%

Bootstrapping

Yarowsky (1995) suggested the one sense per

discourse/collocation constraints.

Yarowsky’s method:

1. select the strongest collocational feature in a specific context

2. disambiguate based only on this feature

(similar to the information-theoretic method discussed earlier)

One sense per collocation

1. For each sense s of w, initialise F, the collocations found in s’sdictionary definition.

2. One sense per collocation:

identify the set of contexts containing collocates of s

for each sense s of w, update F to contain those collocates such that

for all s’ ≠ s

(where alpha is a threshold)

)|'(

)|(

fsP

fsP

One sense per discourse

3. For each document:

find the majority sense of w out of those found in previous step

assign all occurrences of w the majority sense

This is implemented as a post-processing step. Reduces

error rate by ca. 27%.

Unsupervised disambiguation

Preliminaries

Recall: unsupervised learning can do sense discrimination

not tagging

akin to clustering occurrences with the same sense

e.g. Brown et al 1991: cluster translations of a word

this is akin to clustering senses

Brown et al’s method Preliminary categorisation:

1. Set P(w|s) randomly for all words w and senses s of w.

2. Compute, for each context c of w the probability P(c|s) that the context was generated by sense s.

Use (1) and (2) as a preliminary estimate. Re-estimate iteratively to find best fit to the corpus.

Characteristics of unsupervised

disambiguation

Can adapt easily to new domains, not covered by a dictionary

or pre-labelled corpus

Very useful for information retrieval

If there are many senses (e.g. 20 senses for word w), the

algorithm will split contexts into fine-grained sets

NB: can go awry with infrequent senses

Some issues with WSD

The task definition The WSD task traditionally assumes that a word has one and

only one sense in a context. Is this true?

Kilgarriff (1993) argues that co-activation (one word displaying more than one sense) is frequent: this would bring competition to the licensed trade

competition = “act of competing”; “people/organisations who are competing”

Systematic polysemy

Not all senses are so easy to distinguish. E.g. competition in the

“agent competing” vs “act of competing” sense.

The polysemy here is systematic

Compare bank/bank where the senses are utterly distinct (and most

linguists wouldn’t consider this a case of polysemy, but homonymy)

Can translation equivalence help here?

depends if polysemy is systematic in all languages

Logical metonymy

Metonymy = usage of a word to stand for something else

e.g. the pen is mightier than the sword

pen = the press

Logical metonymy arises due to systematic polysemy

good cook vs. good book

enjoy the paper vs enjoy the cake

Should WSD distinguish these? How could they do this?

Which words/usages count?

Many proper names are identical to common nouns (cf.

Brown, Bush,…)

This presents a WSD algorithm with systematic ambiguity

and reduces performance.

Also, names are good indicators of senses of neighbouring

words.

But this requires a priori categorisation of names.

Brown’s green stance vs. the cook’s green curry

Documents

Corpora and Statistical Methods Lecture 12staff.um.edu.mt/albert.gatt/teaching/dl/statLecture6b.pdfLecture 6 Word sense disambiguation Part 2 What are word senses? Cognitive definition: