Word Sense Disambiguation

Word Sense Disambiguation

Asma Naseer

(Slides from Dr. Mary P. Harper,

http://min.ecn.purdue.edu/~ee669/)

2

Problem: many words have different meanings or senses, i.e., there is ambiguity about how they are to be specifically interpreted (e.g., differentiate).

Task: to determine which of the senses of an ambiguous word is invoked in a particular use of the word by looking at the context of its use.

Note: more often than not the different senses of a word are closely related.

Overview of the Problem

3

Bank◦ The rising ground

bordering a lake, river, or sea

◦ An establishment for the custody, loan exchange, or issue of money, for the extension of credit, and for facilitating the transmission of funds

Title◦ Name/heading of a book,

statue, work of art or music, etc.

◦ Material at the start of a film

◦ The right of legal ownership (of land)

◦ The document that is evidence of the right

◦ An appellation of respect attached to a person’s name

Ambiguity Resolution

4

وقت۔ ۔ ۔ عصر۔ زمانہ، ویلا، بیلا،

فصل موسم، رت،۔ ۔ ۔ ۔ موقع مہلت، فرصت،

قرن سماں، جگ،حیات آ%پو، زندگی، عمر،

بار نوبت، دفعہ، باری، د%ر،۔ ۔ ۔۔ ۔ ۔

40 words in 12 groups (senses)www.crulp.org/oud

Ambiguity Resolution

5

Supervised Disambiguation: based on a labeled training set.

Dictionary-Based Disambiguation: based on lexical resources such as dictionaries and thesauri.

Unsupervised Disambiguation: based on unlabeled corpora.

Methodological Preliminaries

6

Supervised versus Unsupervised Learning: In supervised learning (classification), the sense label of each word occurrence is provided in the training set; whereas, in unsupervised learning (clustering), it is not provided.

Pseudowords: used to generate artificial evaluation data for comparison and improvements of text-processing algorithms, e.g., replace each of two words (e.g., bell and book) with a psuedoword (e.g., bell-book).

Upper and Lower Bounds on Performance: used to find out how well an algorithm performs relative to the difficulty of the task.◦ Upper: human performance◦ Lower: Simplest possible Algorithm


Upper Bounds on Performance◦ Human Performance Gale et al. (1992a) Between 97% to 99% For clearly distinct senses 95% and higher For polysemous words 65% to 70%

Lower Bounds on Performance◦ Simplest Possible Algorithm◦ With two equiprobable senses 90%


Symbol Meaning

w an ambiguous word

s1…,sk,…sK senses of w

c1…,ci,…cI context of w

v1…,vj,…vJ words used as contextual features

Notations

9

Training set: exemplars where each occurrence of the ambiguous word w is annotated with a semantic label. This becomes a statistical classification problem; assign w some sense sk in context cl.

Approaches:◦ Bayesian Classification: the context of

occurrence is treated as a bag of words without structure, but it integrates information from many words in a context window.

◦ Information Theory: only looks at the most informative feature in the context, which may be sensitive to text structure.

Supervised Disambiguation

10

Look at the words around an ambiguous word in a large context window.

Each content word contributes potentially useful information about which sense of the ambiguous word is likely to be used with it.

A classifier applies the decision rule when choosing a class, the rule that minimizes the probability of error.

it simply combines the evidence from all features, assuming they are independent.

Bayes decision rule: Decide s’ if P(s’|c) > P(sk|c) for sk s’◦ Optimal because it minimizes the probability of error; for each

individual case it selects the class with the highest conditional probability (and hence lowest error rate).

Supervised Disambiguation: Bayesian Classification

11

P(sk|c) = (P(c|sk)/P(c)) × P(sk)

◦ P(sk) is the prior probability of sk, i.e., the probability of instance sk without any contextual information.

◦ When updating the prior with evidence from context (i.e., P(c|sk)/P(c)), we obtain the posterior probability P(sk|c).

◦ If all we want to do is select the correct class, we can ignore P(c). Also use logs to simplify computation.

Assign word w sense ◦ s’ = argmaxsk

◦ P(sk|c) = argmaxsk

P(c|sk) × P(sk) ◦ = argmaxsk

[log P(c| sk) + log P(sk)]

Supervised Disambiguation: Bayesian Classification

Naïve Bayes Assumption

◦

Decision Rule for Naïve Bayes

◦

◦

◦

Bayesian Classification

13

Training:for all senses sk of w do for all vj in vocabulary

do P(vj|sk) = C(vj,sk)/C(sk) endendfor all senses sk of w do P(sk) = C(sk)/C(w)end

Disambiguation:for all senses sk of w do score(sk) = log P(sk) for all vj in context

window c do score(sk) = score(sk) + log P(vj|sk) endendchoose argmaxsk

score (sk)

Bayesian Disambiguation Algorithm

Gale, Church Yarowsky (1992b; 1992c) reported 90% accuracy for 6 ambiguous nouns - (duty, drug, land, language, position, and sentences)

Bayes Calssifier◦ Uses information from all the words in context

window to disambiguate◦ Independence Assumption

Information-Theory takes opposite route◦ Finds a single contextual feature to indicate sense

of the ambiguous word w.

An Information-Theoretic approach

15

(Brown et al., 1991) attempt to find a single contextual feature that reliably indicates which sense of an ambiguous word is being used.

French verb prendre has two different readings that are affected by the word appearing in object position ◦ mesure to take, ◦ décision to make

The verb vouloir’s reading is affected by tense ◦ present to want◦ conditional to like)

Brown et al. use the Flip-Flop algorithm

Supervised Disambiguation:An Information-Theoretic Approach

Flip-Flop Algorithm◦ t1…tm (translation of ambiguous word)◦ x1…xn (possible value of the indicator)

Mutual Information

◦


17

The algorithm works by searching for a partition of senses that maximizes the mutual information. The algorithm stops when the increase becomes insignificant.

Supervised Disambiguation:An Information-Theoretic Approach

18

I(X; Y)=H(X)-H(X|Y)=H(Y)-H(Y|X), the mutual information between X and Y, is the reduction in uncertainty of one random variable due to knowing about another, or, in other words, the amount of information one random variable contains about another.

Mutual Information

H(X|Y)

H(Y|X)

I(X; Y)

H(X,Y)

H(X) H(Y)

19

I(X; Y) is symmetric, non-negative measure of the common information of two variables.

Some see it as a measure of dependence between two variables, but better to think of it as a measure of independence.◦ I(X; Y) is 0 only when X and Y are independent: H(X|

Y)=H(X)◦ For two dependent variables, I grows not only according

to the degree of dependence but also according to the entropy of the two variables.

H(X)=H(X)-H(X|X)=I(X; X) self-information.

Mutual Information (cont)X)|H(Y -H(Y) Y)|H(X-H(X) Y)I(X;

20

t1…tm (translation of ambiguous word) x1…xn (possible value of the indicator)

I(X; Y) = åxX åyY p(x,y) log (p(x,y)/(p(x)p(y))) Mutual information increases monotonically in the

Flip-Flop algorithm, so it is reasonable to stop when there is only an insignificant improvement.

The Flip-Flop Disambiguation Algorithm

21

Translate prendre based on its object {t1, …, tm}={take, make, rise, speak} {x1, …, xn}={mesure, note, example,decision, parole}, prendre is used as

◦ take when occurring with the objects mesure, note, and exemple; ◦ otherwise used as make, rise, or speak.

Initial partition ◦ P1={take, rise} ◦ P2={make, speak}.

Choose partition of Q (indicator values)◦ Q1={mesure, note, exemple} ◦ Q2={décision, parole}

prendre la parole is not translated as rise to speak Repartition

◦ P1={take} ◦ P2={rise, make, speak}

Example


{t1…tm}={take, make , rise, speak}

{x1…xn}={measure, note, example, decision, parole}

{P1 , P2}={take, make , rise, speak}

{P1 , P2}={take, make , rise, speak}

{Q1 , Q2}={measure, note, example, decision, parole}

P1 P2

Q1 Q2

P1 P2

Prendre -> take Prendre la parole -> rise to speak

23

The Flip-Flop algorithm is a linear time algorithm based on Brieman et al.’s (1984) splitting theorem. ◦ Run the algorithm for all possible indicators and choose

the indicator with the highest mutual information◦ Once the indicator and partition of its values is

determined, disambiguation is simple: For each ambiguous word, determine the value xi of the

indicator If xi is in Q1, assign sense 1; if xi is in Q2, assign

sense 2 Brown et al. (1991) obtained a 20%

improvement in MT system using this approach (translations used as senses).

Flip-Flop Algorithm

24

If we have no information about the senses of specific instances of words, we can fall back on a general characterization of the senses provided by a lexicon.

We will be looking at three different methods:◦ Disambiguation based on sense definitions in

a dictionary (Lesk, 1986)◦ Thesaurus-based disambiguation (Walker, 1987

and Yarowsky, 1992)◦ Disambiguation based on translations in a

second-language corpus (Dagan and Itai, 1994)

One sense per discourse/ One sense per collocation◦ Ambiguous words tend to be used with only one sense

in a given discourse with a given collocate.

Dictionary-Based Disambiguation

25

(Lesk, 1986) uses the simple idea that a word’s dictionary definitions are likely to be good indicators for the senses they define.

For example, the words in definitions associated with the word cone (seed bearing cone versus ice cream containing cone) can be matched to the words in the definitions of all of the words in the context of the word.◦ Let D1, D2, …., DK be the definitions of the senses s1,

s2, …., sK of an ambiguous word w, each represented as a bag of words in the definition.

◦ Let Evj be the dictionary definition(s) for word vj

occurring in context c of w, represented as a bag of words; if sj1

, sj2, …, sjL are the senses of vj, then Evj

= jt

Djt.

Sense Definition Disambiguation

26

Disambiguate the ambiguous word by choosing the sub-definition of the ambiguous word that has the greatest overlap with the words occurring in its context. Overlap can be measured by counting common words or other types of similarity measures.

Comment: Given context cfor all senses sk of w do score(sk) = overlap(Dk, vj in c Evj

)endChoose s’=argmaxsk

score (sk)


27

By itself, this method is insufficient to achieve highly accurate word sense disambiguation; Lesk obtained accuracies between 50% and 70% on a sample of ambiguous words.

There are possible optimizations that can be applied to improve the algorithm:◦ Run several iterations of the algorithm on a text,

and instead of using a union of all words Evj

occurring in the definition for vj, use only the contextually appropriate definitions based on a prior iteration.

◦ Expand each word in context c with synonyms from a thesaurus.


28

This approach exploits the semantic categorization provided by a thesaurus (e.g., Roget’s) or lexicon with subject categories (e.g., Longman’s)

The basic idea is that semantic categories of the words in a context determine the semantic category of the context as a whole. This category, in turn, determines which word senses are used.

Thesaurus-Based Disambiguation

29

(Walker, 87): each word is assigned one or more subject codes in a dictionary corresponding to its different meanings. ◦ If more than one subject code is found, then

assume that each code corresponds to a different word sense.

◦ Let t(sk) be the subject code for sense sk of word w in context c.

◦ Then w can be disambiguated by counting the number of words from the context c for which the thesaurus lists t(sk) has a possible subject code. We select the sense that has the subject code with the highest count.

Black(1988) achieved only moderate success on 5 ambiguous words with this approach (~ 50% accuracies).


Walker’s Algorithmcomment: Given context cfor all senses sk of w do score(sk) = vj in c (t(sk), vj)endchoose s’=argmaxsk

score (sk)

Note that (t(sk), vj)=1 iff t(sk) is one of the subject codes for vj and 0 otherwise. The score is the number of words compatible with the subject code of sk.

One problem with this algorithm is that a general categorization of words into topics may be inappropriate in a particular domain (e.g., mouse as a mammal or electronic device in the context of computer manual).

Another problem is coverage, e.g., names like Navratilova suggests the topic of sports and yet appear in no lexicon.

30


31

(Dagan & Itai, 91, 91) found that words can be disambiguated by looking at how they are translated in other languages.

◦ The first language is the one we wish to disambiguate senses in.

◦ We must have a bilingual dictionary between the first and second language and a corpus for the second (target) language.

Example: the word interest has two translations in German:

1. Beteiligung (legal share--50% a interest in the company) 2. Interesse (attention, concern--her interest in Mathematics).

To disambiguate the word interest, we identify the phrase it occurs in and search a German corpus for instances of that phrase. If the phrase occurs with only one of the translations in German, then we assign the corresponding sense whenever the word appears in that phrase.

Disambiguation Based on Translations in a Second-Language Corpus

32

comment: Given context c in which w occurs in relation R(w, v)

for all senses sk of w do score(sk) = |{c S | w’ T(sk), v’ T (v): R(w’, v’) c}|

endchoose s’=argmaxsk

score(sk)

S is the second-language corpus, T(sk) is the set of possible translations of sense sk, and T(v) is the set of possible translations of v.

The score of a sense is the number of times that one of its translations occurs with the translation v in the second language corpus.

Dagan & Itai’s Algorithm

33

For example, the relation R could be ‘is-object-of’ to disambiguate interest (showed an interest interesse zeigen (attention or concern) versus acquire an interest Beteiligung erwerben (legal share)).

The algorithm of Dagan and Itai is more complex than shown here; it disambiguates only if the decision can be made reliably. They estimate the probability of an error and make decisions only when the probability of an error is less than 10%.

If a word w in the first language can be translated two ways in the second language within a given phrase (e.g., stand at w), then if there are 10 for the first and 5 for the second sense, then the probability of error is 5/(10+5) = 0.33.

Dagan & Itai’s Algorithm

34

(Yarowsky, 1995) suggests that there are constraints between different occurrences of an ambiguous word within a corpus that can be exploited for disambiguation:◦ One sense per discourse: The sense of a target

word is highly consistent within any given document. For example, the word differentiate (calculus vs. biology) when used in one way in discourse is likely to continue being used that way.

◦ One sense per collocation: Nearby words provide strong and consistent clues to the sense of a target word, conditional on relative distance, order, and syntactic relationship. The word senses are strongly correlated with certain contextual features like other words in the same phrase.

One Sense per Discourse, One Sense per Collocation

35

comment: Initializationfor all senses sk of w do Fk = the set of collocations in sk’s dictionary definitionendfor all senses sk of w do Ek = end

Fk contains the characteristic collocations of sk, which is initialized using the dictionary definition of sk or from another source.

Ek is the set of the contexts of the ambiguous word w that are currently assigned to sk, which is initially empty.

Yarowsky’s (1995) Algorithm

36

comment: One sense per collocationwhile (at least one Ek changed during the last

iteration) do for all senses sk of w do Ek = {ci | fm : fmci fmFk} end for all senses sk of w do Fk = {fm | nk P(sk| fm)/ P(sn| fm) >} endendcomment: One sense per discoursefor all documents dm do determine the majority sense sk of w in dm assign all occurrences of w in dm sense sk end

Yarowsky’s (1995) Algorithm

General Dictionaries are less useful for domain specific collections.

Completely unsupervised disambiguation is not possible for sense tagging

Sense discrimination can be performed in a completely unsupervised fashion.

Schutze proposed context-group discrimination

EM Algorithm is used for Unsupervised disambiguation

Unsupervised Disambiguation

38

Without supporting tools such as dictionaries and thesauri and in the absence of labeled text, we can simply cluster the contexts of an ambiguous word into a number of groups and discriminate between these groups without labeling them.

Context-group discrimination (Schutze, 1998): ◦ Clusters uses of an ambiguous word with no

additional knowledge.◦ For an ambiguous word w with senses s1, …, sk, …,

sK, estimate the conditional probability of each word vj occurring in w’s context being used with sense sk, P(vj|sk).


39

The probabilistic model is the same Bayesian model as the one used by Gale et al.’s Bayes classifier, except that each P(vj|sk) is estimated using the EM algorithm.◦ Start with a random initialization of the parameters of P(vj|sk).◦ Compute for each context ci of w, the probability P(cj|sk)

generated by sk..◦ Use this preliminary categorization of contexts as our training

data and then re-estimate P(vj|sk) to maximize the likelihood of the data given the model.

◦ EM is guaranteed to increase the log likelihood of the model given the data at each step; therefore, the algorithm stops when the likelihood does not increase significantly.

Schutze (1998)

Initialize parameters P(vj|sk) and P(sk)◦

Compute the likely hood of the corpus C given model µ◦

E-step for 1≤k ≤K and 1 ≤i ≤I◦

◦


M-step reestimate P(vj|sk) and P(sk)

◦

◦

Repeat E & M steps while I(C|µ) improves


42

Once model parameters are estimated, we can disambiguate contexts w by computing the probability of each of the senses based on the words vj occurring in context. Schutze (1998) uses the Naïve Bayes decision rule: ◦ Decide s’ if s’=argmaxsk

[log P(sk)+vj in c log P(vj|sk)] The granularity of senses of a word can be chosen by

running the algorithm over a range of values.◦ The larger the number of senses the better it will be able to

explain the data.◦ Relative increase in likelihood may help to distinguish important

senses from random variations.◦ Could make # of senses dependent on the amount of training

data.◦ Can get finer grained distinctions than in supervised approaches.

Works better for topic-dependent senses than topic-independent ones.

Schutze (1998)

43

If the disambiguation task is embedded in a task like translation, then it is easy to evaluate in the context of that application. This leads to application-oriented notions of sense.

Direct evaluation of disambiguation accuracy is more difficult in an application-independent sense. It would be easier if there were standard evaluation sets (Senseval project is addressing this need).

There is a need for researchers to evaluate their algorithms on a representative sample of ambiguous words.

Word Sense Disambiguation Evaluation

44

The type of information used in disambiguation affects the notion of sense used:◦ Co-occurrence (bag-of-words model): topical sense◦ Relational information (e.g., subject, object)◦ Other grammatical information (e.g., part-of-speech)◦ Collocations (one sense per collocation)◦ Discourse (one sense per discourse segment): How

much context is needed to determine sense?◦ Combinations of the above

Different types of information may be more useful for different parts of speech (e.g., verb meaning is affected by its complements, but nouns are more affected by wider context).

Factors Influencing the Notion of Sense

Documents

Word Sense Disambiguation