Download pdf - Using WordNet to Disambiguate Word Senses

Using WordNet to Disambiguate Word Senses

by

Ying Liu

Electrical and Computer Engineering

Acknowledgements

I would like to first thank Prof. Peter Scheuermann without whose constant guidance,

support and encouragement, this work would not have been possible. I would also like to

thank Bin Chen who gladly discussed various issues related to my work with me. This

work is the result of many insightful discussions that I have had with Prof. Scheuermann

who inspired me all through and also guided me as and when required. I would also like

to thank the members of the Database System group for their friendship and help. They

are Shayan Zaidi, Mehmet Sayal, Olga Shumsky, and Chris Fernandes.

Further, I would like to thank Dr. Ellen M. Voorhees for her suggestions. Finally I would

like to thank my parents Zongli Liu and Huifang Xu who have guided me all through my

life. I would like to thank them for all the love, encouragement and virtues that I have

received while I was growing.

2

Introduction....................................................................................................................... 1 1.1 Motivation............................................................................................................................ 1

1.2 Contribution ........................................................................................................................ 5

1.3 Organization ........................................................................................................................ 6

Background Knowledge ................................................................................................... 7 2.1 WordNet™............................................................................................................................ 7

2.2 Part-of-Speech Taggers .................................................................................................... 10

2.3 Stemming ........................................................................................................................... 11

2.4 Stopwords .......................................................................................................................... 12

Work related to Word Sense Disambiguation ............................................................. 13 3.1 Survey of Approaches to Word Sense Disambiguation ................................................. 14

3.1.1 Knowledge Based ...................................................................................................................... 14 3.1.2 Corpus Based ............................................................................................................................. 15

Disambiguated Corpora ................................................................................................................. 15 Raw Corpora .................................................................................................................................. 16

3.1.3 Hybrid Approaches .................................................................................................................... 16

Using Hood Algorithm to Disambiguate Word Senses................................................ 17 4.1 Converting WordNet into Relational Database.............................................................. 17

4.2 Algorithm........................................................................................................................... 24 4.2.1 Hood Construction ..................................................................................................................... 24 4.2.2 Word Sense Disambiguation...................................................................................................... 30

Experiments..................................................................................................................... 34 5.1 Part-of-Speech Tagged Brown Corpus ........................................................................... 34

5.2 Flow of Experiment........................................................................................................... 35

5.3 Quality of Results .............................................................................................................. 37

5.4 Result Analysis .................................................................................................................. 37

Conclusion and Future Work ........................................................................................ 39 6.1 Conclusion.......................................................................................................................... 39

6.2 Future Work and Application.......................................................................................... 39

References........................................................................................................................ 41

Appendix A: Definition of Tables.................................................................................. 45

3

CHAPTER 1 Introduction Text retrieval, also known as document or information retrieval, is concerned with

locating natural language documents whose contents satisfy a user's information need.

Unfortunately, there are billions of documents, many of which don't have abstracts or

even titles, on the Internet today. Therefore, there is considerable interest in developing

techniques that automatically index full-text documents and provide access to

heterogeneous collections of full-text documents.

1.1 Motivation

Search engines are great tools that help users find desired documents. Whenever a user

submits a set of query keywords, documents that contain part of the entire keywords will

be returned. But, these search engines are not good enough to answer all the queries. For

instance, most web users have experienced troubles with search engines that when a large

number of web pages are returned, one may have to go through many unrelated pages to

identify useful ones. Sometimes not only is the number of documents returned large, but

also the categories identified by the search engines are not relevant.

Let’s look at an example. If a computer hardware engineer wants to search documents

related to "board", Yahoo returns the following categories:

1

Figure 1-1: Yahoo! Category Matches (1 - 8 of 2394) I only list the first 8 matches of 2394 matches. The results are organized in a hierarchical

structure, i.e., in the first row, “Recreation” is the top category, “Games” is the second

level category, …, and “Board Games” is the category or web page that contains the

keyword “board”. There are only 2 categories related to circuit_board. Obviously, it is a

big burden for the computer hardware engineer to sift the documents that he is really

interested in from such a large number of categories ⎯ remember that within each

category, there may be numerous web sites. Although the user is only interested in the

meaning “circuit_board” of “board”, the search engine returns him all the documents that

contain “board”. To explain why sometimes search engines cannot generate satisfactory

categories, we will review how their hierarchical classification structures are generated.

Most hierarchical categories or classes employed by search engines were either manually

2

set up or automatically constructed by data clustering algorithms. Since the class

hierarchies generated by clustering algorithms lack semantic information, it is likely that

they perform poorly when the number of query terms is small or a query term has more

than one meaning. Although manually constructed hierarchies of classes normally have

higher accuracy, there are also a number of problems with them. First, manually

constructed classes are not concept oriented, that is, for each keyword more than one

class can have the keyword as name or label. For example, there are multiple classes of

“board” in Figure 1-1. Consequently, users have to explore a huge number of categories

in order to identify the desired pages. Secondly, since the hierarchies are often

maintained by a group of people, over times update procedure is prone to conflicting

classification criteria.

To overcome the disadvantages caused by manually constructed classes, an algorithm

that constructs a hierarchical classification model based on keywords and their

relationships from thesauri is proposed. Specifically, each class corresponds to one

concept since in human memory different keywords are used to represent different

objects, ideas, or activities. The topics of documents in each class are similar. The

hierarchical structure is maintained via “IS-A” or “PART-OF” relationships between

classes, i.e., class “homer” is “PART-OF” class “baseball”, hence class “baseball” is a

super class of “homer". The advantages of such a novel class hierarchy can be

summarized as follows:

1. Each class name corresponds to one word (actually it is concept, or keyword sense),

suitable for keyword-based query.

3

2. The relationships between classes are semantically defined by the thesauri, therefore, it

is much more stable than traditional hierarchical classes.

With this thesaurus-based hierarchy, documents are then mapped to the class hierarchy.

During the mapping, a threshold min_sim [28] is employed to determine whether a

document and a class are similar to each other or not. After documents are mapped to the

class hierarchy, class representative vectors [28] are adjusted to reflect the topics of the

documents. Next, documents are re-mapped by using the adjusted class representative

vectors. Such re-mapping iteration may repeat a number of times. Then some classes

which contain too few documents are removed by a hierarchy refinement procedure [28].

The resulting class hierarchy and the document mapping are the final hierarchy

classification.

Fortunately, WordNet, an electronic dictionary invented by Princeton University, is a

concept based dictionary, whose lexical relations are “IS-A” and “PART_OF”. It is used

as the frame for this proposed hierarchy classification. Assume that the class hierarchy is

already constructed, what we need to do is to classify documents to their appropriate

classes. Polysemy, which is defined as a single word form having more than one

meaning, causes false classifying. For example, if we failed to tell which meaning of

“board” is used in a given situation, we would probably map that document to a wrong

class. Synonymy, which is defined as multiple words having the same meaning, causes

true conceptual mapping to be missed. Therefore, the critical step of classifying is to

recognize synonyms and detect uses of different meanings of a word for each word in

each document. For example, if we failed to recognize “notebook” and “laptop” mean the

4

same thing, all those documents that use “notebook” in the place of “laptop” would be

left out of the class “laptop”.

The issue is how to automatically detect polysems and synonyms. In principle, polysems

and synonyms can be handled by assigning the different senses of a word to different

concept identifiers and assigning the same concept identifier to synonyms. In practice,

this requires procedures that not only are able to recognize synonyms but also can detect

uses of different senses of a word.

1.2 Contribution

In this report, we implemented the disambiguation algorithm introduced by Ellen M.

Voorhees in her paper Using WordNet to Disambiguate Word Senses for Text Retrieval

([5]). This algorithm is supposed to automatically detects and resolves the senses of the

polysemous nouns occurring in the texts of documents and queries. Each word processed

by this technique in any document is mapped to a unique concept, which is the meaning

intended in this case. However, she didn't apply this idea to any text document. We

applied this algorithm to a set of documents⎯Brown Corpus, one of the most widely

used document collections in a variety of fields. At last, we tested effectiveness of this

automatic disambiguation algorithm by comparing with the manual disambiguation

results offered by Princeton University. Our experiments verified Dr. Voorhees’

conclusion in her paper [5] that this algorithm is not sufficient to reliably select the

correct sense of a noun from the set of sense disambiguation in WordNet.

5

1.3 Organization

The remaining part of the thesis is organized as follows. Chapter 2 gives some

background on text retrieval and WordNet. Chapter 3 explores the work done in the area

of sense disambiguation. Chapter 4 explains the algorithm introduced by Dr. Voorhees in

detail. The first section explains the hood construction part of the algorithm, the second

section explains word sense disambiguation part of it. Chapter 5 presents our experiment

results. A qualitative analysis of this algorithm is also performed. Chapter 6 draws a

conclusion for the work I have done with this topic. Finally, we make comment on the

future work that can be explored in this area and its potential application.

6

CHAPER 2 Background Knowledge

In our work, we plan to apply our algorithm on Brown Corpus. We download part-of-

speech tagged Brown Corpus from University of Pennsylvania. First, we remove all the

tags and all those non-noun words since most of the semantics is carried by noun words

[2]. Secondly, we convert each word to its stem with Porter's algorithm. Thirdly, we

remove all those words that are not in WordNet or stopwords list. Thus, every document

in this corpus is represented only by its valid noun words after the three steps of

processing. Finally, our algorithm is performed and results are analyzed.

In order to help you understand our work, this chapter gives some background knowledge

involved in our work.

2.1 WordNet™

WordNet is a manually-constructed lexical system developed by George Miller and his

colleagues at the Cognitive Science Laboratory at Princeton University [12]. Originating

from a project whose goal was to produce a dictionary that could be searched

conceptually instead of only alphabetically, WordNet evolved into a system that reflects

current psycholinguistic theories about how humans organize their lexical memories.

7

In WordNet, the basic building block is a synset consisting of all the words that express a

given concept. Synsets, which senses are manually classified into, denote synonym sets.

Within each synset the senses, although from different keywords, denote the same

meaning. For example, “board” has several senses, so does "plank". Each of the two

words has a sense means “a stout length of sawn timber; made in a wide variety of sizes

and used for many purposes”. The synset corresponding to this sense is composed of

"board" and "plank". In this example, this given sense of "plank" and "board" are

synonymous and form one synset. Because all synonymous senses are grouped into one

synset and all different senses of the same word are separated into different synsets, there

are no synonymous or polysemous synsets. Hence, every synset represents a lexicalized

concept.

There are four main divisions in WordNet, one each for nouns, verbs, adjectives and

adverbs. Within a division, synsets are organized by the lexical relations defined on them.

For nouns, the only division of WordNet used in my work, the lexical relations include

“IS-A” and “PART-OF” relations. For example, Figure 2-1 shows the hierarchy relating

the eight different senses of the noun “board”. The synsets with the heavy boarder are the

actual senses of “board”, and the remaining synsets are either ancestors or descendents of

one of the senses. The synsets {group, grouping} and {entity, thing} in Figure 2-1 are

examples of heads of the hierarchies. Other heads include {act, human_action,

human_activity}, {abstraction}, {possession} and {psychologival_feature}.

8

organization

adm_un

comcom

e sheet flat_solid

article_of_commerce

equipment food nutrient

material stuff

electrical_device d e sport_equipment

board mess

building _material

social_group

people folk

group grouping

t

entity thing

artifact artefact article

substance material matter

unit
Word S
inistrative it

mittee mission

d

ration

circuit closed_circuit

pegboard

palettepallet

spring_board n lumber timber

dashboard computer_circuit

bulletin_board notice_board

dining_table board

board diving_board

board plank deal

printed_circuit
boar
governing_board

directorate board_of_director

advisory_board cabinet

circuit_board circuit_card board card

refe

school_board board_of_education

Figure 2-1: The IS-A hierarchy for eight different senses of

9

objec

devic

boar
furnitur
controlpanel displaypanel panel board

table

ctory_table hig

the noun “board”.

k-ratio

hboard

WordNet 1.6 (2000), the version of WordNet used in this work, contains 94473 words

and 116314 senses in the noun division. Because synsets contain only strict synonyms,

the majority of synsets are quite small. Similarly, the average number of senses per word

is close to one. This seems to suggest that polysemy and synonymy occur too

infrequently to be a problem, but they are misleading. The more frequently a word is

used, the more polysemous it tends to be [13]. The more common words also tend to

appear in the larger synsets. Thus, it is precisely those nouns that actually get used in

documents are most likely to have many senses and synonyms.

2.2 Part-of-Speech Taggers

Many corpora are, in addition to structural and bibliographic information, annotated with linguistic knowledge. The most basic and common form this annotation takes is marking

up the words in the corpus with their part-of-speech tags. This adds value to the corpus

because, for example, searches can be performed not only on the word-forms as strings

but also on whether they belong to a certain linguistic category. Such tags are typically

taken to be atomic labels attached to words, denoting the part of speech of the word,

together with shallow morphosyntactic information, e.g. they specify the word as a

proper singular noun, or a plural comparative adjective. For English and other Western

European languages, for which most such annotated corpora have been produced, the

tagset size ranges from about forty to several hundred distinct categories [8]. For

example, since "happy" is an adjective, it is tagged with "JJ", which is the representation

10

of adjectives, as follows, so are one-of-a-kind and run-of-the-mill. Every word in every

document is well tagged in this way.

happy/JJ one-of-a-kind/JJ

run-of-the-mill/JJ 2.3 Stemming

Stemming is a technique for reducing words to their grammatical roots. A stem is the

portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes).

A typical example of a stem is the word connect which is the stem for variants connected,

connecting, connection, and connections. Stems are thought to be useful because they

reduce variants of the same root word to a common concept. Furthermore, stemming has

the effect of reducing the size of the indexing structure because the number of distinct

word is reduced.

The best known algorithm for stemming is Porter's algorithm [9] introduced by

M.F.Porter . This program is given an explicit list of suffixes and with each suffix, the

criterion under which it may be removed from a word to leave a valid stem. The main

merits of the present program are that it is small, fast and reasonably simple while the

success rate is reasonably good. It is quite realistic to apply it to every word in a large file

of continuous text.

11

2.4 Stopwords

Words which are too frequent among the documents are not good discriminators. In fact,

a word which occurs in 80% of the documents in the document collection is useless for

purpose of retrieval or classification. Such words are frequently referred to as stopwords

and should be filtered out. Articles, prepositions and conjunctions are candidates for a list

of stopwords, such as “an”, “against”, “and”. Removal of stopwords can not only

improve the accuracy of retrieval or classification, but also reduce the size of the

document.

12

CHAPTER 3 Work related to Word Sense Disambiguation One of the first problems that is encountered by any natural language processing system

is that of lexical ambiguity, be it syntactic or semantic. The resolution of a word's

syntactic ambiguity has largely been solved in language processing by part-of-speech

taggers which predict the syntactic category of words in text with high levels of accuracy

(for example[14]). The problem of resolving semantic ambiguity is generally known

as word sense disambiguation and has proved to be more difficult than syntactic

disambiguation.

The problem is that words often have more than one meaning, sometimes fairly similar

and sometimes completely different. The meaning of a word in a particular usage can

only be determined by examining its context. This is, in general, a trivial task for the

human language processing system. However, the task has proved to be difficult for

computer and some have believed that it would never be solved.

However, there have been several advances in word sense disambiguation and we are

now at a stage where lexical ambiguity in text can be resolved with a reasonable degree

of accuracy.

13

3.1 Survey of Approaches to Word Sense Disambiguation

It is useful to distinguish some different approaches to the word sense disambiguation

problem. In general we can categorize all approaches to the problem into one of three

general strategies: knowledge based, corpus based and hybrid. We shall now go on to

look at each of these three strategies in turn.

3.1.1 Knowledge Based

Under this approach disambiguation is carried out using information from an explicit

lexicon or knowledge base. The lexicon may be a machine readable dictionary, thesaurus

or it may be hand-crafted. This is one of most popular approaches to word sense

disambiguation and amongst others, work has been done using existing lexical

knowledge sources such as WordNet [16,17,18,19,5], LDOCE [20,21], and Roget's

International Thesaurus [22].

The information in these resources has been used in several ways, for example Wilks and

Stevenson [23], Harley and Glennon [24] and McRoy [25] all use large lexicons

(generally machine readable dictionaries) and the information associated with the senses

(such as part-of-speech tags, topical guides and selectional preferences) to indicate the

correct sense. The word sense disambiguation algorithm in our work introduced by

Voorhees [5] takes advantage of WordNet and part-of-speech tags. Another approach is

to treat the text as an unordered bag of words where similarity measures are calculated by

14

looking at the semantic similarity (as measured from the knowledge source) between all

the words in the window regardless of their positions, as was used by Yarowsky [22].

3.1.2 Corpus Based

This approach attempts to disambiguate words using information which is gained by

training on some corpus, rather than taking it directly from an explicit knowledge source.

This training can be carried out on either a disambiguated or raw corpus, where a

disambiguated corpus is one where the semantics of each polysemous lexical item is

marked and a raw corpus one without such marking.

Disambiguated Corpora

This set of techniques requires a training corpus which has already been disambiguated.

In general, a machine learning algorithm of some kind is applied to certain features

extracted from the corpus and used to form a representation of each of the senses. This

representation can then be applied to new instances in order to disambiguate them.

Different researchers have made use of different sets of features, for example [15] used

local collocates such as first noun to the left and right, second word to the left/right and

so on.

The general problem with these methods is their reliance on disambiguated corpora

which are expensive and difficult to obtain. This has meant that many of these algorithms

have been tested on very small numbers of different words, often as few as 10.

15

Raw Corpora

It is often difficult to obtain appropriate lexical resources and we have already noted the

difficulty in obtaining disambiguated text for supervised disambiguation. This lack of

resources has led several researchers to explore the use of raw corpora to perform

unsupervised disambiguation. It should be noted that unsupervised disambiguation cannot

actually label specific terms as a referring to a specific concept: that would require more

information than is available. What unsupervised disambiguation can achieve is word

sense discrimination, which clusters the instances of a word into distinct categories

without giving those categories labels from a lexicon (such as WordNet synsets).

3.1.3 Hybrid Approaches

These approaches can be neither properly classified as knowledge or corpus based but

use part of both approaches. A good example of this is Luk's system [26] which uses the

textual definitions of senses from a machine readable dictionary to identify relations

between senses. He then uses a corpus to calculate mutual information scores between

these related senses in order to discover the most useful information. This allowed Luk to

produce a system which used the information in lexical resources as a way of reducing

the amount of text needed in the training corpus.

16

CHAPTER 4 Using Hood Algorithm to Disambiguate Word Senses

In this chapter we present our implementation of the algorithm in Voorhees [15] with the

help of WordNet. It is based on the idea that a set of words occurring together in context

will determine appropriate senses for one another despite each individual word being

multiply ambiguous. A common example of this effect ([27]) is the set of nouns base,

bat, glove and hit. While most of these words have several senses, when taken together

the intent is clearly the game of baseball. To exploit this idea automatically, a set of

categories representing the different senses of words needs to be defined. Once such

categories are defined, the number of words in the text that have senses that belong to a

given category is counted. The senses that correspond to the categories with the largest

counts are selected to be the intended sense of the ambiguous words. Obviously, the

category definitions are a critical component.

4.1 Converting WordNet into Relational Database

WordNet, dictionary system by Cognitive Science Department of Princeton University, is

stored in flat files, not in database. In order to make the implementation easy and get

good performance, WordNet should be converted into relational database version. Four

relations created for WordNet are shown from Table 4-1 to Table 4-4: (For detailed

definition of the tables, refer to appendix A.) Each of the four definitions is in third

normal form. Relation synsets has 66025 distinct records, relation words has 94473

17

distinct records, relation synset_word has 116314 distinct records and relation

synset_relations has 86348 distinct records.

1. synsets(synset_id, category, hierarchy, meaning)

synset_id ⎯ an unique decimal integer which represents a synset in

WordNet. category ⎯ one character code indicating the synset type. For example, n indicates noun. hierarchy ⎯ the hierarchy the synset belongs to. In WordNet, the hierarchies for noun range from 3 to 28. meaning ⎯ definition for the synset.

This table contains the basic information of each synset in WordNet.

However, only the attribute synset_id is used in our work. Example tuples are

shown in Table 4-1. 10947841 is the synset_id for a synset in WordNet; “n”

means that this synset is in noun division of WordNet (WordNet also has verb

division); 28 means this synest belongs to the hierarchy 28; “meaning” is the

gloss for this synset.

synset_id category Hierarchy meaning 10947841 n 28 a period of the year marked

by special events or activities in some field

6171035 n 14 a committee having supervisory powers

Table 4-1: Relation Definition for “synsets” and Tuples

2. synset_relations(synset_id1, synset_id2, rel_str)

synset_id1⎯ a child synset of synset_id2. This table does not store any

relationship other than parent-child. synset_id2 ⎯ a parent synset of synset_id1.

18

rel_str ⎯ the actual symbol in WordNet to describe the relationship.

~: synset_id1 is a hypernym of synset_id2 (synset_id1 is a superordinate <generic> of synset_id2<specific>)

@: synset_id1 is a hyponym of synset_id2 (synset_id1 is a subordinate <specific> of synset_id2<generic>)

%: synset_id1 is a holonym of synset_id2 (synset_id2 is part of synset_id1) #: synset_id1 is a meronym of synset_id2 (synset_id1 is part of

synset_id2) #p: synset_id1 is part of synset_id2 #m: synset_id1 is a member of synset_id2 #s: synset_id1 is the stuff that synset_id2 is made from =: synset_id1 has an attribute synset_id2 (synset_id2 is an adjective) !: synset_id1 and synset_id2 are antonyms. (not stored in this table)

Although there are many kinds of relationship, we can treat all of them as child-

parent relationship. Each synset can have multiple parents or multiple children.

Each pair of child and parent is a tuple in this table. Example tuples are shown in

Table 4-2. In this example, the synsets numbered as 10947841 and 10986027 are

two children of synset 10843624; “@” means that synset 10947841 is a

subordinate of synset 10843624, so is 10986027. This relation is frequently used

in our work. We depend on the parent-child relationship to find ancestors of a

given synset.

synset_id1 synset_id2 rel_str 10947841 10843624 @ 10986027 10843624 @ 826095 6643534 #p 946936 945703 #m

Table 4-2: Relation Definition for “synset_relations” and Tuples 3. words(word_id, word)

word_id ⎯ an unique decimal integer for each meaning of each word in

WordNet.

19

word ⎯ the word one of whose meaning is numbered as word_id.

Each word in WordNet may have multiple meanings. Therefore, for every

meaning we assign it a unique identification number word_id Example tuples are

shown in Table 4-3. 92685 is the word_id for one of the 3 meanings of “season”.

“board” has 9 meanings, one of which is numbered as 11872 and another is

numbered as 11875. This relation is also frequently used in our work. We depend

on it to find all the synsets a given word belongs to.

Word_id word 92685 season 11875 board 11872 board

Table 4-3: Relation Definition for “words” and Tuples

4. synset_word(synset_id, word_id) synset_id ⎯ defined in relation synsets. word_id ⎯ defined in relation words. This table is the connection between synsets and words. One synset may consist

of more than one word_id, while each word_id is only assigned to one synset.

This guarantees all different meanings of the same word are separated into

different synsets, in other words, there are no synonymous or polysemous synsets.

Hence, every synset represents a lexicalized concept. Example tuples are shown

in Table 4-4. The word “board” is one of the members of the synset numbered as

6171035 because one of its meanings numbered as 11872 is very close in

meaning to this synset.

20

synset_id word_id 10947841 92685 2581069 11875 6171035 11872

Table 4-4: Relation Definition for “synset_word” and Tuples

All the major information of noun division in WordNet is stored in two files: noun.dat

and noun.idx. The data format in noun.dat is as follows:

synset_id hierarchy category w_cnt word lex_id [word

lex_id...] p_cnt [pointer_symbol synset_id pos source/target]

| gloss

NOTE: w_cnt ⎯ number of words in the synset. lex_id ⎯ one digit hexadecimal integer that uniquely identifies a meaning of this

word. It usually starts with 0. p_cnt ⎯ number of pointers from this synset to other synsets. pointer_symbol ⎯ refer to definition for relation synset_relations. pos ⎯ syntactic category, n for noun.

source/target ⎯ a value of 0000 means that pointer_symbol represents a semantic relation between the current (source) synset and the target synset.

For example, 10947821 is a synset_id; 28 is the hierarchy; “n” means noun; 01 indicates

there is only one word in this synset; “season” is the word in this synset; 2 indicates that

this meaning is the second meaning of the word ”season”; 015 indicates synset 10947841

has 15 pointers to other synsets; one of the 15 target synsets is its parent synset

10843624 due to the relation string “@”, while another target is its child synset 10946877

due to “~”, so are the other 13 targets; the last part is the definition or example sentences.

10947841 28 n 01 season 2 015 @ 10843624 n 0000 ~ 10946877 n 0000 ~ 10946979 n 0000 ~ 10947079 n 0000 ~ 10948329 n 0000 ~ 10948599 n 0000 ~ 10948843 n 0000 ~ 10948943 n 0000 ~ 10949098 n 0000 ~ 10949208 n 0000 ~ 10949304 n 0000 ~ 10949396 n 0000 ~ 10949521 n 0000 ~ 10949615 n 0000 ~ 10950901 n 0000 | a period of the year marked by special events or activities in some field; "he celebrated his 10th season with the ballet company" or "she always looked forward to the avocado season"

21

On the other hand, the data format in noun.idx is as follows:

word pos poly_cnt p_cnt [pointer_symbol...] sense_cnt tagsense_cnt

synset_id [synset_id...]

NOTE: pos ⎯ syntactic category, "n" for noun. poly_cnt ⎯ number of different meanings (polysemy) the current word has in

WordNet. This is the same value as sense_cnt, but is retained for historical reasons.

p_cnt ⎯ number of different types of pointers the current word has in all synsets containing it. pointer_symbol ⎯ refer to definition for relation synset_relations.

sense_cnt ⎯ number of different meanings the current word has in WordNet. tagsense_cnt ⎯ number of meanings of the current word that are ranked

according to their frequency of occurrence in semantic concordance texts.

synset ⎯ each synset_id in the list corresponds to a different meaning of the current word in WordNet.

For example, “seat” is a word; “n” means noun; 6 indicates “seat” has 6 senses; 5 means

“seat” has 5 different types of pointers (@, ~, #m, #p, %p) in all the 6 synsets containing

it; again, 6 tells that “seat” is in 6 synsets; finally, the synsets containing “seat” are listed

one by one.

seat n 6 5 @ ~ #m #p %p 6 5 06368526 04306560 03294261 03293673 06368745 03294658

Pseudo-code 1 shows the steps to convert data in the two flat files into relational

database. The for loop from line1 to line 8 extracts data from noun.dat to construct table

“synsets”. The second for loop from line 9 to line 21 extracts data from noun.dat again to

construct table “synset_relations”. The inner loop generates a separate tuple for every

pair of child and parent. That means, if a synset has multiple pointers to other synsets,

22

there are multiple tuples for it to present the multiple children or multiple parents

relationship.

Then, the code from line 22 to line 32 extracts data from noun.idx to construct table

“words” and “synset_word”. The inner loop generates a separate tuple for every sense in

“words” and “synset_word”.

Pseudo-code 1 build _wordnet()

------------------------------------------------------------------------------------------------------------

1 for each line in noun.dat

2 synset_id ← retrieve synset_id

3 hierarchy ← retrieve hierarchy

4 category ← retrieve category

5 skip the next items until gloss

6 meaning ← retrieve gloss

7 insert tuple (synset_id, hierarchy, category, meaning) into table synsets

8 end

9 for each line in noun.dat


11 skip the next item until p_cnt

12 num_pointers ← retrieve p_cnt

13 for each pointer

14 relationStr ← retrieve pointer_symbol

15 relationsynset_id ← retrieve synset_id

16 if (synset_id is the parent of relationsynset_id)

17 insert tuple (relationsynset_id, synset_id, relationStr) into table synset_relations

18 else

19 insert tuple (synset_id, relationsynset_id, relationStr) into table synset_relations

23

20 end

21 end

22 for each line in noun.idx

23 word ← retrieve word

24 skip the next items until sense_cnt

25 numSenses ← retrieve sense_cnt

26 for each sense

27 generate a unique id word_id for this sense

28 insert tuple (word_id, word) into table words


30 insert tuple (synset_id, word_id) into table synset_word

31 end

32 end

---------------------------------------------------------------------------------------------------------------------------------

4.2 Algorithm

4.2.1 Hood Construction

Using each separate hierarchy as a category is well defined but too coarse grained. For

example, in Figure 2-1 seven of the eight senses of board are in the {entity, thing}

hierarchy. Similarly, using individual synsets is well defined but too fine grained.

Therefore, this algorithm is intended to define an appropriate middle level category ⎯

hood. To define the hood of a given synset, s, consider the set of synsets and the

hyponymy links in WordNet as the set of vertices and directed edges of a graph. Then the

hood of s is the largest connected subgraph that contains s, contains only descendents of

an ancestor of s, and contains no synset that has a descendent that includes another

24

instance of a member of s as a member. A hood is represented by the synset that is the

root of the hood. In other words, shown in Figure 4-1, assume synset s consists of k

words w(1), w(2), w(3)…w(k), p(1), p(2), p(3)…p(n) are n ancestors of s, where p(m) is a

father of p(m-1). p(m) (m is a number in 1…n) has a descendent synset which also

includes w(j) (j is a number in 1…k)as a member and p(m) is the closest one with this

feature to s . So, p(m-1) is one of the root(s) of the hood(s) of s, as shown in Case 1. If m

is 1, s itself is the root, as shown in Case 2. If no such m is found, the root of this

WordNet hierarchy r is the root of the hood of s, as shown in Case 3. If s itself has a

descendent synset that includes w(j) (j is a number in 1…k) as a member, there is no hood

in WordNet for s, as shown in case 4. Because some synsets have more than one parent,

synsets can have more than one hood. A synset has no hood if the same word is a

member of both the synset and one of its descendents. For example, in Figure 2-1 the

hood of the synset for committee sense of board is rooted at the synset {group, grouping}

(and thus the hood for that sense is the entire hierarchy in which it occurs) because no

other synset containing "board" in this hierarchy (Case 3), the hood for the circuit_board

sense of board is rooted at {circuit, closed_circuit} because the synset

{electrical_device} has a descendent synset {control_panel, display_panel, panel,

board} containing "board" (Case 1), and the hood for the panel sense of board is rooted at

the synset itself because its direct parent synset {electrical_device} has a descendent

synset {circuit_board, circuit_card, board, card} containing "board" (Case 2).

Pseudo-code 2 shows the steps to find the root(s) of the hood(s) for a given synset. The

input for this procedure is a given synset_id, s. The output is the synset_id(s) of the

25

root(s) of hood(s) for s. The code from line1 to line 10 is to get all the synsets each of

which has at least one member word which is also a member word of s and save them in

a hashtable synset_id_hashtable . From line 11 to line 22, we get all the ancestors for

every synset in synset_id_hashtable and keep them in another hashtable

all_ancestors_hashtable. From line 23 to line 43, we find the find the ancestors of s one

p(1)

p(m)

p(m-1)

Case 1

Case 3

w(1),…w(k)

…w(j)…

Case 2

Case 4

Figure 4-1 Root of Hood(s) of Synset s

26

w(1),…w(k)

… w(j) …

r

… w(j) …

w(1),…w(k)

… w(j) …

by one from the closest to the farthest. Whenever one ancestor a is in

all_ancestors_hashtable, in other words, a has a descendent that includes another

instance of a member of s as a member, its child that is in the path from s to a is a root of

the hood(s) of s. In our work, we apply find_hood_root(s) procedure to all the 66025

synsets in WordNet. The output is stored in hood_root.txt for further computation.

Pseudo-code 2 find_hood_root( s)

---------------------------------------------------------------------------------------------------------------------------------

1 word_id_set ← π word_id (σsynset_id=s (synset_word))

2 for each word_id in word_id_set

3 word_set ← π word (σ word_id=word_id ( words))

4 for each word in word_set

5 all_word_id_set ← π word_id (σ word=word ( words))

6 end

7 end

8 for each word_id in all_word_id_set

9 synset_id_hashtable ← π synset_id (σ word_id=word_id ( synset_word))

10 end

11 for each synset_id in synset_id_hashtable except s

12 current_id_hashtable ← synset_id

13 while (current_id_hashtable is not empty)

14 for each synset_id in current_id_hashtable

15 parent_id_hashtable ← π synset_id2 (σ synset_id1=synset_id ( synset_relations))

16 end

17 clear current_id_hashtable

18 copy parent_id_hashtable to current_id_hashtable

19 copy parent_id_hashtable to all_ancestors_hashtable

27

20 clear parent_id_hashtable

21 end

22 end

23 if (s is in all_ancestors_hashtable)

24 s has no hood in WordNet

25 else

26 current_id_hashtable ← s

27 while (current_id_hashtable is not empty)

28 for each current_synset_id in current_id_hashtable

29 parent_id_hashtable ← π synset_id2 (σ synset_id1=current_synset_id (

synset_relations))

30 for each parent_synset_id in parent_id_hashtable

31 if (parent_synset_id is in all_ancestors_hashtable)

32 root_found ← true

33 root_set ← current_synset_id

34 remove parent_synset_id from parent_id_hashtable

35 break

36 end

37 end

38 clear current_id_hashtable

39 copy parent_id_hashtable to current_id_hashtable

40 clear parent_id_hashtable

41 end

42 if (root_found is false)

43 root_set ← root of this entire hierarchy in WordNet

------------------------------------------------------------------------------------------------------------

28

17954

people folk

5962976

5997592

artifact artefact article

substance material matter

3447223 2729592 equipment 11575 material stuff

t 5621336 building

_material

entity thing

6088087
Word S
6020493

6172564

2493245 2443096 pegboard

palettepallet

dashboard 2482181 bulletin_board notice_board

2572848

d

2581069

n

10836071

lumber timber

6171035

governing_board

directorate board_of_director

advisory_board cabinet

2443613

school_board board_of_education

refe

Figure 4-2: The IS-A hierarchy for eight different senses of

29

9457

2560468

2625239
2303171 2729950 sport equipmen
table

ct

th

spring boar

ory_table hig

e noun “board”.

k-ratio

3173212
hboard

Let's take synset 2443613 {circuit_board, circuit_card, board, card} as an example

(Figure 4-2, refer to Figure 2-1). All the 9 synsets (6171035, 2493245, 2303171,

2572848, 2581069, 5621336, 10836071, 2443613, 3461955) for "board" are stored in

synset_id_hashtable, as well as those synsets of "circuit_board", "circuit_card" and

"card"; all_ancestors_hashtable contains 6172564, 6020493, 6088087, 2625239,

2560468, 3447223, 2729592, etc., but none of 2443096, 2482181, 3173212 is in it

because each one is only an ancestor of {circuit_board, circuit_card, board, card}, not

an ancestor for any other synsets which contain "circuit_board" or "circuit_card" or

"board" or "card". When we follow the parent-child relationship to find ancestors for

2443613, we finally stop at 2625239 because 2625239 is the parent of synset 2493245

{control_panel, display_panel, panel, board}. Therefore, the root of the hood for

2443613 is synset 2443096.

4.2.2 Word Sense Disambiguation

After hoods for each synset in WordNet are constructed, they can be used to select the

sense of an ambiguous word in a given text-document. The senses of the nouns in a text-

document of a given collection are selected by the following two-stage process. A

marking procedure that visits synsets and maintains a count of the number of times each

synset is visited is fundamental to both stages. Given a word, the procedure finds all

instances of the word in (the noun portion of) WordNet. For each identified synset, the

procedure follows the IS-A links up to the root of the hierarchy incrementing a counter at

each synset it visits. In the first stage the marking procedure is called once for each

occurrence of a content word (i.e., a word that is not a stop word) in all of the documents

in the collection. The number of times the procedure was called and found the word in

30

WordNet is also maintained. This produces a set of global counts (relative to this

particular collection) at each synset. In the second stage, the marking procedure is called

once for each occurrence of a content word in an individual text (document or query).

Again the number of times the procedure was called and found the word in WordNet for

the individual text is maintained. This produces a set of local counts at the synsets. Given

the local and global counts, a sense for a particular ambiguous word contained within the

text that generated the locals is selected as follows:

# local visits # global visits • The difference = --------------------------- - ------------------------------

# local calls # global calls

The difference is computed at the root of the hood for each sense of the word. If a

sense does not have a hood or if the local count at its hood root is less than two,

that difference is set to zero. If a sense has multiple hoods, that difference is set to

the largest difference over the set of hoods.

• The sense corresponding to the hood root with the largest positive difference is

selected as the sense of the word in the text. If no sense has a positive difference,

no WordNet sense is chosen for the word.

Pseudo-code 3 shows the steps to disambiguate sense for every word in a document.

Pseudo-code 3 disambiguation()

---------------------------------------------------------------------------------------------------------------------------------

global_counts()

For each document in the document collection

local_counts(document)

Load words in this document into word_in_doc_hashtable

31

Remove stopwords from word_in_doc_hashtable

Remove words that are not in WordNet noun division

For each word in word_in_doc_hashtable

difference(word)

end

end

---------------------------------------------------------------------------------------------------------------------------------

Pseudo-code for global_counts()

---------------------------------------------------------------------------------------------------------------------------------

For each word in the document collection

if (word is not a stopword and word is in WordNet noun division)

marking(word)

#_of_global_calls is incremented by 1

end

---------------------------------------------------------------------------------------------------------------------------------

Pseudo-code for local_counts(document)

---------------------------------------------------------------------------------------------------------------------------------

For each word in this document

if (word is not a stopword and word is in WordNet noun division)

marking(word)

#_of_locall_calls is incremented by 1

end

---------------------------------------------------------------------------------------------------------------------------------

Pseudo-code for marking(word)

---------------------------------------------------------------------------------------------------------------------------------

Find all the synset(s) that contains the word and save in synset_id_hashtable

For each synset in synset_id_hashtable

Find all its ancestors and save in ancestors_hashtable

32

For each synset in ancestors_hashtable

Increment its counter by 1

end

end

---------------------------------------------------------------------------------------------------------------------------------

Pseudo-code for difference (word)

---------------------------------------------------------------------------------------------------------------------------------

Find all the synset(s) that contains this word and save them in synset_id_hashtable

For each synset in synset_id_hashtable

Find the root(s) of the hood(s) of this synset

if this synset has no hood at all

max_diff =0

else

For each root

Calculate the diff with that formula described above

Compare diff with max_diff and keep the max_diff

end

end

The true sense this word used in this document is the synset whose hood is rooted with the max_diff.

------------------------------------------------------------------------------------------------------------

The idea behind this disambiguation procedure is to select senses from the areas of the

WordNet hierarchies in which document-induced (local) activity is greater than the

expected (global) activity. The hood construct is designed to provide a point of

comparison that is broad enough to encompass markings from several different words yet

narrow enough to distinguish among senses.

33

CHAPTER 5 Experiments

In this chapter I shall describe my experiment that verifies the effectiveness of hood

algorithm for word sense disambiguation. This experiment is performed on part-of-

speech tagged Brown Corpus. The flow of this experiment will be described in detail. I

will report the results of my experiment and analyze the quality of the results.

5.1 Part-of-Speech Tagged Brown Corpus

Brown Corpus consists of 1,014,312 words of running text of edited English prose

printed in the United States during the calendar year 1961. So far as it has been possible

to determine, the writers were native speakers of American English. This Corpus is

divided into 500 samples of 2000+ words each. Each sample begins at the beginning of a

sentence but not necessarily of a paragraph or other larger division, and each ends at the

first sentence ending after 2000 words. The samples represent a wide range of styles and

varieties of prose. Samples were chosen for their representative quality rather than for

any subjectively determined excellence. A corpus is intended to be "a collection of

naturally occurring language text, chosen to characterize a state or variety of a language"

(Sinclair, 1991). As such, very few of the so-called corpora used in current natural

language processing and speech recognition work deserve the name. For English, the

only true corpus that is widely available is the Brown Corpus. It has been extensively

used for natural language processing work.

34

A sentence in natural language text is usually composed of nouns, pronouns, articles,

verbs, adjectives, adverbs and connectives. While the words in each grammatical class

are used with a particular purpose, it can be argued that most of the semantics is carried

by noun words. Thus, nouns can be taken out through the systematic elimination of verbs,

adjectives, adverbs, connectives, articles and pronouns.

Therefore, in this experiment, we make use of the part-of-speech tagged Brown Corpus

provided by Treebank Project, Computer and Information Science Department,

University of Pennsylvania. This document set consists of 479 tagged documents. Each

word in every document is tagged with its certain linguistic category.

5.2 Flow of Experiment

Figure 5-1 shows the steps of my experiment. First of all, I convert WordNet from files

(noun.dat and noun.idx) to relational database. Tables are created and all the data

contained in noun.dat and noun.idx are loaded into these tables (see Pseudo-code 1).

Then, for each synset in WordNet, the root(s) of the hood(s) is found and saved in

hood_root.txt. On the other hand, first, for each part-of-speech tagged document in

Brown Corpus, such as a01, all the tags and non-nouns in a01 are removed and the result

is saved in a01_noun. Second, a01_noun is processed by the stemming algorithm. After

this step, all the words remained in a01_noun_stem are stems for the words in a01.

Finally, a01_noun_stem is processed by Dr. Voorhees’ disambiguation algorithm. The

35

final result we get is saved in disambiguation_result_a01, where each word is mapped to

a unique synset that represents the sense this word is used in this context.

remove tags and non-nouns, then generate a derivative document

a tagged document (i.e. a01.txt)

convert WordNet from files to relational daqtabase (see Pseudo-code 1)

WordNet files, i.e. noun.dat, noun.idx

all the synset_ids in WordNet the file for nouns without tags (i.e. a01_noun.txt)

the file for stemmed nouns (i.e. a01_noun_stem.txt)

apply stemming algorithm on each document

the file for the root(s) of the hood(s) for each synset (i.e. hood_root.txt)

find the root(s) of the hood(s) for each synset (see Pseudo-code 2)

disambiguate each word in this file (see Pseudo-code 3)

in the disambiguation result file, each word is mapped to a unique synset

Figure 5-1 Steps of Experiment

36

5.3 Quality of Results

The results shown in Table 5-1 are for 50 documents randomly chosen from Brown

Corpus. I randomly choose 50 documents to be processed as shown in Figure 5-1. Since

WordNet provides semantically tagged Brown Corpus files, I compare my results with

the manually identified results.

# of words that is selected the same synset as the manually identified one Hit Rate = ---------------------------------------------------------------------------------------------- # of words in the stemmed file Hit Rate <15% 15%-20% 20%-25% 25%-30% 30%-35% >40% # of docs that get this hit rate

1 7 12 16 14 0

Table 5-1 Hit Rate of Experiment for Voorhees’ Algorithm.

From this table, we can see that the hit rate is not as high as our expectation. No one is

higher than 40%, while most are between 15% and 35%. It means that Dr. Voorhees’

disambiguation algorithm is not an effective one to automatically disambiguate word

senses.

5.4 Result Analysis

So far we can say the algorithm doesn’t work well to disambiguate word senses. The

reasons are listed as following:

1. Although most of the semantics is carried by the noun words, verbs, adjectives,

adverbs are important factors that can help determine appropriate senses for an

ambiguous word.

37

2. One word is possible to be used multiple times in one document, while each

appearance uses different sense. But, in this algorithm, each word is mapped to a

unique sense.

3. The stemming algorithm⎯ Porter’s algorithm has bugs. For example, family is

stemmed to famili, times is stemmed to tim.

4. Part-of-speech tagger used for Brown Corpus separates words connected with

underscore, such as school_board, which is an individual word while it is

separated into two words school and board by the tagger. Thus, the sense of

school or board will never hit the manually identified word sense of

school_board.

38

CHAPTER 6

Conclusion and Future Work 6.1 Conclusion

In this report, I implemented the disambiguation technique introduced by Dr. Ellen M.

Voorhees in her paper Using WordNet to Disambiguate Word Senses for Text Retrieval

([5]). This algorithm was tested on 50 part-of-speech tagged documents from Brown

Corpus. Processed by this technique, each word in any document is mapped to a unique

sense. Further, the effectiveness of this automatic disambiguation algorithm was tested by

comparing with the manual disambiguation results provided by Princeton University.

The results verified Dr. Voorhees’ conclusion in her paper that this algorithm is not

sufficient to reliably select the correct sense of a noun from the set of fine sense

disambiguation in WordNet.

6.2 Future Work and Application

The algorithm introduced in this report is an automatic technique to disambiguate word

senses. If the four reasons that cause the low hit rate are overcome, the effectiveness will

be much better. It would be a great help for information retrieval. If a retrieval system

indexed documents by senses of the words they contain and the appropriate senses in the

document query could be identified, then irrelevant documents containing query words of

a different sense would not be retrieved. This technique is also useful for machine

39

translation because ambiguity in the source language must be disambiguated before

correct translation.

One another important area of application is that of hierarchy text classification. As far as

I know, the hierarchy text classification algorithm was only applied to a small set of

documents which are manually disambiguated by WordNet, Princeton University. The

test size is too small to tell whether it is a good text classification algorithm. Therefore,

assume the automatic word sense disambiguation technique works well, unlimited

number of documents can be applied to the hierarchy text classification algorithm. Each

word is mapped to a corresponding class according to the sense it truly uses in this

document. Thus, it will provide researchers enough evidence to tell the benefits of

hierarchy text classification algorithm.

40

References [1] Christiane Fellbaum. WordNet: An Electronic Lexical Database. MIT Press,

1998.

[2] Ricardo Baeza-Yates, Berthier Ribeiro-Neto. Modern Information Retrieval.

ACM Press, 1999.

[3] Jiawei Han, Micheline Kamber. Data Mining: Concepts and Techniques.

Academic Press, 2001.

[4] Richard K. Belew. Find Out About: Search Engine Technology from a Cognitive

Perspective. Cambridge Univ. Press, 2000.

[5] Ellen M. Voorhees. Using WordNet to Disambiguate Word Senses for Text

Retrieval. SIGIR 1993: 171-180.

[6] Mark Sanderson. Word Sense Disambiguation and Information Retrieval.

Proceedings of the 17th Annual International ACM SIGIR Conference on

Research and Development in Information Retrieval, pages 142-151, June 1994.

[7] Robert Krovetz, W. Bruce Croft. Lexical Ambiguity And Information Retrieval.

ACM Transactions on Information Systems, 10(2):115-141, April 1992.

[8] Brill Eric. A Simple Rule-Based Part of Speech Tagger. Proceedings of the Third

Annual Conference on Applied Natural Language Processing, ACL. 1992.

[9] http://telemat.det.unifi.it/book/2001/wchange/download/stem_porter.html. M. F.

Porter. An algorithm for suffix stripping(1980).

41

[10] http://www.ilc.pi.cnr.it/EAGLES96/rep2/node39.html. The EAGLES Lexicon

Interest Group. Word Sense Disambiguation. May, 1998.

[11] Geoffrey Towell, Ellen M. Voorhees. Disambiguating Highly Ambiguous words.

Computational Linguistics, 24(1):125-146, 1998.

[12] George Miller. Special Issue, WordNet: An on-line lexical database. International

Journal of Lexicography, 3(4), 1990.

[13] G.K. Zipf. The meaning-frequency relationship of words. Journal of General

Psychology, 3:251-256, 1945.

[14] E. Brill. Transformation-based error-driven learning and natural language

processing: A case study in part of speech tagging. Computational Linguistics,

21(4):543-566, December 1995.

[15] P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer. Word sense

disambiguation using statistical methods. In Proceedings of the 29th Meeting of

the Association for Computational Linguistics (ACL-91), pages 264-270,

Berkley, C.A., 1991.

[16] Agirre, E. and Rigau, G. Word sense disambiguation using conceptual density. In

Proceedings of COLING, 1996.

[17] Ribas-95 Ribas, F. On learning more appropriate selectional restrictions. In

Proceedings of the Seventh Conference of the European Chapter of the

Association for Computational Linguistics, pages 112-118, 1995.

42

http://www.ilc.pi.cnr.it/EAGLES96/rep2/node39.html

[18] Richardson, R. and Smeaton, A. Using wordnet in a knowledge-based approach

to information retrieval. In Proceedings of the BCS-IRSG Colloquium, Crewe,

1995.

[19] Sussna, M. Word sense disambiguation for free-test indexing using a massive

semantic network. Proceedings of the 2nd International Conference on

Information and Knowledge Management. Arlington, Virginia, USA, 1993.

[20] Cowie, J. and Lehnert, W. Information extraction. Communications of the ACM,

39(1):80-91, 1996.

[21] J. Guthrie, L. Guthrie, Y. Wilks and H. Aidinejad. Subject-Dependent

Cooccurrence and Word Sense Disambiguation, ACL-91, pp. 146-152,1991.

[22] D. Yarowsky. Word-sense disambiguation using statistical models of Roget's

categories trained on large corpora. In Proceedings of the 14th International

Conference on Computational Linguistics (COLING-92), pages 454-460, Nantes,

France, 1992.

[23] Y. Wilks and M. Stevenson. The Grammar of Sense: using part-of-speech tags as

a first step in semantic disambiguation. Journal of Natural Language Engineering,

4(3), 1997.

[24] A. Harley and D. Glennon. Sense tagging in action: Combining different tests

with additive weights. In Proceedings of the SIGLEX Workshop ``Tagging Text

with Lexical Semantics'', pages 74-78. Association for Computational Linguistics,

Washington, D.C., 1997.

43

[25] S. McRoy. Using multiple knowledge sources for word sense disambiguation.

Computational Linguistics, 18(1):1-30, 1992.

[26] A. Luk. Statistical sense disambiguation with relatively small corpora using

dictionary definitions. In Proceedings of the 33rd Meetings of the Association for

Computational Linguistics (ACL-95), pages 181-188, Cambridge, M.A., 1995.

[27] Gerard Salton and Micaeal E. Lesk. Information analysis and dictionary

construction. In Gerard Salton, editor, The SMART Retrieval System:

Experiments in Automatic Document Processing, chapter 6, pages 115-142.

Prentice-Hall, Inc. Englewood Cliffs, New Jersey, 1971.

[28] Bin Chen. Sampling and Text Classification Techniques for Data Mining. Ph.D

Thesis, ECE dept., Northwestern University, 2001.

[29] Beatrice Santorini. Part-of-Speech Tagging Guidelines for the Penn Treebank

Project. June, 1990.

[30] W. N. Francis,H. Kucera. Brown Corpus Manual.

http://www.hit.uib.no/icame/brown/bcm.html, 1979.

44

Appendix A: Definition of Tables --- ORACLE WORDNET TABLES -- NOTE: CATEGORY is the category the synset belongs to. -- 1 Noun -- 2 Adjective -- 4 Verb -- 8 Adverb -- NOTE: HIERARCHY is the hierarchy the synset belongs to. -- In WordNet, the hierarchies for noun range from 3 to 28 (second column of noun.dat) -- Hierarchy 0 defaults to "unknown" or "undefined". CREATE TABLE SYNSETS (

SYNSET_ID int CONSTRAINT nn_synsets_synset_id NOT NULL, CATEGORY int CONSTRAINT nn_synsets_category NOT NULL, HIERARCHY int CONSTRAINT nn_synsets_hierarchy NOT NULL, MEANING varchar2(4000) NULL,

CONSTRAINT pk_synsets PRIMARY KEY (SYNSET_ID) ); CREATE INDEX idx_synsets_hierarchy on SYNSETS (hierarchy); -- NOTE: RELATION indicates the relation of ID1 regards to ID2. -- We require that ID1 is always a child of ID2. This table does not store any relationship other than parent-child. -- REL_STR indicates the actual symbol in WordNet to describe the relationship. -- ~: ID1 is a hypernym of ID2 (ID1 is a superordinate <generic> of ID2 <specific>) -- @: ID1 is a hyponym of ID2 (ID1 is a subordinate <specific> of ID2 <generic>) -- %: ID1 is a holonym of ID2 (ID2 is part of ID1) -- #: ID1 is a meronym of ID2 (ID1 is part of ID2) -- #p: ID1 is part of ID2 -- #m: ID1 is a member of ID2 -- #s: ID1 is the stuff that ID2 is made from -- =: ID1 has an attribute ID2 (ID2 is an adjective) -- !: ID1 and ID2 are antonyms. (not stored in this table) CREATE TABLE SYNSET_RELATIONS (

45

SYNSET_ID1 int CONSTRAINT nn_synset_rel_synset_id1 NOT NULL,

SYNSET_ID2 int CONSTRAINT nn_synset_rel_synset_id2 NOT NULL,

REL_STR varchar2(10) CONSTRAINT nn_synset_rel_rel_str NOT NULL,

CONSTRAINT pk_synset_rel PRIMARY KEY (SYNSET_ID1, SYNSET_ID2), CONSTRAINT fk_synset_rel_synset_id1 FOREIGN KEY (SYNSET_ID1) REFERENCES SYNSETS(synset_id) ON DELETE CASCADE, CONSTRAINT fk_synset_rel_synset_id2 FOREIGN KEY (SYNSET_ID2) REFERENCES SYNSETS(synset_id) ON DELETE CASCADE ); CREATE INDEX idx_synset_rel_synset_id2 on SYNSET_RELATIONS (synset_id2); CREATE TABLE WORDS (

WORD_ID int CONSTRAINT nn_words_word_id NOT NULL, WORD varchar2(200) CONSTRAINT nn_words_word NOT NULL,

CONSTRAINT pk_words PRIMARY KEY (WORD_ID), ); CREATE TABLE SYNSET_WORD (

SYNSET_ID int CONSTRAINT nn_sw_senset_id NOT NULL, WORD_ID int CONSTRAINT nn_sw_word_id NOT NULL,

CONSTRAINT pk_sw PRIMARY KEY (SYNSET_ID, WORD_ID), CONSTRAINT fk_sw_synset_id FOREIGN KEY (SYNSET_ID) REFERENCES SYNSETS(SYNSET_ID) ON DELETE CASCADE, CONSTRAINT fk_sw_word_id FOREIGN KEY (WORD_ID) REFERENCES WORDS (WORD_ID) ON DELETE CASCADE ); CREATE INDEX idx_sw_word_id on SYNSET_WORD (word_id);

46