Using WordNet to Disambiguate Word Senses
by
Ying Liu
Electrical and Computer Engineering
Acknowledgements
I would like to first thank Prof. Peter Scheuermann without whose constant guidance,
support and encouragement, this work would not have been possible. I would also like to
thank Bin Chen who gladly discussed various issues related to my work with me. This
work is the result of many insightful discussions that I have had with Prof. Scheuermann
who inspired me all through and also guided me as and when required. I would also like
to thank the members of the Database System group for their friendship and help. They
are Shayan Zaidi, Mehmet Sayal, Olga Shumsky, and Chris Fernandes.
Further, I would like to thank Dr. Ellen M. Voorhees for her suggestions. Finally I would
like to thank my parents Zongli Liu and Huifang Xu who have guided me all through my
life. I would like to thank them for all the love, encouragement and virtues that I have
received while I was growing.
2
Introduction....................................................................................................................... 1 1.1 Motivation............................................................................................................................ 1
1.2 Contribution ........................................................................................................................ 5
1.3 Organization ........................................................................................................................ 6
Background Knowledge ................................................................................................... 7 2.1 WordNet™............................................................................................................................ 7
2.2 Part-of-Speech Taggers .................................................................................................... 10
2.3 Stemming ........................................................................................................................... 11
2.4 Stopwords .......................................................................................................................... 12
Work related to Word Sense Disambiguation ............................................................. 13 3.1 Survey of Approaches to Word Sense Disambiguation ................................................. 14
3.1.1 Knowledge Based ...................................................................................................................... 14 3.1.2 Corpus Based ............................................................................................................................. 15
Disambiguated Corpora ................................................................................................................. 15 Raw Corpora .................................................................................................................................. 16
3.1.3 Hybrid Approaches .................................................................................................................... 16
Using Hood Algorithm to Disambiguate Word Senses................................................ 17 4.1 Converting WordNet into Relational Database.............................................................. 17
4.2 Algorithm........................................................................................................................... 24 4.2.1 Hood Construction ..................................................................................................................... 24 4.2.2 Word Sense Disambiguation...................................................................................................... 30
Experiments..................................................................................................................... 34 5.1 Part-of-Speech Tagged Brown Corpus ........................................................................... 34
5.2 Flow of Experiment........................................................................................................... 35
5.3 Quality of Results .............................................................................................................. 37
5.4 Result Analysis .................................................................................................................. 37
Conclusion and Future Work ........................................................................................ 39 6.1 Conclusion.......................................................................................................................... 39
6.2 Future Work and Application.......................................................................................... 39
References........................................................................................................................ 41
Appendix A: Definition of Tables.................................................................................. 45
3
CHAPTER 1 Introduction Text retrieval, also known as document or information retrieval, is concerned with
locating natural language documents whose contents satisfy a user's information need.
Unfortunately, there are billions of documents, many of which don't have abstracts or
even titles, on the Internet today. Therefore, there is considerable interest in developing
techniques that automatically index full-text documents and provide access to
heterogeneous collections of full-text documents.
1.1 Motivation
Search engines are great tools that help users find desired documents. Whenever a user
submits a set of query keywords, documents that contain part of the entire keywords will
be returned. But, these search engines are not good enough to answer all the queries. For
instance, most web users have experienced troubles with search engines that when a large
number of web pages are returned, one may have to go through many unrelated pages to
identify useful ones. Sometimes not only is the number of documents returned large, but
also the categories identified by the search engines are not relevant.
Let’s look at an example. If a computer hardware engineer wants to search documents
related to "board", Yahoo returns the following categories:
1
Figure 1-1: Yahoo! Category Matches (1 - 8 of 2394) I only list the first 8 matches of 2394 matches. The results are organized in a hierarchical
structure, i.e., in the first row, “Recreation” is the top category, “Games” is the second
level category, …, and “Board Games” is the category or web page that contains the
keyword “board”. There are only 2 categories related to circuit_board. Obviously, it is a
big burden for the computer hardware engineer to sift the documents that he is really
interested in from such a large number of categories ⎯ remember that within each
category, there may be numerous web sites. Although the user is only interested in the
meaning “circuit_board” of “board”, the search engine returns him all the documents that
contain “board”. To explain why sometimes search engines cannot generate satisfactory
categories, we will review how their hierarchical classification structures are generated.
Most hierarchical categories or classes employed by search engines were either manually
2
set up or automatically constructed by data clustering algorithms. Since the class
hierarchies generated by clustering algorithms lack semantic information, it is likely that
they perform poorly when the number of query terms is small or a query term has more
than one meaning. Although manually constructed hierarchies of classes normally have
higher accuracy, there are also a number of problems with them. First, manually
constructed classes are not concept oriented, that is, for each keyword more than one
class can have the keyword as name or label. For example, there are multiple classes of
“board” in Figure 1-1. Consequently, users have to explore a huge number of categories
in order to identify the desired pages. Secondly, since the hierarchies are often
maintained by a group of people, over times update procedure is prone to conflicting
classification criteria.
To overcome the disadvantages caused by manually constructed classes, an algorithm
that constructs a hierarchical classification model based on keywords and their
relationships from thesauri is proposed. Specifically, each class corresponds to one
concept since in human memory different keywords are used to represent different
objects, ideas, or activities. The topics of documents in each class are similar. The
hierarchical structure is maintained via “IS-A” or “PART-OF” relationships between
classes, i.e., class “homer” is “PART-OF” class “baseball”, hence class “baseball” is a
super class of “homer". The advantages of such a novel class hierarchy can be
summarized as follows:
1. Each class name corresponds to one word (actually it is concept, or keyword sense),
suitable for keyword-based query.
3
2. The relationships between classes are semantically defined by the thesauri, therefore, it
is much more stable than traditional hierarchical classes.
With this thesaurus-based hierarchy, documents are then mapped to the class hierarchy.
During the mapping, a threshold min_sim [28] is employed to determine whether a
document and a class are similar to each other or not. After documents are mapped to the
class hierarchy, class representative vectors [28] are adjusted to reflect the topics of the
documents. Next, documents are re-mapped by using the adjusted class representative
vectors. Such re-mapping iteration may repeat a number of times. Then some classes
which contain too few documents are removed by a hierarchy refinement procedure [28].
The resulting class hierarchy and the document mapping are the final hierarchy
classification.
Fortunately, WordNet, an electronic dictionary invented by Princeton University, is a
concept based dictionary, whose lexical relations are “IS-A” and “PART_OF”. It is used
as the frame for this proposed hierarchy classification. Assume that the class hierarchy is
already constructed, what we need to do is to classify documents to their appropriate
classes. Polysemy, which is defined as a single word form having more than one
meaning, causes false classifying. For example, if we failed to tell which meaning of
“board” is used in a given situation, we would probably map that document to a wrong
class. Synonymy, which is defined as multiple words having the same meaning, causes
true conceptual mapping to be missed. Therefore, the critical step of classifying is to
recognize synonyms and detect uses of different meanings of a word for each word in
each document. For example, if we failed to recognize “notebook” and “laptop” mean the
4
same thing, all those documents that use “notebook” in the place of “laptop” would be
left out of the class “laptop”.
The issue is how to automatically detect polysems and synonyms. In principle, polysems
and synonyms can be handled by assigning the different senses of a word to different
concept identifiers and assigning the same concept identifier to synonyms. In practice,
this requires procedures that not only are able to recognize synonyms but also can detect
uses of different senses of a word.
1.2 Contribution
In this report, we implemented the disambiguation algorithm introduced by Ellen M.
Voorhees in her paper Using WordNet to Disambiguate Word Senses for Text Retrieval
([5]). This algorithm is supposed to automatically detects and resolves the senses of the
polysemous nouns occurring in the texts of documents and queries. Each word processed
by this technique in any document is mapped to a unique concept, which is the meaning
intended in this case. However, she didn't apply this idea to any text document. We
applied this algorithm to a set of documents⎯Brown Corpus, one of the most widely
used document collections in a variety of fields. At last, we tested effectiveness of this
automatic disambiguation algorithm by comparing with the manual disambiguation
results offered by Princeton University. Our experiments verified Dr. Voorhees’
conclusion in her paper [5] that this algorithm is not sufficient to reliably select the
correct sense of a noun from the set of sense disambiguation in WordNet.
5
1.3 Organization
The remaining part of the thesis is organized as follows. Chapter 2 gives some
background on text retrieval and WordNet. Chapter 3 explores the work done in the area
of sense disambiguation. Chapter 4 explains the algorithm introduced by Dr. Voorhees in
detail. The first section explains the hood construction part of the algorithm, the second
section explains word sense disambiguation part of it. Chapter 5 presents our experiment
results. A qualitative analysis of this algorithm is also performed. Chapter 6 draws a
conclusion for the work I have done with this topic. Finally, we make comment on the
future work that can be explored in this area and its potential application.
6
CHAPER 2 Background Knowledge
In our work, we plan to apply our algorithm on Brown Corpus. We download part-of-
speech tagged Brown Corpus from University of Pennsylvania. First, we remove all the
tags and all those non-noun words since most of the semantics is carried by noun words
[2]. Secondly, we convert each word to its stem with Porter's algorithm. Thirdly, we
remove all those words that are not in WordNet or stopwords list. Thus, every document
in this corpus is represented only by its valid noun words after the three steps of
processing. Finally, our algorithm is performed and results are analyzed.
In order to help you understand our work, this chapter gives some background knowledge
involved in our work.
2.1 WordNet™
WordNet is a manually-constructed lexical system developed by George Miller and his
colleagues at the Cognitive Science Laboratory at Princeton University [12]. Originating
from a project whose goal was to produce a dictionary that could be searched
conceptually instead of only alphabetically, WordNet evolved into a system that reflects
current psycholinguistic theories about how humans organize their lexical memories.
7
In WordNet, the basic building block is a synset consisting of all the words that express a
given concept. Synsets, which senses are manually classified into, denote synonym sets.
Within each synset the senses, although from different keywords, denote the same
meaning. For example, “board” has several senses, so does "plank". Each of the two
words has a sense means “a stout length of sawn timber; made in a wide variety of sizes
and used for many purposes”. The synset corresponding to this sense is composed of
"board" and "plank". In this example, this given sense of "plank" and "board" are
synonymous and form one synset. Because all synonymous senses are grouped into one
synset and all different senses of the same word are separated into different synsets, there
are no synonymous or polysemous synsets. Hence, every synset represents a lexicalized
concept.
There are four main divisions in WordNet, one each for nouns, verbs, adjectives and
adverbs. Within a division, synsets are organized by the lexical relations defined on them.
For nouns, the only division of WordNet used in my work, the lexical relations include
“IS-A” and “PART-OF” relations. For example, Figure 2-1 shows the hierarchy relating
the eight different senses of the noun “board”. The synsets with the heavy boarder are the
actual senses of “board”, and the remaining synsets are either ancestors or descendents of
one of the senses. The synsets {group, grouping} and {entity, thing} in Figure 2-1 are
examples of heads of the hierarchies. Other heads include {act, human_action,
human_activity}, {abstraction}, {possession} and {psychologival_feature}.
8
organization
adm_un
comcom
e sheet flat_solid
article_of_commerce
equipment food nutrient
material stuff
electrical_device d e sport_equipment
board mess
building _material
social_group
people folk
group grouping
t
entity thing
artifact artefact article
substance material matter
unit
Word Sinistrative it
mittee mission
d
ration
circuit closed_circuit
pegboard
palettepallet
spring_board n lumber timber
dashboard computer_circuit
bulletin_board notice_board
dining_table board
board diving_board
board plank deal
printed_circuit
boargoverning_board
directorate board_of_director
advisory_board cabinet
circuit_board circuit_card board card
refe
school_board board_of_education
Figure 2-1: The IS-A hierarchy for eight different senses of
9
objec
devic
boar
furniturcontrolpanel displaypanel panel board
table
ctory_table hig
the noun “board”.
k-ratio
hboard
WordNet 1.6 (2000), the version of WordNet used in this work, contains 94473 words
and 116314 senses in the noun division. Because synsets contain only strict synonyms,
the majority of synsets are quite small. Similarly, the average number of senses per word
is close to one. This seems to suggest that polysemy and synonymy occur too
infrequently to be a problem, but they are misleading. The more frequently a word is
used, the more polysemous it tends to be [13]. The more common words also tend to
appear in the larger synsets. Thus, it is precisely those nouns that actually get used in
documents are most likely to have many senses and synonyms.
2.2 Part-of-Speech Taggers
Many corpora are, in addition to structural and bibliographic information, annotated with linguistic knowledge. The most basic and common form this annotation takes is marking
up the words in the corpus with their part-of-speech tags. This adds value to the corpus
because, for example, searches can be performed not only on the word-forms as strings
but also on whether they belong to a certain linguistic category. Such tags are typically
taken to be atomic labels attached to words, denoting the part of speech of the word,
together with shallow morphosyntactic information, e.g. they specify the word as a
proper singular noun, or a plural comparative adjective. For English and other Western
European languages, for which most such annotated corpora have been produced, the
tagset size ranges from about forty to several hundred distinct categories [8]. For
example, since "happy" is an adjective, it is tagged with "JJ", which is the representation
10
of adjectives, as follows, so are one-of-a-kind and run-of-the-mill. Every word in every
document is well tagged in this way.
happy/JJ one-of-a-kind/JJ
run-of-the-mill/JJ 2.3 Stemming
Stemming is a technique for reducing words to their grammatical roots. A stem is the
portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes).
A typical example of a stem is the word connect which is the stem for variants connected,
connecting, connection, and connections. Stems are thought to be useful because they
reduce variants of the same root word to a common concept. Furthermore, stemming has
the effect of reducing the size of the indexing structure because the number of distinct
word is reduced.
The best known algorithm for stemming is Porter's algorithm [9] introduced by
M.F.Porter . This program is given an explicit list of suffixes and with each suffix, the
criterion under which it may be removed from a word to leave a valid stem. The main
merits of the present program are that it is small, fast and reasonably simple while the
success rate is reasonably good. It is quite realistic to apply it to every word in a large file
of continuous text.
11
2.4 Stopwords
Words which are too frequent among the documents are not good discriminators. In fact,
a word which occurs in 80% of the documents in the document collection is useless for
purpose of retrieval or classification. Such words are frequently referred to as stopwords
and should be filtered out. Articles, prepositions and conjunctions are candidates for a list
of stopwords, such as “an”, “against”, “and”. Removal of stopwords can not only
improve the accuracy of retrieval or classification, but also reduce the size of the
document.
12
CHAPTER 3 Work related to Word Sense Disambiguation One of the first problems that is encountered by any natural language processing system
is that of lexical ambiguity, be it syntactic or semantic. The resolution of a word's
syntactic ambiguity has largely been solved in language processing by part-of-speech
taggers which predict the syntactic category of words in text with high levels of accuracy
(for example[14]). The problem of resolving semantic ambiguity is generally known
as word sense disambiguation and has proved to be more difficult than syntactic
disambiguation.
The problem is that words often have more than one meaning, sometimes fairly similar
and sometimes completely different. The meaning of a word in a particular usage can
only be determined by examining its context. This is, in general, a trivial task for the
human language processing system. However, the task has proved to be difficult for
computer and some have believed that it would never be solved.
However, there have been several advances in word sense disambiguation and we are
now at a stage where lexical ambiguity in text can be resolved with a reasonable degree
of accuracy.
13
3.1 Survey of Approaches to Word Sense Disambiguation
It is useful to distinguish some different approaches to the word sense disambiguation
problem. In general we can categorize all approaches to the problem into one of three
general strategies: knowledge based, corpus based and hybrid. We shall now go on to
look at each of these three strategies in turn.
3.1.1 Knowledge Based
Under this approach disambiguation is carried out using information from an explicit
lexicon or knowledge base. The lexicon may be a machine readable dictionary, thesaurus
or it may be hand-crafted. This is one of most popular approaches to word sense
disambiguation and amongst others, work has been done using existing lexical
knowledge sources such as WordNet [16,17,18,19,5], LDOCE [20,21], and Roget's
International Thesaurus [22].
The information in these resources has been used in several ways, for example Wilks and
Stevenson [23], Harley and Glennon [24] and McRoy [25] all use large lexicons
(generally machine readable dictionaries) and the information associated with the senses
(such as part-of-speech tags, topical guides and selectional preferences) to indicate the
correct sense. The word sense disambiguation algorithm in our work introduced by
Voorhees [5] takes advantage of WordNet and part-of-speech tags. Another approach is
to treat the text as an unordered bag of words where similarity measures are calculated by
14
looking at the semantic similarity (as measured from the knowledge source) between all
the words in the window regardless of their positions, as was used by Yarowsky [22].
3.1.2 Corpus Based
This approach attempts to disambiguate words using information which is gained by
training on some corpus, rather than taking it directly from an explicit knowledge source.
This training can be carried out on either a disambiguated or raw corpus, where a
disambiguated corpus is one where the semantics of each polysemous lexical item is
marked and a raw corpus one without such marking.
Disambiguated Corpora
This set of techniques requires a training corpus which has already been disambiguated.
In general, a machine learning algorithm of some kind is applied to certain features
extracted from the corpus and used to form a representation of each of the senses. This
representation can then be applied to new instances in order to disambiguate them.
Different researchers have made use of different sets of features, for example [15] used
local collocates such as first noun to the left and right, second word to the left/right and
so on.
The general problem with these methods is their reliance on disambiguated corpora
which are expensive and difficult to obtain. This has meant that many of these algorithms
have been tested on very small numbers of different words, often as few as 10.
15
Raw Corpora
It is often difficult to obtain appropriate lexical resources and we have already noted the
difficulty in obtaining disambiguated text for supervised disambiguation. This lack of
resources has led several researchers to explore the use of raw corpora to perform
unsupervised disambiguation. It should be noted that unsupervised disambiguation cannot
actually label specific terms as a referring to a specific concept: that would require more
information than is available. What unsupervised disambiguation can achieve is word
sense discrimination, which clusters the instances of a word into distinct categories
without giving those categories labels from a lexicon (such as WordNet synsets).
3.1.3 Hybrid Approaches
These approaches can be neither properly classified as knowledge or corpus based but
use part of both approaches. A good example of this is Luk's system [26] which uses the
textual definitions of senses from a machine readable dictionary to identify relations
between senses. He then uses a corpus to calculate mutual information scores between
these related senses in order to discover the most useful information. This allowed Luk to
produce a system which used the information in lexical resources as a way of reducing
the amount of text needed in the training corpus.
16
CHAPTER 4 Using Hood Algorithm to Disambiguate Word Senses
In this chapter we present our implementation of the algorithm in Voorhees [15] with the
help of WordNet. It is based on the idea that a set of words occurring together in context
will determine appropriate senses for one another despite each individual word being
multiply ambiguous. A common example of this effect ([27]) is the set of nouns base,
bat, glove and hit. While most of these words have several senses, when taken together
the intent is clearly the game of baseball. To exploit this idea automatically, a set of
categories representing the different senses of words needs to be defined. Once such
categories are defined, the number of words in the text that have senses that belong to a
given category is counted. The senses that correspond to the categories with the largest
counts are selected to be the intended sense of the ambiguous words. Obviously, the
category definitions are a critical component.
4.1 Converting WordNet into Relational Database
WordNet, dictionary system by Cognitive Science Department of Princeton University, is
stored in flat files, not in database. In order to make the implementation easy and get
good performance, WordNet should be converted into relational database version. Four
relations created for WordNet are shown from Table 4-1 to Table 4-4: (For detailed
definition of the tables, refer to appendix A.) Each of the four definitions is in third
normal form. Relation synsets has 66025 distinct records, relation words has 94473
17
distinct records, relation synset_word has 116314 distinct records and relation
synset_relations has 86348 distinct records.
1. synsets(synset_id, category, hierarchy, meaning)
synset_id ⎯ an unique decimal integer which represents a synset in
WordNet. category ⎯ one character code indicating the synset type. For example, n indicates noun. hierarchy ⎯ the hierarchy the synset belongs to. In WordNet, the hierarchies for noun range from 3 to 28. meaning ⎯ definition for the synset.
This table contains the basic information of each synset in WordNet.
However, only the attribute synset_id is used in our work. Example tuples are
shown in Table 4-1. 10947841 is the synset_id for a synset in WordNet; “n”
means that this synset is in noun division of WordNet (WordNet also has verb
division); 28 means this synest belongs to the hierarchy 28; “meaning” is the
gloss for this synset.
synset_id category Hierarchy meaning 10947841 n 28 a period of the year marked
by special events or activities in some field
6171035 n 14 a committee having supervisory powers
Table 4-1: Relation Definition for “synsets” and Tuples
2. synset_relations(synset_id1, synset_id2, rel_str)
synset_id1⎯ a child synset of synset_id2. This table does not store any
relationship other than parent-child. synset_id2 ⎯ a parent synset of synset_id1.
18
rel_str ⎯ the actual symbol in WordNet to describe the relationship.
~: synset_id1 is a hypernym of synset_id2 (synset_id1 is a superordinate <generic> of synset_id2<specific>)
@: synset_id1 is a hyponym of synset_id2 (synset_id1 is a subordinate <specific> of synset_id2<generic>)
%: synset_id1 is a holonym of synset_id2 (synset_id2 is part of synset_id1) #: synset_id1 is a meronym of synset_id2 (synset_id1 is part of
synset_id2) #p: synset_id1 is part of synset_id2 #m: synset_id1 is a member of synset_id2 #s: synset_id1 is the stuff that synset_id2 is made from =: synset_id1 has an attribute synset_id2 (synset_id2 is an adjective) !: synset_id1 and synset_id2 are antonyms. (not stored in this table)
Although there are many kinds of relationship, we can treat all of them as child-
parent relationship. Each synset can have multiple parents or multiple children.
Each pair of child and parent is a tuple in this table. Example tuples are shown in
Table 4-2. In this example, the synsets numbered as 10947841 and 10986027 are
two children of synset 10843624; “@” means that synset 10947841 is a
subordinate of synset 10843624, so is 10986027. This relation is frequently used
in our work. We depend on the parent-child relationship to find ancestors of a
given synset.
synset_id1 synset_id2 rel_str 10947841 10843624 @ 10986027 10843624 @ 826095 6643534 #p 946936 945703 #m
Table 4-2: Relation Definition for “synset_relations” and Tuples 3. words(word_id, word)
word_id ⎯ an unique decimal integer for each meaning of each word in
WordNet.
19
word ⎯ the word one of whose meaning is numbered as word_id.
Each word in WordNet may have multiple meanings. Therefore, for every
meaning we assign it a unique identification number word_id Example tuples are
shown in Table 4-3. 92685 is the word_id for one of the 3 meanings of “season”.
“board” has 9 meanings, one of which is numbered as 11872 and another is
numbered as 11875. This relation is also frequently used in our work. We depend
on it to find all the synsets a given word belongs to.
Word_id word 92685 season 11875 board 11872 board
Table 4-3: Relation Definition for “words” and Tuples
4. synset_word(synset_id, word_id) synset_id ⎯ defined in relation synsets. word_id ⎯ defined in relation words. This table is the connection between synsets and words. One synset may consist
of more than one word_id, while each word_id is only assigned to one synset.
This guarantees all different meanings of the same word are separated into
different synsets, in other words, there are no synonymous or polysemous synsets.
Hence, every synset represents a lexicalized concept. Example tuples are shown
in Table 4-4. The word “board” is one of the members of the synset numbered as
6171035 because one of its meanings numbered as 11872 is very close in
meaning to this synset.
20
synset_id word_id 10947841 92685 2581069 11875 6171035 11872
Table 4-4: Relation Definition for “synset_word” and Tuples
All the major information of noun division in WordNet is stored in two files: noun.dat
and noun.idx. The data format in noun.dat is as follows:
synset_id hierarchy category w_cnt word lex_id [word
lex_id...] p_cnt [pointer_symbol synset_id pos source/target]
| gloss
NOTE: w_cnt ⎯ number of words in the synset. lex_id ⎯ one digit hexadecimal integer that uniquely identifies a meaning of this
word. It usually starts with 0. p_cnt ⎯ number of pointers from this synset to other synsets. pointer_symbol ⎯ refer to definition for relation synset_relations. pos ⎯ syntactic category, n for noun.
source/target ⎯ a value of 0000 means that pointer_symbol represents a semantic relation between the current (source) synset and the target synset.
For example, 10947821 is a synset_id; 28 is the hierarchy; “n” means noun; 01 indicates
there is only one word in this synset; “season” is the word in this synset; 2 indicates that
this meaning is the second meaning of the word ”season”; 015 indicates synset 10947841
has 15 pointers to other synsets; one of the 15 target synsets is its parent synset
10843624 due to the relation string “@”, while another target is its child synset 10946877
due to “~”, so are the other 13 targets; the last part is the definition or example sentences.
10947841 28 n 01 season 2 015 @ 10843624 n 0000 ~ 10946877 n 0000 ~ 10946979 n 0000 ~ 10947079 n 0000 ~ 10948329 n 0000 ~ 10948599 n 0000 ~ 10948843 n 0000 ~ 10948943 n 0000 ~ 10949098 n 0000 ~ 10949208 n 0000 ~ 10949304 n 0000 ~ 10949396 n 0000 ~ 10949521 n 0000 ~ 10949615 n 0000 ~ 10950901 n 0000 | a period of the year marked by special events or activities in some field; "he celebrated his 10th season with the ballet company" or "she always looked forward to the avocado season"
21
On the other hand, the data format in noun.idx is as follows:
word pos poly_cnt p_cnt [pointer_symbol...] sense_cnt tagsense_cnt
synset_id [synset_id...]
NOTE: pos ⎯ syntactic category, "n" for noun. poly_cnt ⎯ number of different meanings (polysemy) the current word has in
WordNet. This is the same value as sense_cnt, but is retained for historical reasons.
p_cnt ⎯ number of different types of pointers the current word has in all synsets containing it. pointer_symbol ⎯ refer to definition for relation synset_relations.
sense_cnt ⎯ number of different meanings the current word has in WordNet. tagsense_cnt ⎯ number of meanings of the current word that are ranked
according to their frequency of occurrence in semantic concordance texts.
synset ⎯ each synset_id in the list corresponds to a different meaning of the current word in WordNet.
For example, “seat” is a word; “n” means noun; 6 indicates “seat” has 6 senses; 5 means
“seat” has 5 different types of pointers (@, ~, #m, #p, %p) in all the 6 synsets containing
it; again, 6 tells that “seat” is in 6 synsets; finally, the synsets containing “seat” are listed
one by one.
seat n 6 5 @ ~ #m #p %p 6 5 06368526 04306560 03294261 03293673 06368745 03294658
Pseudo-code 1 shows the steps to convert data in the two flat files into relational
database. The for loop from line1 to line 8 extracts data from noun.dat to construct table
“synsets”. The second for loop from line 9 to line 21 extracts data from noun.dat again to
construct table “synset_relations”. The inner loop generates a separate tuple for every
pair of child and parent. That means, if a synset has multiple pointers to other synsets,
22
there are multiple tuples for it to present the multiple children or multiple parents
relationship.
Then, the code from line 22 to line 32 extracts data from noun.idx to construct table
“words” and “synset_word”. The inner loop generates a separate tuple for every sense in
“words” and “synset_word”.
Pseudo-code 1 build _wordnet()
------------------------------------------------------------------------------------------------------------
1 for each line in noun.dat
2 synset_id ← retrieve synset_id
3 hierarchy ← retrieve hierarchy
4 category ← retrieve category
5 skip the next items until gloss
6 meaning ← retrieve gloss
7 insert tuple (synset_id, hierarchy, category, meaning) into table synsets
8 end
9 for each line in noun.dat
10 synset_id ← retrieve synset_id
11 skip the next item until p_cnt
12 num_pointers ← retrieve p_cnt
13 for each pointer
14 relationStr ← retrieve pointer_symbol
15 relationsynset_id ← retrieve synset_id
16 if (synset_id is the parent of relationsynset_id)
17 insert tuple (relationsynset_id, synset_id, relationStr) into table synset_relations
18 else
19 insert tuple (synset_id, relationsynset_id, relationStr) into table synset_relations
23
20 end
21 end
22 for each line in noun.idx
23 word ← retrieve word
24 skip the next items until sense_cnt
25 numSenses ← retrieve sense_cnt
26 for each sense
27 generate a unique id word_id for this sense
28 insert tuple (word_id, word) into table words
29 synset_id ← retrieve synset_id
30 insert tuple (synset_id, word_id) into table synset_word
31 end
32 end
---------------------------------------------------------------------------------------------------------------------------------
4.2 Algorithm
4.2.1 Hood Construction
Using each separate hierarchy as a category is well defined but too coarse grained. For
example, in Figure 2-1 seven of the eight senses of board are in the {entity, thing}
hierarchy. Similarly, using individual synsets is well defined but too fine grained.
Therefore, this algorithm is intended to define an appropriate middle level category ⎯
hood. To define the hood of a given synset, s, consider the set of synsets and the
hyponymy links in WordNet as the set of vertices and directed edges of a graph. Then the
hood of s is the largest connected subgraph that contains s, contains only descendents of
an ancestor of s, and contains no synset that has a descendent that includes another
24
instance of a member of s as a member. A hood is represented by the synset that is the
root of the hood. In other words, shown in Figure 4-1, assume synset s consists of k
words w(1), w(2), w(3)…w(k), p(1), p(2), p(3)…p(n) are n ancestors of s, where p(m) is a
father of p(m-1). p(m) (m is a number in 1…n) has a descendent synset which also
includes w(j) (j is a number in 1…k)as a member and p(m) is the closest one with this
feature to s . So, p(m-1) is one of the root(s) of the hood(s) of s, as shown in Case 1. If m
is 1, s itself is the root, as shown in Case 2. If no such m is found, the root of this
WordNet hierarchy r is the root of the hood of s, as shown in Case 3. If s itself has a
descendent synset that includes w(j) (j is a number in 1…k) as a member, there is no hood
in WordNet for s, as shown in case 4. Because some synsets have more than one parent,
synsets can have more than one hood. A synset has no hood if the same word is a
member of both the synset and one of its descendents. For example, in Figure 2-1 the
hood of the synset for committee sense of board is rooted at the synset {group, grouping}
(and thus the hood for that sense is the entire hierarchy in which it occurs) because no
other synset containing "board" in this hierarchy (Case 3), the hood for the circuit_board
sense of board is rooted at {circuit, closed_circuit} because the synset
{electrical_device} has a descendent synset {control_panel, display_panel, panel,
board} containing "board" (Case 1), and the hood for the panel sense of board is rooted at
the synset itself because its direct parent synset {electrical_device} has a descendent
synset {circuit_board, circuit_card, board, card} containing "board" (Case 2).
Pseudo-code 2 shows the steps to find the root(s) of the hood(s) for a given synset. The
input for this procedure is a given synset_id, s. The output is the synset_id(s) of the
25
root(s) of hood(s) for s. The code from line1 to line 10 is to get all the synsets each of
which has at least one member word which is also a member word of s and save them in
a hashtable synset_id_hashtable . From line 11 to line 22, we get all the ancestors for
every synset in synset_id_hashtable and keep them in another hashtable
all_ancestors_hashtable. From line 23 to line 43, we find the find the ancestors of s one
p(1)
p(m)
p(m-1)
Case 1
Case 3
w(1),…w(k)
…w(j)…
Case 2
Case 4
Figure 4-1 Root of Hood(s) of Synset s
26
w(1),…w(k)
… w(j) …
r
… w(j) …
w(1),…w(k)
… w(j) …
by one from the closest to the farthest. Whenever one ancestor a is in
all_ancestors_hashtable, in other words, a has a descendent that includes another
instance of a member of s as a member, its child that is in the path from s to a is a root of
the hood(s) of s. In our work, we apply find_hood_root(s) procedure to all the 66025
synsets in WordNet. The output is stored in hood_root.txt for further computation.
Pseudo-code 2 find_hood_root( s)
---------------------------------------------------------------------------------------------------------------------------------
1 word_id_set ← π word_id (σsynset_id=s (synset_word))
2 for each word_id in word_id_set
3 word_set ← π word (σ word_id=word_id ( words))
4 for each word in word_set
5 all_word_id_set ← π word_id (σ word=word ( words))
6 end
7 end
8 for each word_id in all_word_id_set
9 synset_id_hashtable ← π synset_id (σ word_id=word_id ( synset_word))
10 end
11 for each synset_id in synset_id_hashtable except s
12 current_id_hashtable ← synset_id
13 while (current_id_hashtable is not empty)
14 for each synset_id in current_id_hashtable
15 parent_id_hashtable ← π synset_id2 (σ synset_id1=synset_id ( synset_relations))
16 end
17 clear current_id_hashtable
18 copy parent_id_hashtable to current_id_hashtable
19 copy parent_id_hashtable to all_ancestors_hashtable
27
20 clear parent_id_hashtable
21 end
22 end
23 if (s is in all_ancestors_hashtable)
24 s has no hood in WordNet
25 else
26 current_id_hashtable ← s
27 while (current_id_hashtable is not empty)
28 for each current_synset_id in current_id_hashtable
29 parent_id_hashtable ← π synset_id2 (σ synset_id1=current_synset_id (
synset_relations))
30 for each parent_synset_id in parent_id_hashtable
31 if (parent_synset_id is in all_ancestors_hashtable)
32 root_found ← true
33 root_set ← current_synset_id
34 remove parent_synset_id from parent_id_hashtable
35 break
36 end
37 end
38 clear current_id_hashtable
39 copy parent_id_hashtable to current_id_hashtable
40 clear parent_id_hashtable
41 end
42 if (root_found is false)
43 root_set ← root of this entire hierarchy in WordNet
------------------------------------------------------------------------------------------------------------
28
17954
people folk
5962976
5997592
artifact artefact article
substance material matter
3447223 2729592 equipment 11575 material stuff
t 5621336 building
_material
entity thing
6088087
Word S6020493
6172564
2493245 2443096 pegboard
palettepallet
dashboard 2482181 bulletin_board notice_board
2572848
d
2581069
n
10836071
lumber timber
6171035
governing_board
directorate board_of_director
advisory_board cabinet
2443613
school_board board_of_education
refe
Figure 4-2: The IS-A hierarchy for eight different senses of
29
9457
2560468
2625239
2303171 2729950 sport equipmentable
ct
th
spring boar
ory_table hig
e noun “board”.
k-ratio
3173212
hboardLet's take synset 2443613 {circuit_board, circuit_card, board, card} as an example
(Figure 4-2, refer to Figure 2-1). All the 9 synsets (6171035, 2493245, 2303171,
2572848, 2581069, 5621336, 10836071, 2443613, 3461955) for "board" are stored in
synset_id_hashtable, as well as those synsets of "circuit_board", "circuit_card" and
"card"; all_ancestors_hashtable contains 6172564, 6020493, 6088087, 2625239,
2560468, 3447223, 2729592, etc., but none of 2443096, 2482181, 3173212 is in it
because each one is only an ancestor of {circuit_board, circuit_card, board, card}, not
an ancestor for any other synsets which contain "circuit_board" or "circuit_card" or
"board" or "card". When we follow the parent-child relationship to find ancestors for
2443613, we finally stop at 2625239 because 2625239 is the parent of synset 2493245
{control_panel, display_panel, panel, board}. Therefore, the root of the hood for
2443613 is synset 2443096.
4.2.2 Word Sense Disambiguation
After hoods for each synset in WordNet are constructed, they can be used to select the
sense of an ambiguous word in a given text-document. The senses of the nouns in a text-
document of a given collection are selected by the following two-stage process. A
marking procedure that visits synsets and maintains a count of the number of times each
synset is visited is fundamental to both stages. Given a word, the procedure finds all
instances of the word in (the noun portion of) WordNet. For each identified synset, the
procedure follows the IS-A links up to the root of the hierarchy incrementing a counter at
each synset it visits. In the first stage the marking procedure is called once for each
occurrence of a content word (i.e., a word that is not a stop word) in all of the documents
in the collection. The number of times the procedure was called and found the word in
30
WordNet is also maintained. This produces a set of global counts (relative to this
particular collection) at each synset. In the second stage, the marking procedure is called
once for each occurrence of a content word in an individual text (document or query).
Again the number of times the procedure was called and found the word in WordNet for
the individual text is maintained. This produces a set of local counts at the synsets. Given
the local and global counts, a sense for a particular ambiguous word contained within the
text that generated the locals is selected as follows:
# local visits # global visits • The difference = --------------------------- - ------------------------------
# local calls # global calls
The difference is computed at the root of the hood for each sense of the word. If a
sense does not have a hood or if the local count at its hood root is less than two,
that difference is set to zero. If a sense has multiple hoods, that difference is set to
the largest difference over the set of hoods.
• The sense corresponding to the hood root with the largest positive difference is
selected as the sense of the word in the text. If no sense has a positive difference,
no WordNet sense is chosen for the word.
Pseudo-code 3 shows the steps to disambiguate sense for every word in a document.
Pseudo-code 3 disambiguation()
---------------------------------------------------------------------------------------------------------------------------------
global_counts()
For each document in the document collection
local_counts(document)
Load words in this document into word_in_doc_hashtable
31
Remove stopwords from word_in_doc_hashtable
Remove words that are not in WordNet noun division
For each word in word_in_doc_hashtable
difference(word)
end
end
---------------------------------------------------------------------------------------------------------------------------------
Pseudo-code for global_counts()
---------------------------------------------------------------------------------------------------------------------------------
For each word in the document collection
if (word is not a stopword and word is in WordNet noun division)
marking(word)
#_of_global_calls is incremented by 1
end
---------------------------------------------------------------------------------------------------------------------------------
Pseudo-code for local_counts(document)
---------------------------------------------------------------------------------------------------------------------------------
For each word in this document
if (word is not a stopword and word is in WordNet noun division)
marking(word)
#_of_locall_calls is incremented by 1
end
---------------------------------------------------------------------------------------------------------------------------------
Pseudo-code for marking(word)
---------------------------------------------------------------------------------------------------------------------------------
Find all the synset(s) that contains the word and save in synset_id_hashtable
For each synset in synset_id_hashtable
Find all its ancestors and save in ancestors_hashtable
32
For each synset in ancestors_hashtable
Increment its counter by 1
end
end
---------------------------------------------------------------------------------------------------------------------------------
Pseudo-code for difference (word)
---------------------------------------------------------------------------------------------------------------------------------
Find all the synset(s) that contains this word and save them in synset_id_hashtable
For each synset in synset_id_hashtable
Find the root(s) of the hood(s) of this synset
if this synset has no hood at all
max_diff =0
else
For each root
Calculate the diff with that formula described above
Compare diff with max_diff and keep the max_diff
end
end
The true sense this word used in this document is the synset whose hood is rooted with the max_diff.
------------------------------------------------------------------------------------------------------------
The idea behind this disambiguation procedure is to select senses from the areas of the
WordNet hierarchies in which document-induced (local) activity is greater than the
expected (global) activity. The hood construct is designed to provide a point of
comparison that is broad enough to encompass markings from several different words yet
narrow enough to distinguish among senses.
33
CHAPTER 5 Experiments
In this chapter I shall describe my experiment that verifies the effectiveness of hood
algorithm for word sense disambiguation. This experiment is performed on part-of-
speech tagged Brown Corpus. The flow of this experiment will be described in detail. I
will report the results of my experiment and analyze the quality of the results.
5.1 Part-of-Speech Tagged Brown Corpus
Brown Corpus consists of 1,014,312 words of running text of edited English prose
printed in the United States during the calendar year 1961. So far as it has been possible
to determine, the writers were native speakers of American English. This Corpus is
divided into 500 samples of 2000+ words each. Each sample begins at the beginning of a
sentence but not necessarily of a paragraph or other larger division, and each ends at the
first sentence ending after 2000 words. The samples represent a wide range of styles and
varieties of prose. Samples were chosen for their representative quality rather than for
any subjectively determined excellence. A corpus is intended to be "a collection of
naturally occurring language text, chosen to characterize a state or variety of a language"
(Sinclair, 1991). As such, very few of the so-called corpora used in current natural
language processing and speech recognition work deserve the name. For English, the
only true corpus that is widely available is the Brown Corpus. It has been extensively
used for natural language processing work.
34
A sentence in natural language text is usually composed of nouns, pronouns, articles,
verbs, adjectives, adverbs and connectives. While the words in each grammatical class
are used with a particular purpose, it can be argued that most of the semantics is carried
by noun words. Thus, nouns can be taken out through the systematic elimination of verbs,
adjectives, adverbs, connectives, articles and pronouns.
Therefore, in this experiment, we make use of the part-of-speech tagged Brown Corpus
provided by Treebank Project, Computer and Information Science Department,
University of Pennsylvania. This document set consists of 479 tagged documents. Each
word in every document is tagged with its certain linguistic category.
5.2 Flow of Experiment
Figure 5-1 shows the steps of my experiment. First of all, I convert WordNet from files
(noun.dat and noun.idx) to relational database. Tables are created and all the data
contained in noun.dat and noun.idx are loaded into these tables (see Pseudo-code 1).
Then, for each synset in WordNet, the root(s) of the hood(s) is found and saved in
hood_root.txt. On the other hand, first, for each part-of-speech tagged document in
Brown Corpus, such as a01, all the tags and non-nouns in a01 are removed and the result
is saved in a01_noun. Second, a01_noun is processed by the stemming algorithm. After
this step, all the words remained in a01_noun_stem are stems for the words in a01.
Finally, a01_noun_stem is processed by Dr. Voorhees’ disambiguation algorithm. The
35
final result we get is saved in disambiguation_result_a01, where each word is mapped to
a unique synset that represents the sense this word is used in this context.
remove tags and non-nouns, then generate a derivative document
a tagged document (i.e. a01.txt)
convert WordNet from files to relational daqtabase (see Pseudo-code 1)
WordNet files, i.e. noun.dat, noun.idx
all the synset_ids in WordNet the file for nouns without tags (i.e. a01_noun.txt)
the file for stemmed nouns (i.e. a01_noun_stem.txt)
apply stemming algorithm on each document
the file for the root(s) of the hood(s) for each synset (i.e. hood_root.txt)
find the root(s) of the hood(s) for each synset (see Pseudo-code 2)
disambiguate each word in this file (see Pseudo-code 3)
in the disambiguation result file, each word is mapped to a unique synset
Figure 5-1 Steps of Experiment
36
5.3 Quality of Results
The results shown in Table 5-1 are for 50 documents randomly chosen from Brown
Corpus. I randomly choose 50 documents to be processed as shown in Figure 5-1. Since
WordNet provides semantically tagged Brown Corpus files, I compare my results with
the manually identified results.
# of words that is selected the same synset as the manually identified one Hit Rate = ---------------------------------------------------------------------------------------------- # of words in the stemmed file Hit Rate <15% 15%-20% 20%-25% 25%-30% 30%-35% >40% # of docs that get this hit rate
1 7 12 16 14 0
Table 5-1 Hit Rate of Experiment for Voorhees’ Algorithm.
From this table, we can see that the hit rate is not as high as our expectation. No one is
higher than 40%, while most are between 15% and 35%. It means that Dr. Voorhees’
disambiguation algorithm is not an effective one to automatically disambiguate word
senses.
5.4 Result Analysis
So far we can say the algorithm doesn’t work well to disambiguate word senses. The
reasons are listed as following:
1. Although most of the semantics is carried by the noun words, verbs, adjectives,
adverbs are important factors that can help determine appropriate senses for an
ambiguous word.
37
2. One word is possible to be used multiple times in one document, while each
appearance uses different sense. But, in this algorithm, each word is mapped to a
unique sense.
3. The stemming algorithm⎯ Porter’s algorithm has bugs. For example, family is
stemmed to famili, times is stemmed to tim.
4. Part-of-speech tagger used for Brown Corpus separates words connected with
underscore, such as school_board, which is an individual word while it is
separated into two words school and board by the tagger. Thus, the sense of
school or board will never hit the manually identified word sense of
school_board.
38
CHAPTER 6
Conclusion and Future Work 6.1 Conclusion
In this report, I implemented the disambiguation technique introduced by Dr. Ellen M.
Voorhees in her paper Using WordNet to Disambiguate Word Senses for Text Retrieval
([5]). This algorithm was tested on 50 part-of-speech tagged documents from Brown
Corpus. Processed by this technique, each word in any document is mapped to a unique
sense. Further, the effectiveness of this automatic disambiguation algorithm was tested by
comparing with the manual disambiguation results provided by Princeton University.
The results verified Dr. Voorhees’ conclusion in her paper that this algorithm is not
sufficient to reliably select the correct sense of a noun from the set of fine sense
disambiguation in WordNet.
6.2 Future Work and Application
The algorithm introduced in this report is an automatic technique to disambiguate word
senses. If the four reasons that cause the low hit rate are overcome, the effectiveness will
be much better. It would be a great help for information retrieval. If a retrieval system
indexed documents by senses of the words they contain and the appropriate senses in the
document query could be identified, then irrelevant documents containing query words of
a different sense would not be retrieved. This technique is also useful for machine
39
translation because ambiguity in the source language must be disambiguated before
correct translation.
One another important area of application is that of hierarchy text classification. As far as
I know, the hierarchy text classification algorithm was only applied to a small set of
documents which are manually disambiguated by WordNet, Princeton University. The
test size is too small to tell whether it is a good text classification algorithm. Therefore,
assume the automatic word sense disambiguation technique works well, unlimited
number of documents can be applied to the hierarchy text classification algorithm. Each
word is mapped to a corresponding class according to the sense it truly uses in this
document. Thus, it will provide researchers enough evidence to tell the benefits of
hierarchy text classification algorithm.
40
References [1] Christiane Fellbaum. WordNet: An Electronic Lexical Database. MIT Press,
1998.
[2] Ricardo Baeza-Yates, Berthier Ribeiro-Neto. Modern Information Retrieval.
ACM Press, 1999.
[3] Jiawei Han, Micheline Kamber. Data Mining: Concepts and Techniques.
Academic Press, 2001.
[4] Richard K. Belew. Find Out About: Search Engine Technology from a Cognitive
Perspective. Cambridge Univ. Press, 2000.
[5] Ellen M. Voorhees. Using WordNet to Disambiguate Word Senses for Text
Retrieval. SIGIR 1993: 171-180.
[6] Mark Sanderson. Word Sense Disambiguation and Information Retrieval.
Proceedings of the 17th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, pages 142-151, June 1994.
[7] Robert Krovetz, W. Bruce Croft. Lexical Ambiguity And Information Retrieval.
ACM Transactions on Information Systems, 10(2):115-141, April 1992.
[8] Brill Eric. A Simple Rule-Based Part of Speech Tagger. Proceedings of the Third
Annual Conference on Applied Natural Language Processing, ACL. 1992.
[9] http://telemat.det.unifi.it/book/2001/wchange/download/stem_porter.html. M. F.
Porter. An algorithm for suffix stripping(1980).
41
[10] http://www.ilc.pi.cnr.it/EAGLES96/rep2/node39.html. The EAGLES Lexicon
Interest Group. Word Sense Disambiguation. May, 1998.
[11] Geoffrey Towell, Ellen M. Voorhees. Disambiguating Highly Ambiguous words.
Computational Linguistics, 24(1):125-146, 1998.
[12] George Miller. Special Issue, WordNet: An on-line lexical database. International
Journal of Lexicography, 3(4), 1990.
[13] G.K. Zipf. The meaning-frequency relationship of words. Journal of General
Psychology, 3:251-256, 1945.
[14] E. Brill. Transformation-based error-driven learning and natural language
processing: A case study in part of speech tagging. Computational Linguistics,
21(4):543-566, December 1995.
[15] P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer. Word sense
disambiguation using statistical methods. In Proceedings of the 29th Meeting of
the Association for Computational Linguistics (ACL-91), pages 264-270,
Berkley, C.A., 1991.
[16] Agirre, E. and Rigau, G. Word sense disambiguation using conceptual density. In
Proceedings of COLING, 1996.
[17] Ribas-95 Ribas, F. On learning more appropriate selectional restrictions. In
Proceedings of the Seventh Conference of the European Chapter of the
Association for Computational Linguistics, pages 112-118, 1995.
42
[18] Richardson, R. and Smeaton, A. Using wordnet in a knowledge-based approach
to information retrieval. In Proceedings of the BCS-IRSG Colloquium, Crewe,
1995.
[19] Sussna, M. Word sense disambiguation for free-test indexing using a massive
semantic network. Proceedings of the 2nd International Conference on
Information and Knowledge Management. Arlington, Virginia, USA, 1993.
[20] Cowie, J. and Lehnert, W. Information extraction. Communications of the ACM,
39(1):80-91, 1996.
[21] J. Guthrie, L. Guthrie, Y. Wilks and H. Aidinejad. Subject-Dependent
Cooccurrence and Word Sense Disambiguation, ACL-91, pp. 146-152,1991.
[22] D. Yarowsky. Word-sense disambiguation using statistical models of Roget's
categories trained on large corpora. In Proceedings of the 14th International
Conference on Computational Linguistics (COLING-92), pages 454-460, Nantes,
France, 1992.
[23] Y. Wilks and M. Stevenson. The Grammar of Sense: using part-of-speech tags as
a first step in semantic disambiguation. Journal of Natural Language Engineering,
4(3), 1997.
[24] A. Harley and D. Glennon. Sense tagging in action: Combining different tests
with additive weights. In Proceedings of the SIGLEX Workshop ``Tagging Text
with Lexical Semantics'', pages 74-78. Association for Computational Linguistics,
Washington, D.C., 1997.
43
[25] S. McRoy. Using multiple knowledge sources for word sense disambiguation.
Computational Linguistics, 18(1):1-30, 1992.
[26] A. Luk. Statistical sense disambiguation with relatively small corpora using
dictionary definitions. In Proceedings of the 33rd Meetings of the Association for
Computational Linguistics (ACL-95), pages 181-188, Cambridge, M.A., 1995.
[27] Gerard Salton and Micaeal E. Lesk. Information analysis and dictionary
construction. In Gerard Salton, editor, The SMART Retrieval System:
Experiments in Automatic Document Processing, chapter 6, pages 115-142.
Prentice-Hall, Inc. Englewood Cliffs, New Jersey, 1971.
[28] Bin Chen. Sampling and Text Classification Techniques for Data Mining. Ph.D
Thesis, ECE dept., Northwestern University, 2001.
[29] Beatrice Santorini. Part-of-Speech Tagging Guidelines for the Penn Treebank
Project. June, 1990.
[30] W. N. Francis,H. Kucera. Brown Corpus Manual.
http://www.hit.uib.no/icame/brown/bcm.html, 1979.
44
Appendix A: Definition of Tables --- ORACLE WORDNET TABLES -- NOTE: CATEGORY is the category the synset belongs to. -- 1 Noun -- 2 Adjective -- 4 Verb -- 8 Adverb -- NOTE: HIERARCHY is the hierarchy the synset belongs to. -- In WordNet, the hierarchies for noun range from 3 to 28 (second column of noun.dat) -- Hierarchy 0 defaults to "unknown" or "undefined". CREATE TABLE SYNSETS (
SYNSET_ID int CONSTRAINT nn_synsets_synset_id NOT NULL, CATEGORY int CONSTRAINT nn_synsets_category NOT NULL, HIERARCHY int CONSTRAINT nn_synsets_hierarchy NOT NULL, MEANING varchar2(4000) NULL,
CONSTRAINT pk_synsets PRIMARY KEY (SYNSET_ID) ); CREATE INDEX idx_synsets_hierarchy on SYNSETS (hierarchy); -- NOTE: RELATION indicates the relation of ID1 regards to ID2. -- We require that ID1 is always a child of ID2. This table does not store any relationship other than parent-child. -- REL_STR indicates the actual symbol in WordNet to describe the relationship. -- ~: ID1 is a hypernym of ID2 (ID1 is a superordinate <generic> of ID2 <specific>) -- @: ID1 is a hyponym of ID2 (ID1 is a subordinate <specific> of ID2 <generic>) -- %: ID1 is a holonym of ID2 (ID2 is part of ID1) -- #: ID1 is a meronym of ID2 (ID1 is part of ID2) -- #p: ID1 is part of ID2 -- #m: ID1 is a member of ID2 -- #s: ID1 is the stuff that ID2 is made from -- =: ID1 has an attribute ID2 (ID2 is an adjective) -- !: ID1 and ID2 are antonyms. (not stored in this table) CREATE TABLE SYNSET_RELATIONS (
45
SYNSET_ID1 int CONSTRAINT nn_synset_rel_synset_id1 NOT NULL,
SYNSET_ID2 int CONSTRAINT nn_synset_rel_synset_id2 NOT NULL,
REL_STR varchar2(10) CONSTRAINT nn_synset_rel_rel_str NOT NULL,
CONSTRAINT pk_synset_rel PRIMARY KEY (SYNSET_ID1, SYNSET_ID2), CONSTRAINT fk_synset_rel_synset_id1 FOREIGN KEY (SYNSET_ID1) REFERENCES SYNSETS(synset_id) ON DELETE CASCADE, CONSTRAINT fk_synset_rel_synset_id2 FOREIGN KEY (SYNSET_ID2) REFERENCES SYNSETS(synset_id) ON DELETE CASCADE ); CREATE INDEX idx_synset_rel_synset_id2 on SYNSET_RELATIONS (synset_id2); CREATE TABLE WORDS (
WORD_ID int CONSTRAINT nn_words_word_id NOT NULL, WORD varchar2(200) CONSTRAINT nn_words_word NOT NULL,
CONSTRAINT pk_words PRIMARY KEY (WORD_ID), ); CREATE TABLE SYNSET_WORD (
SYNSET_ID int CONSTRAINT nn_sw_senset_id NOT NULL, WORD_ID int CONSTRAINT nn_sw_word_id NOT NULL,
CONSTRAINT pk_sw PRIMARY KEY (SYNSET_ID, WORD_ID), CONSTRAINT fk_sw_synset_id FOREIGN KEY (SYNSET_ID) REFERENCES SYNSETS(SYNSET_ID) ON DELETE CASCADE, CONSTRAINT fk_sw_word_id FOREIGN KEY (WORD_ID) REFERENCES WORDS (WORD_ID) ON DELETE CASCADE ); CREATE INDEX idx_sw_word_id on SYNSET_WORD (word_id);
46