7
A New Conceptual Approach to Document Indexing Simona Barresi, Samia Nefti-Meziani, Yacine Rezgui Informatics Research Institute, University of Salford, Salford, United Kingdom [email protected], [email protected], [email protected] Abstract— This paper presents a new conceptual indexing technique intended to overcome the major problems resulting from the use of Term Frequency (TF) based approaches. To resolve the semantic problems related to TF approaches, the proposed technique disambiguates the words contained in a document and creates a list of super ordinates based on an external knowledge source. In order to reduce the dimension of the document vector, the final set of index values is created by extracting a set of common concepts, shared by multiple related words, from the list of hypernyms. Subsequently, a weight is assigned to each concept index by considering its position in the knowledge source's hierarchical tree (i.e. distance from the substituted words) and its number of occurrences. By applying the proposed technique, we were able to disambiguate words within different contexts, extrapolate concepts from documents, assigning appropriate normalised weights, and significantly reduce the vector dimension. Keywords- Indexing; Semantic; Ontology I. INTRODUCTION During the last decades, the increasing amount of digital information available as well as the advances in data collection and analysis, have led to the need for better disambiguation, indexing, and categorization techniques that allow efficient and effective information retrieval (IR). Traditionally, lexical matching methods have been used to retrieve information from documents; terms in the query were matched against the terms contained in the documents. However, approaches that just rely on lexical matching are often considered to be inadequate and likely to lead to poor retrieval performance [11]. The shortcomings of these approaches to IR lies in their lack of semantic consideration, and consequently their inability to deal with synonyms (i.e. different words expressing the same concept) and polysemys (i.e. same word expressing different concepts). These approaches, being unable to tackle synonyms and polysemys, are likely to fail to retrieve relevant documents if these do not contain any of the query terms, and might also retrieve irrelevant documents in the case of polysemys. To overcome these problems, several retrieval methods go beyond the mere lexical matching of terms in items, by trying to extrapolate the concepts described in the documents and to match them against the concepts expressed in the query. The indexing process performed in these approaches maps the item into a completely different representation, called concept indexing, which uses concepts as a basis for the final set of index values [12]. This type of approach to IR and document categorization is of course a much more challenging one, which has led over the years to the development of a multitude of document indexing and categorization techniques [1], [8], [11], [15], [19]. In this respect, the ability to properly disambiguate words contained in documents is essential, as it provides the basis for proper indexing and categorization techniques, as well as the foundation for any concept extrapolation method. Most indexing formalisms make use of term frequency approaches to compute the weight to be assigned to index terms. These approaches represent a document by using its TF vector. In this case, a document D is represented by computing the number of times a term is repeated within the document. Consequently, the higher the weight of a term inside the vector, the higher the importance of the term in the document. However, the use of indexing techniques based on TF approaches presents the following drawbacks: TF approaches generally fail to differentiate the degree of semantic importance of each term, and assign weights without distinguishing between semantically important and unimportant words within the document. This can lead to a failure in capturing the topics of the documents. TF algorithms do not consider synonyms, polysemous, etc [10], [12]. Therefore, document and centroid vectors might contain terms that act as synonyms of the topic they describe [11]. This can easily happen if various terms are used interchangeably to describe the main topic. Another problem, frequently encountered in document indexing, is related to the high dimensionality of the data used to represent documents. This often leads to high computational costs and inefficiency in the representation of the document, in cases where non-relevant terms are included in the document representation. To tackle the problem, various approaches based on dimensionality reduction techniques have been increasingly investigated. In 2009 Mexican International Conference on Computer Science 978-0-7695-3882-2/09 $26.00 © 2009 IEEE DOI 10.1109/ENC.2009.50 360

[IEEE 2009 Mexican International Conference on Computer Science - Mexico City, Mexico (2009.09.21-2009.09.25)] 2009 Mexican International Conference on Computer Science - A New Conceptual

  • Upload
    yacine

  • View
    215

  • Download
    3

Embed Size (px)

Citation preview

A New Conceptual Approach to Document Indexing

Simona Barresi, Samia Nefti-Meziani, Yacine Rezgui Informatics Research Institute, University of Salford,

Salford, United Kingdom [email protected], [email protected], [email protected]

Abstract— This paper presents a new conceptual indexing technique intended to overcome the major problems resulting from the use of Term Frequency (TF) based approaches. To resolve the semantic problems related to TF approaches, the proposed technique disambiguates the words contained in a document and creates a list of super ordinates based on an external knowledge source. In order to reduce the dimension of the document vector, the final set of index values is created by extracting a set of common concepts, shared by multiple related words, from the list of hypernyms. Subsequently, a weight is assigned to each concept index by considering its position in the knowledge source's hierarchical tree (i.e. distance from the substituted words) and its number of occurrences. By applying the proposed technique, we were able to disambiguate words within different contexts, extrapolate concepts from documents, assigning appropriate normalised weights, and significantly reduce the vector dimension.

Keywords- Indexing; Semantic; Ontology

I. INTRODUCTION During the last decades, the increasing amount of digital information available as well as the advances in data collection and analysis, have led to the need for better disambiguation, indexing, and categorization techniques that allow efficient and effective information retrieval (IR). Traditionally, lexical matching methods have been used to retrieve information from documents; terms in the query were matched against the terms contained in the documents. However, approaches that just rely on lexical matching are often considered to be inadequate and likely to lead to poor retrieval performance [11].

The shortcomings of these approaches to IR lies in their lack of semantic consideration, and consequently their inability to deal with synonyms (i.e. different words expressing the same concept) and polysemys (i.e. same word expressing different concepts). These approaches, being unable to tackle synonyms and polysemys, are likely to fail to retrieve relevant documents if these do not contain any of the query terms, and might also retrieve irrelevant documents in the case of polysemys.

To overcome these problems, several retrieval methods go beyond the mere lexical matching of terms in items, by trying to extrapolate the concepts described in the documents and to match them against the concepts

expressed in the query. The indexing process performed in these approaches maps the item into a completely different representation, called concept indexing, which uses concepts as a basis for the final set of index values [12].

This type of approach to IR and document categorization is of course a much more challenging one, which has led over the years to the development of a multitude of document indexing and categorization techniques [1], [8], [11], [15], [19]. In this respect, the ability to properly disambiguate words contained in documents is essential, as it provides the basis for proper indexing and categorization techniques, as well as the foundation for any concept extrapolation method.

Most indexing formalisms make use of term frequency approaches to compute the weight to be assigned to index terms. These approaches represent a document by using its TF vector. In this case, a document D is represented by computing the number of times a term is repeated within the document. Consequently, the higher the weight of a term inside the vector, the higher the importance of the term in the document. However, the use of indexing techniques based on TF approaches presents the following drawbacks:

• TF approaches generally fail to differentiate the degree

of semantic importance of each term, and assign weights without distinguishing between semantically important and unimportant words within the document. This can lead to a failure in capturing the topics of the documents.

• TF algorithms do not consider synonyms, polysemous, etc [10], [12]. Therefore, document and centroid vectors might contain terms that act as synonyms of the topic they describe [11]. This can easily happen if various terms are used interchangeably to describe the main topic.

Another problem, frequently encountered in document

indexing, is related to the high dimensionality of the data used to represent documents. This often leads to high computational costs and inefficiency in the representation of the document, in cases where non-relevant terms are included in the document representation. To tackle the problem, various approaches based on dimensionality reduction techniques have been increasingly investigated. In

2009 Mexican International Conference on Computer Science

978-0-7695-3882-2/09 $26.00 © 2009 IEEE

DOI 10.1109/ENC.2009.50

360

simple terms, these techniques aim at mapping documents into a lower dimensional space by explicitly considering the dependencies in the original data [6].

This paper proposes a new indexing technique, which can be used to cluster collections of documents taking into account semantic and dimensionality reduction aspects. The proposed technique aims to overcome some of the major problems related to TF based approaches, such as their lack of consideration for synonyms as well as their usual failure in differentiating the degree of semantic importance of terms when assigning weights. Also, the proposed technique considers the drawbacks deriving from the high dimensionality of document vectors and suggests a concept based approach towards document indexing and categorization, which is expected to consistently reduce the number of features in the document vector. To overcome the problems posed by TF based approaches, our method considers and extracts the common concepts from words contained in the documents, and uses these common concepts to create semantic indexes and weights.

II. BACKGROUND Various studies have been carried out in the area of conceptual document indexing; this section aims at summarising some of the most relevant approaches and at introducing the major points that distinguish these approaches from our technique.

Scott and Matwin [16] proposed a technique that involves the use of an algorithm, “Ripper” [5] which is adapted to deal with the multi-dimensionality of classifying texts by sets of bag-of-words. This technique is comprised of the following steps: (a) The Brill tagger [2] identifies the part of speech of each word in the corpus; (b) WordNet is used to look up each noun and verb in the corpus, and to retrieve a global listing of all synonym and hypernym synsets. Infrequent synsets are discarded (i.e. synsets having a frequency of less than 0.05N, where N is the number of documents in the corpus), and the remaining synsets are used as the feature set; (c) The density of each synset is then calculated by considering the number of occurrences of a given synset divided by the number of words in the document).

Scott and Matwin [16] make use of a parameter h (height) to control the level of generalization of the synsets. This parameter is needed to ensure that the common hypernym (i.e. shared by related words) is not too high in the WordNet tree. Otherwise, this common hypernym could result in a concept that is too abstract (e.g. entity) and consequently not useful for indexing and classification purposes. As a matter of fact, if the value of h is too high the learner will suffer from overgeneralization, if it is instead too low it will not generalize effectively. These authors suggest that the ideal value for h is depended on several factors: topic, terminology, style and level of speech, as well as WordNet structure [4].

It is important to draw attention to the fact that in the Scott and Matwin [16] technique, no disambiguation is

carried out before retrieving the global list of synonym and hypernym synsets, or during the calculation of hypernym density. All senses returned by WordNet are considered as equally correct and incorporated in the feature set. Scott and Matwin [16] also pointed out some other possible sources of error in their technique; these include errors related to the tagger, to the incompleteness of WordNet, and to the shallowness of WordNet's semantic hierarchy in some specific domains. The results achieved by the use of this technique were not optimal. It is thought that the lack of proper disambiguation and the lack of ability to accurately learn in a highly dimensional space were among the most possible causes of failure [7].

Rajman et al [15] used the EDR Electronic dictionary to associate concepts with documents' keywords. Similarly to Scott and Matwin [16], this approach does not really disambiguate words; they firstly separate the documents into topically homogenous segments, and then trigger all the possible concepts that are associated with the segments’ keywords and select a sub-hierarchy that covers all of the triggered concepts. Possible set of concepts are selected by considering the different cuts in the sub-hierarchy, the algorithm presented explores the different cuts in the sub-hierarchy, scores them, and uses the given score to select the best one to be used to represent the document. A set of more than 200 short abstracts, manually annotated with keywords extracted from the abstracts, were used as sample. These abstracts were manually associated with two different sets of concepts: one corresponding to a simple keyword, and another one corresponding to compound keywords. However, in the evaluation of the proposed algorithm, only concepts of the former type were considered. While compound keywords were reduced into their elementary constituent and associated with their corresponding concepts [15].

Finally, the technique proposed by Baziz et al [1] detects mono and multiword WordNet concepts from both documents and queries and then uses these concepts as a conceptual indexing space. Terms that are not present in WordNet are also added to the final expanded document and used during the searching phase to compare documents and queries. The two major steps involved in this technique are concept detection and concept weighting. Concepts are detected by mapping a document onto WordNet and extracting the mono and multi terms (adjacent words are used to find multi-world concepts) that appear both in the document and in WordNet. They then proceed by applying an expansion method on mono sense terms, which selects all their synonyms and one hypernym. This method is intended to avoid the disambiguation problem. However, the authors acknowledge that a more complex expansion method could deliver better results. All of the single and multi terms extracted concepts are then weighted using a TF-IDF based approach [1].

The above techniques demonstrate that the basic idea behind concept indexing remains generally the same: deriving concepts from documents' words and using these concepts as index values to represent documents. However,

361

over the years, techniques focusing on concept indexing, although very promising, have generated numerous indexing algorithms that have produced the most disparate possible results. This reveals the need to conduct further research and investigation on such an important topic. The technique presented in this paper proposes an alternative approach to concept indexing, highlighting the importance of disambiguation and reduction of the index values used to represent items. In particular, the approaches discussed above differ from our technique in terms of the relevance given to the semantic relationships between the words contained in a document, disambiguation results, algorithm employed to derive common concepts from a document's words, and dimension of the final document vector.

III. WORDNET In the present work we used WordNet (version 2.0) as a knowledge source to derive the various meanings of words (i.e. senses) and the lowest super ordinates (lso). WordNet is an electronic lexico-semantic database, it was chosen as the lexical database to be used because of its scope and free availability. WordNet organises nouns, verbs, adverbs, and adjectives into synonym sets (i.e. synsets), representing core lexicalized concepts, so that each sense of the same word belongs to a different synset. Synsets are connected via a variety of relations, such as synonymy, antonymy, hyponymy, etc.

WordNet is graphically represented as a tree. The lexical tree can be built by following trails of super ordinate terms. For example, using the symbol '@•' to indicate the is-a relation, the trail connecting the terms researcher, scientist, and person can be indicated as: researcher@• scientist@• person. By doing so the lexical database can be considered as a hierarchical network of terms, consisting of more specific terms at the lower levels and few generic terms at the top [9]. Each term in WordNet can be considered as a node, and the various trails as the paths connecting these nodes.1 To assure the existence of at least a path connecting two nodes, a virtual top node connecting all the hierarchies (i.e. Root node) is also included in the database. Due to its scope, WordNet has been increasingly used to determine the similarity between terms, and various similarity/relatedness measures based on WordNet are reported in the literature. The terms similarity measures and relatedness measures are often used interchangeably to include all of the measures which quantify the similarity, relatedness, and distance between two different lexemes.

A variety of measures based on WordNet are proposed in the literature to compute the similarity, relatedness and distance between words. Some of these are: Leacock-Chodorow, Hirst-St-Onge, Jiang & Conrath, Lin [14], [3]. The technique presented in this paper could be easily adapted to use any of the similarity measures. However, as demonstrated by the various evaluations of WordNet

1 Paths are drawn differently depending on the type of relation

existing between the nodes.

similarity measures reported in the literature [3], [13], these measures are highly context dependent, showing considerable differences in their performance depending on the circumstances in which they are evaluated, on the evaluation method, and on the input data used. Therefore, our approach does not rely on any of the above similarity measures but uses a simple node counting method to determine the most similar senses between terms and to then select their lowest super ordinates. Also, it is important to stress that the method proposed in this paper is not dependent on a specific knowledge source (i.e. WordNet) but can be used in conjunction with any generic or domain specific taxonomy/ontology.

IV. METHODOLOGY The methodology adopted in the proposed technique is comprised of four major steps: disambiguation; concept extrapolation; detection of common concepts; weighting of common concepts. Each of these steps is described here by the use of definitions and examples. This technique is based on the generally accepted assumption that words contained in a document are used to convey common concepts, which can be used to represent the document itself. In other words, given that a document D can be represented as a vector of m keywords (Wi), defined by the following set:

{ }mWWWWWD ,...,,,, 4321= (1)

where each keyword Wi (i=1,…,m) has n possible senses and WiSk (k=1,…,n) represents the different meanings that Wi can express.

In order to disambiguate and to assign the proper sense to each keyword, we compare all of the keyword senses by taking groups of two keywords at a time (table 1) and selecting the two keyword's senses that are semantically more related.

Table 1. Disambiguation process

Pair of keywords in document D (W1SkW2Sk) (W1SkW3Sk) (W2SkW3Sk) (W1SkW4Sk) (W2SkW4Sk) (W3SkW4Sk) (W1SkW5Sk) (W2SkW5Sk) (W3SkW5Sk) (W4SkW5Sk)(W1SkWmSk) (W2SkWmSk) (W3SkWmSk) (W4SkWmSk) (W5SkWmSk)

In other words we compute the semantic relatedness

between two words for each of the words’ senses, and we select the most related senses as in Definition 1.

Definition 1 (Most related senses): The selection of the most related senses between two keywords is achieved by considering the number of nodes present in the paths that connect each of the two keywords’ senses, and selecting the two words’ senses residing on the shortest path (i.e. the path with the smallest number of nodes).

The proposed technique assumes that between two keywords, the senses residing on the shortest path are the

362

most related. The path P between two words-senses can be calculated, by considering the depth dep of their lowest super-ordinate (i.e. LSO), as follows:

( ) ( ) ( ) ( )( )( ) 1,2, +−+= kjkikjkikjki SWSWLSOdepSWdepSWdepSWSWP (2)

Example 1: In general terms, given two words W1 and

W2,,W1 having only one possible meaning (i.e. W1S1) and W2 having two possible meanings (i.e. W2S1 and W2S2) as depicted in figure 1. The most related senses are identified by applying the above formula and selecting the senses on the shortest path. In this case (W1S1, W2S1) as calculated below.

( ) ( )( ) 312233, 1211 =+−+=SWSWP ( ) ( )( ) 611243, 2211 =+−+=SWSWP

Fig. 1. Related words' senses

Definition 2 (LSO sets): After identifying all the related

keywords, by only considering nouns in the source of knowledge (i.e. WordNet), the sets LSO, composed of the lowest super ordinates for each pair of keywords (x, y) are generated. To control the level of generalization in these sets, we make use of a parameter h as proposed in [16], so that:

( ) ( ){ }hiyxlsoyxLSO i ≤≤= 1:,, (3)

Definition 3 (List of Common LSO (CLSO)): Once the LSO sets have been generated, considering all the pairs of keywords (x, y) in the document (see Table 2), all those lowest super ordinates that are common to more than one LSO sets are stored separately, in the list CLSO.

Example 3: For practical reasons this example only considers W1S1 keywords combinations and makes use of the arbitrary lowest super ordinates depicted in figure 2. Lowest super ordinates appearing in more than one pair of keywords (i.e. common lsos or CLSO) are indicated in bold and italics in table 2.

Fig. 2. Keywords and lsos hierarchical position

Table 2. LSO(x, y) sets - h = 3

Pair of keywords (x ,y)

LSO1 LSO2 LSO3

(W1S

1,W

2S

1) LSOa LSOb LSOc

(W1S

1,W

3S

4) LSOb LSOc LSOd

(W1S

1,W

4S

2) LSOd - -

(W1S

1,W

5S

4) LSOd - -

Therefore: list of CLSO { }LSOdLSOcLSOb ,,=

Definition 4 (Pair of keywords substitution): Each pair of

keywords is substituted by the first of its h lowest super ordinates (i.e. LSO1, LSO2, LSO3) that is included in the list of CLSO. In case none of its lowest super ordinates belongs to the list, the pair of keywords will be then substituted by their first lowest super ordinate (i.e. LSO1).

Example 4: By following the results obtained in table 2 and applying definition 4, we can proceed substituting the given pair of keywords by the selected concept (i.e. lowest super ordinate) as shown in table 3:

Table 3. Keywords substitution

Pair of keywords (x ,y) Substituted by: (W1S1,W2S1) LSOb (W1S1,W3S4) LSOb (W1S1,W4S2) LSOd (W1S1,W5S4) LSOd

To estimate the semantic relation between two keywords and the concept substituting them, we assign to each concept a score as indicated in Definition 5.

Definition 5 (Score of concepts): A score is assigned to the selected lowest super ordinate, calculated by considering its position in the taxonomy hierarchical tree compared to the keywords' position (i.e. its distance from the substituted keywords). The score is meant to provide a degree of relatedness between the concept and the substituted keywords.

Taking the maximum relatedness score to be equal to 1 when the selected concept is equal to the substituted

363

keywords (in case of synonym keywords) and considering LSO(x, y) to be equivalent to the concept selected to substitute the considered pair of keywords, the score is inversely proportional to ( )kjki SWSWP , .

Example 5: By using the arbitrary hierarchical positions of the substituting concept (see figure 2), we can assign a score to each concept and we can re-write table 3 as follows.

Table 4. LSOs scores

Pair of keywords

Substituted by:

Path concept-

keywords

Score(LSO)

(W1S

1,W

2S

1) LSOb 5 0.20

(W1S

1,W

3S

4) LSOb 5 0.20

(W1S

1,W

4S

2) LSOd 7 0.14

(W1S

1,W

5S

4) LSOd 8 0.12

As can be seen in table 4, a score of 0.14 is given to

LSOd when substituting (W1S1, W4S2), instead, when substituting (W1S1, W5S4), LSOd has a score equal to 0.12, as the distance between LSOd and (W1S1, W4S2) is lower than the distance between LSOd and (W1S1, W5S4). Therefore, in the latter case LSOd is to be considered less representative of the substituted keywords.

Based on Definition 5, we can now assign a collective weight to concepts as described in Definition 6. This represents the importance of each concept within a document.

Definition 6 (weighting of concepts): the collective weight of a concept (wCi) is defined as the sum of the individual scores of all the instances of the concept Ci in the document.

)(1

i

n

CCi CScorew

i

∑=

= (4)

Where Score(Ci) represents the individual score of each instance of Ci.

In order to avoid differentiation between long and short documents, the computed concept weight is normalised by the number of pairs of keywords considered.

Example 6: considering the data in table 4, we have two concepts: LSObC =1 and LSOdC =2 . By applying

Definition 6 to the results illustrated in table 4, we obtain a collective weight for C1 equal to 0.40, and a collective weight for C2 equal to 0.26. After applying document length normalization, these values change to C1 equal to 0.1, and C2 equal to 0.06 as there are 4 pairs of keywords.

Table 5. Concepts representation and weight

Ci Equal to

Substituting keywords wCi

C1 LSOb (W

1S

1, W

2S

1), (W

1S

1, W

3S

4) 0.10

C2 LSOd (W

1S

1, W

4S

2) , (W

1S

1, W

5S

4) 0.06

V. EMPIRICAL EXAMPLE For practical reasons, in order to demonstrate how our technique can be used to achieve word sense disambiguation, concept extrapolation, and vector dimensionality reduction, we apply the above technique to the following sentences:

Sentence 1: Diamonds and pearls are precious gems often used as solitaires. Sentence 2: Diamonds win over clubs but not over spades.

After removing stop words and suffixes and by only considering nouns in the document, we can represent the two sentences by the following vectors:

Sentence 1 = {diamond, pearl, gem, solitaire} Sentence 2 = {diamond, club, spade}

It is evident that the above words are very ambiguous as they can have more than one meaning. For instance, solitaire can be related to precious stones but it could also be a card game; club could be a type of playing card but also an association, a weapon, or a nightclub. Also, the same word, diamonds, is used in both sentences with a different meaning. The aim of our method is to ‘understand’ the context in which these ambiguous words are used and to extrapolate the concepts out of these sentences, so that the sentences can be subsequently listed under an appropriate category.

Firstly, the words (i.e. features) in each given sentence vector are disambiguated by computing the shortest path between each pair of features2. By applying Definition 1 and using WordNet as a knowledge source, we obtain the following most related senses (Sk) for each feature:

Sentence 1 = (diamondS1, pearlS1, gemS5, solitaire S1) Sentence 2 = (diamondS3, clubS6, spadeS1)

Considering h = 3, the three lowest super ordinates are found for each pair of keywords as reported in Table 6 and Table 7.

Table 6. Sentence 1 - pair of keywords’ LSOs

Pair of keywords LSO1 LSO2 LSO3 (diamondS

1, pearlS

1) jewel jewellery adornment

(diamondS1, gemS

5) jewel jewellery adornment

(diamondS1, solitaireS

1) jewel jewellery adornment

(pearlS1, gemS

5) jewel jewellery adornment

(pearlS1, solitaireS

1) jewel jewellery adornment

(gemS5, solitaireS

1) jewel jewellery adornment

Table 7. Sentence 2 - pair of keywords’ LSOs

Pair of keywords LSO1 LSO2 LSO3

2 The WordNet::Similarity package was adapted in order to implement the proposed algorithm (http://search.cpan.org/ dist/WordNet-Similarity).

364

(diamondS3, clubS

6) playing_card card paper

(diamondS3, spadeS

1) playing_card card paper

(clubS6, spadeS

1) playing_card card paper

Subsequently, by applying Definition 2 and Definition 3

the List of CLSO is computed for both sentences. Sentence 1 list of CLSO = (jewel, jewellery, adornment) Sentence 2 list of CLSO = (playing_card, card, paper)

It is important to note that in this example, the pairs of keywords in each sentence share the same lowest super ordinates. Therefore, all the pairs of keywords will be substituted by their first lso (see Definition 4). However, different scores are assigned to the substituting concept due to the different length of the paths connecting each concept with the substituted keywords.

Table 8. Sentence 1 - keywords substitution and scores

Pair of keywords Substituted by

concept:

Path concept-

keywords

Score (LSOi)

(diamondS1, pearlS

1) jewel 3 0.33

(diamondS1, gemS

5) jewel 2 0.50

(diamondS1,solitaireS

1) jewel 3 0.33

(pearlS1, gemS

5) jewel 2 0.50

(pearlS1, solitaireS

1) jewel 3 0.33

(gemS5, solitaireS

1) jewel 2 0.50

Table 9. Sentence 2 - keywords substitution and scores

Pair of keywords Substituted by

concept:

Path concept - keywords

Score (LSOi)

(diamondS3, clubS

6) playing_card 3 0.33

(diamondS1, spadeS

1) playing_card 3 0.33

(clubS6, spadeS

1) playing_card 3 0.33

As shown in table 8, the same concept jewel is assigned

different weights depending on the semantic relationship held between the concept and the substituting keywords. All of the keywords in table 8, apart from gem, are immediate hyponyms (subordinates) of the concept jewel. The keyword gem and jewel are instead synonyms. Consequently, a greater weight (i.e. 0.50) is assigned to this concept when one of the two considered keywords is gem.

Subsequently, collective weight are calculated as per Definition 6 and normalised by considering the number of pairs of keywords in the sentences. From the previous tables, sentence 1 is now reduced to the unique concept jewel, with a normalised weight equal to 0.41 (i.e. 2.46/6). Sentence 2 is instead reduced to the unique concept playing_card, with a weight equal to 0.33 (i.e. 0.99/3).

The two sentences can be now represented by the following vectors, consisting of only one concept instead of initial keywords.

Sentence 1 = {jewel 0.41} Sentence 2 = {playing_card 0.33}

It is necessary to note that when applying this technique to real documents, a variety of concepts will be used as index values to represent each document and the concepts’ weight will then provide an indication of the importance of each concept within the considered document.

VI. CONCLUSION This paper has described a new concept indexing

technique aimed at overcoming the major problems encountered by the use of TF based approaches. By applying the proposed technique, we were able to disambiguate words within different contexts, extrapolate concepts from documents, and assign appropriate normalised weights. The proposed technique allows for the significant reduction of the vector dimension. We are testing our technique on a wider collection of experimental data, considering document abstracts from various domains, in order to perform full clustering. The results from these tests will also facilitate the establishment of thresholds, which will be used during the indexing of each document, to discard low-frequency and high-frequency concepts from the feature sets.

By relying on an external knowledge source, the results of the method presented in this paper are going to be strictly influenced by the scope and completeness of the chosen external knowledge source. As a matter of fact, concepts will not be found for words that are not present in the used knowledge source. However, this drawback should be considered as inherent to the knowledge source itself, rather than a weakness in the proposed indexing technique. Accordingly, the proposed method is expected to show higher performance when applied to a specific domain; as the probability of retrieving generic concepts will be higher when a generic knowledge source, with small number of paths and nodes, is used. Consequently, greater performance will be achieved when specialised documents are used as input and a domain specific taxonomy/ontology is used as a knowledge source.

REFERENCES [1] Baziz, M., Boughanem, M., Aussenac-Gilles, N.: Evaluating a

Conceptual Indexing Method by Utilizing WordNet. In Accessing Multilingual Information Repositories: 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005, Revised Selected Papers, Vienna, Austria, 21/09/05-23/09/05, Peters, C., Gey, F.C., Gonzalo, J., Jones G.J.F., (Eds.), Lecture Notes in Computer Science, Vol. 4022, (2005)

[2] Brill, E.: A Simple Rule-Based Part of Speech Tagger. Proceedings of the Third Conference on Applied Natural Language Processing. ACL (1992) 152-155

[3] Budanitsky, A., Hirst, G.: Semantic Distance in WordNet: An Experimental Application-Oriented evaluation of five measures. In: Workshop on WordNet and Other Lexical Resources, Second Meeting of the North American Chapter of the Association for Computational Linguistic (2001) 29-34

[4] Chaves, R.P.: WordNet and Automated Text Summarization. In Proceedings of NLPRS-01, 6th Natural Language Processing Pacific Rim Symposium , Tokyo, Japan (2001)

365

[5] Cohen W.W.: Fast Effective Rule Induction. In: Proceedings of ICML-95, Lake Tahoe, California (1995)

[6] Fodor, I.K.: A survey of dimension reduction techniques. LLNL technical report, June (2002). UCRL-ID-148494.

[7] Gómez Hidalgo, J.M., Cortizo Pérez, J.C., Puertas Sanz, E., Ruíz Leyva, M.: Concept Indexing for Automated Text Categorization. In: Proceedings of 9th International Conference on Applications of Natural Language to Information Systems, NLDB 2004, Lecture Notes in Computer Science, Vol. 3136, Springer (2004) 195-206

[8] Mihalcea, R., Moldovan, D.: Semantic Indexing Using WordNet Senses. In: Proceedings of the ACL Workshop on IR. & NLP, Hong Kong (2000)

[9] Miller, G. A., Beckwith, R., Fellbaum, C., Gross D., Miller, K.J.: Introduction to WordNet: an on-line lexical database. In: International Journal of Lexicography 3 (4) (1990), 235-244

[10] Kang, B.Y., and Lee, S.J.: Document Indexing: a Concept Based Approach to Term Weight Estimation. Information Processing and Management 41(2005), 1065-1080

[11] Karpis, G., Han, E.H.: Concept Indexing – a Fast Dimensionality Reduction Algorithm with Application to Document Retrieval and

Categorization. Technical Report TR-00-016, Department of Computer Science, University of Minnesota, Minneapolis, (2000)

[12] Kowalsky, G.: Information Retrieval systems: theory and implementation, Kluwer Academic Publishers, Boston (1997)

[13] Patwardhan, S., Banerjee, S., Pedersen, T.: Using Measures of Semantic Relatedness for Word Sense Disambiguation. In: Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistic. Mexico City. Springer - Verlag (2003) 241-257

[14] Pedersen, T., Banerjee, S., Patwardhan, S.: Maximizing Semantic Relatedness to Perform Word Sense Disambiguation. In: University of Minnesota Supercomputing Institute Research Report UMSI 2005/25 March, (2005)

[15] Rajman, M., Andrews, P., Del Mar Perez Almenta, M., Seydoux, F.: Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy. In Proceedings of the Applied Stochastic Models and Data Analysis (ASMDA 2005). ENST Bretagne, France (2005) 98-105

[16] Scott, S., Matwin, S.: Text Classification Using WordNet Hypernyms.: Proceedings of the COLING/ACL Workshop on Usage of WordNet in Natural Language Processing Systems (1998) 45-51

366