8
Cybernetics and Systems Analysis, Vol. 47, No. 4, July, 2011 A METHOD FOR THE COMPUTATION OF THE SEMANTIC SIMILARITY AND RELATEDNESS BETWEEN NATURAL LANGUAGE WORDS A. V. Anisimov, a† O. O. Marchenko, a‡ and V. K. Kysenko aUDC 681.3 Abstract. This paper develops methods for calculating the semantic similarity (closeness)-relatedness of natural language words. The concept of semantic relatedness allows one to construct algorithmic models for the context-linguistic analysis with a view to solving problems such as word sense disambiguation, named entity recognition, natural language text analysis, etc. A new algorithm is proposed for estimating the semantic distance between natural language words. This method is a weighted modification of the well-known Lesk approach based on the lexical intersection of glossary entries. Keywords: computer linguistics, semantic analysis of natural language texts, semantic similarity-relatedness of words, semantic ambiguity of words. INTRODUCTION A key element in computer simulation of natural language processes is the possibility to determine the semantic closeness (similarity), i.e., the semantic distance between concepts that is often specified on the graph of concepts (notions) of an ontological knowledge base. The computation of semantic distances is widely used in many problems of computational linguistics such as automatic abstracting and annotation of texts, word sense disambiguation, anaphora analysis, indexing and search, and machine translation. In a natural language, there are a number of classical problems of considerable complexity for the majority of tasks of computer linguistics, namely, polysemy, homonymy, anaphoric references, pronouns, and other language phenomena whose computer processing is impossible without some semantic analysis and semantic interpretation of a text. The essence of polysemy and homonymy problems is that the same words denote sets of different concepts (for example, the word bank has different semantic meanings such as a financial institution and a riverside). The context in which a given word is located suggests the meaning in which it is used. To take into account the influence of the context and to determine the actual meaning of some word, a computer system should find an estimate for the semantic closeness with respect to meanings of words adjacent to it in the text for each meaning of this word. This is solved by the application of a function for computing the semantic closeness and relatedness of concepts. In computer linguistics, the anaphora problem is as follows: the same entity in a text is mentioned using different words-names; a particular case of an anaphora is a pronoun. For each pronoun, there can be a wide variety of candidates for replacement (antecedents), i.e., groups of nouns that are located earlier in the text and that are denoted by this pronoun. One can determine the candidate that is the correct antecedent by substituting each of them for a pronoun (anaphora) and computing the degree of correspondence of the context of the candidate for replacement with the context of the pronoun (anaphora). Such a correspondence is also found with the help of a function for computing the semantic closeness and relatedness of concepts. 515 1060-0396/11/4704-0515 © 2011 Springer Science+Business Media, Inc. a Taras Shevchenko National University, Kiev, Ukraine, [email protected]; [email protected]; †† [email protected]. Translated from Kibernetika i Sistemnyi Analiz, No. 4, pp. 18–27, July–August 2011. Original article submitted March 10, 2011.

A method for the computation of the semantic similarity and relatedness between natural language words

Embed Size (px)

Citation preview

Cybernetics and Systems Analysis, Vol. 47, No. 4, July, 2011

A METHOD FOR THE COMPUTATION

OF THE SEMANTIC SIMILARITY

AND RELATEDNESS BETWEEN NATURAL

LANGUAGE WORDS

A. V. Anisimov,a†

O. O. Marchenko,a‡

and V. K. Kysenkoa††

UDC 681.3

Abstract. This paper develops methods for calculating the semantic similarity (closeness)-relatedness of

natural language words. The concept of semantic relatedness allows one to construct algorithmic models

for the context-linguistic analysis with a view to solving problems such as word sense disambiguation,

named entity recognition, natural language text analysis, etc. A new algorithm is proposed for estimating

the semantic distance between natural language words. This method is a weighted modification of the

well-known Lesk approach based on the lexical intersection of glossary entries.

Keywords: computer linguistics, semantic analysis of natural language texts, semantic

similarity-relatedness of words, semantic ambiguity of words.

INTRODUCTION

A key element in computer simulation of natural language processes is the possibility to determine the semantic

closeness (similarity), i.e., the semantic distance between concepts that is often specified on the graph of concepts (notions)

of an ontological knowledge base. The computation of semantic distances is widely used in many problems of computational

linguistics such as automatic abstracting and annotation of texts, word sense disambiguation, anaphora analysis, indexing

and search, and machine translation.

In a natural language, there are a number of classical problems of considerable complexity for the majority of tasks of

computer linguistics, namely, polysemy, homonymy, anaphoric references, pronouns, and other language phenomena whose

computer processing is impossible without some semantic analysis and semantic interpretation of a text. The essence of

polysemy and homonymy problems is that the same words denote sets of different concepts (for example, the word bank has

different semantic meanings such as a financial institution and a riverside). The context in which a given word is located

suggests the meaning in which it is used. To take into account the influence of the context and to determine the actual

meaning of some word, a computer system should find an estimate for the semantic closeness with respect to meanings of

words adjacent to it in the text for each meaning of this word. This is solved by the application of a function for computing

the semantic closeness and relatedness of concepts.

In computer linguistics, the anaphora problem is as follows: the same entity in a text is mentioned using different

words-names; a particular case of an anaphora is a pronoun. For each pronoun, there can be a wide variety of candidates for

replacement (antecedents), i.e., groups of nouns that are located earlier in the text and that are denoted by this pronoun. One

can determine the candidate that is the correct antecedent by substituting each of them for a pronoun (anaphora) and

computing the degree of correspondence of the context of the candidate for replacement with the context of the pronoun

(anaphora). Such a correspondence is also found with the help of a function for computing the semantic closeness and

relatedness of concepts.

5151060-0396/11/4704-0515

©

2011 Springer Science+Business Media, Inc.

a

Taras Shevchenko National University, Kiev, Ukraine,

[email protected];

[email protected];

††

[email protected]. Translated from Kibernetika i Sistemnyi Analiz, No. 4, pp. 18–27, July–August 2011. Original

article submitted March 10, 2011.

516

A semantic closeness (similarity) relationship specifies not only a synonymy relationship since meanings of concepts

can be closely related but not identical. The presence of many other relationships stipulates the following refinement of

semantic relatedness: engine and car are connected through the whole-part relationship and cold and hot are antonyms. At

the same time, it is difficult to establish a direct relationship between many words (for example, winter and blizzard), but,

despite this, one can see that they are explicitly semantically related.

Semantic closeness and semantic relatedness relationships differ from one another. Whereas boat and launch are

semantically closely related concepts, engine and fuel are semantically related concepts but are not similar by implication.

Semantic closeness and semantic relatedness are relationships traditionally defined on the semantic graph of an

ontological knowledge base. The determination of the presence of some relationship between concepts is realized by

checking the existence of semantic relationships between nodes that contain corresponding concepts in an ontological

network. Such a check is often reduced to the problem of searching for the shortest path between vertices-concepts of the

graph of the knowledge base. The construction of the path is followed by the stage of its analysis and interpretation whose

purpose is the determination of the semantic meaning of the found path, i.e., the type of the semantic relationship that exists

between these concepts and the depth of this relationship.

There also is another approach to the determination of an estimate for the semantic closeness (similarity)-relatedness of

concepts, which is proposed in [1]. Methods of this line of investigation compute the intersection of the lexical composition of

entries-definitions for two input concepts, and the more words belong to this intersection, the more related these concepts are.

In this article, a new method for the determination of the semantic relatedness of concepts is proposed. It is suggested

that it is more expedient to compute and consider not the simple intersection of sets of lexemes of two entries of some

thesaurus that define two input concepts but also to take into account the position of each word within the entry-definition of

a concept. To this end, a thesaurus entry should be structured by partitioning into zones with different priority levels, for

example, “name,” “definition,” “links to other terms,” and “descriptive part.” Depending on the zone to which belongs a

meaningful word, a definite priority weight is assigned to it. Thus, a set of subsets of terms in which each subset has its

weight rather than a simple set of lexemes of the text defining a concept is considered. We propose to compute and analyze

not the intersection of two lexical sets of the texts defining input concepts but the intersection of structured “multilevel” sets.

This allows one to consider all variants of pairwise intersections of subsets from the first and second set and to take into

account delicate nuances of the lexical structural organization of texts, for example, the number of common words in names

of the first and second concept (the priority weight of such an intersection is highest), the number of common words in the

definition of the first concept and in the name of the second word (as is obvious, its weight should be less than that of the

previous one), the number of common words in the definition of the first concept and in the descriptive part of the entry of

the second concept (its weight decreases still further), etc. Analyzing all possible variants of multilevel intersections and

selecting an optimal weight for each variant, one can construct a qualitatively new efficient estimate for semantic

closeness-relatedness of natural language words.

MODERN METHODS OF COMPUTATION OF SEMANTIC CLOSENESS

Let us consider existing methods for semantic distance computation. From the beginning of the 1980s, several

heuristic methods are developed.

The choice of source data, i.e., the foundation for the computation of semantic closeness, is very important. The

linguistic knowledge bases WordNet and ConceptNet are mostly used in investigations; Wikipedia and Google Search are

also involved. The most important results are obtained in using WordNet and Wikipedia [2–4].

One class of methods is based on the computation of the distance �( , )c c1 2

between two concepts (nodes) c c1 2

and

in some taxonomy (WordNet or the category tree of Wikipedia). For example, the shortest path between two corresponding

vertices in this taxonomy can be used. One of first such metrics proposed in [5] is as follows:

�( , )c cN p

1 2

1

,

where N p is the number of vertices in the shortest path connecting nodes c c1 2

and . A minus of this metrics was

noted to be the nonuniformity of depths of some taxonomy concepts. In [6], the following normalized version of this

method is presented that take into account the height of the taxonomy being used:

�( , )c cN

D

p

1 2

� �log

2

,

where D is the maximal depth of the taxonomy tree.

One more method is described in [7]. The proposed algorithm takes into account LSO( , )c c1 2

, i.e., the depth of the

lowest superordinate of two nodes of the taxonomy graph that correspond to concepts c c1 2

and ,

�( , )

( ( , ))

( ) ( )

c cc c

c c1 2

� �

log

depth LSO

depth depth

1 2

1 2

,

where depth( )x is the distance from the taxonomy root to a node x.

In [8], Wikipedia is first used for the calculation of semantic distance. The method WikiRelate! applies the

above-mentioned metrics to the Wikipedia category tree.

Another class of algorithms is developed by M. Lesk [1]. He constructed an algorithm based on the idea of definition

of close concepts with the help of a similar collection of words. As the semantic distance between concepts, the ratio of the

number of identical words in definitions of concepts to the total amount of words in two definitions is used.

In the past five years, several methods based on Wikipedia were developed whose accuracy was unattainable earlier.

In [9], the method Wikipedia Link-Based Measure (WLM) is proposed for the computation of semantic closeness on the

basis of links between pages. Its main idea is the assumption that a concept (represented by some Wikipedia entry in this

case) is rather exactly described with the help of incoming and outgoing links. Each link has its weight determined by the

frequency of its occurrence among all pages of the encyclopedia. Thus, to each entry corresponds a vector with links. The

weight of a link is computed using the well-known formula TD-IDF. The distance between entries is found with the help of

cosine distance between vectors of entry weights.

One of the most efficient methods is Explicit Semantic Analysis (ESA) described in [4]. In contrast to the well-known

algorithm Latent Semantic Analysis (LSA) in which implicit relationships between texts of entries are determined, in this

method, a concept is represented in explicit form with the help of the weighed sum of terms obtained from Wikipedia.

A given concept is projected into the space of vectors-entries of Wikipedia. Thus, semantic closeness is determined as the

cosine distance between vectors projected into the space of Wikipedia entries.

In [10], the method WikiWalk is presented that applies the technique for random walks on a graph. Two types of

graphs are considered that are constructed with the help of WordNet and Wikipedia. This method uses the following

algorithm called Personalized PageRank: a certain particle randomly wanders on graph nodes (in the case of Wikipedia, on

its entries) and passes to a new page with some probability. Thus, each graph node is characterized by a vector of

probabilities of transitions to other pages (a teleportation vector). Such a vector turns out to be a unique characteristic of

a Wikipedia page (and the corresponding described concept together with it). The semantic closeness is computed as

the distance between teleportation vectors of corresponding pages.

A METHOD FOR THE COMPUTATION OF SEMANTIC CLOSENESS-RELATEDNESS

Source data used in this paper consist of the free Internet Encyclopedia Wikipedia. At present, English-language,

Russian-language, and Ukrainian-language Wikipedias contain more than 3.5 million, more than 600 thousand, and more

than 250 thousand entries, respectively. This large amount of entries is provided by the “freeness” of the project. Each user

can create, correct, and supplement entries. Owing to moderation, this does not lead to the decrease in the quality of texts of

entries since practically each change is checked by one user or a group of users that have proved their competence. A very

important factor is also the possibility of downloading the complete local copy of Wikipedia. However, this encyclopedia

has definite drawbacks. Some entries are not completely objective; for example, an author can introduce his personal opinion

concerning some question. One more minus is an insufficient rigor of the format of an entry description, which considerably

complicates the development of a program-analyzer of texts of the encyclopedia. The Internet Encyclopedia Wikipedia is

a unique and valuable but not formalized data source.

The Wikipedia structure has a number of properties that can be used in computing semantic closeness. These

properties can model some types of lexical relations between words.

517

• Synonymy. It is determined by means of pages-redirections. As a rule, contents of such entries consist of a string of

the form “#REDIRECT <page name>.” For example, the entry auto refers to the page car.

• Homonymy. It is specified by special pages with a list of possible meanings of a concept. For example, the page

note contains links to different meanings of this word, for example, musical symbol, diplomatic note, financial bond, model

of tape recorders, and river name. In this work, pages of this type are used for word sense disambiguation, in particular, for

obtaining the list of possible meanings of a term.

• Cross references. They are represented by links to other Wikipedia entries. For example, the entry water contains

links to the following entries: chemical substance, liquid, ice, snow, steam, solvent, ocean, river, life, weather, climate, etc.

Such links show interrelations between concepts.

The proposed method called the Estimated Weighted Overlap (EWO) is a development of the above-mentioned Lesk

approach. His method proceeds from the assumption that close concepts are described (or defined) with using similar

collections of words, i.e., the number of common words in dictionary definitions can show the semantic closeness of the

corresponding two concepts. The proposed functional-structural generalization of the Lesk method is based on the idea that a

text reflects the semantic ordering between words. Some words are more important than others depending on their position

in a text. For example, a word from the name of an entry (or from the definition of a term) usually has a greater significance

than a word from the end of the text. To introduce such a distinction, the proposed algorithm assigns to each word of an

entry text a weight corresponding to the significance of the word. Weights of words are computed from the following

features: the name of an entry contains a given word; the word belongs to the definition of the corresponding concept; the

word belongs to the first section of the entry; the word is a link to another entry; other words.

We assume that two words to be estimated are applied to the input of the algorithm. It first fetches the corresponding

entries from Wikipedia. Then the texts of the entries are divided into words. Next, the algorithm eliminates words from the

“stop-list.” The stop-list contains words that do not carry a large semantic load, namely, prepositions, unions, pronouns,

words in general use, etc. At the next step, the algorithm divides sets of words into subsets corresponding to prescribed

factors. For example, for the features described above, the sets are as follows: L1

are words from the name; L2

are words

from the concept definition; L3

are words from the first section; L4

are words representing cross references; L5

are other

words. At the same time, if some word w belongs to Li , then it is eliminated from L j for any j i� .

This method proposes to take into account the structure of entries rather than to simply analyze intersections of two

lexical collections of Wikipedia entries for two input concepts. If names and definitions of concepts contain common terms,

then the list of intersection of lexemes for names and definitions must have a much larger weight of significance than the list of

intersection for the entire other entry body. It is proposed to divide significant words of both entries according to the

corresponding features into groups L Ln1

1 1

, ,� and L Ln1

2 2

, ,� and then to pairwisely count intersections L Li j

1 2

� . In the case

being considered, the number of features is equal to five (the number of features can be another in another realization of the

algorithm). For each possible intersection, the corresponding priority weight is determined, which is maximal for the case of

intersection of terms from names L L1

1

1

2

� and minimal for common terms from the descriptive part of the entry L L5

1

5

2

� . An

intermediate weight is assigned to an intermediate variant, for example, when some terms are used in the definition of the first

concept but appear descriptively at the end of the entry for the second concept (the intersection L L2

1

5

2

� ).

A weight wi is assigned to each word from the ith set Li . Based on sets Li

1

and Li

2

, a matrix D is constructed in which

an element D i j[ , ] is equal to the number of common words in Li

1

and Li

2

, i.e., | |L Li j

1

2

, that is multiplied by the weight

w w wij i j� � . We assume that the semantic closeness is equal to the normalized sum of elements of the matrix D.

The algorithm performs the following actions.

1. For two concepts c c1 2

and , retrieve entries t1

and t2

defining these concepts. Extract all words from the entries t1

and t2

. Denote the sets of words by T1

and T2

, respectively.

2. Eliminate the words belonging to the stop-list from T1

and T2

.

3. Divide the sets T1

and T2

into subsets L Ln1

1 1

, ,� and L Ln1

2 2

, ,� according to given features, where n is the number

of features.

4. Construct the matrix D of the form

w L L w L L

w L L w

n n

n n n

11 1

1

| | | |

| |

1

1

1

2

1

1 2

1

1

2

� �

� �

� � � �

� � � �

� � n n nL L| |

1 2

.

518

5. Compute the semantic closeness value as the normalized sum

EWO

11

1 2

1

( , )

(| | | | )

,

c c

D

w L L

i j

j

n

i

n

i i i

i

n1 2

��

.

The procedure for obtaining weights wi on the basis of the simulated annealing algorithm (a method for discrete

global optimization) is described below in detail.

WORD SENSE DISAMBIGUATION

Some concepts can have identical spellings but different meanings. For example, the word jaguar can denote an

animal from the cat family and a British car model. Thus, one should correctly choose a meaning (and an entry from

Wikipedia) depending on the second word of a pair. For example, if the pair of words <jaguar; lion> is applied to the input

of the algorithm, then jaguar must be considered as a big cat, and if the pair <jaguar; Mercedes> is applied, then it must be

interpreted as a car model. An algorithm is developed for resolving such ambiguities.

As above, a pair of words is applied to the input of the algorithm. For both words, we obtain a list of possible

candidate entries (meanings). Then for each pair of meanings in which the first meaning belongs to one list and the second

belongs to the other list, the semantic closeness value is computed. Then the pair with the largest value is chosen. More

formally, the algorithm is written as follows.

1. Obtain lists of meanings for both words as follows:

• retrieve the list of entries with the name of the form <word> (refinement) from the index;

• (additionally) retrieve the list of possible meanings from the page with the description of ambiguities.

2. For each pair of entries, calculate the semantic closeness (similarity)-relatedness value.

3. Choose the pair with the highest semantic closeness.

In practical implementations, this process can be optimized as follows: use only the first sections instead of the

intersection of complete texts of entries. This optimization considerably reduces the computational complexity of the process

but do not exert influence on the accuracy of computations.

ESTIMATION OF WEIGHTS

For estimating weights wij , the simulated annealing method [11] is used, i.e., a probabilistic heuristic technique for

solving global optimization problems. This method operates with points in a decision space. In the case being considered,

a point is a vector consisting of five weights that correspond to the chosen features. At each iteration of the algorithm, one

point is stored, namely, the current one, which can be changed according to a definite probabilistic rule. The structure of the

pseudo-code [12] of this algorithm for the maximization of a function F x( ) is as follows.

1. Randomly choose an initial point x0

.

2. Put x xbest 0

� .

3. While i k� , perform the following steps:

• randomly choose a point x among the neighbors of a point xi ;

• if F x F x( ) ( )

best

� , then x xbest

� ;

• if F x F xi( ) ( )� , then x xi� �1

;

• if rnd �

eF x F x ti i( ( ) ( ))/

, then x xi�

�1

.

4. Return xbest

.

Here, rnd is a random number beween 0 and 1, and the parameter t i denotes elements of some decreasing sequence.

These values are called annealing temperatures.

On the whole, this method is similar to the gradient descent method, but the use of a probabilistic law prevents the

algorithm from “sticking” at points of local maxima. This property helps one to obtain more efficient results.

519

The Spearman rank correlation coefficient is used in the capacity of the function for maximization. The solution

search space is the space of vectors whose dimension is equal to the number of features used in the algorithm, i.e., to each

coordinate of such a vector corresponds the weight of some feature. To estimate weights, a small training base is created that

consists of pairs of words belonging to main classes of semantic closeness-relatedness relationships such as very close

concepts, absolutely independent concepts, words with many meanings, etc. An optimizing procedure was executed several

times and weights were chosen that provide the maximal correlation with the training base.

SOFTWARE IMPLEMENTATION

A software implementation of the proposed method is developed. The program is written in the programming

language Scala [13, 14], which is a modern well-designed language convenient for the creation of text-processing software.

The current Scala implementation compiles the initial text into a bytecode for the Java Virtual Machine (JVM). This

property allows one to execute the program on all operational systems that are supported by JVM (for example, Windows,

GNU/Linux, and MacOS X). As source data, a local copy of Wikipedia downloaded from the web-site of the project is used.

The total archive size is very large (more than 5.5 Gb) and, hence, to realize an efficient fast search for entries, a

preprocessing was performed. We note that block archiving is used for the creation of an archive. This allows one to

partition this large archive into a set of small (about 1 Mb) archives and to create a search index for them. The unique

XML-file (about 25Gb) that contains all Wikipedia entries is located in the middle of the archive. For retrieving entries from

this file, a parser is developed that takes into account the block structure of the archive and can process large amounts of

data. In general, the mentioned preprocessing can be described as follows.

1. For each entry from a local Wikipedia copy

• retrieve its name and text;

• eliminate parts of the text that are insignificant for the algorithm, for example, links to external resources,

comments, and image descriptions;

• store the name and processed entry text as a text file;

• add the pair <entry name; name of the text file in which the contents is stored> to the database.

2. After processing all Wikipedia entries, create the database index for the field “entry name.”

Thus, entries are stored in conventional text files. As a database, the modern nonrelational document-centric database

MongoDB is used that, according to the results of a set of tests, is considered to be one of most productive. Of importance

also is the possibility of search based on regular expressions in the database, which is actively used in word sense

disambiguation. The size of the final database equals 1.5 Gb. On the whole, this approach to data storage has allowed us to

achieve extremely high efficiency in searching for and retrieving entries.

To optimize weight parameters (to implement the simulated annealing method), a separate application is developed. The

interaction of the optimizer with the program is realized with the help of configuration files. The optimizer gives an answer in the

form of a vector of real numbers, i.e., weight parameters of the algorithm that provide the highest correlation with a training set.

The program for computing semantic distance is developed with console and graphic interfaces. The graphic interface

allows one to introduce pairs of words for estimating semantic closeness in interactive mode. This interface is more

user-friendly and, in addition to estimation as such, allows for browsing a large amount of additional information such as

entry texts, lists of candidate entries, weights of words, etc. The console interface can be more easily called from other

programs and is controlled with the help of parameters of the command line. We also plan to develop an additionally

downloaded separate library for a better integration with third-party applications.

For testing algorithms of computation of semantic closeness-relatedness, the collection of weighed pairs of words

Finkelstein WordSimilarity-353 [15] is frequently used. It contains 353 pairs of words estimated by experts. Each pair is

estimated by a real number between 0 and 10. As an estimate of operation of the proposed algorithm, the Spearman rank

correlation coefficient was used. Below, the coefficients of correlation of meanings computed by the proposed algorithm with

estimates from Finkelstein WordSimilarity are equal to 0.63, 0.68, and 0.74, respectively, for the following three modes:

• without word sense disambiguation;

• with partial word sense disambiguation (candidates are entries whose names are of the form <word>

(<refinement>));

• with complete word sense disambiguation (candidates are obtained from entry lists of ambiguities; as a rule, the

names of these entries are of the form <word> (disambiguation)).

520

These values demonstrate a substantial improvement of results owing to the use of word sense disambiguation. To

compare with some other methods, the diagram presented in Fig. 1 is constructed, which reflects the results of measurements

for various algorithms of computing semantic closeness. The diagram contains estimates obtained by the following methods:

• the RND method returning a random value for a pair of words;

• methods based on the search for a path in a graph, namely, the shortest path method (PATH), Leacock–Chodorov

(LCH) method, Wu–Palmer (WUP) method, and Resnik (RES) method [8, 16];

• the WLM method [9];

• the ESA method [4, 9];

• the EWO method.

The software implementation of the EWO method shows its high efficiency, namely, it estimates 20–100 pairs of

words per second. Some results obtained by the program for computing estimates of semantic closeness-relatedness of words

from a test sample are presented in Table 1.

CONCLUSIONS

In this article, a new efficient method for the computation of semantic closeness-relatedness between natural language

words is described. The presented algorithm is a modification of the well-known Lesk approach. It is based on the positional

structurization of texts of glossary entries that provides the assignment of a priority weight to each significant term

depending on its position in the entry text, which makes it possible to compute lexical intersections of different levels with

different priority weights. In this case, nuances of lexical structures of entries containing definitions of concepts are taken

into account rather than a simple intersection of words of two texts. The Internet Encyclopedia Wikipedia is used as source

data for computations. The simulated annealing method is used for the determination of weight parameters.

The described method has demonstrated a high degree of correlation with test data. Thus, the proposed algorithm

demonstrates results at the level of best modern methods and, at the same time, is transparent and intuitive. A software

implementation of the method is developed whose high operating speed allows one to use it in solving various problems of

computer linguistics.

There are several ways for improving estimation quality, namely,

• addition of new factors to the weight model;

• integration with others techniques for calculating semantic closeness with a view to constructing a complex

estimate.

The efficiency can be increased, for example, by developing a parallel version of the program. This will allow using

modern multiprocessor and multinuclear computing systems.

521

Pair of WordsEstimate of Semantic

Closeness-Relatedness of Words

word 1 word 2 expert algorithm

ñar automobile 8.94 9.99

magician wizard 9.02 6.93

glass magician 2.08 1.1

money currency 9.04 5.67

noon string 0.54 0.82

FBI fingerprint 6.94 4.05

tiger cat 7.35 4.13

tiger tiger 10 10

book paper 7.46 4.44

computer keyboard 7.62 4.38

computer internet 7.58 4.04

physics chemistry 7.35 4.28

drink ear 1.31 1.13

TABLE 1

Fig. 1

Co

rrela

tio

n

This program for computing the semantic closeness-relatedness between natural language words is developed within

the framework of a system of multi-purpose applied systems of semantic analysis and semantic processing of text

documents.

REFERENCES

1. M. Lesk, “Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice

cream cone,” in: Proc. of the 5th Annu. Intern. Conf. on Syst. Document SIGDOC’86, ACM, New York (1986),

pp. 24–26.

2. S. Wubben, “Using free link structure to calculate semantic relatedness,” ILK Research Group Technical Report

Series No. 08–01, Tilburq Univ., Tilburq (2008).

3. S. P. Ponzetto and M. Strube, “Knowledge deriver from Wikipedia for computing semantic relatedness,” Artif. Intell.

Res., No. 30, 181–212 (2007).

4. E. Gabrilovich and S. Markovitch, “Computing semantic relatedness using Wikipedia-based explicit semantic

analysis,” in: Proc. 20th Intern. Joint Conf. on Artif. Intell. (Hyderabad, 2007), Morgan Kauffman, San Francisco

(2007), pp. 1606–1611.

5. P. Resnik, “Using information content to evaluate semantic similarity in a taxonomy,” in: Proc. Intern. Joint Conf. on

Artif. Intell. (Montreal, 1995), Morgan Kauffman, San Francisco (1995), pp. 448–453.

6. C. Leacock, M. Chodorow, and G. A. Miller, “Using corpus statistics and wordnet relations for sense identification,”

Comput. Ling., 24, No. 1, 147–165 (1998).

7. Z. Wu and M. Palmer, “Verb semantics and lexical selection,” in: Proc. 32nd. Annu. Meet. of the Assoc. for Comput.

Ling. (Las Cruces, 1994), Morgan Kauffman, San Francisco (1994), pp. 133–138.

8. M. Strube and S. P. Ponzetto, “WikiRelate! Computing semantic relatedness using Wikipedia,” in: Proc. 21st Nat.

Conf. on Artif. Intell., AAAI, Boston, MA (2006), pp. 1419–1424.

9. D. Milne and I. H. Witten, “An effective, low-cost measure of semantic relatedness obtained from Wikipedia links,”

in: Proc. 1st AAAI Workshop on Wikipedia and Artif. Intell. (CIKM’2008) (Chicago, 2008), AAAI Press, Menlo

Park (USA) (2008).

10. E. Yeh, D. Ramage, C. D. Manning, et al., “WikiWalk: Random walks on Wikipedia for semantic relatedness,” in:

ACL-IJCNLP TextGraphs-4 Workshop 2009, Singapore (2009).

11. S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by simulated annealing,” Science, New Series, No. 220,

671–680 (1983).

12. S. Luke, Essentials of Metaheuristics (2009), http://cs.gmu.edu/!sean/book/metaheuristics/.

13. M. Odersky, Scala by Example, Progr. Meth. Lab., EPFL, Lausanne (2009).

14. M. Odersky, L. Spoon, and B. Venners, Programming in Scala, Artima Press, Montain View (2008).

15. L. Finkelstein, E. Gabrilovich, Y. Matias, et al., “Placing search in context: The concept revisited,” ACM Trans.

Inform. Systems, 20, No. 1, 116–131 (2002).

16. T. Pedersen, S. Pathwardhan, and J. Michelizzi, “Wordnet::Similarity — Measuring the relatedness of concepts,” in:

Proc. 19th Nat. Conf. on Artif. Intell. (San Jose, 2004), Springer, Berlin (2004), pp. 1024–1025.

522