6
Int. J. Man-MachineStudies (1976) 8, 711-716 On fuzziness in information retrieval C. V. NEGOITA Department of Cybernetics, ASE, Bucharest, Romania AND P. FLONDOR Institute of Mathematics, Bucharest, Romania (Received 30 May 1976 and in revised form 23 August 1976) The IR systems are faced with a need to manage fuzziness and not merely to react to fuzziness. The indexing process is viewed as representing the set X of information items by fuzzy subsets o~'(Y) of the descriptor set Y. A fuzzy assignment is modelled as a system (X,Y,f:X-+o~'(Y))where f(x)0,) is the link between the information item x and the descriptor y. The grade of importance of a subset of descriptors is expressed by a fuzzy measure Ix: ~(Y)-+[0,1]. In this way syntax estimation is made possible by using a fuzzy integral ~f(x)(y)oix. The information items are ranked according to a new application X-+[0,1] induced by the global measure Ix. Finally, a descriptor is viewed as a fuzzy set. 1. Introduction The objective of this paper is to introduce a set of models which are, or promise to be, of general value in information retrieval theory. The study of fuzziness is essential to a proper understanding of man-to-machine and machine-to-man communication. The IR systems are faced with a need to manage fuzziness and not merely to react to fuzziness (Negoita & Ralescu, 1975). In this paper we have carried out an investigation into the possibility that the indexing process is so presented by a descriptor assignment model as to make this a useful basis for a proper understanding of the retrieval process. The retrieval process is considered as a matching process detecting the resemblance or the similarity between requests and actual information objects. Therefore, in the context we shall consider, information retrieval has two major facets: language for description and search strategy for retrieval. In indexing, we are replacing the content with a few simple descriptors, and these descriptors will serve as the basis for making decisions on what information objects we want to retrieve. The skill with which these descriptors are used plays a great role in determining retrieval effectiveness. Consequently, we are concerned wit,h syntax, i.e. a set of rules for combining words with meanings not expressible in the basic vocabulary. We call these combinations "coalitions", extending the descriptive capability of a vocabulary, and use a fuzzy measure to describe the importance of such coalitions. For some given descriptions of the information objects modelled as subjective assign- ments we consider another subjectivity which belongs to the user side and is expressed as a measure of the context. 711

On fuzziness in information retrieval

Embed Size (px)

Citation preview

Page 1: On fuzziness in information retrieval

Int. J. Man-Machine Studies (1976) 8, 711-716

On fuzziness in information retrieval

C. V. NEGOITA

Department of Cybernetics, ASE, Bucharest, Romania

AND

P. FLONDOR

Institute of Mathematics, Bucharest, Romania

(Received 30 May 1976 and in revised form 23 August 1976)

The IR systems are faced with a need to manage fuzziness and not merely to react to fuzziness. The indexing process is viewed as representing the set X of information items by fuzzy subsets o~'(Y) of the descriptor set Y. A fuzzy assignment is modelled as a system (X,Y,f:X-+o~'(Y))where f(x)0,) is the link between the information item x and the descriptor y. The grade of importance of a subset of descriptors is expressed by a fuzzy measure Ix: ~(Y)-+[0,1]. In this way syntax estimation is made possible by using a fuzzy integral ~f(x)(y)oix. The information items are ranked according to a new application X-+[0,1] induced by the global measure Ix. Finally, a descriptor is viewed as a fuzzy set.

1. Introduction

The objective of this paper is to introduce a set of models which are, or promise to be, of general value in information retrieval theory. The study of fuzziness is essential to a proper understanding of man-to-machine and machine-to-man communication. The IR systems are faced with a need to manage fuzziness and not merely to react to fuzziness (Negoita & Ralescu, 1975).

In this paper we have carried out an investigation into the possibility that the indexing process is so presented by a descriptor assignment model as to make this a useful basis for a proper understanding of the retrieval process. The retrieval process is considered as a matching process detecting the resemblance or the similarity between requests and actual information objects. Therefore, in the context we shall consider, information retrieval has two major facets: language for description and search strategy for retrieval.

In indexing, we are replacing the content with a few simple descriptors, and these descriptors will serve as the basis for making decisions on what information objects we want to retrieve. The skill with which these descriptors are used plays a great role in determining retrieval effectiveness. Consequently, we are concerned wit,h syntax, i.e. a set o f rules for combining words with meanings not expressible in the basic vocabulary. We call these combinations "coalitions", extending the descriptive capability of a vocabulary, and use a fuzzy measure to describe the importance of such coalitions.

For some given descriptions of the information objects modelled as subjective assign- ments we consider another subjectivity which belongs to the user side and is expressed as a measure of the context.

711

Page 2: On fuzziness in information retrieval

712 C. V. NEGOITA AND P. FLONDOR

Indexing The prhacipal objective of this section will be the formal statement of what we shall mean by "descriptors assignment problem".

Since the search criteria are based on the content of an information object, it becomes necessary to use content identifiers, such as a set of descriptors attached to each object, normally chosen from a controlled list of allowable terms. The preparation of these sets is normally referred to as indexing.

Many different approaches to the problem of providing models for indexing have been explored by computer technologists, and a few have been tested by actual IR systems. It would be tedious and for our purposes unrewarding to attempt to describe even the most successful of these individually. Let us instead discuss the general approach to the problem, which can be recognized in practically every system thus far reported.

Let us consider a set X of objects, a set Y of descriptors and a map f : X ~ ( Y ) , where f(x) is the subset of these descriptors assigned to the object x. We can consider also g : B-+~(X) where gO') is the subset of those objects having the descriptor y. It is obvious that f and g are, in fact, applications obtained from a relation defined on the Cartesian product X × Y , namely, the relation "the object x has the descriptor y". Generally, the sets X,Y are considered to be finite. In this case, the relationship between objects and descriptors may conveniently be represented in the form of a matrix. For instance, the columns can represent a vocabulary of descriptors and the rows can represent objects indexed into a retrieval system. To enter an object into the system we assign it to a class, by tagging it with appropriate values. This assignment is indicated in each cell of the matrix. Thus, the indexing process is the representation of objects by subsets of Y.

So far we have considered in our model sets and relations. We are ready now to note that an assignment is an arrow in category Rel, whose objects are sets and whose mor- phisms are relations between sets. The model of a deterministic assignment can be, therefore, studied in the framework of this categorical embedment. I f we note by id x the identity morphism X-+X : x ~ x we have an abstraction of the identity relation.

Identity is always understood as a binary relation in a certain set of objects. The content of this relation depends on the situation in which we are considering these objects, or on the observer, who passes judgement on the identity of objects from his chosen point of view. Identity turns out to be synonymous with interchangeability.

If the identity of objects signifies their complete interchangeability, then their resem- blance means their only partial interchangeability.

Such a relation is given, for instance, by the coincidence of certain descriptors. We say "xl,x~ are partially indistinguishable if they have at least a common descriptor", and write xiRxs~f(xi) nf(x2)@~, if for each xeX, f (x )@~ ; the relation R is called a tolerance, that is a relation which is reflexive and symmetric. Indeed, every object resembles itself and two objects resemble each other, independently of the order we consider them. The transitivity is by no means obligatory. The deterministic model, however, is not well suited for dealing with real systems because it fails to come to grips with the reality of the fuzziness of descriptor assignment. Thus, we shall consider this fact in greater detail.

We begin with a fundamental assumption concerning the existence of decision situa- tions, the essential nature of which may conceivably be captured within the framework of a new model.

Page 3: On fuzziness in information retrieval

ON FUZZINESS IN INFORMATION RETRIEVAL 713

Let us consider again a set X of objects, a set Y of descriptors and a map. This time the map will be f : X~o~(Y) or g : Y-+o~(X) where ~-(X) are all the fuzzy subsets of X, f(x)(y) is the grade of membership of the descriptor y to the object x and g(y)(x) is the grade of membership of the object x to the descriptor y. It is evident t h a t f a n d g are, in fact, applications obtained from a fuzzy relation defined on the Cartesian product X x Y. Generally, the sets X and Y are considered to be finite. Once again, the relation- ship between objects and descriptors may conveniently be represented in the form of a matrix. To enter an object into the system we assign it to fuzzy sets, by tagging it with membership degrees. In the context, to index is to represent objects by fuzzy subsets of the set Y (Negoita, 1973).

We have considered in our new model, sets and fuzzy relations. It is precisely for this reason that a fuzzy assignment can be viewed as an arrow in the category FRel, whose objects are sets and whose morphisms are fuzzy relations between sets.

The strength of the model for fuzzy assignment lies in the fact that we can largely exploit the resemblance relation. From the beginning we shall stress that the condition of reflexivity can be replaced by a weaker one. A fuzzy resemblance relation may be defined as a mapping t : XxX-+[0,1] such that t(x,x)>O and t(x,y) = t(y,x).

In order to illustrate this idea in a concrete setting, it may be well to consider a series of examples in which a fuzzy resemblance is given in various ways.

For instance, bearing in mind the idea that two objects have at least a common descriptor in order to resemble each other, we can introduce the relation t(xl,x2) =

V y ~y [f(xDAf(x~)].

Let us now examine the case where the relation is: two objects have as many common descriptors as possible. Even though we consider the deterministic modelX,Y,f: X-+~(Y), if we denote by card : ~(Y)-+N the map assigning to every part of Y the number of its elements and by b : ~+-+[0,1] the fuzzy set of big numbers, then

h(xx,x2) = b{card[f(x0 nf(x~)]}

is a fuzzy relation on X. In the same way, if X,Yf: X-+~-(Y) is a fuzzy assignment model and t ' : ~-(Y)-+[0,1]

a fuzzy subset, then h(xlx2) = b'[f(xl) nf(x2)] is a fuzzy relation. Our overall position at this moment may be briefly recapitulated as follows. A

deterministic description assignment is a mapping

F : X ×Y~{0,1}

with the property YxeX, 3y eY with F(x,y) ---- 1, meaning that all the objects can be characterized by Y. F is a relation and F(x,y) ----- 1 means "x has a descriptor y" . A descriptor assignment induces a mapping

f : X ~ ( Y )

defined as f (x) ----- {y eY] F(x,y) = 1}. Clearly, ~--/=f(x). We denote by (F,X,Y)the model of descriptor assignment.

The model (F,X,Y) induces a set of relations:

(a) "two objects a,b ~X have at least a common descriptor", defined aRb i f f ( a ) n f(b)=A~ (clearly, this relation is a tolerance);

Page 4: On fuzziness in information retrieval

714 C. V. NEGOITA AND P. FLONDOR

(b) X x X ~ N defined as (a,b) l~card[f(a) nf(b)] the number of common descriptors; (c) "as many common descriptors as possible" defined as R(a,b) = H[card(f(a)c3

f(b))].

A fuzzy descriptor assignment is a mapping

: X × Y-+J0,1]

with the property Vx ~X, 3y ~Y with F(x,y)>0. Therefore, F is a fuzzy relation. A fuzzy descriptor assignment induces ~: X - + f ( Y ) and F(x,y) means the membership degree of descriptor y to object x.

The mapping t': X - ~ - ( Y ) is defined as ~(x)(y) = F(x,y). Clearly, ~ g:j'(x). We denote by (~,X,Y) a model of fuzzy of descriptor assignment. Such a model induces a relation "two objects have at least a common descriptor" given by

with properties

p(a,b) --- y V y L~C(a)A~(b)]

o(a,a)>O, p(a,b) = p(b,a).

We shall consider now, a cluster as a number of similar objects. This means that members within each cluster are sufficiently alike to justify ignoring the individual differences between them.

Let us consider a fuzzy measure g : ~(Y)-~[O,1] defined on Y. Then, a classification

~i: X~[O,l]

is a global evaluation procedure defined as an assignment

a fy (a)o where ~y is the fuzzy integral as defined by Sugeno (1974).

Search strategies Information is retrieved in response to a request. An IR system is therefore defined as a purposeful, and, then, subjective recovery. We shall now consider the problems of formulating a statement of information needs and the use of this statement to search the desired information. This statement is commonly called a request. We define a request to be a word in a language. Sets of descriptors provide great flexibility of subject description. Because there is no need of structural relationship between descriptors, they can be added to or deleted from the vocabulary at will, making the language highly adaptive to subject matter changes.

A request is the basis for an evaluation by computing a relation (identity, resemblance). Using a resemblance relation, the IR system is then one which compares the request with the description of stored objects and ranks all the objects.

Let us consider the model (F,X,Y). We shall define X 1 - - X u {q} where q is the request. In this way a new model (F1,X~,Y) is constructed, with F~[ x x Y = F.

The problem now is to find a eX as closest of y as possible.

Page 5: On fuzziness in information retrieval

ON FUZZINESS IN INFORMATION RETRIEVAL 715

Let us consider the classification ~ : X1-->[0,1 ]. The problem is solved, choosing that it maximizes

d[~(a),~(q)].

It should be observed at this point that ambiguity and consequent retrieval of irrelevant objects, can arise as a result of ignoring relational information.

Various descriptors modify one another, greatly enlarging the number o f distinct statements that can be made. This is nicely illustrated by the example of the ambiguity between blind Venetians and Venetian blinds which occurs through failing to note when one descriptor is used to qualify another. The human brain takes advantage of the context. The coalition is used to avoid ambiguity. Almost any common word can loose its straightforward character by finding itself in company that brings out some hidden weakness. The human brain, guided by experience, has developed a capacity to cope with problems of this type.

The first step in our model will be then to define the grade of importance o f every coalition and to investigate the consequences of this line of thinking. A possible way to evaluate this importance is to use the fuzzy measure (Sugeno, 1974).

Let us consider a set X of objects, a set Y of descriptors, and a map f : X-->~-(Y). This is our model of fuzzy assignment (indexing), where f ( x ) ( y ) is the link between the object x and the property y. The next step is to consider a subjective measure IX : ~(Y)-->[0,1] expressing the grade of importance of a subset of descriptors. In this way we have introduced a tool for subjective evaluation of the context. The IX measure ranks the groups of descriptors according to syntax.

Bearing in mind that f (x) e ~ ( Y ) we are ready now to introduce the concept of syntax estimation as a functional with monotonicity defined by using fuzzy measures, i.e. the

number F x = ~ yf(x)(y)oIX ---- sup a e[0,1] [aAIX(Y c~Xa)], where Xa = (xlf(y)(y)>~a}. In this

way we have obtained a new application:

F : X-->[0,1], F(x) = F x

induced by a global criterion, the IX measure. Let us consider again a set X of information objects and a set Y of descriptors. We can

make statements about an element x eX of the form: the result of performing a test y eY on x eX is v eV. The basic predicates, out of which the descriptions of information objects are built, consist of primitive statements of the form y(x) = v. The statement

yl(x~) = vl A . . . A y~(x3 = v ,

may be considered to describe the object x~. So far we have discussed two cases, V = {0,1} and V ---- [0,1]. Now, we shall focus

our attention to another case, when V is the set of fuzzy sets. For instance, the statement "colour (x) ---- red" is a primitive statement where "colour" is a descriptor and " red" is a value. Then, the vocabulary can be thought as the set of values, and a key-word is a family of values.

We can consider the descriptors y ~Y as fuzzy sets fs :Sj-->[0,1], where the key word Sj is a particular subset of descriptors S j c Y . For instance:

Page 6: On fuzziness in information retrieval

716 C. V. NEGOITA AND P. FLONDOR

field = (mathematics, chemistry, physics, art), language = (English, French, Russian, German), year = (1900, 1901, . . . ) , pages = (100, 120 . . . . ),

f3:$3~[0,1] means new, f 4 : S ~ [ 0 , 1 ] means big.

Let I~ be a fuzzy measure on {1,2,3,4}. A query is a vector (fl,f2,f3,f4). The answer to the system is an x eX which maximizes

~f(x)oI.t

where f is a function resulted from f l . . . . . f~q. So far it has been supposed that the user has a detailed knowledge about vocabulary.

Unfortunately this is not the truth. The process of request formulation is complex and depends on particular attributes of the requestor, such as: his knowledge of the content of the store, his familiarity with the topic matter being searched, his personal preferences for vocabulary and style, and so on.

This is the reason that a relevance feedback is used to modify the original request within a learning process. Relevance feedback means that the requestor asks a question, gets its response and retreats to ponder the output. He then has the opportunity to return and ask a new question.

Conclusions The concept of the fuzzy set is found to be of both theoretical and practical significance in the information retrieval field. This was first noted by Negoita (1973) who developed a relevance theory. In this paper his results were complemented by modelling the indexing process as a fuzzy assignment in order to exploit the resemblance relations for search strategies. The fuzzy integral introduced by Sugeno (1974) was used for syntax estimation. Subjectivity which belongs to the user is expressed as a measure of the context. This is believed to be a versatile area for further research and is currently under investigation.

References NEGOITA, C. V. (1973). On the notion of relevance in information retrieval. Kybernetes, 2, 161. NEC, OITA, C. V. & RALSSCU, D. A. (1975). Applications of Fuzzy Sets to Systems Analysis.

Basel: Birkh~iuser Verlag. SuosNo, M. (1974). Theory of fuzzy integrals and its applications. Ph.D Thesis. Tokyo Institute

of Technology.