26
Strong similarity measures for ordered sets of documents in information retrieval L. Egghe a,b, * , C. Michel c a LUC, Universitaire Campus, B-3590 Diepenbeek, Belgium b UIA, Universiteitsplein 1, B-2610 Antwerpen (Wilrijk), Belgium c CEM-GRESIC, MSHA, D.U. Bordeaux III, Esplanade des Antilles, F-33607, Pessac Cedex, France Received 6 December 2000; accepted 3 October 2001 Abstract A general method is presented to construct ordered similarity measures (OS-measures), i.e., similarity measures for ordered sets of documents (as, e.g., being the result of an IR-process), based on classical, well- known similarity measures for ordinary sets (measures such as Jaccard, Dice, Cosine or overlap measures). To this extent, we first present a review of these measures and their relationships. The method given here to construct OS-measures extends the one given by Michel in a previous paper so that it becomes applicable on any pair of ordered sets. Concrete expressions of this method, applied to the classical similarity measures, are given. Some of these measures are then tested in the IR-system Profil-Doc. The engine SPIRIT Ó extracts ranked document sets in three different contexts, each for 550 requests. The practical usability of the OS-measures is then discussed based on these experiments. Ó 2002 Elsevier Science Ltd. All rights reserved. 1. Introduction Similarity measures for sets of objects (such as documents) are well known and well used in the IR literature. They have become standard tools that are featuring in any good monograph on IR. Relatively recent monographs dealing with these measures are Boyce, Meadow, and Kraft (1995), Tague-Sutcliffe (1995), Grossman and Frieder (1998) and Losee (1998). Of course the ‘‘standard’’ books on IR, Salton and Mc Gill (1987) and Van Rijsbergen (1979) should also be mentioned here. * Corresponding author. E-mail addresses: [email protected] (L. Egghe), [email protected] (C. Michel). 0306-4573/02/$ - see front matter Ó 2002 Elsevier Science Ltd. All rights reserved. PII:S0306-4573(01)00051-6 Information Processing and Management 38 (2002) 823–848 www.elsevier.com/locate/infoproman

Strong similarity measures for ordered sets of documents in information retrieval

  • Upload
    l-egghe

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Strong similarity measures for ordered sets of documentsin information retrieval

L. Egghe a,b,*, C. Michel c

a LUC, Universitaire Campus, B-3590 Diepenbeek, Belgiumb UIA, Universiteitsplein 1, B-2610 Antwerpen (Wilrijk), Belgium

c CEM-GRESIC, MSHA, D.U. Bordeaux III, Esplanade des Antilles, F-33607, Pessac Cedex, France

Received 6 December 2000; accepted 3 October 2001

Abstract

A general method is presented to construct ordered similarity measures (OS-measures), i.e., similaritymeasures for ordered sets of documents (as, e.g., being the result of an IR-process), based on classical, well-known similarity measures for ordinary sets (measures such as Jaccard, Dice, Cosine or overlap measures).To this extent, we first present a review of these measures and their relationships.

The method given here to construct OS-measures extends the one given by Michel in a previous paper sothat it becomes applicable on any pair of ordered sets. Concrete expressions of this method, applied to theclassical similarity measures, are given.

Some of these measures are then tested in the IR-system Profil-Doc. The engine SPIRIT� extracts rankeddocument sets in three different contexts, each for 550 requests. The practical usability of the OS-measuresis then discussed based on these experiments.� 2002 Elsevier Science Ltd. All rights reserved.

1. Introduction

Similarity measures for sets of objects (such as documents) are well known and well used in theIR literature. They have become standard tools that are featuring in any good monograph on IR.Relatively recent monographs dealing with these measures are Boyce, Meadow, and Kraft (1995),Tague-Sutcliffe (1995), Grossman and Frieder (1998) and Losee (1998). Of course the ‘‘standard’’books on IR, Salton and Mc Gill (1987) and Van Rijsbergen (1979) should also be mentionedhere.

* Corresponding author.

E-mail addresses: [email protected] (L. Egghe), [email protected] (C. Michel).

0306-4573/02/$ - see front matter � 2002 Elsevier Science Ltd. All rights reserved.

PII: S0306-4573(01)00051-6

Information Processing and Management 38 (2002) 823–848www.elsevier.com/locate/infoproman

In general one has a ‘‘universe’’ X from which subsets A;B; . . . are generated. In IR, X usually isa database of documents and the subsets A;B; . . . are the result of an IR-process that started afterpresenting a query to the system. For any A;B � X, a similarity measure D gives a positivenumber DðA;BÞ that expresses the ‘‘degree’’ of similarity between the two sets. Further on we willgo into the properties that good similarity measures should have but we can already understandthat DðA;BÞ should reach its minimal value (usually 0) if A \ B ¼ ; and should reach its maximalvalue if A ¼ B 6¼ ;. This maximal value usually is 1 if D is normalized. If this is the case, then1� D is a dissimilarity measure. In this paper we will restrict ourselves to the study of similaritymeasures.

Examples of important similarity measures are (A;B � X; A;B 6¼ ;; jAj denotes the cardinalityof A).

Jaccard’s index J:

JðA;BÞ ¼ jA \ BjjA [ Bj : ð1Þ

Recall R and Precision P:

RðA;BÞ ¼ jA \ BjjBj ; ð2Þ

P ðA;BÞ ¼ jA \ BjjAj : ð3Þ

Note that here the roles of A and B cannot be interchanged. In IR, A ¼ ret, the set of retrieveddocuments and B ¼ rel, the set of relevant documents. For this reason, R and P are non-sym-metric (RðA;BÞ 6¼ RðB;AÞ and PðA;BÞ 6¼ P ðB;AÞ). Note that J is symmetric. In most IR cases, Rand P are related: the R–P curves (cf. Van Rijsbergen, 1979; Salton & Mc Gill, 1987) are de-creasing. An ideal search, however, should have high values of R and P. There is a similaritymeasure that expresses this, namely the harmonic average of R and P:

E ¼ 21

R

��þ 1

P

�: ð4Þ

It is readily seen that

EðA;BÞ ¼ 2jA \ BjjAj þ jBj ; ð5Þ

which is also known as Dice’s index. Note that E is symmetric. Eqs. (4) and (5) can be generalized.(Generalized) Dice index: For a 2 0; 1½, we define

Ea ¼ 1aP

��þ 1� a

R

�; ð6Þ

which reduces to (4) for a ¼ 1=2. Now

EaðA;BÞ ¼jA \ Bj

ajAj þ ð1� aÞjBj : ð7Þ

824 L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848

Formula (6) appears in Salton and Mc Gill (1987), Van Rijsbergen (1979) and Tague-Sutcliffe(1995) but in the form 1� Ea. It is not logical to mix similarity and dissimilarity measures; so weuse Ea instead of 1� Ea. In Tague, one also finds the (apparently) to (4) related

11

R

��þ 1

P� 1

�: ð8Þ

However, as is readily seen from (2) and (3), (8) is nothing else than J, the Jaccard index.Cosine measure: If we take the geometric average of R and P, we obtain another symmetric

similarity measure:

Cos ¼ffiffiffiffiffiffiRPp

¼ jA \ BjffiffiffiffiffiffiffiffiffiffiffiffijAjjBj

p : ð9Þ

The name Cos comes from cosine and will be explained later, when applying (9) for vectors.In Boyce et al. (1995) one can find another measure that can be formulated (in our notation) as

follows:

jA \ BjffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffijAj2 þ jBj2

q : ð10Þ

They call it the cosine coefficient. In view of the above (9) and the explanation with vectors (tofollow) we cannot accept this name for (10). In addition, all measures encountered so far arenormalized (i.e., their maximal value, obtained if A ¼ B 6¼ ;, is 1). This is not the case in (10). Thenormalized form of (10) is

NðA;BÞ ¼ffiffiffi2p jA \ Bjffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

jAj2 þ jBj2q : ð11Þ

We will investigate this measure further on (it will be shown that N has good qualities). Note thatN is symmetric.Overlap measures O1 and O2: In Boyce et al. (1995) and Van Rijsbergen (1979) one finds the

following overlap measure:

O1ðA;BÞ ¼jA \ Bj

minðjAj; jBjÞ : ð12Þ

We did not find a reference for the next measure:

O2ðA;BÞ ¼jA \ Bj

maxðjAj; jBjÞ ð13Þ

but O2 will turn out to have better properties than O1. Furthermore, max is as good as min to beused; therefore O2 is worth to be studied.

It follows from elementary calculations that

J 6O2 6E6N 6Cos6O1 ð14Þ

L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848 825

and

O2 6R; P ;Ea 6O1: ð15ÞIn the following section we will study general properties of these measures, i.e., properties that

good similarity measures should have. This will lead to the definition of weak and strong simi-larity measures.

Section 3 presents a general method of constructing similarity measures for ordered sets (or-dered similarity measures or OS-measures) from similarity measures for ordinary sets as definedabove. It turns out that the method can be applied to most of the measures defined above: J ;E (infact all Ea), Cos, N and O2. The problem of constructing OS-measures for the other similaritymeasures ðR; P ;O1Þ is left as an open problem; the authors intend to study this in the near future.The obtained general method is then applied to concrete strong similarity measures and concreteformulae are given.

Section 4 is then devoted to experiments. Using the IR-system Profil-Doc and the engineSPIRIT�, we obtain ranked document sets for 550 requests in three different contexts, which arethen compared (using OS-measures) with a standard neutral ordered set. Graphs are constructedand qualitative conclusions on the used OS-measures are given.

1.1. General note on the interpretation of these similarity measures for the comparison of vectors

All of the above measures compare couples of sets. As such this is very interesting and ap-plicable in IR. However, there is an important part of IR that considers documents and queries asvectors; each coordinate (which we will restrict here to have values 1 or 0) expresses the fact that acertain keyword does or does not belong to the document or the query. Let us limit our attentionto documents. In this setting a document is then ‘‘replaced’’ by its vector ~dd ¼ ðd1; . . . ; dnÞ, whereeach di 2 f0; 1g; i ¼ 1; . . . ; n. The above measures can be used to measure the similarity betweentwo such vectors ~dd ¼ ðdiÞi¼1;...;n and ~dd 0 ¼ ðd 0iÞi¼1;...;n. This goes as follows: define

D ¼ fi 2 f1; . . . ; ngkdi ¼ 1g; ð16ÞD0 ¼ fi 2 f1; . . . ; ngkd 0i ¼ 1g: ð17Þ

Then

jD \ D0j ¼ h~dd;~dd 0i ¼Xni¼1

did 0i ; ð18Þ

the classical inproduct of ~dd and ~dd 0. Also

jDj ¼Xni¼1jdij2 ¼ k~ddk2; ð19Þ

jD0j ¼Xni¼1jd 0i j

2 ¼ k~dd 0k2; ð20Þ

the squares of the norms of ~dd and ~dd 0. Finally

jD [ D0j ¼ k~ddk2 þ k~dd 0k2 � h~dd;~dd 0i ð21Þ

826 L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848

as is readily seen. Since only (18)–(21) are needed in the formulae (1)–(3), (5), (7), (9) and (11)–(13), we are able to interpret all of the above measures as similarity measures for vectors. Oneexample yields (9):

jD \ D0jffiffiffiffiffiffiffiffiffiffiffiffiffijDkD0j

p ¼ h~dd;~dd 0ik~ddk � k~dd 0k

; ð22Þ

which is the well-known formula for cosðd~dd;~dd 0~dd;~dd 0 Þ, the cosine of the angle between the vectors ~dd and~dd 0. That is why in (9), we used the term Cos for the geometric mean of R and P. Note that thisinterpretation also shows that N (formula (11)) cannot be called a cosine function.

For the rest of this paper, however, we will limit ourselves to similarities for sets (and orderedsets).

2. General properties of similarity measures (on ordinary sets)

Let D be any real-valued function on 2X � 2X ð2X ¼ the set of all subsets of XÞ, where DðA;BÞis defined for all A;B � X; A;B 6¼ ;. For normalization reasons we require

ðD1Þ 06DðA;BÞ6 1: ð23ÞFor the maximal value, DðA;BÞ ¼ 1, we require the highest possible similarity between A and B,

i.e., A ¼ B 6¼ ; (and vice versa):

ðD2Þ DðA;BÞ ¼ 1 () A ¼ B 6¼ ;: ð24ÞAlthough this is true for most of the measures discussed in the previous section, some measures donot satisfy it because they have a different purpose, e.g., measuring overlap. If we look at R; P orO1 we see that the following property is valid:

ðD02Þ DðA;BÞ ¼ 1 () A � B or B � A: ð240 ÞðD02Þ is the weaker one since ðD2Þ ) ðD02Þ obviously but, since it is a typical ‘‘overlap prop-

erty’’ and since overlap is an important topic, we will keep both properties ðD2Þ and ðD02Þ in ourstudy.

For the minimal value DðA;BÞ ¼ 0, the situation is simpler:

ðD3Þ DðA;BÞ ¼ 0 () A \ B ¼ ;: ð25ÞAll of the measures defined in the previous section are of the form (w1;w2: functions)

DðA;BÞ ¼ w1ðjA \ BjÞw2ðjAj; jBj; jA [ BjÞ ð26Þ

and satisfy

ðD4Þ w1 is a strictly increasing function:

Optionally, one might also require symmetry:

ðD5Þ DðA;BÞ ¼ DðB;AÞ: ð27ÞAll measures of the previous section, except R and P, satisfy this.

L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848 827

In view of the above discussion we define:

Definition 2.1. If D is a real-valued function on 2X � 2X which satisfies (D1), (D2), (D3), (D4), wesay that D is a strong similarity measure.

Definition 2.2. If D is a real-valued function on 2X � 2X which satisfies ðD1Þ; ðD02Þ; ðD3Þ; ðD4Þ,we say that D is a weak similarity measure.

The measures J, E (in fact all Ea), Cos, N, O2 are strong similarity measures as is readily seen.The measures R, P, O1 are weak similarity measures. In this paper O1 is the only symmetric weaksimilarity measure that is not strong, showing that fðD1Þ; ðD02Þ; ðD3Þ; ðD4Þ; ðD5Þg does notimply ðD2Þ.

Note also that strong similarity measures need not be symmetric: the measures Ea ða 6¼ 1=2Þ arestrong similarity measures that are non-symmetric. Hence fðD1Þ; ðD2Þ; ðD3Þ; ðD4Þg does notimply ðD5Þ.

Of course, if D is any of the above measures and if f : ½0; 1 ! ½0; 1 is a strictly increasingfunction such that f ð0Þ ¼ 0 and f ð1Þ ¼ 1, then the measure f � D, defined as

ðf � DÞðA;BÞ ¼ f ðDðA;BÞÞ ð28Þhas the same properties as D. This simple remark will play a crucial role in the construction ofsimilarity measures for ordered sets: in several cases it will only be possible to construct an OS-measure from a similarity measure D as above when using functions f as above – see the followingsection.

All definitions above are given for measures D : ðA;BÞ ! DðA;BÞ for A;B 6¼ ;. If A or B is ; wewill define DðA;BÞ ¼ 0, for all measures D.

3. Ordered similarity measures (OS-measures)

3.1. Statement of the problem

An ordered set is (cf. Michel, 2000) an infinite chain C ¼ ðCiÞi2N [where Ci � X for all i 2 N andwhere Ci \ Cj ¼ ; 8i 6¼ j] such that the elements within one Ci are unordered but for elements indifferent Cis we have: d 2 Ci; d 0 2 Cj then d < d 0 () i < j. Such an ordered set can be depicted asin Fig. 1.

Possibly Ci ¼ ;. This will always be the case from a certain i on, if X is finite. This is always soin practise but, since jXj can be any high number (in other words, jXj is unbounded), the set-upwith infinite chains will prove to be very useful. Let us denote by C the set of all possible chains(ordered sets) as above.

Fig. 1. Visualization of an ordered set C ¼ ðCiÞi2N, where – symbolizes the order induced on the sets Ci via the order

< between the documents.

828 L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848

This is a very general set-up, comprising the most general cases in IR: it comprises the extremes

(i) The unordered case: C1 6¼ ;; Ci ¼ ; 8iP 2.(ii) The total linear case: all Cis are singletons (or are empty).

The unordered case refers to ‘‘classical’’ IR (e.g., boolean retrieval) and the total linear caserefers to a ranked list of documents (as, e.g., given by browsing machines in WWW). The generalsituation is needed, e.g., in the case where one gives, say, four keywords. C1 is then the unorderedset of documents that contains all four of these keywords. Every document in C1 is ordered beforeany document in C2 where C2 contains the documents that have exactly three of the four keywordsin their indexing. These documents, in turn, are ranked before all documents in C3, the unorderedset of documents that have exactly two of the four keywords in their indexing. Documents in C3

are ranked before all documents in C4, the unordered set of documents that have exactly one ofthe four keywords in their indexing. Finally, C4 is ranked before C5 containing the documents inX n

S4i¼1 Ci, i.e., those with none of the four keywords in their indexing (and Ci ¼ ; 8iP 6). One

could also stop with C4 (making Ci ¼ ; 8iP 5) since documents, not containing any of the re-quested key words, should not occur in a ranked output.

Ordered similarity (OS) measures compare two such chains in such a way that some ‘‘natural’’properties are satisfied. We will formulate them as was done in Michel (2000), but will call them(in view of the above discussions): strong ordered similarity measures. Their properties are: denoteby Q the function, acting on couples ðC;C0Þ; C ¼ ðCiÞi2N; C0 ¼ ðC0jÞj2N; C;C0 2 C. Q must satisfyðQ0Þ If C and C0 reduce to the unordered case (cf. (i) above), Q must be a good strong simi-larity measure for unordered sets (i.e., satisfy ðD1Þ; ðD2Þ; ðD3Þ; ðD4Þ).ðQ1Þ 06QðC;C0Þ6 1 for all C;C0 2 C.ðQ2Þ QðC;C0Þ ¼ 1() C ¼ C0 and no Ci or C0j is empty: i.e., 8i; 8j; Ci 6¼ ;; C0j 6¼ ;:ðQ3Þ QðC;C0Þ ¼ 0() C \ C0 ¼ ;. Here

C \ C0 ¼[1i¼1

Ci

!\

[1j¼1

C0j

!:

ðQ4Þ Let i; j 2 N; i 6¼ j. Let CðiÞ;C0ðjÞ be ordered sets in C such that Ck \ C0l ¼ ; 8k; l 2 N ex-cept for k ¼ i and l ¼ j. If we let i and j vary (but not the unordered sets Ci and C0j) thenQðCðiÞ;C0ðjÞÞ is strictly decreasing in j > i (i fixed) and in i > j (j fixed).ðQ5Þ The same as ðQ4Þ but now for i ¼ j: now QðCðiÞ;C0ðiÞÞ strictly decreases in i 2 N.The first four properties are clear from the unordered case; the last two properties express the

fact that sets Ci; C0j for i, j high have a smaller impact in the comparison of C and C0 than setsCi; C0j for i; j small.

Of course, a strong OS-measure may or may not be symmetric. If so, we have the propertyðQ6Þ QðC;C0Þ ¼ QðC0;CÞ 8C;C0 2 C.Michel (2000) was able to construct OS-measures as follows. Let D be any similarity measure

(for unordered sets). Then form

QðC;C0Þ ¼X1i¼1

X1j¼1

DðCi;C0jÞuði; jÞ; ð29Þ

L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848 829

where u is a certain weighting function. The problem is to make sure that (29) satisfies theproperties ðQiÞ ði ¼ 0; . . . ; 5Þ above. The main problem here is the fact that (29) can have scoresbeyond 1. To avoid this Michel (2000) has put a stringent condition on the type of ordered setsthat can be compared. For instance, in Michel (2000) one requires, given C ¼ ðCiÞ and C0 ¼ ðC0jÞ,that for every i, there exists a j such that Ci \ C0j 6¼ ;. This condition is, however, not always validin practice. We therefore improve Michel’s method so as to be usable for comparing any twoC;C0 2 C. Point is that a solution in the form of (29) is not always possible anymore. The OS-measure (29) is therefore modified so that we obtain a generally valid solution.

3.2. General theorem on the construction of strong OS-measures

Theorem 3.2.1. Let D be any strong similarity measure (on unordered sets). Define

QðC;C0Þ ¼X1i¼1

X1j¼1

f ðDðCi;C0jÞÞuði; jÞ; ð30Þ

where f is a strictly increasing function, f ð0Þ ¼ 0; f ð1Þ ¼ 1; 06 f 6 1, such thatX1j¼1

f ðDðCi;C0jÞÞ6 1 8i 2 N; ð31Þ

X1i¼1

f ðDðCi;C0jÞÞ6 1 8j 2 N ð32Þ

and where u satisfies: u > 0 and

(i) uði; jÞ strictly decreases in jP i (i fixed),(ii) uði; jÞ strictly decreases in iP j (j fixed),(iii) uði; iÞ strictly decreases in i,(iv)

P1i¼1 uði; iÞ ¼ 1,

(v) uði; jÞ6 minðuði; iÞ;uðj; jÞÞ 8i; j 2 N.

Then Q is a good strong OS-measure, i.e., satisfies ðQ0Þ; ðQ1Þ; ðQ2Þ; ðQ3Þ; ðQ4Þ; ðQ5Þ (and, if D issymmetric, then Q also satisfies ðQ6Þ if u is symmetric).

The proof is rather long and technical and is, therefore, given in Appendix A. This theorem willonly be important if functions u and f can be found such that (31), (32), (i)–(v) are satisfied. Theconstruction of f is non-trivial and dependent on the concrete D (i.e., J, E, Cos, . . .) and will bestudied in Section 3.3. For u, there is no problem: take, e.g.,

uði; jÞ ¼ 31

2i1

2j1

2ji�jj: ð33Þ

This function is decreasing for high ranks i and j as well as for high differences between theranks i and j, a natural property. Note that (33) equals the easier

uði; jÞ ¼ 3

22maxði;jÞ ð34Þ

as is readily seen. It is also readily seen that this u satisfies all five properties (i)–(v).

830 L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848

That u is a decreasing power of the ranks i and j is good and better than, e.g., a linear decrease.Indeed, there should be a larger difference in comparing C1 with C02 than, e.g., C101 with C0102 forC1 ¼ C101 and C02 ¼ C0102. This is also linked with the sensation law of Weber–Fechner stating thatthe sensation is proportional to the logarithm of the stimulus (see also Egghe, 1994; Egghe &Rousseau, 1990).

We will now proceed by the construction of concrete strong OS-measures, given one of thestrong similarity measures (for unordered sets) J, E, general Ea; Cos; N ; O2. For the moment it isnot clear at all that a proper function f, satisfying the properties as described in Theorem 3.2.1,can be constructed, for each of these strong similarity measures. Of course, any measure thatyields a good f will yield a good Q, i.e., a usable strong OS-measure.

3.3. Strong OS-measures derived from strong similarity measures for ordinary sets

3.3.1. JaccardIn view of Theorem 3.2.1 we must find a suitable f which can be used for J. Let C;C0 2

C; C ¼ ðCiÞi2N; C0 ¼ ðC0jÞj2N. In view of (31) and (32) we estimate 8i; j 2 N:

JðCi;C0jÞ ¼jCi \ C0jjjCi [ C0jj

6

jCi \ C0jjjCij

: ð35Þ

If we fix i and let j vary, we can write

jCi \ C0jj ¼ ai;jjCij; ð36Þwhere 06 ai;j 6 1 andX1

j¼1ai;j 6 1; ð37Þ

since the ai;js denote fractions of the C0js in Ci (fixed). Eqs. (36) in (35) yields

JðCi;C0jÞ6 ai;j ð38Þand hence (37) gives (31) for f the identical function: f ðxÞ ¼ x. In the same way we have

JðCi;C0jÞ ¼jCi \ C0jjjCi [ C0jj

6

jCi \ C0jjjC0jj

¼ a0i;j; ð39Þ

where 06 a0i;j 6 1 andX1i¼1

a0i;j 6 1; ð40Þ

since the a0i;js denote fractions of the Cis in C0j (fixed). Hence also (32) is proved for f the identicalfunction. Of course f ðxÞ ¼ x also satisfies f ð0Þ ¼ 0; f ð1Þ ¼ 1; 06 f 6 1 and f strictly increasing.

Conclusion:

QJðC;C0Þ ¼X1i¼1

X1j¼1

JðCi;C0jÞuði; jÞ ð41Þ

(with u, e.g., as in (33), (34)) is a good strong OS-measure.

L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848 831

3.3.2. DiceNow, for C;C0 2 C, we investigate

EðCi;C0jÞ ¼2jCi \ C0jjjCij þ jC0jj

: ð42Þ

With the same definition of ai;j as above we have:

2jCi \ C0jjjCij þ jC0jj

6

2jCi \ C0jjjCij þ jCi \ C0jj

¼ 2ai;jjCijjCij þ ai;jjCij

¼ 2ai;j

1þ ai;j: ð43Þ

We are now in search of a strictly increasing function f, 06 f 6 1; f ð0Þ ¼ 0; f ð1Þ ¼ 1 such that(43) transforms into ai;j, the numbers for which (37) is true. We hence have the equation

f2y

1þ y

� �¼ y ¼ f ðxÞ

yielding

y ¼ f ðxÞ ¼ x2� x

; ð44Þ

which indeed satisfies all the requirements. Hence with this f and since it increases we have, by (37)and (43), that (31) is valid for D ¼ E. In the same way we obtain

EðCi;C0jÞ ¼2jCi \ C0jjjCij þ jC0jj

6

2jCi \ C0jjjCi \ C0jj þ jC0jj

¼2a0i;jjC0jj

a0i;jjC0jj þ jC0jj¼

2a0i;ja0i;j þ 1

; ð45Þ

with a0i;j satisfying (40). Since (45) is the same expression as (43) we see that the same function f asin (44) will yield (32). We are now running into a surprise concerning the OS-measure derivedfrom E in this way: indeed

f ðEðCi;C0jÞÞ ¼ 2jCi \ C0jjjCij þ jC0jj

( )( ,2

(�

2jCi \ C0jjjCij þ jC0jj

))¼jCi \ C0jjjCi [ C0jj

¼ JðCi;C0jÞ ð46Þ

showing that, although we started from E, we refound the OS-measure derived from J (Section3.3.1). In other words: QE ¼ QJ . Although we did not find a new OS-measure, we consider thisresult as a remarkable link between E and J: by the properties of f (44), we can say that E and Jare very similar measures.

The most intricate case is the case of the generalized Dice indices Ea ða 6¼ 1=2Þ. Note that thesemeasures are not symmetrical and this causes trouble in finding the right function f. Nevertheless,we arrive at new strong OS-measures.

3.3.3. Generalized DiceLet a 2 0; 1½nf1=2g be fixed. We have

EaðCi;C0jÞ ¼jCi \ C0jj

ajCij þ ð1� aÞjC0jj6

jCi \ C0jjajCij þ ð1� aÞjCi \ C0jj

¼ ai;jjCijajCij þ ð1� aÞai;jjCij

¼ ai;j

aþ ð1� aÞai;j; ð47Þ

832 L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848

with ai;j as in (37). Transforming (47) into ai;j gives the function

y ¼ f1ðxÞ ¼ax

1� ð1� aÞx ð48Þ

for which (31) is satisfied (for f ¼ f1). Note indeed that 06 f1 6 1; f1ð0Þ ¼ 0; f1ð1Þ ¼ 1 and f1 isstrictly increasing.

Likewise we obtain

EaðCi;C0jÞ ¼a0i;j

aa0i;j þ ð1� aÞ : ð49Þ

Transforming this into a0i;j gives the function

y ¼ f2ðxÞ ¼ð1� aÞx1� ax

ð50Þ

for which (32) is satisfied. Both functions f1; f2 satisfy fið0Þ ¼ 0; fið1Þ ¼ 1; 06 fi 6 1; fi strictlyincreasing ði ¼ 1; 2Þ but we need one f in Theorem 3.2.1.

Hence we take

f ¼ minðf1; f2Þ; ð51Þwhich satisfies all the requirements, notwithstanding the fact that the Eas are non-symmetricða 6¼ 1=2Þ. In order to obtain workable expressions for our strong OS-measures that we obtainedwe need to find a concrete expression for f. Note however that

f1 < f2 iff 0 < a <1

2;

f2 < f1 iff1

2< a < 1

as is readily seen. So we obtain the following good strong OS-measures. Derived from Ea with 0 <a < 1=2 we have

QEaðC;C0Þ ¼X1i¼1

X1j¼1

f1ðEaðCi;C0jÞÞuði; jÞ: ð52Þ

Here

f1ðEaðCi;C0jÞÞ ¼ajCi \ C0jj

ajCij þ ð1� aÞjC0jj � ð1� aÞjCi \ C0jj¼

ajCi \ C0jjajCij þ ð1� aÞjC0j n Cij

ð53Þ

as is readily seen. If 1=2 < a < 1, we have

QEaðC;C0Þ ¼X1i¼1

X1j¼1

f2ðEaðCi;C0jÞÞuði; jÞ; ð54Þ

where

f2ðEaðCi;C0jÞÞ ¼ð1� aÞjCi \ C0jj

ajCij þ ð1� aÞjC0jj � ajCi \ C0jj¼

ð1� aÞjCi \ C0jjajCi n C0jj þ ð1� aÞjC0jj

: ð55Þ

All these measures are new and non-trivial.

L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848 833

3.3.4. CosineWe will now search for the OS-measure related to Cos via (30). Now

CosðCi;C0jÞ ¼jCi \ C0jjffiffiffiffiffiffiffiffiffiffiffiffiffiffiffijCijjC0jj

q : ð56Þ

Now we have

jCi \ C0jjffiffiffiffiffiffiffiffiffiffiffiffiffiffijCikC0jj

q 6

jCi \ C0jjffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffijCikCi \ C0jj

q ¼ ai;jjCijffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffijCijai;jjCij

p ¼ ffiffiffiffiffiffiai;jp

; ð57Þ

where ai;j is as before. Now we have the requirement ai;j 6¼ 0 (in order to have the validity of (57))but if ai;j ¼ 0 then CosðCi;C0jÞ ¼ 0 as is

ffiffiffiffiffiffiai;jp

. So we can use (57) also for ai;j ¼ 0. Transforming(57) into ai;j, yields the function

y ¼ f ðxÞ ¼ x2 ð58Þobviously. By symmetry (or an analogous argument) this function also works for a0i;j. Since f ð0Þ ¼0; f ð1Þ ¼ 1; 06 f 6 1 and since f increases strictly on ½0; 1 we have that the next measure Q is agood strong OS-measure (derived from Cos):

QCosðC;C0Þ ¼X1i¼1

X1j¼1

jCi \ Cjj2

jCijjC0jjuði; jÞ; ð59Þ

a new measure.

3.3.5. The measure NHere

NðCi;C0jÞ ¼ffiffiffi2p jCi \ C0jjffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

jCij2 þ jC0jj2

q : ð60Þ

Now

NðCi;C0jÞ6ffiffiffi2p jCi \ C0jjffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

jCij2 þ jCi \ C0jj2

q ¼ffiffiffi2p ai;jjCijffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

jCij2 þ a2i;jjCij2

q ¼ffiffiffi2p ai;jffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1þ a2i;j

q ð61Þ

with ai;j as before. Turning (61) into ai;j leads to the function

y ¼ f ðxÞ ¼ xffiffiffiffiffiffiffiffiffiffiffiffiffi2� x2p ; ð62Þ

which is strictly increasing, f ð0Þ ¼ 0; f ð1Þ ¼ 1; 06 f 6 1 on ½0; 1 . The same is true for a0i;j. Wehave now the good strong OS-measure

QNðC;C0Þ ¼X1i¼1

X1j¼1

f ðNðCi;C0jÞÞuði; jÞ; ð63Þ

834 L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848

where

f ðNðCi;C0jÞÞ ¼jCi \ C0jjffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

jCij2 þ jC0jj2 � jCi \ C0jj

2q ð64Þ

as is readily seen.

3.3.6. The overlap measure O2

O2ðCi;C0jÞ ¼jCi \ C0jj

maxðjCij; jC0jjÞ6

jCi \ C0jjjCij

¼ ai;j ð65Þ

and the same for a0i;j. Hence f can be taken f ðxÞ ¼ x. We have now the good strong OS-measure

QO2ðC;C0Þ ¼

X1i¼1

X1j¼1

jCi \ C0jjmaxðjCij; jC0jjÞ

uði; jÞ: ð66Þ

Since R; P and O1 are not strong similarity measures, they cannot be treated here as above.These weak similarity measures will be studied in another paper.

Note: As it follows from the above constructions, QO2is the largest of all the OS-measures that

are constructed here. Indeed, denoting QD for any of the QJ ¼ QE;QEa ;QCos;QN , we have that

f ðDðCi;C0jÞÞ6 minðai;j; a0i;jÞ ¼

jCi \ C0jjmaxðjCij; jC0jjÞ

; ð67Þ

hence QD 6QO2, for all the above D. Note the difference with (14).

The following section investigates some of these measures in a practical context.

4. Experimentation

4.1. Presentation of the context

The experimentation’s aim is to compare different OS-measures. In order to test it (as in Michel,2000), we have constructed a corpus of 550 queries and put them to the personalized filteringinformation retrieval system, Profil-Doc, developed by the laboratory RECODOC – see Michel(1999). Profil-Doc has the particularity of making a filtering of information regarding the profileof the user (Lain�ee-Cruzel, Lafouge, Lardy, & Ben Abdallah, 1996). SPIRIT� 1, the motor used toextract documents from the plain text database, uses weights to present answer documents intoclasses, with these classes being ranked in order of relevance to the user (see Fluhr, 1997).

Three different user profiles (called T1, T2 and T3) are tested with the 550 queries. In order tomeasure profiles’ effectiveness we compare, for each query, the personalized answer with a neutral

1 SPIRIT (Syntactic and Probabilistic Indexing and Retrieval Information System) is a commercial product of

T.GID. Searches about SPIRIT are made according to the CEA-DIST (Atomic Energy Commission – Scientific and

Technique Information Direction) – http://www.dist.cea.fr/.

L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848 835

one, i.e., given by the system without any filtering. In Fig. 2 we can see the number of documents(on the Y-axis) given by the system for each answer (number of answers on the X-axis). For eachcurve, values are presented by a decreasing number of documents for T1. So query numberscannot be directly compared.

We can see that the neutral process, T2 filtering and T3 filtering give the same variety of an-swers relating to the number of documents, indeed it varies in a regular way from 1 to 371, 366and 276, respectively. T1 filtering gives more specialized answers (indeed none of them have morethan 40 documents) but as before, we can observe a regular distribution. So the number ofdocuments per answer is not a bias in the experimentation.

OS-measures, presented in the following section, will be used to calculate, query per query, thesimilarity degree between neutral and personalized answers.

4.2. Tested OS-measures

We choose to compare the Jaccard and Cosine strong OS-measures defined previously in Eqs.(41) and (59) by (we used finite sums since this is always the case in practice):

QJðC;C0Þ ¼Xmi¼1

Xm0j¼1

JðCi;C0jÞuði; jÞ;

Fig. 2. Number of documents per query.

836 L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848

QCosðC;C0Þ ¼Xmi¼1

Xm0j¼1

jCi \ C0jj2

jCijjC0jjuði; jÞ:

As we have said before, it is possible to choose for u any function verifying the conditions (i)–(v) of Theorem 3.2.1.

We have shown in Michel (2000) that the linear function ulði; jÞ defined as follows is correct

ulði; jÞ : ½1;m � ½1;m0 ! Rþ; ulði; jÞ ¼ dm0ðiðji� jj þ 1ÞÞ � dm0ðjðji� jj þ 1ÞÞ; ð68Þwhere dm0 : ½1;m2

0 ! Rþ; m0 ¼ maxðm;m0Þ

dm0ðnÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi6m3

0

6m40 � 6m3

0 þ 8m20 � 3m0 þ 1

s1

�� n� 1

m20

�: ð69Þ

We have seen in Eq. (33) that a power function like u1ði; jÞ ¼ 3=ð2i2j2ji�jjÞ is good too. Fornormalization reasons with finite sums, we now apply a normalization constant equalling4m0=ð4m0 � 1Þ. So, the second weight function is defined by:

up : ½1;m � ½1;m0 ! Rþ; upði; jÞ ¼4m0

4m0 � 1

3

2i2j2ji�jj: ð70Þ

We hence now have four strong concrete OS-measures defined as follows:

JXl ðC;C0Þ ¼

Xmi¼1

Xm0j¼1

JðCi;C0jÞulði; jÞ; ð71Þ

JXp ðC;C0Þ ¼

Xmi¼1

Xm0j¼1

JðCi;C0jÞupði; jÞ; ð72Þ

CosXl ðC;C0Þ ¼Xmi¼1

Xm0j¼1

jCi \ C0jj2

jCijjC0jjulði; jÞ; ð73Þ

CosXp ðC;C0Þ ¼Xmi¼1

Xm0j¼1

jCi \ C0jj2

jCikC0jjupði; jÞ: ð74Þ

The measure QRec defined in (75) is derived from the Recall,

QRecðC;C0Þ ¼Xmi¼1

Xm0j¼1

jCi \ C0jjjCij

uði; jÞ: ð75Þ

Recall has been found to be a weak similarity measure, hence not suited in our framework. Butit is sufficient for a ‘‘simple’’ OS-measure as defined in Michel (2000). We choose to test it only forexperimentation purpose. The corresponding formulae are:

RXl ðC;C0Þ ¼

Xmi¼1

Xm0j¼1

RðCi;C0jÞulði; jÞ; ð76Þ

RXp ðC;C0Þ ¼

Xmi¼1

Xm0j¼1

RðCi;C0jÞupði; jÞ: ð77Þ

L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848 837

4.3. Analysis

We will make three types of comparisons in order to estimate:

• the impact of the weight function u;• the impact of the similarity indicators (J, Cos and R) and• the impact of the query type (specialized or general queries).

The impact of the weight function u is estimated by comparing JXl and CosXl with their cor-

responding means Jm0and Cosm0

defined as

Jm0¼Pm

i¼1Pm0

j¼1 JðCi;C0jÞm0

and

Cosm0¼Pm

i¼1Pm0

j¼1 CosðCi;C0jÞm0

:

The means are considered as measures with neutral weight function. In order to make thecurves readable, we have ranked the queries according to increasing values of JX

l and CosXl (Figs.3–8).

The impact of the similarity indicator J, Cos and R is estimated by comparing JXl ; CosXl ; RX

l ,and JX

p ; CosXp ; RXp . In order to make the curves readable, we have ranked the queries according to

the increasing values of JXp (Figs. 9–11).

The impact of the query type (specialized or general queries) is estimated by the shape of thecurve when we rank the results of JX

l as a function of the increasing number of documents peranswer (Fig. 12), the increasing number of classes per answer (Fig. 13) and the increasing numberof documents per class per answer (Fig. 14).

Fig. 3. T1 – JX1 ; Jm0

.

838 L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848

4.4. Results

4.4.1. Impact of weight functionFor each user’s interrogation context T1, T2 and T3, we compare the OS JX

l and CosXl withtheir corresponding non-ordered rank measure Jm0

and Cosm0. We have chosen not to present the

comparison of ðJXp ; Jm0

Þ and ðCosXp ; Cosm0Þ because we compare ðJX

l ; CosXl Þ and ðJXp ; CosXp Þ in

the following section. In all figures, the queries on the X-axis are represented according to theincreasing values of JX

l or CosXl .Bold curves relate to the OS linear measures (JX

l and CosXl ). Thin curves relate to the corre-sponding mean ðJm0

; Cosm0Þ. We notice very different shapes of curves between the T1, T2 and T3

Fig. 4. T1 – CosX1 ;Cosm0.

Fig. 5. T2 – JX1 ; Jm0

.

L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848 839

contexts. Indeed, when we rank the results of T1 by increasing values of JXl we can observe an

exponential type function. The same operation in T2 results in a logarithmic type function and inT3 results in an S-shaped function. These differences show that the systems to be compared givereally different and personalized results.

Globally, the bold and thin curves have the same shape. In all cases, generally, the thin curve isabove the bold one. The interpretation is the same as in Michel (2000):

The Jm0curve slightly above the OS curve is a consequence of the choice of the decreasing

function dm0 . Punctually, we can observe values of Jm0below the curve. This phenomenon is

explained by the calculation of the average of the index of Jaccard on m0, the maximum ofthe classes of the compared sets. If m and m0 are the numbers of classes of each compared

Fig. 7. T3 – JX1 ; Jm0

.

Fig. 6. T2 – CosX1 ;Cosm0.

840 L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848

answer and if there is a strong difference between m and m0, then Jm0could be lower than the

corresponding OS value. The further the curves Jm0and OS will move away, the more im-

portant the filtering will be in the scheduling of the documents.

By comparing the figures presented on the same line (Figs. 3 and 4, Figs. 5 and 6, Figs. 7 and 8),we can remark that the values of J and Cos look very similar. There are three zones:

• Zone 1 (called head of the curve) is when Jm0and JX

l are identical to 0. This phenomenon musthave been visible in curves 3 and 6. It did not because of technical reasons we have had to sup-press some queries.

Fig. 9. T1 – JX1 ;Cos

X1 ;R

X1 and JX

p ;CosXp ;R

Xp .

Fig. 8. T3 – CosX1 ;Cosm0.

L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848 841

• Zone 2 (called heart of curve) is when Jm0(respectively, Cosm0

) takes extremely variable valuescompared to JX

l (respectively, CosXl ).• Zone 3 (called tail of the curve) is when Jm0

and JXl (or Cosm0

and CosXl ) are identical to 1.

We can observe that the tail of curve is unchanged from the Jaccard case ðJXl ; Jm0

Þ to the cosinecase ðCosXl ; Cosm0

Þ. On the contrary, in the heart of the curve (zone 2) the thin curve of Cosm0has

much more amplitude than Jm0. We can say that, without any weight function, the cosine measure

seems to be less stable than the Jaccard.

Fig. 11. T3 – JX1 ;Cos

X1 ;R

X1 and JX

p ;CosXp ;R

Xp .

Fig. 10. T2 – JX1 ;Cos

X1 ;R

X1 and JX

p ;CosXp ;R

Xp .

842 L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848

4.4.2. Impact of the classical similarity indicatorIn the three following figures we compare, for T1, T2 and T3, the OS JX

l ; CosXl ; RXl and

JXp ; CosXp ; RX

p . In order to make the curves readable, we have ranked the queries according toincreasing values of JX

p (Figs. 9–11).In the three figures we can remark that for a given weight function and irrespective of the

indicator used (Cosine, Jaccard or Recall) the results of OS measures are strictly identical for eachof the 550 queries. Indeed, we have JX

l CosXl RXl (represented by the thin curve) and

JXp CosXp RX

p (bold curve).This very remarkable result shows that the weight function suppresses the initial effect of the

Jaccard, Cosine or Recall measure.Secondly, we can remark that, in each figure, thin and bold curves’ shapes are identical. Indeed,

as above, with T1 we can observe an exponential function (Fig. 9), with T2 there is a logarithmic

Fig. 12. T2 – JX1 , and JX

p ranking by increasing number of documents per answer.

Fig. 13. T2 – JX1 , and JX

p ranking by increasing number of class per answers.

L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848 843

function (Fig. 10) and with T3 we have an S-shaped function (Fig. 11). This regularity shows thatthe linear and power indicator act in the same way but not on the same degree.

Finally we can say that in each system, the thin curve is always above the bold one in thebeginning, goes through it in a particular point and remains below until the end of the values.When the answers are very different (similarity near 0), the power weight is more precise than thelinear one. On the contrary, when the answers are very similar, the linear function is more precise.We did not find any interpretation of the intersection point.

4.4.3. Impact of the query typeWe have some quantitative information about the answers, for example, the number of doc-

uments they have or the number of classes they have.Let us suppose first that the query type is a function of the number of documents that the

corresponding answer gives: a lot of documents means that the query is very general; only fewdocuments means that the query is specialized. In order to observe if this fact has an impact on thesimilarity measure, we rank the results of JX

l (thin curve) and JXp (bold curve) as a function of the

increasing number of documents per answer. In Fig. 12, the curves are drawn with results of user’sT2. We cannot observe any regularity. Some queries have very particular results, giving abruptdecreasing values. There is nothing particular characterizing these queries. For different reasons,linked with the particular profile T2 of the user, the personal answers are very different to theneutral ones. In the case of T1 and T3 (not represented with figures) the same phenomenon isobserved for other queries.

We know that in the answers, the documents are grouped into classes. The more the querieshave terms the more the answers may have classes. So, if we suppose that general queries have fewterms, answers with little number of classes must be supposed to be more general. In order to testif the number of classes has an impact on the OS, we have ranked the answers as a function of thenumber of classes (Fig. 13). But, as before, there is no regularity.

The last and final test is to rank the queries as a function of the number of documents per class.But in this case too, no regularity is observed (Fig. 14).

Fig. 14. T2 – JX1 , and JX

p ranking by increasing number of document/class per answers.

844 L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848

5. Conclusion

In Michel (2000) we have presented some first results on OS-measures. The present workcompletes it by presenting a more general treatment of OS-measures. Indeed, in Section 2 we makea distinction between two types of OS-measures: strong and weak ones. In Section 3, a method isproposed to construct strong OS-measures. It is usable in more general contexts than the onepresented in Michel (2000) because it works for comparing any two ordered sets whereas Michel(2000) needs some restrictive compatibility condition on them. We propose in Section 3 concretegood strong OS-measures derived from strong similarity measures of Jaccard, Dice (from which werefind the strong OS-measures derived from Jaccard), Generalized Dice, Cosine, N, Overlap O2.We remark that we did not have to take into account the case of R; P and O1 because they are notstrong similarity measures. These weak similarity measures will be studied in another paper.

We choose to test two strong OS-measures derived from Jaccard and Cosine, and one simpleOS-measure derived from Recall. The construction of concrete OS-measures is made with a linearand a power weight function. We made no hypothesis upon the similarity indicator. We made onehypothesis on the weight function: according to the law of Weber–Fechner, a power weightfunction is supposed to be better than a linear one to reflect the rank difference between orderedsets.

First results show that without any weight function the Jaccard indicator seems to be morestable than the Cosine one. This result may be linked with the fact that Spirit�, the software use inthe experimentation, is a weighted boolean system. The second very remarkable result is that theweight function (linear or power) suppresses, for both strong and simple OS results, the initialeffect of the Jaccard, Cosine or Recall indicator. Comparative tests on the linear and power weightfunction show that they act in the same way but not on the same degree: there is not one betterthan the other. Indeed, when the answers are very different (similarity near 0), the power weight ismore precise than the linear one. On the contrary, when the answers are very similar, the linearfunction is more precise. Nothing has been remarked on a possible link between the type of query(general or specialized query) and the OS characteristics.

Some complementary experimentations with the Overlap measure O2 and the generalized Dicecould be interesting in order to know if these phenomena are general to all strong OS-measures. Inthis case, it would be interesting to know if it is really important to find some ‘‘weakest’’ weightfunction, or if (and in this case why) in the ordered set cases, the rank is the more importantcriterion. These will be studied in another paper.

Appendix A

Proof of theorem 3.2.1.

ðQ0Þ:If C ¼ ðC1; ;; . . . ; ;; . . .Þ and C0 ¼ ðC01; ;; . . . ; ;; . . .Þ, symbolizing the unordered case, then (30)reduces to

QðC;C0Þ ¼ f ðDðC1;C01ÞÞ;

L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848 845

hence the measure of f � D. We noted already (in Section 2) that f � D is a good strong similaritymeasure if D is, because of the properties of f. We leave the easy proof to the reader.

ðQ1Þ:for all C;C0 2 C:

06QðC;C0Þ ¼X1i¼1

X1j¼1

f ðDðCi;C0jÞÞuði; jÞ6X1i¼1

X1j¼1

f ðDðCi;C0jÞÞuði; iÞ6X1i¼1

uði; iÞ ¼ 1;

using (31), (iv) and (v). Note that, instead of (31), we could have used (32) as well. For the proof ofðQ2Þ we will, however, need both (31) and (32).

ðQ2Þ:Let QðC;C0Þ ¼ 1 for C ¼ ðCiÞi2N 2 C and C0 ¼ ðC0jÞj2N 2 C. HenceX1

i¼1

X1j¼1

f ðDðCi;C0jÞÞuði; jÞ ¼ 1: ðA:1Þ

Suppose 9j0 > i0 such that f ðDðCi0 ;C0j0ÞÞ 6¼ 0.

Then, by (i)X1i¼1

X1j¼1

f ðDðCi;C0jÞÞuði; jÞ <X1i¼1

X1j¼1

f ðDðCi;C0jÞÞuði; iÞ6X1i¼1

uði; iÞ ¼ 1;

by (31) and (iv), contradicting (A.1). Suppose 9i0 > j0 such that f ðDðCi0 ;C0j0ÞÞ 6¼ 0. Then,

by (ii)X1i¼1

X1j¼1

f ðDðCi;C0jÞÞuði; jÞ <X1i¼1

X1j¼1

f ðDðCi;C0jÞÞuðj; jÞ

¼X1j¼1

X1i¼1

f ðDðCi;C0jÞÞuðj; jÞ6X1j¼1

uðj; jÞ ¼ 1;

by (32) and (iv), again contradicting (A.1). Hence

QðC;C0Þ ¼X1i¼1

f ðDðCi;C0iÞÞuði; iÞ ¼ 1: ðA:2Þ

If there exists an i 2 N such that f ðDðCi;C0iÞÞ < 1, then by (iv) and the fact that u > 0:

QðC;C0Þ <X1i¼1

uði; iÞ ¼ 1;

again a contradiction. Hence f ðDðCi;C0iÞÞ ¼ 1 8i 2 N. By the properties of f we have DðCi;C0iÞ ¼ 1and since D is a strong similarity measure (hence satisfying (D2)) we have that Ci ¼ C0i 8i 2 N.Hence C ¼ C0. Conversely, if C ¼ C0, we have that Ci \ C0j ¼ ; 8i; j; i 6¼ j and Ci ¼ C0i 8i 2 N.Hence (by the properties of D and f):

846 L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848

QðC;C0Þ ¼X1i¼1

X1j¼1

f ðDðCi;C0jÞÞuði; jÞ ¼X1i¼1

f ðDðCi;C0iÞÞuði; iÞ

¼X1i¼1

f ðDðCi;CiÞÞuði; iÞ ¼X1i¼1

uði; iÞ ¼ 1:

ðQ3Þ:

QðC;C0Þ ¼X1i¼1

X1j¼1

f ðDðCi;C0jÞÞuði; jÞ ¼ 0

iff f ðDðCi;C0jÞÞ ¼ 0 8i; j 2 N (since u > 0). By the fact that f strictly decreases and f ð0Þ ¼ 0 we seethat DðCi;C0jÞ ¼ 0. Hence ðD3Þ implies Ci \ C0j ¼ ; 8i; j 2 N. Hence

8i 2 N : Ci \[1j¼1

C0j

!¼ ;;

hence [1i¼1

Ci

!\

[1j¼1

C0j

!¼ ;;

hence C \ C0 ¼ ;.ðQ4Þ:For CðiÞ;C0ðjÞ 2 C as in ðQ4Þ we have

QðCðiÞ;C0ðjÞÞ ¼ f ðDðCi;C0jÞÞuði; jÞ: ðA:3Þ

As given, f ðDðCi;C0jÞÞ is constant in i and j (the sets remain the same; only their ranks arevariable). Hence (Q4) follows from (i) and (ii).

ðQ5Þ:For CðiÞ;C0ðiÞ 2 C as above we have

QðCðiÞ;C0ðiÞÞ ¼ f ðDðCi;C0iÞÞuði; iÞ:Again, f ðDðCi;C0iÞÞ is independent of i. Hence ðQ5Þ follows from (iii).Of course, if D is symmetric and u is too, then (30) implies that Q is symmetric too. �

References

Boyce, B. R., Meadow, C. T., & Kraft, D. H. (1995).Measurement in information science. New York: Academic Press.

Egghe, L. (1994). A theory of continuous rates and applications to the theory of growth and obsolescence rates.

Information Processing and Management, 30(2), 279–292.

Egghe, L., & Rousseau, R. (1990). Introduction to informetrics. Quantitative methods in library, documentation and

information science. Amsterdam: Elsevier.

L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848 847

Fluhr, C. (1997). SPIRIT.W3: A distributed cross-lingual indexing and search engine. In Proceedings of the INET 97

‘‘The seventh annual conference of the internet society’’, June 24–27, 1997, Kuala Lumpur, Malaysia.

Grossman, D. A., & Frieder, O. (1998). Information retrieval algorithms and heuristics. Boston, MA: Kluwer Academic

Publishers.

Lain�ee-Cruzel, S., Lafouge, T., Lardy, J. P., & Ben Abdallah, N. (1996). Improving information retrieval by combining

user profile and document segmentation. Information Processing and Management, 32(3), 305–315.

Losee, R. M. (1998). Text retrieval and filtering. Analytic models of performance. Boston, MA: Kluwer Academic

Publishers.

Michel, C. (1999). Evaluation de syst�eemes de recherche d’information, comportant une fonctionnalit�ee de filtrage, par

des mesures endog�eenes. R�eealisation et evaluation d’un prototype de syst�eeme de recherche d’information avec filtre

selon les profils des utilisateurs. Ph.D. Thesis. University Lyon II, January 6, 1999, 322 p.

Michel, C. (2000). Ordered similarity measures taking into account the rank of documents. Information Processing and

Management, 37(4), 603–622.

Salton, G., & Mc Gill, M. J. (1987). Introduction to modern information retrieval. New York: Mc Graw-Hill.

Tague-Sutcliffe, J. (1995). Measuring information. An information services perspective. New York: Academic Press.

Van Rijsbergen, C. J. (1979). Information retrieval (2nd ed.). London: Butterworths.

848 L. Egghe, C. Michel / Information Processing and Management 38 (2002) 823–848