[IEEE 2011 International Conference on Recent Trends in Information Technology (ICRTIT) - Chennai, India (2011.06.3-2011.06.5)] 2011 International Conference on Recent Trends in Information

Statistical Measure for Assisting the Multilingual Information Retrieval System Evaluation Using

Normalized Distance Measure 1Sidige Chetana, 2Pothula Sujatha, 3Raju Korra

School of Engineering and Technology, Department of Computer Science, Pondicherry University, Puducherry, India. [email protected]

[email protected]

[email protected]

Abstract— Multilingual information retrieval system (MLIR) retrieves relevant information from multiple languages in response to a user query in a single source language. Effectiveness of multilingual information retrieval is measured using Mean Average Precision (MAP). The main feature of multilingual information retrieval is the score list of one language cannot be compared with other language score list. MAP does not consider this feature. We propose a new metric Normalized Distance Measure (NDM) for measuring the effectiveness of MLIR systems. NDM considers the MLIR features. Our analysis states that NDM metric gives credits to MLIR systems that retrieve highly relevant multilingual documents. Keywords— Average Distance Measure (ADM), Normalized Distance-based Performance Measure (NDPM), Discount cumulative gain (DCG), Multilingual Information Retrieval, Merging Techniques.

I. INTRODUCTION Information Retrieval is the identification of those

documents in a large document collection that are relevant to an explicitly stated query. The goal of an IR system is to nd those documents that are relevant to a natural language query from within a collection of natural language documents. Information retrieval uses retrieval models to get the similarity between the query and documents. Retrieval models are like binary retrieval model, vector space model, and probabilistic model.

Cross-language information retrieval (CLIR) is the circumstance in which a user tries to search a set of documents written in one language for a query in another language. The matching operations are performed between the translated query and each document in another language. There are three main approaches of translation in CLIR: Machine translation, bilingual machine-readable dictionary, Parallel or comparable corpora-based methods.

If translations in bilingual resources employ unnecessary terms irrelevant to the original query then the e ectiveness of the search is diminished. Thus translation disambiguation is desirable that only relevant terms will be automatically or semi-automatically selected from a set of translations. Sophisticated methods are explored in CLIR for maintain translation disambiguation part-of-speech (POS) tags, parallel

corpus, co-occurrence statistics in the target corpus, the query expansion techniques. CLIR systems have language barrier issues in CLIR systems like how to merge a ranked list that contains documents in different language [2].

Due to the existence of several multicultural communities, users are facing multilingualism. User searches in multilingual document collection for a query expressed in a single language kind of systems are termed as MLIR system. Target of MLIR system is that users have to access more information in an easier and faster way than monolingual systems. First, the incoming question is translated into target languages and second, integrates information obtained from different languages into one single ranked list. This task is more complicated than simple bilingual CLIR. The weight assigned to each document (RSV) is calculated not only according to the relevance of the document and the IR model used, but also the rest of monolingual corpus to which the document belongs is a determining factor.

There are two architectures in MLIR: centralized MLIR and distributed MLIR. Our proposed metric is applied on distributed MLIR. Two types of multilingual information retrieval methods are query translation and document translation. As document translation causes more complications than query translation, our proposal is applying query translation.

Normal information retrieval metrics like MAP are used to evaluate the MLIR. To measure the MLIR performance correctly we need to consider the MLIR features like translation (language barrier), merging methods. Our new metric is based on the concept of ADM metric. The drawbacks of the ADM metric are overcome in the proposed formula.

The rest of the paper is arranged as follows: Section 2 discusses the previous research on Information retrieval metrics, previous metric drawbacks and different MLIR features. Section 3 proposes a new metric for evaluating MLIR. Section 4 presents experiments to demonstrate the effectiveness of the proposed metric . Section 5 concludes this work.

IEEE-International Conference on Recent Trends in Information Technology, ICRTIT 2011MIT, Anna University, Chennai. June 3-5, 2011

978-1-4577-0590-8/11/$26.00 ©2011 IEEE 728

II. RELATED WORK Basically, there are two architectures in MLIR: centralized

and distributed [12]. In a centralized architecture, document collections in different languages are viewed as a single document collection and are indexed in one huge index file. The advantage of a centralized architecture is that it avoids the merging problem. One of problems of centralized architecture is that the weights of index terms are over weighting. Thus, centralized architecture prefers documents in small document collection. The second architecture is a distributed MLIR. In distributed architecture, documents in different languages are indexed and retrieved separately. A standard bilingual search is performed for each separate language and several ranked document lists are generated by each run. In addition to language translation issues, obtain a ranked list that contains documents in different languages from several text collections is also critical.

Two types of multilingual information retrieval methods are query translation and document translation [2]. The retrieval method based on document translation may have an advantage compared to the retrieval method based on query translation because the translation of long documents may be more accurate in preserving the semantic meaning than the translation of short queries. A general and easy search strategy in MLIR is query translation to translate the user query to each language of the information sources and run a monolingual search in each information source. It is then necessary to obtain a single ranked document list by merging the individual ranked lists that are in different languages. This issue is known as merging strategy problem or collection fusion problem. The results merging problem in MLIR is more complicated than the merging problem in monolingual environments since the language barrier has to be crossed in order to rank the documents in different languages.

A. Merging Strategies Round-robin merging: This approach is based on the idea

that document scores are not comparable across the collections [11].

Raw score merging: This approach is based on the idea that document scores are comparable across the collections [11].

Normalized score merging: This approach divides each score by the maximum score of the topic on the current list [10], [11].

CombSUM merging: In this approach the result scores from each language are initially normalized. Afterward, the duplicated scores in multiple collections are summed and ranked these results in accordance to their new joint score [8].

CORI merging: CORI is a weighted score approach, weights are based upon document’s score (RSV) and collection ranking information [8].

Logistic regression or Rank-based score: Mathematically ranks are converted into scores by assuming a relationship between the rank and probability of relevance [13].

User-centered approach is trustworthy and reliable approach. A user-centered approach causes problems in terms of lack of control of variables. Relevant and non

relevant are two dichotomous decisions. Four cases arise for binary relevance and retrieval. A performance measurement can be examined by the agreement or disagreement between the user and the system rankings. By taking into account the algorithmic rank position and the various assigned relevance values of the retrieved objects one should take advantage of two parameters: 1) the ranked order 2) user’s interpretations of the ranked documents.

B. Existing IR metrics Discount Cumulated Gain (DCG): Cumulated Gain scans

the ranked output from the top and gains score for every relevant document. The greater the rank, the smaller share of the document score is added to the discount cumulated gain.

Normalized Distance-based Performance Measure (NDPM): NDPM compares the order of ranking of two documents. NDM numerator compares the user ranking order with systems ranking order and denominator compares the user order of ranking with reverse order of the user ranking. [1][5]

Average Distance Measure (ADM): [3] Measures the average distance between UREs (user relevance estimation) and SREs (system relevance estimation) [2]. [3] States the importance of ranking in performance measurement.

III. PROPOSED METRIC We propose a new metric Normalized Distance Measure

(NDM) for measuring the MLIR system. Distributed MLIR system has three steps translation of queries, retrieval of documents from each collection and merging the individual results in a single list. NDM considers ranking as a suitable measurement, because performance measurement is better in continuous varying of relevance and retrieval values and also the scores of the document cannot be compared due to language specific barriers like translation schemes, different number of documents in different heterogeneous language collections. Many searching models returns rank of the retrieved items. Importance of order of rank is considered form the DCG metric [4]. The final ranked list obtained from the MLIR contains relevant documents of all different languages is represented as MLIRR . The ranked list obtained

from user is represented as USERR . Normalized Distance Measure measures the difference between the user’s estimated ranked list and final ranked list obtained for the MLIR system. The NDM value ranges from 0 to 1.

=

=

+−

+−

−=m

iiThreshold

iUSERiThreshold

m

iiMLIR

iUSERiMLIR

RRR

RRR

NDM

0)(

)()(

0)(

)()(

1

α

α (1)

Where i = {0, 1, 2… m} where m is total number of documents present in user resultant list. ‘ ’ states the interval between consecutive rankings i.e ‘ =1’ . In (1) equation, the term ‘ α+)(iMLIRR ’ is penalty calculated. ‘ ’ significance is known in cases where an relevant document is not retrieved.

IEEE-ICRTIT 2011

729

The best MLIR system’s NDM value is closer to values 1. ‘ α+)(iMLIRR ’ measures the precision i.e. as the relevant document goes bottom of the result list the importance of the relevance document decreases. The difference between MLIR and USER ranking may be same. For example 15-10=5 and 100-95=5 are same. The document which is in 10th position of a user system may be placed in 15th position in an MLIR system has more importance than the document which is in 95th position which is placed in 100th position of an MLIR system. This feature is termed as precision and is specified by the term ‘ α+)(iMLIRR ’. Thus, the penalty increases if the MLIR system’s rank is far away from the user ranking. Table 1 calculates the distance between the MLIR and USER resultant rank systems by using the formula

‘α+

−

)(

)()(

iMLIR

iUSERiMLIR

RRR

’. Table1 shows that a rank system can

be partitioned into six partitions.

TABLE I. SIX POSSIBLE DIVISIONS FOR RANKING SYSTEM

Exactly Relevant

High Relevant Less Relevant

Retrieved Rank, Rank Rank, Rank + y Rank +y, Rank Non Retrieved

X,X X, Rank Rank, X

Six cases are as follows.

a) )()( iUSERiMLIR RR =

b) )()( iUSERiMLIR RR >

c) )()( iUSERiMLIR RR <

d) 0,0 )()( == iUSERiMLIR RR

e) 0,0 )()( =≠ iUSERiMLIR RR

f) 0,0 )()( ≠= iUSERiMLIR RR In first three cases the document is considered as relevant

by both MLIR system and USER. In the last three cases the document is considered as non relevant by MLIR or USER system. Table 2 represents all the possible cases when comparing two raking systems. ‘X’ in table 1 state that the document is not retrieved. In case (a), (d) the difference between rankings is 0 as both rankings are same. Case (a), (d) representations are seen in table 2 in a diagonal having 0’s. In case (c), (f) difference between rankings is positive i.e., MLIR system is giving more importance to the document (high relevant case). This is represented on left bottom of the diagonal of table 2. In case (b), (e) difference between rankings is negative i.e., MLIR system is giving less importance to the document (less relevant case). This is represented on top right of the diagonal.

We can estimate the good MLIR System but the worst MLIR is not possible to estimate because worseness of MLIR system increases as the irrelevant documents are termed as

relevant documents. Thus we are using threshold MLIR as a bad MLIR system. The denominator measures the difference between the user ranked lists and threshold MLIR system ‘ )(iThresholdR ’. The numerator measures the difference between the MLIR ranked list and user ranked list.

The threshold case is built by considering few rules. They

are • The first relevant document is placed as last

document in the threshold case, second as second last document and so on.

• Threshold case is opposite to the best case. As best case is from the top left corner to bottom right corner in table2, threshold case is from top right corner to bottom left corner in table2.

TABLE II. CALCULATION OF DISTANCE BETWEEN MLIR AND USER RANK

SYSTEMS IN ALL SIX POSSIBILITIES

)(iMLIRR 0 1 2 3 4 5

)(iUSERR α+)(iMLIRR

)(iUSERR

α+

1 2 3 4 5 6

0 1 0 0.5 0.67 0.75 0.8 0.83 1 2 1 0 0.33 0.5 0.6 0.67 2 3 2 0.5 0 0.25 0.4 0.5 3 4 3 1 0.33 0 0.2 0.33 4 5 4 1.5 0.67 0.25 0 0.17 5 6 5 2 1 0.5 0.2 0

IV. EXPERIMENTAL RESULTS Effectiveness of an Information Retrieval System (IRS) is

based on relevance and retrieval. [2] States that precision and recall are highly sensitive to the thresholds chosen. The drawbacks of the ADM are stated in [3]. These are corrected in NDM. [3] States the importance of ranking in performance measurement.

TABLE III. DOCUMENT SCORES IN SIX MLIR SYSTEMS

D1 D2 D3 D4 D5 USER 0.9 0.8 0.7 0.6 0.5 MLIR1 0.8 0.7 0.6 0.5 0.9 MLIR2 0.9 0.7 0.6 0.8 0.5 MLIR3 0.9 0.6 0.8 0.7 0.5 MLIR4 0.9 0.8 0.7 0.5 0.6 MLIR5 0.8 0.9 0.7 0.6 0.5 MLIR6 0.9 0.7 0.8 0.5 0.6

Table3 contains document score lists obtained from the

user and six MLIR systems giving different document scores. Table4 shows the conversion of the score list into rank list.

Statistical Measure for Assisting the Multilingual Information Retrieval System Evaluation Using Normalized Distance Measure

730

TABLE IV. RANK LISTS OF SIX MLIR SYSTEMS

D1 D2 D3 D4 D5 USER 1 2 3 4 5 MLIR1 2 3 4 5 1 MLIR2 1 3 4 2 5 MLIR3 1 4 2 3 5 MLIR4 1 2 3 5 4 MLIR5 2 1 3 4 5 MLIR6 1 3 2 5 4

The variations between MLIR system rank list and User

ranked lists are highlighted in the Table 4. The precision and recall are not sensitive enough to show important differences between systems. The problem is because the precision and recall are not continuous. ADM and NDPM metrics are continuous metrics. NDM metric is also a continuous metric. Thus we are comparing the NDM metric with ADM and NDPM in table5.

TABLE V. COMPARE NDM WITH ADM AND NDPM

ADM NDPM NDM MLIR1 0.84 0.60 0.653 MLIR2 0.92 0.80 0.868 MLIR3 0.92 0.80 0.885 MLIR4 0.60 0.90 0.957 MLIR5 0.60 0.90 0.902 MLIR6 0.92 0.80 0.888

In MLIR5, first two top ranked documents are swapped. In

MLIR4, last two ranked documents are swapped. Ordering of relevant documents is almost same in both MLIR4 and MLIR5 so NDM values 0.957, 0.902 are accurately same but ADM and NDPM shows no difference in performance. In MLIR2 and MLIR3 2nd, 3rd, 4th documents are interchanged among them. NDM value for MLIR2 is less than MLIR3 because top 2nd document is placed at 4th position. MLIR6 gives good performance. MLIR1 gives bad performance because the 1st top document is placed at last position. The difference between the NDM, NDPM, and ADM are shown in graph Fig1.

0

0.2

0.4

0.6

0.8

1

1.2

ADM

NDPM

NDM

Fig 1: The performance of NDM is compared with the ADM and NDPM

V. CONCLUSION The proposed new MLIR metric is based on rank schema.

Six different cases in ranking system are explained. Our analysis states that the performance of NDM is better when compared to ADM and NDPM. Their may be cases where NDM measure for MLIR system may cross the threshold values. In such cases instead of calculating NDM, its enough to calculate numerator i.e the distance between MLIR and USER rank systems. In this case performance of an MLIR system is better if the numerator is low.

REFERENCES [1] Bing Zhou, Yiyu Yao, “Evaluating information retrieval system

performance based on user preference,” journal of intelligent information systems, Springerlink, vol. 34, issue 3, pp. 227-248 , June. 2010.

[2] Kazuaki Kishida, “Technical issues of cross-language information retrieval: a review,” Information Processing and Management international journal, science direct, vol. 41, issue 3, pp. 433-455, may. 2005.

[3] Stefano Mizzaro, S, “A New Measure of Retrieval Effectiveness (Or: What’s Wrong with Precision and Recalls),” In: International Workshop on Information Retrieval, pp. 43–52 .2001

[4] Järvelin, K., Kekäläinen, J, “Cumulated Gain-based Evaluation of IR Techniques”. ACM Transactions on Information Systems, vol. 20, Issue 4, 422—446 October (2002)

[5] Yao, Y. Y. (1995), “Measuring retrieval effectiveness based on user preference of documents,” Journal of the American Society for Information Science, Volume 46 Issue 2, 133–145, March 1995.

[6] W. C. LIN and H. H. CHEN, “Merging results by using predicted retrieval effectiveness,” Lecture notes in computer science, pages 202–209, 2004.

[7] Savoy, “Combining multiple strategies for effective monolingual and cross-lingual retrieval,” IR Journal, 7(1-2):121-148, 2004.

[8] Lin, W.C. & Chen, H.H. (2002b), “Merging Mechanisms in Multilingual Information Retrieval”, In Peters, C. (Ed.), Working Notes for the CLEF 2002 Workshop, (pp. 97-102).

[9] Rita M. Aceves-Pérez, Manuel Montes-y-Gómez, Luis Villaseñor-Pineda, Alfonso Ureña-López, “Two Approaches for Multilingual Question Answering: Merging Passages vs. Merging Answers”, International Journal of Computational Linguistics and Chinese Language Processing. Vol. 13, No. 1, pp 27-40, March 2008.

[10] F. Mart´ nez-Santiago, M. Mart´ n, and L.A. Ure˜na, “SINAI at CLEF 2002: Experiments with merging strategies”, In Carol Peters, editor, Proceedings of the CLEF 2002 Cross-Language Text Retrieval System Evaluation Campaign. Lecture Notes in Computer Science, pages 103–110, 2002.

[11] E. Airio, H. Keskustalo, T. Hedlund and A. Pirkola, “Multilingual Experiments of UTA at CLEF2003 - the Impact of Different Merging Strategies and Word Normalizing Tools”, CLEF 2003, Trondheim, Norway, 21-22 August 2003.

[12] Wen-Cheng Lin and Hsin-Hsi Chen (2003), “Merging Mechanisms in Multilingual Information Retrieval”, In Advances in Cross-Language Information Retrieval: Third Workshop of the Cross-Language Evaluation Forum, CLEF 2002, Lecture Notes in Computer Science, LNCS 2785, September 19-20, 2002, Rome, Italy, pp. 175-186.

[13] Anne Le Calvé , Jacques Savoy, “Database merging strategy based on logistic regression,” Information Processing and Management: an International Journal, vol.36, p.341-359, May. 2000.

IEEE-ICRTIT 2011

731

Documents

[IEEE 2011 International Conference on Recent Trends in Information Technology (ICRTIT) - Chennai, India (2011.06.3-2011.06.5)] 2011 International Conference on Recent Trends in Information