Transcript
Page 1: Evaluating Effectiveness of Information Retrieval System

Evaluating Effectiveness ofInformation Retrieval System

Jing He, Yue LuApril 21, 2023

Page 2: Evaluating Effectiveness of Information Retrieval System

Outline

• System Oriented Evaluation– Cranfield Paradigm– Measure– Test Collection– Incomplete Relevance Judgment Evaluation

• User Oriented Evaluation

Page 3: Evaluating Effectiveness of Information Retrieval System

Cranfield Paradigm

• established by [Cleverdon etal. 66]• Test collection

– Document collection– Topic set– Relevance judgment

• Measures

Page 4: Evaluating Effectiveness of Information Retrieval System

Measure: Binary Relevance• Binary retrieval

– precision and recall• Ranking retrieval

– P-R curve• full information• measures below are summary of it

– P@N: insensitive, local information, not average well– Average Precision

• Geometric interpretation• Utility interpretation

– R-precision: break even point; approximate area– RR: appropriate for known-item search

Page 5: Evaluating Effectiveness of Information Retrieval System

Measure: Graded Relevance

• Discounted Cumulated Gain

– discountedFun is always logb

• RBP– assume stop probability p at any document

DGi discountFun(relevanceFun(di),i)

DCG DGii

RBP (1 p) ri pi 1

i

Page 6: Evaluating Effectiveness of Information Retrieval System

Measure: Topic Set Integration

• Arithmetic Average– MAP, MRR, average P-R curve– P@n and DCG: normalize

• Geometric Average– GMAP: focus on difficulty topic

• Standardize [Webber etal SIGIR08]

– Average and standardize as normal distribution

Page 7: Evaluating Effectiveness of Information Retrieval System

Measure: Compare Systems

• integration score difference– Depend on topic number, difficulty, etc

• Significance tests– factors

• null hypothesis: two systems have identical performance• test criterion: integration score difference• significance level

– t-test, randomization test, bootstrap test agree to each other and more powerful (than sign and Wilcoxon test)[Smucher etal CIKM07, Cormack etal SIGIR07]

Page 8: Evaluating Effectiveness of Information Retrieval System

Measure: How good

• Relationship– Correlation

• Correlation between system performance rankings by different measures: use kendall’s τ or some variant [Yilmaz etal SIGIR08]

• All measures are highly correlated, especially AP, R-precision and nDCG with fair weight setting [Voorhees TREC99, Kekalainen IPM05]

– Inference Ability [Aslam etal SIGIR05]

• measure m1 score measure m2 score ?• AP and R-precision has inference ability for P@n, not hold

on the contrary

Page 9: Evaluating Effectiveness of Information Retrieval System

Test Collection: document and topic

• Document collection– Newspaper, newswire, etc Web Page Set– 1 billion pages(25TB) for TREC09 Web track– But no much research on how to construct document

collection for IR evaluation• Topic set

– Human designed search engine query log– How to select discriminative topics?

• [Mizzaro and Robetson SIGIR07] proposes a method, but it can only applied posteriori

Page 10: Evaluating Effectiveness of Information Retrieval System

Test collection: relevance judgment

• judgment agreement– agreement rate is low– System performance ranking is stable between

topic originator and experts for topic but not for the others [Bailey etal SIGIR08]; and also stable between TREC assessors [Voorhees and Harman 05]

Page 11: Evaluating Effectiveness of Information Retrieval System

Test Collection: relevance judgment

• How to select document to judge– Pooling (by [Jones etal 76])– Limitation of pooling

• Bias to contributed systems• Bias to title words• Not efficient enough

Page 12: Evaluating Effectiveness of Information Retrieval System

Incomplete Relevance Judgment Evaluation

• Motivation– dynamic, growing collection v.s constant human

labor

• Problems– Does traditional measures still work?– How to select document to judge?

Page 13: Evaluating Effectiveness of Information Retrieval System

Incomplete Problem:Measures

• Buckley and Voorhees’s bpref[Buckley and Voorhees SIGIR04]

– Penalize by |irrelevant doc above relevant doc| • Sakai’s condensed measures[Sakai SIGIR07]

– Just remove the unjudged documents• Yilmaz and Aslam’s infAP[Yilmaz etal CIKM06, SIGIR08]

– Estimate average precision with uniform distribution• Results

– infAP, condensed measure, nDCG are more robust than bpref for random sampling judgment from pooling

– infAP is more appropriate to estimate absolute AP value

Page 14: Evaluating Effectiveness of Information Retrieval System

Incomplete Problem: select document to judge(1)

• Aslam’s statAP[Aslam etal SIGIR06, Allan etal TREC07, Yilzma etal SIGIR08]

– Extension of infAP (based on uniform sampling)– uniform sampling: too few relevant document– Stratified sampling

• Higher sampling probability for document ranked higher by more retrieval system (like voting)

– Estimate

Page 15: Evaluating Effectiveness of Information Retrieval System

Incomplete Problem: select document to judge(2)

• Carterett’s minimal test collection– Select most “discriminative” document to judge– How to define “discriminative”?

• By how the AP difference boundary changes with the relevance knowledge of this document

– Estimate AP

Page 16: Evaluating Effectiveness of Information Retrieval System

Incomplete Problem

• It’s more reliable to handle incomplete problem with more queries with less judgment each

• statAP is more appropriate for estimating absolute AP value

• Minimal test collection is more appropriate for discriminating systems

Page 17: Evaluating Effectiveness of Information Retrieval System

User Oriented Evaluation - Alternative to batch-mode evaluation• Conduct user studies [Kagolovsky&Moehr 03]  

o where actual users would use system and assess the quality of the search process and results. 

• Advantage: • allows us to see actual utility system, and provides more

interpretability in terms of the usefulness of the system. • Deficiencies: 

o difficult to compare two system reliably in the same context. o expensive to invite many users to participate in the

experiments.

Page 18: Evaluating Effectiveness of Information Retrieval System

Criticism of batch-mode evaluation[Kagolovsky&Moehr 03][Harter&Hert ARIST97] • Expensive judgments

o obtaining relevance judgments is time consumingo How to overcome? predict relevance with implicit

information which is easy to collect with real systems.• Judgment = user need?

o judgment may not represent real users' information needs thus the evaluation results may not reflect the real utility of the system

o whether batch evaluation correlates well with user evaluation?

Page 19: Evaluating Effectiveness of Information Retrieval System

Expensive judgments (1)

•  [Carterette&Jones 07NIPS] • Predict the relevance score (nDCG) using clicks after an

initial training phase.• Can identify the better of two ranking 82% of the time with

no relevance judgment and 94% of the time with only two judgment for each query

•  [Joachims 03TextMining] • Compare two systems by using click-through data on the

mixture ranking list which is generated by interleaving the results from the two systems. 

• Results closely followed the relevance judgments using P@n

Page 20: Evaluating Effectiveness of Information Retrieval System

Expensive judgments (2)

• [Radlinski et al 08CIKM] • "absolute usage metrics" (such as clicks per quary,

frequency of query reformulations) fail to reflect the retrieval quality

• "paired comparison test" produces reliable predictions• Summary

o reliable pair-wise comparison availableo reliable absolute prediction of relevance scores is still an

open research question

Page 21: Evaluating Effectiveness of Information Retrieval System

Judgment = user need ? (1)

• Negative correlationo [Hersh et al 00SIGIR] 24 users for 6 instance-recall tasks. o [Turpin&Hersh 01SIGIR] 24 users for 6 QA tasks. o Both no significant difference in user task effectiveness

found between systems with significantly different MAP. o small number of topics may explain why no correlation was

detectedo Mixed correlation

o [Turpin&Scholer 06SIGIR] two exp on 50 queries: o one precision-based user task(finding the first relevant doc)o one recall-based user task(# of relevant doc found in five min)o Results: no significant relationship between system

effectiveness and user effectiveness in precision task, and significant but week relationship wtih recall-based task. 

Page 22: Evaluating Effectiveness of Information Retrieval System

Judgment = user need ? (2)

• Positive correlationo [Allan et al 05SIGIR] 33 users, 45 topics, differences in bpref

(0.5-0.98) could result in significant differences in user effectiveness of retrieving faceted doc passages. 

o [Huffman&Hochster 07SIGIR]  7 participants, 200 Google queries, satisfaction of assessors correlate fairly strongly with relevance among top three doc measured using a version of nCDG.

o [Al-Maskari et al 08SIGIR] 56 users, recall-based task on 56 queries on top of "good" and "bad" systems. The authors showed that user effectiveness (time consumed, relevant doc collected, queries input, satisfaction etc) and system effectiveness(P@n, MAP) are highly corrective. 

Page 23: Evaluating Effectiveness of Information Retrieval System

Judgment = user need ? (3)

• Summary• Although there are limitations of the batch-

mode relevance evaluation, most recent studies showed high correlation between user evaluation and system evaluation using relevance measures.


Recommended