Tutorial @ ECIR Toulouse Who am I? › ~mandl › events › TutorialEvaluationECIR2009 › ECIR... · 1 Mandl: Current Developments in Information Retrieval Evaluation 1Thomas Mandl

1

1Mandl: Current Developments in Information Retrieval Evaluation

Thomas Mandl Information Science

University of Hildesheim

[email protected]

Tutorial @ ECIR

Toulouse6th Apr. 2009

Current Developments

in Information Retrieval

Evaluation

Mandl: Current Developments in Information Retrieval Evaluation

WhoWho am I?am I?

• Assistant Professor at University of Hildesheim

• Studies at University of Regensburg, Germany and

University of Illinois at UC, USA

• PhD on Neural Networks in IR from University of

Hildesheim

• Postdoc Thesis (Habilitation) 2006 on Quality in

Web IR from University of Hildesheim

• Research on IR

– Participant at CLEF since 2002

– Track Coordinator at CLEF since 2006


• Which system is better?

• Management approach?


Different Query Different Query typestypes ––

Different Different evaluationevaluation

• Navigational

– In search of a homepage of company X

• Informational

– Yellow-Pages-Queries

– question answering

– Ad-hoc (Searching everything concerning topic X)


„There must be some fundamental

understanding of what it means to be good

and what it means to be better“

(Bollmann/Cherniavsky 1983,3)


documents

(objects)

Author

Information

Seeker

Query

Indexing

Object-

Attribute

Matrix

Document

Corpus

Query-

Representation

Result-

Documentssimilarity-

calculation

representation

Creation

Formulation

Evaluation IR

Indexing

2


RoughRough OutlineOutline

•Cranfield

•Metrics

•Topics

•Users


OverviewOverview

Cranfield Paradigm

Introduction

Validity

Evaluation Metrics

Binary relevance

Multi level relevance

Evaluation Initiatives

Topic Specific Analysis

Results

Optimization

User Studies

Bonus:

Site Search

Evaluation

Hands on

Activities


PART 1PART 1

Perspectives on the

Cranfield paradigm


WhyWhy evaluationevaluation??

• IR systems: numerous components, models and approaches

• not possible to predict the effectivity for a certaincollection

• No general prefercne for model or a certaincomponent has been proven

• The evaluation of effectivity is crucial

• A holistic evaluation of retrieval processes is difficult

• Success and satisfaction of the users should be theideal benchmark


WhyWhy evaluationevaluation??

• User satisfaction

– Proven documents help to supply the user's information need

– User interface

– System reaction time

– Adaptivity

• User-oriented evaluation is very complex and difficult

– individual and subjective impacts

• Mostly evaluation of Retrieval Systems

– User as „constant“

– Replaced by prototypical user (experts)

– Cranfield-Paradigm of evaluation


RecallRecall and and PrecisionPrecision

DokumenterelevanterAnzahl

DokumenterelevanterrgefundendeAnzahlRecall =

Precision =DokumentegefundenerAnzahl

DokumenterelevanterrgefundendeAnzahl

• „The ability of the retrieval system to uncover

relevant documents is known as the recall

power of the system“ (Lancaster 1968,55)

3


• Which Retrieval model is the basis for Recall

and Precision?


ExamplesExamples

CLEF

year

Tas k Type Topic

language

number

runs

correlation

2000 Multilingual Eng lish 21 0.26

2001 Bilingual German 9 0.44

2001 Multilingual German 5 0.19

2001 Bilingual Eng lish 3 0.20

2001 Multilingual Eng lish 17 -0.34


2002 Multilingual German 4 0.43


2002 Monolingual German 21 0.45

2002 Monolingual Spanish 28 0.21

2003 Monolingual German 30 0.37

2003 Monolingual Spanish 38 0.39

2003 Monolingual Eng lish 11 0.16





0

0.2

0.4

0.6

0.8

1

1.2

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35


0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

0 0,1 0,2 0,3 0,4 0,5

Determination „measuring point“

mostly Precision at a Recall of

0,1 0,2 0,3 ...

-> mean (arithmetic) ->

Average Precision (AP)


OverviewOverview

• Cranfield Paradigm

– Introduction

– Validity

• Evaluation Metrics

– Binary relevance

– Multi-level relevance

• Evaluation initiative

• View on single queries

– Analysis

– Topic specific Optimisation

• User studies


RelevanceRelevance

• Situational relevance describes the

(actual) utility of documents concerning

the information needs

– virtually hardly to capture

– rather a theoretical construct

• Pertinence is the utility observed by the

user concerning her/his information

need

cf. Fuhr 2003 18Mandl: Current Developments in Information Retrieval Evaluation

RelevanceRelevance

• Objective relevance is the relation betweenthe information need and the document, thatwas judged by one or several neutral observers

– Common basis of system evaluation!

– How objective can this be?

• System relevance marks the relevance of thedocument concerning the formal query, thatwas guessed by a system (= similarity – commonly described as: Retrieval value

– (english: Retrieval Status Value (RSV)

cf. Fuhr 2003

4


EstimationEstimation of of thethe recallrecall

• Precision is directly evident for every user of

an IR-System

• Recall however is neither evident for the user

nor is is possible to define it precisely with

adequate effort

– The number of relevant documents is unknown

– This is especially problematic for information

needs which aim at a high Recall (e.g. Patent

Novelty search)

cf. Fuhr 2003 20Mandl: Current Developments in Information Retrieval Evaluation

EstimationEstimation of of thethe RecallRecall

• Pooling-Method (Retrieval withseveral systems) – Apply several IR-Systems to the same set

of documents and mix the results of different systems

– Mostly strong overlapping in the sets of answers of the different systems, so thatthe effort doesn't increase linearly with theamount of analysed systems

cf. Fuhr 2003


Relevant / Relevant / notnot relevantrelevant

• Binary Relevance decisions are often

criticised

• New Metrics for multi-level relevance being

are discussed

– Binary judgments prevail

– Lead to in similar results often

– More later


EvaluationEvaluation

Cranfield-Paradigm of evaluation in the

Information Retrieval• To find objective evaluation standards for a comparison of

systems

• Maintain conditions for comparison constant

• Systems work with the same document corpus, same

information needs and same relevance judgments

• Abstraction from usage situation and context


EvaluationEvaluation

Cranfield-Paradigm for evaluation in

Information Retrieval• Objective relevance is judged by a neutral user

• Relation between the expressed information need and the

document

• no individual and subjective relevance assessment in

situational context

• Currently, the basis of all evaluation initiatives in

Information Retrieval (TREC, CLEF, NTCIR, INEX, ...)


TREC: Text Retrieval TREC: Text Retrieval ConferenceConference

• „TREC is a new ballgame for IR research and

development“ (Sparck Jones 1994)

• Evaluation initiative of the National Institute of

Standards and Technology (NIST) in the USA

• 1992: TREC-1 (Proceedings 1993)

5


CrossCross--LanguageLanguage Evaluation Forum Evaluation Forum

EU Förderung: DELOS NoEfor Digital Libraries Mandl et al. @ CLEF 2003 - 2006

Research on Evaluation

System development

Test environmentResearch on cross- and multi-lingual Information Retrieval Systems

Benchmarks

26Mandl: Current Developments in Information Retrieval Evaluation 26

ExampleExample TopicTopic

<num>10.2452/89-GC</num> <title>Trade fairs in Lower Saxony </title> <desc>Documents reporting about industrial or cultural

fairs in Lower Saxony. </desc> <narr>Relevant documents should contain information

about trade or industrial fairs which take place in the German federal state of Lower Saxony, i.e. name, type and place of the fair. The capital of Lower Saxony is Hanover. Other cities include Braunschweig, Osnabrück, Oldenburg and Göttingen. </narr> </top>


ObjectivesObjectives of of evaluationevaluation initiativesinitiatives

• To find consistent evaluation standards for

retrieval systems (Standardisation)

• To provide comparison between different

systems

• To advance further development of IR

systems

• To consider the needs of the community

• To advance the evaluation methodology


ProceedingsProceedings

• Test basis

– objects (documents, ....)

– queries (Topics)

• relevant information needs for potential users

– consistent weighting

• Time frame

– Release of topics

– Submission of results

– Publication of results


DocumentDocument collectioncollection

• Representative for a real world task

– Large

– Diverse

• Often used: News agency and news paper

collection


RelevanceRelevance JudgmentJudgment

• Abstraction of individual user and his context

• Consistent evaluation

• Objective jurors, who are not in the user's

situation

• Objektive conclusions about relation between

content between topic and document

6


PoolingPooling MethodMethod

Jurors create topics

Relevance assessment by jurors

Pooling of all once found documents

Systems provide the Top 1000 documents for every topic

Ellen Voorhees – CLEF 2001 Workshop


ProcedureProcedure

• Intellectual Evaluation

– relevant or not relevant

• statistical analysis

Mandl: Current Developments in Information Retrieval Evaluation Mandl: Current Developments in Information Retrieval Evaluation


OverviewOverview

Cranfield Paradigm

Introduction

Validity

Evaluation Metrics

Binary relevance

Multi level relevance

Evaluation Initiatives

Topic Specific Analysis

Results

Optimization

User Studies


How reliable is the

evaluation according to

the Cranfield-Paradigm?

7


GeoCLEF Monolingual EnglishGeoCLEF Monolingual English

3737

Bilingual 76% wrt Monolingual


Relevance AssessmentRelevance Assessment

Indirect Information

„foreign aid in Sub-Saharan Africa „

Is a document on the kidnapping of an aid worker

relevant?

„natural desasters in the Western USA“

Is a document on the insurance

costs caused by a

natural desaster relevant?


InterraterInterrater--ReliabilityReliability

• Isn’t relevance a rather subjective concept?

• Is there actually a consistency/agreement, if

several jurors evaluate the same set of

documents?

• Wouldn’t this lead to totally different

results?

• Asian approach?


ComparisonComparison

• Several assessments could be created• By independent jurors

• Several rankings of systems are created

– Using alternative sets of relevance assessments

• How strongly do they differ/vary?

• How to compare rankings?


ComparisonComparison of of rankingsrankings

• Rank correlation coefficient

– Number of position changes (swaps)

– Kendalls Tau

– Spearman-Coefficient


SubjektivitySubjektivity JurorsJurors

• Topic Developer is the primary juror

• For TREC 4, the documents in the pool were

evaluated several times

– Evaluation by primary jurors

– 200 relevant (as far as available) + 200 randomly

chosen documents form the new pool

– Two more jurors evaluate the new pool

Buckley & Voorhees 2005http://doi.acm.org/10.1145/290941.291017

8


SubjektivitySubjektivity JurorsJurors

• Results

–Overlap: between 30% and 40%

– Example: Topic 219: no document was classified

as relevant by any of the three jurors

– But: Less than 3% of the documents, that were

primarily rated as not relevant, were rated as

relevant afterwards

Buckley & Voorhees 2005http://doi.acm.org/10.1145/290941.291017


ChangesChanges of absolute of absolute valuesvalues


CorrelationCorrelation betweenbetween RankingsRankings

Rankings rather robust


SubjectivitySubjectivity of Jurorsof Jurors

• Does the result depend on the jurors, whoevaluate the relevance of a document?

– Actually, jurors evaluate differently

– The absolute values for the system performancechange

– This does not have a large impact on the order of the systems

– The comparison between the systems turns out very similarly independent of the jurors

Buckley & Voorhees 2005


Expertise Level of JurorExpertise Level of Juror

• What is the effect of the knowledge of thejuror?

• Experiment with three different groups of jurors

– Gold: Topic Developer, Task experts

– Silver: Task experts

– Bronze: Non experts

• Data: Enterprise Track at TREC

– Three relevance levels

(Bailey et al. 2008)


Expertise Level of JurorExpertise Level of Juror

• Typical level of agreement

– Between 18% and 58%

– E.g. 24% of the highly relevant docs. as judged

by Bronze were judged irrelevant by Gold jurors

• Effect on system orderings

– 0.96 and 0.94 correlation between Gold and

Silver (for two measures)

– Only 0.73 and 0.66 between Gold and Silver

• Task Knowledge is important!

(Bailey et al. 2008)

9


CranfieldCranfield

• Test basis

– Objekts (Documents, ....)

• Sufficient?

– Queries (Topics)

• Sufficient queries?

• Large or small differences?

– Consistent relevance assessment

• Sufficient adjudgments?

• Difference between jurors

– Systems

• What if more system had participated> Pooling Method


QuestionsQuestions and and DoubtsDoubts

– Queries (Topics)

• Sufficient queries?

• Large or small differences?

– Relevance assessment

• Sufficient number of judgments

• What if ...

• Fewer topics would be available

• Fewer relevance judgments would be available

– Analysis


VariablityVariablity

• OBSERVATION: The difference in the performance

between the systems is far smaller than between

the topics


ExampleExample GeoCLEF 2006GeoCLEF 2006

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

Mono DE Mono EN Mono ES Mono PT

Variance of the systems in GeoCLEF 2006 Mandl 2008


ExampleExample GeoCLEF 2006GeoCLEF 2006

Variance of the topics in GeoCLEF 2006

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

Mono DE Mono EN Mono ES Mono PT

Mandl 2008


SignificanceSignificance teststests

• Descriptive Statistics: Description of

distributions e.g. mean and variance

• Statistics : evaluation of results

• Question: Do the statistical values vary

significantly from each other?

10


ResultResult of of SignificanceSignificance teststests

• The systems A and B do not vary significantly

from each other

• With a 95% probability there is a difference

and system A is better than System B


GeoCLEF Monolingual EnglishGeoCLEF Monolingual English

5656

Bilingual 76% wrt Monolingual

Mandl et al. 2009

57Mandl: Current Developments in Information Retrieval Evaluation 5757

Mandl et al. 2009

Mandl: Current Developments in Information Retrieval Evaluation Harman 2005 http://www.haifa.il.ibm.com/sigir05-qp/index.html


VarianceVariance of of systemssystems

0.0

0.2

0.4

0.6

0.8

1.0

BK

GeoD

2

BK

GeoD

1

FU

HddG

YY

YTD

FU

HddG

YY

YTD

N

FU

HddG

YY

YM

TD

N

Avera

ge

Average Precision of the five best results for ten topics at GeoCLEF

2006 (monolingual German) Mandl 2008


SwapSwap RateRate

• Initial point: original ranking of the systems

– Leaving out topics

– Create a ranking

– How often is a „worse“ system in front of a better

one?

– Error rate

11




SmallerSmaller Topic SetsTopic Sets

Sanderson & Zobel 2005


NumberNumber of of topicstopics

• Do the common 50 topics suffice to compare

the systems?

• a certain difference has to exist between two

systems to prove statistically that one system

is better than the other

– From 50 topics the difference is located under 5%

– Partially as well significantly under 5% (absolute)



SubSub Set AnalysisSet Analysis

• Select n out of m topics

– Many combinations possible

– Repeat the selection

– Create a few rankings for each sub set size

– Calculate the average and show the corridor size


ExampleExample CorrelationCorrelation

0.0

0.2

0.4

0.6

0.8

1.0

1 3 5 7 9

11

13

15

17

19

21

23

25

Average correlation of the rankings for smaller sub sets of topics with the

ranking for all topics: Monolingual Spanish GeoCLEF 2006

Mandl 2008


• Statistical tests overrate the error rate

• Compared to Swap-Rate Analysis

• Tendency:

– More topics

– Fewer relevance judgments per topic

– -> higher reliability of the system comparison


12


IsIs PoolingPooling sufficientsufficient??

• Do the systems find all relevant documents?

• Or are there many relevant documents, thatare not found, beyond the evaluated pools?

– The Pool has to have a certain quality

– Single systems do not contribute very much

– Leaving out single systems does essentially notchange the ranking of other systems

– Comparison remains robust

for TREC: Zobel (1998)

for CLEF: Braschler (2003)


IsIs PoolingPooling sufficientsufficient??

• Incompleteness ...

– More topics are more important than an

exhaustive evaluation of the pools

– Higher statistical reliability with fewer judgments




ObjectionObjection: : VariabilityVariability

• The difference of performance between the systems

is essentially smaller than between the topics

– robust performance over all queries more

important than a high average performance

– “Difficult” queries should gain higher weighting at

the evaluation

– -> Problem robustness

Buckley & Voorhees 2005, Mandl 2006


ContextContext

• Is it possible to transfer the results of

experiments to other situations?

– No



RecentRecent DevelopmentsDevelopments

• Million Query Track at TREC

• Suggestion for Targeted Relevance

Judgments

(Moffat, Webber & Zobel 2007)

(Carterette et al. 2008)