Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
1
1Mandl: Current Developments in Information Retrieval Evaluation
Thomas Mandl Information Science
University of Hildesheim
Tutorial @ ECIR
Toulouse6th Apr. 2009
Current Developments
in Information Retrieval
Evaluation
Mandl: Current Developments in Information Retrieval Evaluation
WhoWho am I?am I?
• Assistant Professor at University of Hildesheim
• Studies at University of Regensburg, Germany and
University of Illinois at UC, USA
• PhD on Neural Networks in IR from University of
Hildesheim
• Postdoc Thesis (Habilitation) 2006 on Quality in
Web IR from University of Hildesheim
• Research on IR
– Participant at CLEF since 2002
– Track Coordinator at CLEF since 2006
3Mandl: Current Developments in Information Retrieval Evaluation
• Which system is better?
• Management approach?
Mandl: Current Developments in Information Retrieval Evaluation
Different Query Different Query typestypes ––
Different Different evaluationevaluation
• Navigational
– In search of a homepage of company X
• Informational
– Yellow-Pages-Queries
– question answering
– Ad-hoc (Searching everything concerning topic X)
5Mandl: Current Developments in Information Retrieval Evaluation
„There must be some fundamental
understanding of what it means to be good
and what it means to be better“
(Bollmann/Cherniavsky 1983,3)
6Mandl: Current Developments in Information Retrieval Evaluation
documents
(objects)
Author
Information
Seeker
Query
Indexing
Object-
Attribute
Matrix
Document
Corpus
Query-
Representation
Result-
Documentssimilarity-
calculation
representation
Creation
Formulation
Evaluation IR
Indexing
2
Mandl: Current Developments in Information Retrieval Evaluation
RoughRough OutlineOutline
•Cranfield
•Metrics
•Topics
•Users
8Mandl: Current Developments in Information Retrieval Evaluation
OverviewOverview
Cranfield Paradigm
Introduction
Validity
Evaluation Metrics
Binary relevance
Multi level relevance
Evaluation Initiatives
Topic Specific Analysis
Results
Optimization
User Studies
Bonus:
Site Search
Evaluation
Hands on
Activities
Mandl: Current Developments in Information Retrieval Evaluation
PART 1PART 1
Perspectives on the
Cranfield paradigm
Mandl: Current Developments in Information Retrieval Evaluation
WhyWhy evaluationevaluation??
• IR systems: numerous components, models and approaches
• not possible to predict the effectivity for a certaincollection
• No general prefercne for model or a certaincomponent has been proven
• The evaluation of effectivity is crucial
• A holistic evaluation of retrieval processes is difficult
• Success and satisfaction of the users should be theideal benchmark
Mandl: Current Developments in Information Retrieval Evaluation
WhyWhy evaluationevaluation??
• User satisfaction
– Proven documents help to supply the user's information need
– User interface
– System reaction time
– Adaptivity
• User-oriented evaluation is very complex and difficult
– individual and subjective impacts
• Mostly evaluation of Retrieval Systems
– User as „constant“
– Replaced by prototypical user (experts)
– Cranfield-Paradigm of evaluation
12Mandl: Current Developments in Information Retrieval Evaluation
RecallRecall and and PrecisionPrecision
DokumenterelevanterAnzahl
DokumenterelevanterrgefundendeAnzahlRecall =
Precision =DokumentegefundenerAnzahl
DokumenterelevanterrgefundendeAnzahl
• „The ability of the retrieval system to uncover
relevant documents is known as the recall
power of the system“ (Lancaster 1968,55)
3
13Mandl: Current Developments in Information Retrieval Evaluation
• Which Retrieval model is the basis for Recall
and Precision?
Mandl: Current Developments in Information Retrieval Evaluation
ExamplesExamples
CLEF
year
Tas k Type Topic
language
number
runs
correlation
2000 Multilingual Eng lish 21 0.26
2001 Bilingual German 9 0.44
2001 Multilingual German 5 0.19
2001 Bilingual Eng lish 3 0.20
2001 Multilingual Eng lish 17 -0.34
2002 Bilingual German 4 0.33
2002 Multilingual German 4 0.43
2002 Bilingual Eng lish 51 0.40
2002 Monolingual German 21 0.45
2002 Monolingual Spanish 28 0.21
2003 Monolingual German 30 0.37
2003 Monolingual Spanish 38 0.39
2003 Monolingual Eng lish 11 0.16
2002 Multilingual Eng lish 32 0.29
2003 Bilingual German 24 0.21
2003 Bilingual Eng lish 8 0.41
2003 Multilingual Eng lish 74 0.31
0
0.2
0.4
0.6
0.8
1
1.2
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
15Mandl: Current Developments in Information Retrieval Evaluation
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
0 0,1 0,2 0,3 0,4 0,5
Determination „measuring point“
mostly Precision at a Recall of
0,1 0,2 0,3 ...
-> mean (arithmetic) ->
Average Precision (AP)
16Mandl: Current Developments in Information Retrieval Evaluation
OverviewOverview
• Cranfield Paradigm
– Introduction
– Validity
• Evaluation Metrics
– Binary relevance
– Multi-level relevance
• Evaluation initiative
• View on single queries
– Analysis
– Topic specific Optimisation
• User studies
17Mandl: Current Developments in Information Retrieval Evaluation
RelevanceRelevance
• Situational relevance describes the
(actual) utility of documents concerning
the information needs
– virtually hardly to capture
– rather a theoretical construct
• Pertinence is the utility observed by the
user concerning her/his information
need
cf. Fuhr 2003 18Mandl: Current Developments in Information Retrieval Evaluation
RelevanceRelevance
• Objective relevance is the relation betweenthe information need and the document, thatwas judged by one or several neutral observers
– Common basis of system evaluation!
– How objective can this be?
• System relevance marks the relevance of thedocument concerning the formal query, thatwas guessed by a system (= similarity – commonly described as: Retrieval value
– (english: Retrieval Status Value (RSV)
cf. Fuhr 2003
4
19Mandl: Current Developments in Information Retrieval Evaluation
EstimationEstimation of of thethe recallrecall
• Precision is directly evident for every user of
an IR-System
• Recall however is neither evident for the user
nor is is possible to define it precisely with
adequate effort
– The number of relevant documents is unknown
– This is especially problematic for information
needs which aim at a high Recall (e.g. Patent
Novelty search)
cf. Fuhr 2003 20Mandl: Current Developments in Information Retrieval Evaluation
EstimationEstimation of of thethe RecallRecall
• Pooling-Method (Retrieval withseveral systems) – Apply several IR-Systems to the same set
of documents and mix the results of different systems
– Mostly strong overlapping in the sets of answers of the different systems, so thatthe effort doesn't increase linearly with theamount of analysed systems
cf. Fuhr 2003
21Mandl: Current Developments in Information Retrieval Evaluation
Relevant / Relevant / notnot relevantrelevant
• Binary Relevance decisions are often
criticised
• New Metrics for multi-level relevance being
are discussed
– Binary judgments prevail
– Lead to in similar results often
– More later
22Mandl: Current Developments in Information Retrieval Evaluation
EvaluationEvaluation
Cranfield-Paradigm of evaluation in the
Information Retrieval• To find objective evaluation standards for a comparison of
systems
• Maintain conditions for comparison constant
• Systems work with the same document corpus, same
information needs and same relevance judgments
• Abstraction from usage situation and context
23Mandl: Current Developments in Information Retrieval Evaluation
EvaluationEvaluation
Cranfield-Paradigm for evaluation in
Information Retrieval• Objective relevance is judged by a neutral user
• Relation between the expressed information need and the
document
• no individual and subjective relevance assessment in
situational context
• Currently, the basis of all evaluation initiatives in
Information Retrieval (TREC, CLEF, NTCIR, INEX, ...)
24Mandl: Current Developments in Information Retrieval Evaluation
TREC: Text Retrieval TREC: Text Retrieval ConferenceConference
• „TREC is a new ballgame for IR research and
development“ (Sparck Jones 1994)
• Evaluation initiative of the National Institute of
Standards and Technology (NIST) in the USA
• 1992: TREC-1 (Proceedings 1993)
5
25Mandl: Current Developments in Information Retrieval Evaluation
CrossCross--LanguageLanguage Evaluation Forum Evaluation Forum
EU Förderung: DELOS NoEfor Digital Libraries Mandl et al. @ CLEF 2003 - 2006
Research on Evaluation
System development
Test environmentResearch on cross- and multi-lingual Information Retrieval Systems
Benchmarks
26Mandl: Current Developments in Information Retrieval Evaluation 26
ExampleExample TopicTopic
<num>10.2452/89-GC</num> <title>Trade fairs in Lower Saxony </title> <desc>Documents reporting about industrial or cultural
fairs in Lower Saxony. </desc> <narr>Relevant documents should contain information
about trade or industrial fairs which take place in the German federal state of Lower Saxony, i.e. name, type and place of the fair. The capital of Lower Saxony is Hanover. Other cities include Braunschweig, Osnabrück, Oldenburg and Göttingen. </narr> </top>
27Mandl: Current Developments in Information Retrieval Evaluation
ObjectivesObjectives of of evaluationevaluation initiativesinitiatives
• To find consistent evaluation standards for
retrieval systems (Standardisation)
• To provide comparison between different
systems
• To advance further development of IR
systems
• To consider the needs of the community
• To advance the evaluation methodology
28Mandl: Current Developments in Information Retrieval Evaluation
ProceedingsProceedings
• Test basis
– objects (documents, ....)
– queries (Topics)
• relevant information needs for potential users
– consistent weighting
• Time frame
– Release of topics
– Submission of results
– Publication of results
Mandl: Current Developments in Information Retrieval Evaluation
DocumentDocument collectioncollection
• Representative for a real world task
– Large
– Diverse
• Often used: News agency and news paper
collection
30Mandl: Current Developments in Information Retrieval Evaluation
RelevanceRelevance JudgmentJudgment
• Abstraction of individual user and his context
• Consistent evaluation
• Objective jurors, who are not in the user's
situation
• Objektive conclusions about relation between
content between topic and document
6
31Mandl: Current Developments in Information Retrieval Evaluation
PoolingPooling MethodMethod
Jurors create topics
Relevance assessment by jurors
Pooling of all once found documents
Systems provide the Top 1000 documents for every topic
Ellen Voorhees – CLEF 2001 Workshop
32Mandl: Current Developments in Information Retrieval Evaluation
ProcedureProcedure
• Intellectual Evaluation
– relevant or not relevant
• statistical analysis
Mandl: Current Developments in Information Retrieval Evaluation Mandl: Current Developments in Information Retrieval Evaluation
35Mandl: Current Developments in Information Retrieval Evaluation
OverviewOverview
Cranfield Paradigm
Introduction
Validity
Evaluation Metrics
Binary relevance
Multi level relevance
Evaluation Initiatives
Topic Specific Analysis
Results
Optimization
User Studies
36Mandl: Current Developments in Information Retrieval Evaluation
How reliable is the
evaluation according to
the Cranfield-Paradigm?
7
37Mandl: Current Developments in Information Retrieval Evaluation
GeoCLEF Monolingual EnglishGeoCLEF Monolingual English
3737
Bilingual 76% wrt Monolingual
Mandl: Current Developments in Information Retrieval Evaluation
Relevance AssessmentRelevance Assessment
Indirect Information
„foreign aid in Sub-Saharan Africa „
Is a document on the kidnapping of an aid worker
relevant?
„natural desasters in the Western USA“
Is a document on the insurance
costs caused by a
natural desaster relevant?
Mandl: Current Developments in Information Retrieval Evaluation
InterraterInterrater--ReliabilityReliability
• Isn’t relevance a rather subjective concept?
• Is there actually a consistency/agreement, if
several jurors evaluate the same set of
documents?
• Wouldn’t this lead to totally different
results?
• Asian approach?
40Mandl: Current Developments in Information Retrieval Evaluation
ComparisonComparison
• Several assessments could be created• By independent jurors
• Several rankings of systems are created
– Using alternative sets of relevance assessments
• How strongly do they differ/vary?
• How to compare rankings?
Mandl: Current Developments in Information Retrieval Evaluation
ComparisonComparison of of rankingsrankings
• Rank correlation coefficient
– Number of position changes (swaps)
– Kendalls Tau
– Spearman-Coefficient
42Mandl: Current Developments in Information Retrieval Evaluation
SubjektivitySubjektivity JurorsJurors
• Topic Developer is the primary juror
• For TREC 4, the documents in the pool were
evaluated several times
– Evaluation by primary jurors
– 200 relevant (as far as available) + 200 randomly
chosen documents form the new pool
– Two more jurors evaluate the new pool
Buckley & Voorhees 2005http://doi.acm.org/10.1145/290941.291017
8
43Mandl: Current Developments in Information Retrieval Evaluation
SubjektivitySubjektivity JurorsJurors
• Results
–Overlap: between 30% and 40%
– Example: Topic 219: no document was classified
as relevant by any of the three jurors
– But: Less than 3% of the documents, that were
primarily rated as not relevant, were rated as
relevant afterwards
Buckley & Voorhees 2005http://doi.acm.org/10.1145/290941.291017
Mandl: Current Developments in Information Retrieval Evaluation
ChangesChanges of absolute of absolute valuesvalues
Mandl: Current Developments in Information Retrieval Evaluation
CorrelationCorrelation betweenbetween RankingsRankings
Rankings rather robust
46Mandl: Current Developments in Information Retrieval Evaluation
SubjectivitySubjectivity of Jurorsof Jurors
• Does the result depend on the jurors, whoevaluate the relevance of a document?
– Actually, jurors evaluate differently
– The absolute values for the system performancechange
– This does not have a large impact on the order of the systems
– The comparison between the systems turns out very similarly independent of the jurors
Buckley & Voorhees 2005
Mandl: Current Developments in Information Retrieval Evaluation
Expertise Level of JurorExpertise Level of Juror
• What is the effect of the knowledge of thejuror?
• Experiment with three different groups of jurors
– Gold: Topic Developer, Task experts
– Silver: Task experts
– Bronze: Non experts
• Data: Enterprise Track at TREC
– Three relevance levels
(Bailey et al. 2008)
Mandl: Current Developments in Information Retrieval Evaluation
Expertise Level of JurorExpertise Level of Juror
• Typical level of agreement
– Between 18% and 58%
– E.g. 24% of the highly relevant docs. as judged
by Bronze were judged irrelevant by Gold jurors
• Effect on system orderings
– 0.96 and 0.94 correlation between Gold and
Silver (for two measures)
– Only 0.73 and 0.66 between Gold and Silver
• Task Knowledge is important!
(Bailey et al. 2008)
9
49Mandl: Current Developments in Information Retrieval Evaluation
CranfieldCranfield
• Test basis
– Objekts (Documents, ....)
• Sufficient?
– Queries (Topics)
• Sufficient queries?
• Large or small differences?
– Consistent relevance assessment
• Sufficient adjudgments?
• Difference between jurors
– Systems
• What if more system had participated> Pooling Method
Mandl: Current Developments in Information Retrieval Evaluation
QuestionsQuestions and and DoubtsDoubts
– Queries (Topics)
• Sufficient queries?
• Large or small differences?
– Relevance assessment
• Sufficient number of judgments
• What if ...
• Fewer topics would be available
• Fewer relevance judgments would be available
– Analysis
51Mandl: Current Developments in Information Retrieval Evaluation
VariablityVariablity
• OBSERVATION: The difference in the performance
between the systems is far smaller than between
the topics
52Mandl: Current Developments in Information Retrieval Evaluation
ExampleExample GeoCLEF 2006GeoCLEF 2006
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
Mono DE Mono EN Mono ES Mono PT
Variance of the systems in GeoCLEF 2006 Mandl 2008
53Mandl: Current Developments in Information Retrieval Evaluation
ExampleExample GeoCLEF 2006GeoCLEF 2006
Variance of the topics in GeoCLEF 2006
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
Mono DE Mono EN Mono ES Mono PT
Mandl 2008
54Mandl: Current Developments in Information Retrieval Evaluation
SignificanceSignificance teststests
• Descriptive Statistics: Description of
distributions e.g. mean and variance
• Statistics : evaluation of results
• Question: Do the statistical values vary
significantly from each other?
10
55Mandl: Current Developments in Information Retrieval Evaluation
ResultResult of of SignificanceSignificance teststests
• The systems A and B do not vary significantly
from each other
• With a 95% probability there is a difference
and system A is better than System B
56Mandl: Current Developments in Information Retrieval Evaluation
GeoCLEF Monolingual EnglishGeoCLEF Monolingual English
5656
Bilingual 76% wrt Monolingual
Mandl et al. 2009
57Mandl: Current Developments in Information Retrieval Evaluation 5757
Mandl et al. 2009
Mandl: Current Developments in Information Retrieval Evaluation Harman 2005 http://www.haifa.il.ibm.com/sigir05-qp/index.html
Mandl: Current Developments in Information Retrieval Evaluation
VarianceVariance of of systemssystems
0.0
0.2
0.4
0.6
0.8
1.0
BK
GeoD
2
BK
GeoD
1
FU
HddG
YY
YTD
FU
HddG
YY
YTD
N
FU
HddG
YY
YM
TD
N
Avera
ge
Average Precision of the five best results for ten topics at GeoCLEF
2006 (monolingual German) Mandl 2008
Mandl: Current Developments in Information Retrieval Evaluation
SwapSwap RateRate
• Initial point: original ranking of the systems
– Leaving out topics
– Create a ranking
– How often is a „worse“ system in front of a better
one?
– Error rate
11
Mandl: Current Developments in Information Retrieval Evaluation
Buckley & Voorhees 2000
Mandl: Current Developments in Information Retrieval Evaluation
SmallerSmaller Topic SetsTopic Sets
Sanderson & Zobel 2005
63Mandl: Current Developments in Information Retrieval Evaluation
NumberNumber of of topicstopics
• Do the common 50 topics suffice to compare
the systems?
• a certain difference has to exist between two
systems to prove statistically that one system
is better than the other
– From 50 topics the difference is located under 5%
– Partially as well significantly under 5% (absolute)
Sanderson & Zobel 2005
Mandl: Current Developments in Information Retrieval Evaluation
SubSub Set AnalysisSet Analysis
• Select n out of m topics
– Many combinations possible
– Repeat the selection
– Create a few rankings for each sub set size
– Calculate the average and show the corridor size
Mandl: Current Developments in Information Retrieval Evaluation
ExampleExample CorrelationCorrelation
0.0
0.2
0.4
0.6
0.8
1.0
1 3 5 7 9
11
13
15
17
19
21
23
25
Average correlation of the rankings for smaller sub sets of topics with the
ranking for all topics: Monolingual Spanish GeoCLEF 2006
Mandl 2008
Mandl: Current Developments in Information Retrieval Evaluation
• Statistical tests overrate the error rate
• Compared to Swap-Rate Analysis
• Tendency:
– More topics
– Fewer relevance judgments per topic
– -> higher reliability of the system comparison
Sanderson & Zobel 2005
12
67Mandl: Current Developments in Information Retrieval Evaluation
IsIs PoolingPooling sufficientsufficient??
• Do the systems find all relevant documents?
• Or are there many relevant documents, thatare not found, beyond the evaluated pools?
– The Pool has to have a certain quality
– Single systems do not contribute very much
– Leaving out single systems does essentially notchange the ranking of other systems
– Comparison remains robust
for TREC: Zobel (1998)
for CLEF: Braschler (2003)
68Mandl: Current Developments in Information Retrieval Evaluation
IsIs PoolingPooling sufficientsufficient??
• Incompleteness ...
– More topics are more important than an
exhaustive evaluation of the pools
– Higher statistical reliability with fewer judgments
Sanderson & Zobel 2005
Buckley & Voorhees 2004
69Mandl: Current Developments in Information Retrieval Evaluation
ObjectionObjection: : VariabilityVariability
• The difference of performance between the systems
is essentially smaller than between the topics
– robust performance over all queries more
important than a high average performance
– “Difficult” queries should gain higher weighting at
the evaluation
– -> Problem robustness
Buckley & Voorhees 2005, Mandl 2006
70Mandl: Current Developments in Information Retrieval Evaluation
ContextContext
• Is it possible to transfer the results of
experiments to other situations?
– No
Buckley & Voorhees 2005
Mandl: Current Developments in Information Retrieval Evaluation
RecentRecent DevelopmentsDevelopments
• Million Query Track at TREC
• Suggestion for Targeted Relevance
Judgments
(Moffat, Webber & Zobel 2007)
(Carterette et al. 2008)