Upload
mounia-lalmas-roelleke
View
927
Download
1
Embed Size (px)
Citation preview
Outline o What to evaluate in IR o Test collection methodology
- Document, information need, query, relevance - TREC
o Precision and recall - Average precision, interpolated, mean average precision (MAP) - P@r, R-Precision, MRR - E and F measures
o Other measures (DCG, bpref) o Significance testing o Large-scale evaluation (web search & clicks) o Evaluating classifiers
2
Information Retrieval = IR IR vs. Search
Outline o What to evaluate in IR o Test collection methodology
- Document, information need, query, relevance - TREC
o Precision and recall - Average precision, interpolated, mean average precision (MAP) - P@r, R-Precision, MRR - E and F measures
o Other measures (DCG, bpref) o Significance testing o Large-scale evaluation (web search & clicks) o Evaluating classifiers
3
Information Retrieval = IR IR vs. Search
Evaluation in general versus evaluation in IR
o Evaluating a system in computer science is often concerned with time and space è system performance
o With large collections of documents, system performance is still very important
o However, in IR, we care a lot about retrieval performance: are the retrieved documents “relevant” to a “user information need”?
4
Why do we need to evaluate an IR system?
o The user wants to find recipes about “couscous” as cooked in various countries
o User uses 2 IR systems
o How we can say which one is better?
5
Acknowledgements
6
These slides were based on - Evaluation lecture @ QMUL; Thomas Roelleke & Mounia Lalmas - Lecture 8: Evaluation @ Stanford University; Pandu Nayak & Prabhakar Raghavan - Retrieval Evaluation @ University of Virginia; Hongnig Wang - Lectures 11 and 12 on Evaluation @ Berkeley; Ray Larson - Evaluation of Information Retrieval Systems @ Penn State University; Lee Giles
o Information Retrieval, 2nd edition, C.J. van Rijsbergen (1979) o Introduction to Information Retrieval, C.D. Manning, P. Raghavan & H. Schuetze (2008) o Modern Information Retrieval: The Concepts and Technology behind Search, 2nd edition; R.
Baeza-Yates & B. Ribeiro-Neto (2011)
What to evaluate in IR o coverage of the collection: extent to which the system includes
relevant material o time lag (efficiency): average interval between the time a query is
submitted and the answer is given o presentation of the output o effort involved by user in obtaining answers to a query o recall of the system: proportion of relevant documents retrieved o precision of the system: proportion of the retrieved documents that
are actually relevant
7
o coverage has to do with the quality of the collection o efficiency in terms of speed, memory usage, etc o presentation has to do with interface and visualisation
issues o effort has to do with user issues, e.g. user satisfaction.
o recall and precision have to do with retrieval effectiveness or effectiveness for short è system-oriented evaluation
8
What to evaluate in IR
System-oriented evaluation
o Measuring effectiveness has been the most predominant in IR evaluation
o Test collection methodology - Benchmark (dataset) upon which effectiveness is measured and compared - Dataset tells for a given query what are the relevant documents
o Metrics to measure effectiveness - Precision and recall, and variants - E and F measures - Others (DCG, bpref)
9
Test Collection methodology o Compare retrieval performance using a test collection
- Document collection, that is the document themselves. The document collection depends on the task, e.g. evaluating web retrieval requires a collection of HTML documents.
- Queries, which simulate real user information needs. - Relevance judgements, stating for a query the relevant documents.
o To compare the performance of two techniques: - each technique used to answer queries - results (set or ranked list) compared using some effectiveness performance measure - most common measures are precision and recall
o Usually use multiple measures to get different views of performance o Usually test with multiple collections as performance can be collection
dependent 10
Informa(onneed,queryandrelevance
o The information need is translated into a query o Relevance is assessed relative to the information need not the query
- Information need: I am looking for information on what are the best places to go on holiday near the beach and play tennis
- Query: tennis beach holiday
- Evaluate whether the document addresses the information need, not whether it has the three words “tennis”, “beach” and “holiday”
Sec. 8.1
11
Relevance … as defined in system-oriented evaluation
o A document is relevant if it “has significant and demonstrable bearing on the matter at hand”.
o There are common assumptions about the nature of relevance in system-centred evaluation: - Objectivity: everybody agree on whether a document is relevant or not to a
query - Topicality: relevance is about whether the document is about the topic
expressed in the query - Binary nature: either a document is relevant or not - Independence: the fact that a document is relevant to a query has no effect
of the relevance of another document for that same query 12
Relevance is difficult to define satisfactorily o A document is relevant within the context of a query
- Who judges the relevance? è humans not very consistent (see next slide) - Is the document useful? è Utility - Judgment on whether a document is relevant or not depend on more than document
and query
o With real collections, we never know the full set of relevant documents
o Retrieval model incorporates notion of relevance - Satisfiability of a logical expression in Boolean model - P(relevance | query, document) in BIRM - Similarity to query in VSM - P(query generated | document model) in LM
13
Kappa measure for inter-judge relevance agreement o Kappa measure
- Agreement measure among judges (assessing document relevance) - Designed for categorical judgments (relevant or not)
o Kappa = [ P(A) – P(E) ] / [ 1 – P(E) ] o P(A) – proportion of time judges agree o P(E) – what agreement would be by chance o Kappa = 0 for chance agreement, 1 for total agreement
Sec. 8.5
14
Kappa Measure: Example Number of documents assessed
Judge 1 Judge 2
300 Relevant Relevant
70 Non-relevant Non-relevant
20 Relevant Non-relevant
10 Non-relevant Relevant
Sec. 8.5
15
Judg
es a
gree
Ju
dges
dis
agre
e
Kappa measure: Example
P(A)=370/400=0.925P(non-relevant)=(10+20+70+70)/800=0.2125P(relevant)=(10+20+300+300)/800=0.7875P(E)=0.2125^2+0.7875^2=0.665Kappa=(0.925–0.665)/(1-0.665)=0.776
Kappa>0.8=goodagreement0.67<Kappa<0.8è“tentaGveconclusions” For>2judgesèaveragepairwisekappas
Sec. 8.5
16
Impact of inter-judge agreement on IR systems comparisons o ImpactonabsoluteeffecGvenessperformancemeasurecanbe
significant(0.32vs0.39)
o ButliVleimpactonrankingofdifferentsystemsorrela(veeffecGvenessperformance
o IfwejustwanttoknowifIRsystemAisbeVerthanIRsystemBètestcollecGonmethodologygivesreliablecomparison
Sec. 8.5
17
Find the relevant documents in the collection o Did the IR system find all relevant document? o To answer accurately, we need complete judgments
- i.e., “yes,” “no,” or some score for every query-document pair
o For small test collections, we can review all documents for all queries
o Not practical for large or even medium-sized collection - TREC collections have millions of documents
o Pooling method o Click-based evaluation in web search (later in the lecture)
18
Test collection creation
o Manual method: - Every document in the collection is judged against every query by one of several judges
(human assessors) - This is feasible for small document collection.
o Pooling method (used for large document collection): - The queries are run against several IR systems first - The top, for example 100, documents retrieved by each system are pooled together - The pool is then judged for relevance (by human assessors) - This is what TREC does
o Query logs (web search) è see later about “evaluation with clicks” 19
Sample test collections (ad hoc retrieval)
Characteristics Cranfield CACM ISI West TREC2
Collection size (docs) 1400 3204 1460 11953 742611
Collection size (MB) 1.5 2.3 2.2 254 2162
Year created 1968 1983 1983 1990 1991
Unique stems 8226 5493 5448 196707 1040415
Stem occurrences 123200 117578 98304 21798833 243800000
Max within document frequency
27 27 1309
Mean document length (words)
88 36.5 67.3 1823 328
Number of queries 225 50 35 44 100
20
ad hoc retrieval: query, document, ranking
CIS o 1239 documents about cystic fibrosis from MEDLINE collection o Fields: author, title, source, major and minor subjects, abstracts, references and
citations o 100 queries, developed by relevance judges
o Unusual features: - 4 judges per document per query (3 experts, 1 medical bibliographer) - 3 levels of relevance (0-2) - Combined relevance on scale of 0-8
222 2 221 2 211 2 111 2 222 1 221 1 211 1 111 1 000 0
21
Added so we do not forget history
CACM o 3024 articles on computer science from CACM, 1958 - 1979 o Fields: author, date, word stems for titles and abstracts, categories, direct
referencing, bibliography coupling, number of co-citations for each pair of articles o 52 queries, each with 2 Boolean formulations
o Unusual features: - Citation links to other documents, so often used for hypertext-type
experiments
22
Added so we do not forget history
TREC o Text REtrieval Conference/Competition - http://trec.nist.gov - Run by NIST (National
Institute of Standards & Technology)
o Collections: > Terabytes, o Datasets
- Newswire & full text news (AP, WSJ, Ziff, FT)
- Government documents (federal register, Congressional Record)
- Radio Transcripts (FBIS) - Web “subsets” - …
23
Queries & relevance judgments at TREC
o Queries devised and judged by “information specialists” èTREC Topics
o Relevance judgments done only for those documents retrieved and not entire collection! - E.g. merge top 100 retrieved documents from systems experimented
with (TREC participants) - Pooling method
25
Example (excerpt) of a TREC document <doc> <docno> WSJ880406-0090 </docno> <hl> AT&T Unveils Services to Upgrade Phone Networks Under Global Plan </hl> <author> Janet Guyon (WSJ Sta) </author> <dateline> New York </dateline> <text> American Telephone & Telegraph Co. introduced the rest of a new generation of phone
services with broad ... </text> </doc>
26
Example (excerpt) of a TREC topic <top> <num> Number: 168 </docno> <title> Topic: Financing AMTRAK <desc> Description A document will address the role of the Federal Government in financing the operation of
the National Railroad Transportation Corporation (AMTRAK) <nar> Narrative: A relevant document must provide information on the government's responsibility to make
AMTRAK an economically viable entity. It could also discuss the privatisation of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to AMTRAK would also be relevant.
</top>
27
TREC legacy o Pros:
- made research systems scale to large collections (pre-WWW) - allows for controlled comparisons
o Cons: - emphasis on high recall, often unrealistic for what most users want è but
recall-oriented search exist (patent retrieval, e-discovery)
- very long queries, unrealistic è systems optimized for long queries and hence perform worse for shorter, more realistic queries - focus on batch ranking (one-off result) rather than interaction (but session track
was introduced to evaluate a “user search session”)
28
Others evaluation forums
o CLEF (Cross-Language Evaluation Forum) o NCTIR (NII Testbeds and Community for Information access Research) o FIRE (Forum for Information Retrieval Evaluation) o INEX (The Initiative for the Evaluation of XML retrieval)
29
Effectiveness
o We recall that the goal of an IR system is to retrieve as many relevant documents as possible and as few non-relevant documents as possible.
o Evaluating the above consists of a comparative evaluation of technical performance of IR system(s): - In traditional IR, technical performance means the effectiveness of the IR
system: the ability of the IR system to retrieve relevant documents and suppress non-relevant documents - Effectiveness is measured by the combination of recall and precision
30
Intuition behind precision and recall
o Collection of 10,000 documents, 50 relevant to a given topic
o Ideal system finds these 50 documents and reject all others
o An actual system likely identifies 25 documents; 20 are relevant and 5 were on other topics
Precision: 20/25 = 0.8 (80% of retrieved document are relevant)
Recall: 20/50 = 0.4 (40% of the relevant document are found)
31
Measuring Precision and Recall Precision is easy to measure:
o Look at each document retrieved and decide whether it is relevant or not
o In previous example, only the 25 documents that are found need to be examined
Recall is difficult to measure:
o To know all relevant items, we must go through entire collection, looking at every document to decide if it is relevant or not
o In previous example, all 10,000 documents must be examined! è remember the pooling method at TREC
32
Recall / Precision
Document collection
Retrieved RelevantRetrieved and relevant
Knowing which documents are relevant to which queries comes from the test collection
For a given query, the document collection can be divided into three sets: the set of retrieved document, the set of relevant documents, and the rest of the documents.
33
Recall / Precision In the ideal case, the set of retrieved documents is equal to the set of relevant documents. However, in most cases, the two sets will be different. This difference is formally measured with precision and recall.
34
Document collection
Retrieved RelevantRetrieved and relevant
precision =
number of relevant documents retrieved
number of documents retrieved
recall =
number of relevant documents retrieved
number of documents relevant
Retrieved vs. Relevant Documents
Relevant
Very high precision, very low recall
retrieved
35
High precision rate is achieved by returning documents that we know for sure are relevant à Is this a good idea?
Retrieved vs. Relevant Documents
Relevant
High recall, but low precision
retrieved
36
100% recall can be achieved by returning all documents in the collection à This is for sure a bad idea!
Retrieved vs. Relevant Documents
Relevant
Very low precision, very low recall (0 for both)
retrieved
37
Total failure!
Retrieved vs. Relevant Documents
Relevant
High precision, high recall
retrieved
38
The perfect scenario!
Recall and Precision
The above two measures do not take into account where the relevant documents are retrieved, this is, at which rank (crucial since the output of most IR systems is a ranked list of documents). This is very important because an effective IR system should not only retrieve as many relevant documents as possible and as few non-relevant documents as possible, but also it should retrieve relevant documents before the non-relevant ones. 39
precision =
number of relevant documents retrieved
number of documents retrieved
recall =
number of relevant documents retrieved
number of documents relevant
Recall and Precision o Let us assume that for a given query, the following documents are relevant (10 relevant
documents) {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}
o Now suppose that the following documents are retrieved for that query:
For each relevant document (in red bold), we calculate the precision value and the recall value. For example, for d56, we have 3 retrieved documents, and 2 among them are relevant, so the precision is 2/3. We have 2 of the relevant documents so far retrieved (the total number of relevant documents being 10), so recall is 2/10.
rank doc precision recall rank doc precision recall
1234567
d123d84d56D6d8d9d511
1/1
2/3
3/6
1/10
2/10
3/10
891011121314
d129d187d25d48d250d113d3
4/10
5/14
4/10
5/10
40
Recall and Precision o For each query, we obtain pairs of recall and precision values
- In our example, we would obtain (1/10, 1/1) (2/10, 2/3) (3/10, 3/6) (4/10, 4/10) (5/10, 5/14) …
which are usually expressed in % (10%, 100%) (20%, 66.66%) (30%, 50%) (40%, 40%), (50%, 35.71%) … - This can be read for instance: at 20% recall, we have 66.66% precision; at 50%
recall, we have 35.71% precision The pairs of values are plotted into a graph, which has the following curve
Recall (%)
Precision (%)
10 20 30 40 50 60 70 80 90 100
100908070605040302010
41
Recall and Precision o We have shown how to derive the recall and precision curve for a
given query
o Now we describe how using the above for all queries, the effectiveness of an IR system is evaluated and thus compared to other IR systems.
o Note that we can also compare the same system, but with different versions (e.g. different parameters are used). The idea here is to find out the best version of the IR system.
42
The complete methodology
For each IR system / IR system version
1. For each query in the test collection a. We first run the query against the system to obtain a ranked list of retrieved
documents b. We use the ranking and relevance judgements to calculate recall/precision pairs
2. Then we average recall / precision values across all queries, to obtain an overall measure of the effectiveness
43
Averaging across queries
o Hard to compare precision and recall graphs or tables for individual queries (too much data) - Need to average over many queries
o Two main types of averaging - Macro-average: each query is a point in the average - Micro-average: each relevant document is a point in the average
- Macro is mostly used (all queries count equally)
44
(Macro) Interpolated average precision o Average precision at standard recall points
o For a given query, compute precision and recall point for every relevant document
o Interpolate precision at standard recall levels - 11-pt is usually 100%, 90%, 80%, ..., 10%, 0%
o Average over all queries to get average precision at each recall level
45
Interpolation
0102030405060708090100
0 20 40 60 80 100recall
Interpolated valueObserved value
prec
ision
It is often the case that recall values are not given for standard recall values (10%, 20%, ….). We therefore need to interpolate to obtain standard recall values.
For example, the value is 25%, and is interpolated to the nearest standard recall value on the right, that is 30%.
46
Interpolated average precision
0102030405060708090100
0 10 20 30 40 50 60 70 80 90 100recall
query1query 2average
We have precision values at standard recall values for two queries. The precision values for query 1 are higher than those for query 2. This means that the effectiveness of the IR system is better for query 1 than for query 2. We can plot the average of the two queries.
47
prec
ision
Averaging
The same information can be displayed in a table.
48
Precision in % Recall in % Query 1 Query 2 Average
10 80 60 70
20 80 50 65
30 60 40 50
40 60 30 45
50 40 25 32.5
60 40 20 30
70 30 15 30
80 30 10 22.5
90 20 5 11.5
100 20 5 11.5
Comparison of systems
0102030405060708090100
0 10 20 30 40 50 60 70 80 90 100recall
prec
ision
system 1system 2
We can now compare IR systems / system versions. For example, here we see that at low recall, system 2 is better than system 1, but this changes from recall value 30%, etc. It is common to calculate an average precision value across all recall levels, so that to have a single value to compare.
49
Averaging across averages
o Average interpolated recall levels to get single result - Called “interpolated average precision” - Not used much anymore; “mean average precision” more common - Values at specific interpolated points still commonly used
o Mean average precision (MAP) - (“Average average precision” sounds weird) - Average precision over all relevant documents, non-interpolated- Reward systems that retrieve relevant documents quickly (highly ranked)
50
Mean Average Precision Consider rank position of each relevant document (n) for given query
r1, r2, … rn
Compute precision@r (denoted P@r) for each r1, r2, … rn
Average precision = average of P@r for given query
MAP is Average Precision across multiple queries
1
3.(1
1+
2
3+
3
5) ⇡ 0.76
51
Mean Average Precision (MAP)
52
average precision query 1 (AP) = (1.0 + 0.67 + 0.5 + 0.44 + 0.5)/5 = 0.62
average precision query 2 (AP) = (0.5 + 0.4 + 0.43)/3 = 0.44
mean average precision (MAP) = (0.62 + 0.44)/2 = 0.53
Moreaboutmeanaverageprecision(MAP)
o If a relevant document is not retrieved, precision corresponding to that relevant document is zero
o Most commonly used measure in research papers … with issues
o Not so good for web search evaluation (precision oriented) - MAP assumes user is interested in finding many relevant documents
53
TREC (trec_eval) evaluation results
Recall Level Precision Average
Recall Precision0.00.1…
1.0
0.610.45…
0.003average precision over all relevant documents
Non-interpolated (MAP) 0.23
54
Average precision per query
1.0
-1.0
0.0
200 201 202 203 204 …… Topic ids
Difference average precision
55
A system may perform badly for some information needs (MAP = 0.1) and excellently on others (MAP = 0.7) èoften the case that variance in performance of same system across queries is much greater than variance of different systems on the same query
ThereareeasyinformaGonneedsandhardones!
Rank-based measures
o Binary relevance - Mean Average Precision (MAP)
- P@r - R-Precision - Mean Reciprocal Rank (MRR)
- bpref
o Multiple levels of relevance - Normalized Discounted Cumulative Gain (NDCG)
56
P@r or Precision @ rank r
Set a rank threshold r
Compute % relevant documents in top r
Ignores documents ranked lower than r
P@3 = 2/3
P@4 = 2/4
P@5 = 3/5
actual performance as a user might see it
often used in web retrieval
used at fixed rank values: P@5, P@10
57
Note the slight difference with P@r in slide 51
R-Precision
o Precision after R documents are retrieved o R = number of relevant documents for the topic o De-emphasize exact ranking of retrieved relevant documents, which can
be useful for topics with large number of relevant documents o Perfectsystemcouldscore1.0
o Average R-precision - Example: 2 topics, with 50 and 10 relevant documents respectively. - Assume IR system return 17 relevant documents in the top 50 documents for
1st topic and 7 relevant documents in top 10 for 2nd topic - Average R-precision for this IR system is (17/50 + 7/10) / 2 = 0.52
58
Mean Reciprocal Rank (MRR) o Suppose there is only one relevant document o Scenarios: known-item search, navigational queries, looking for a fact
o Search duration à rank of the answer measures a user effort in finding that one and only document
Consider rank position, r, of first relevant document
Reciprocal Rank score =
MRR is the mean RR across multiple queries
1
r
59
E-measureo Used to emphasize precision (or recall)
- Essentially a weighted average of precision and recall - Large α increases importance of precision
o Can be transformed by α = 1/(β2+1) leading to
- When β =1 (α=1/2) equal importance of precision and recall - Normalised symmetric difference of retrieved and relevant sets
60
E = 1� 1
↵ 1P + (1� ↵) 1
R
E = 1� (�2 + 1)PR
�2P +R
Symmetric Difference and E A is the retrieved set of documents B is the set of relevant documents P = |A∩B|/|A| R = |A∩B|/|B| A⊗B (the symmetric difference) is the shaded area
A⊗B = |A∪B|- |A∩B| = |A|+|B|-2|A∩B|
Eβ=1=1-(2PR / (P+R)) = (P+R-2PR)/(P+R) = … = A⊗B / (|A|+|B|)
symmetric difference
normalised
61
A
B A∩B
F measure o F = 1-E often used
- Good results mean larger values of F
- “F1” measure is popular: F with β=1 - particular popular with evaluating classification approaches
harmonic mean of P and R
62
F� = 1� E =(�2 + 1)PR
�2P +R
F1 =2PR
P +R=
112 (
1R + 1
P )
HarmonicmeanisaconservaGveaverage
F measure, geometric interpretation A is the retrieved set of documents B is the set of relevant documents P = |A∩B|/|A| R = |A∩B|/|B|
A
B A∩B
63
F�=1 = 2PR/(P +R)
= 2|A \B|2
|A|+ |B|/(|A \B|( 1
|A| +1
|B| ))
=2|A \B||A|+ |B|
Relation to Contingency Table
Why is accuracy not much used in IR in large documents collections?
- Most document are NOT relevant - Most documents are NOT retrieved - Inflates the accuracy value
Document is Relevant
Document is NOT relevant
Document is retrieved a b Document is NOT retrieved c d
64
Accuracy : (a+ d)/(a+ b+ c+ d)
Precision : a/(a+ b)
Recall : a/(a+ c)
Discounted Cumulative Gain (DCG)
o Popular measure for evaluating web search
o Two assumptions: - Highly relevant documents are more useful than marginally relevant
documents - The lower the ranked position of a relevant document, the less useful it is for
the user, since it is less likely to be examined
66
Discounted Cumulative Gain (DCG) o Uses graded relevance as a measure of usefulness, or gain, from
examining a document
o Gain is accumulated starting at the top of the ranking and can be reduced, or discounted, at lower ranks
o Typical discount is 1/log(rank) - With base 2, the discount at rank 4 is 1/2 , and at rank 8 it is 1/3
67
Summarize a Ranking with DCG o Relevance judgments in a scale of [0,r] with r>2
o Cumulative Gain (CG) at rank n - Let the ratings of the n documents be r1, r2, …rn (in ranked order) - CG = r1+r2+…+rn
o Discounted Cumulative Gain (DCG) at rank n - DCG = r1 + r2/log22 + r3/log23 + … + rn/log2n (We may use any base for the logarithm)
68
DCGn = rel1 +nX
i=2
reli
log2i
DCG Example o 10 ranked documents judged on 0-3 relevance scale:
3, 2, 3, 0, 0, 1, 2, 2, 3, 0
o discounted gain (CG):
3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0 = 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0
o discounted cumulative gain (DCG):
3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61
69
0
5
10
15
0 2 4 6 8 10 12 rank
Summarize a Ranking with NDCG
o Normalized Discounted Cumulative Gain (NDCG) at rank n - Normalize DCG at rank n by the DCG value at rank n of the ideal ranking
- Ideal ranking would first return the documents with the highest relevance level, then the next highest relevance level, and so on (we get Max DCG)
o Normalization useful for contrasting queries with varying numbers of relevant documents
o NDCG popular in evaluating web search 70
NDCG =DCG
MaxDCG
NDCG Example
rankiIdealsystem(IS) System1(S1) System2(S2)
DocumentOrder ri
DocumentOrder ri
DocumentOrder ri
1 d4 2 d3 2 d3 2
2 d3 2 d4 2 d2 1
3 d2 1 d2 1 d4 2
4 d1 0 d1 0 d1 0
NDCGIS=1.00 NDCGS1=1.00 NDCGS2=0.9203
4 documents: d1, d2, d3, d4
71
DCGIS = 2 + (2
log22+
1
log23+
0
log24) = 4.6309
DCGS1 = 2 + (2
log22+
1
log23+
0
log24) = 4.6309
DCGS2 = 2 + (1
log22+
2
log23+
0
log24) = 4.2619
MaxDCG = DCGIS = 4.6309
Problem with the test collection methodology o Building larger test collections along with complete relevance judgment is difficult
or impossible - require assessor time, which is very expensive - require many diverse retrieval “runs”
o Recall is difficult if not impossible to get correctly as there is no way we can find all the relevant documents for each query
o Precision at top n often not stable enough
o Issue: - Non-judged documents are assumed non-relevant - Can we reuse the test collection later on?
72
bpref measure o Binary preference-based measure
- Introduced in 2004 - Unlike MAP, P@10, and recall and precision, only uses information from judged documents
o A function of how frequently relevant documents are retrieved before non-relevant documents.
R is the number of judged relevant documents, r is a relevant retrieved
document, and n is a member of the first R irrelevant retrieved documents. Non judged documents are ignored.
73
bpref =1
R
X
r
1� n ranked higher than r
R
bpref measure o When comparing systems over test collections with complete judgments, MAP
and bpref are reported to be equivalent
o With incomplete judgments, bpref is shown to be more stable - We look at what happen when we use less queries, more queries - We look at what happen when we swap documents in the ranking
74
bpref - Example Retrieved result set with D2 and D5 being relevant: D1 D2 D3 not judged D4 -------- D5 D6 D7 D8 D9 D10 R=2
bpref= 1/2 [1-(1/2)] 75
bpref - Example Retrieved result set with D2, D5 and D7 are relevant: D1 D2 D3 not judged D4 not judged D5 D6 D7 D8 ---------- D9 D10 R=3
bpref= 1/3 [(1 -1/3) + (1 -1/3) + (1 -2/3)] 76
bpref Example Retrieved result set with D2, D4, D6 and D9 are relevant: D1 D2 D3 D4 D6 D7 D8 ---------- D9
D10 R=4 bpref= 1/4 [(1-1/4) + (1 -2/4) + (1 -2/4)]
77
Evaluating interaction with the IR systems
o Empirical data involving human users is time consuming to gather and difficult to draw universal conclusions from
o Evaluation metrics for user interaction (interface) - Time required to learn the system - Time to achieve goals on benchmark tasks - Error rates - Retention of the use of the interface over time - User satisfaction
78
Why significance testing o System A beats System B on one query
- Is it just a lucky query for System A? - Maybe System B does better on some other query? - Needs as many queries as possible
Empirical research suggests 25 is minimum needed
TREC tracks generally aim for at least 50 queries
o Systems A and B identical on all but one query - If System A beats System B by enough on that one query, average will make A look better than B
As above could just be a lucky break for System A
- Need A to beat B frequently to believe A is really better
o System A is only 0.00001% better than system B - Even if true in all queries, does it mean much
o Significance testing consider those issues
79
Significance tests o Are observed differences statistically different?
- Make use of statistics
o Generally we cannot make assumptions about underlying distribution - Most significance tests do make such assumptions
o Significance tests are easier to do on single-valued effectiveness measures (MAP, bpref)
o Example: Sign test - Do not require that data be normally distributed - For techniques A and B, compare average precision for each pair of results generated by queries in
the test collection - If difference is large enough, count as + or -, otherwise ignore - Use number of +’s and the number of significant differences to determine significance level
80
Measures for large-scale systems … web search o Typical user behavior in web search shown preference for high precision o Graded scales of relevance seem more useful than binary è NDCG
o Recall difficult to measure on the web - Often use precision at top k, such as k=5, k =10, …
o . . . or measures that reward you more for getting rank 1 right than for getting rank 10 right è NDCG
o Use non-relevance-based datasets such as click-through data (query logs)
o A/B testing
81
A/Btes(ng
o Testaasinglenew“innovaGon”
o Havemostusersuseoldsystemo DivertasmallproporGonoftraffic(e.g.,1%)tothenewsystemthatincludes
theinnovaGon
o Evaluatewithan“automaGc”measurelikeclick-throughrates
o NowwecandirectlyseeiftheinnovaGondoesimproveretrievalperformance(e.g.click-throughrate)
o ProbablytheevaluaGonmethodologythatlargesearchenginestrustmost
Sec. 8.6.3
82
Bias in where users click
#ofclicksreceived
Strong position bias, so absolute click rates unreliable
83
Relative vs absolute ratings
Hard to conclude Result1 > Result3 Probably can conclude Result3 > Result2
User click sequence
pairwise relative rating instead of individual rating
Assess in terms of conformance with historical pairwise preferences recorded from user clicks
84
Comparing two rankings via clicks and interleaving method
Kernel machines
SVM-light
Lucent SVM demo
Royal Holl. SVM
SVM software
SVM tutorial
Kernel machines
SVMs
Intro to SVMs
Archives of SVM
SVM-light
SVM software
Query: [support vector machines]
System A System B
85
(Joachims, 2002)
Interleavethetworankingsandremoveduplicates
Kernel machines
SVM-light
Lucent SVM demo
Royal Holl. SVM
Kernel machines
SVMs
Intro to SVMs
Archives of SVM
SVM-light
Kernel machines
SVM-light
Lucent SVM demo
Royal Holl. SVM
Kernel machines
SVMs
Intro to SVMs
Archives of SVM
SVM-light
86
Count user clicks
87
Kernel machines
SVM-light
Lucent SVM demo
Royal Holl. SVM
Kernel machines
SVMs
Intro to SVMs
Archives of SVM
SVM-light
Clicks
Ranking A: 3 Ranking B: 1 ê
A, B
A
A System A is better than System B
88
o Focus on measuring its effectiveness rather than efficiency
o We recall that: - Effectiveness is the ability to make the right classification decision - Efficiency is concerned with time and space requirement
Evaluation of classifiers
89
Evaluation of classifiers
o After a classifier is constructed using a training set, the effectiveness is evaluated using a test set
o For each category ci, we calculate the following sets: - TPi: true positives - FPi: false positives - TNi: true negatives - FNi: false negatives
90
True and false positives with respect to a cageory
o TPi: true positives with respect to category ci - the set of documents that both the classifier and the previous
judgments (as recorded in the test set) classify under ci
o FPi: false positives with respect to category ci - the set of documents that the classifier classifies under ci, but the test
set indicates that they do not belong to ci
91
o TNi: true negatives with respect to category ci - both the classifier and the test set agree that the documents in
TNi do not belong to ci
o FNi: false negatives with respect to category ci - the classifier do not classify the documents in FNi under ci, but
the test set indicates that they should be classified under ci
True and false negatives with respect to a cageory
92
Evaluation measures for classifiers o Precision with respect to category ci
o Recall with respect to category ci TPi FPi FNi
TNi
Classified ci (what it returns)
Test Class ci (what it should return)
Pi =TPi
TPi + FPi
Ri =TPi
TPi + FNi
93
Evaluation measures for classifiers
o for obtaining estimates for precision and recall in the collection as a whole, two different methods may be adopted:
- Micro-averaging counts for true positives, false positives and false negatives for all categories are first summed up precision and recall are calculated using the global values
- Macro-averaging average of precision (recall) for individual categories
94
Micro- vs macro-averaging
o microaveraging and macroaveraging may give quite different results, if the different categories have very different generality
o e.g. the ability of a classifier to behave well also on categories with low generality (i.e. categories with few positive training instances) will be emphasized by macroaveraging
o choice depends on the application
Conclusions … some few words o Here we solely focused on system-oriented evaluation. We should not
forget about user-oriented evaluation o Here we focus on batch-style evaluation. We should not forget that
search is part of a bigger task. o At the end, it is all about making the users “happy”. We should not forget
about long-term engagement. o Lots of work and research looked beyond precision and recall, in terms of
validations, extensions or alternatives o Lots of work such as “significance testing” so that we can be sure that IR
system A is indeed better than IR system B. o Here we focused on “document” and text. We should not forget
multimedia, mobile, social media, etc, where evaluating effectiveness may mean something a bit different.
95