Upload
stella-ford
View
212
Download
0
Embed Size (px)
Citation preview
A new measure of retrieval A new measure of retrieval effectiveness (Or: What’s wrong effectiveness (Or: What’s wrong
with precision and recall)with precision and recall)
Stefano Mizzaro
Department of Mathematics and Computer ScienceUniversity of Udine
[email protected]://www.dimi.uniud.it/~mizzaro
S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 2/16
OutlineOutline
Introduction: Measures of retrieval effectiveness... motivation for...
...a new measure: Average Distance Measure (ADM)
Discussion– Theoretical and practical adequacy of ADM– ADM vs. precision and recall– Pbms. with P & R
Conclusions and future work
S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 3/16
From binary to continuous From binary to continuous relevance & retrievalrelevance & retrieval
“Less” retrieved
“More” retrieved
“Less” relevant
“More” relevant
Not retrieved
Retrieved
Not relevant
Relevant
[Salton & McGill, 84]Documents database
S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 4/16
Continuous relevance & retrievalContinuous relevance & retrieval
URE
SRE
0.5
0.5
1.0
1.0
00
“Less” retrieved
“More” retrieved
“Less” relevant
“More” relevant
• SRE = System Relevance Estimate (aka RSV)
• URE = User Relevance Estimate
S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 5/16
ThresholdsThresholds on URE & SRE: why? on URE & SRE: why?
URE
SRE
0.5
0.5
1.0
1.0
00
“Less” retrieved
“More” retrieved
“Less” relevant
“More” relevant
Retrieved &
relevant?
Nonretrieved & relevant?
Nonretrieved& nonrelevant?
Retrieved & nonrelevant?
P = RetRel /(RetRel+RetNRel)R = RetRel /(RetRel+NRetRel)
... and historical reasons
S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 6/16
Average Distance Measure (ADM)Average Distance Measure (ADM)
SRE:
URE:
ADM = average “distance” between
URE and SRE values
D
dUREdSREADM Dd iqiq
qi
1
iq dURE
iq dSRE
S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 7/16
ADM: graphical representationADM: graphical representation
URE
SRE
1.0
1.0
00
Exactly
evaluated
S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 8/16
ADM: An exampleADM: An example
URE
SRE
0.5
0.3
0.1 1.0
1.0
00
0.2
0.4
0.6
0.8
0.9
0.4 0.8 URE
SRE
0.5
0.3
0.1 1.0
1.0
00
0.2
0.4
0.6
0.8
0.9
0.4 0.8
0.10.40.8URE
0.71.00.40.8IRS3
0.80.30.61.0IRS2
0.90.20.50.9IRS1
ADMd3d2d1Docs.
S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 9/16
Adequacy of ADMAdequacy of ADMOne single numberAllows complete ordering of different
performances...ADM vs. P & R
– No hyper-sensitiveness to small variations close to borders
– No lack of sensitiveness to big variations inside “equivalence” regions
– Wrong thresholds
S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 10/16
Hyper-sensitiveness: Hyper-sensitiveness: Three very similar IRSsThree very similar IRSs
URE
SRE
0.5
0.50.49
0.49 1.0
1.0
00
0.8260.50.50.5IRS3
0.830.750.51IRS2
0.830.8410.67IRS1
ADMERP
stable
unstable
S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 11/16
Lack of sensitiveness:Lack of sensitiveness:two very different IRSstwo very different IRSs
URE
SRE
0.5
0.5
1.0
1.0
00
0.5111IRS2
1111IRS1
ADMERP
unstable
stable
S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 12/16
Again on the thresholds...Again on the thresholds...
URE
SRE
0.5
0.5
1.0
1.0
00
“Less” retrieved
“More” retrieved
“Less” relevant
“More” relevant
Retrieved &
relevant?
Nonretrieved & relevant?
Nonretrieved& nonrelevant?
Retrieved & nonrelevant?
S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 13/16
The “right” thresholdsThe “right” thresholds
URE
SRE
0.5 1.0
1.0
00
“Less” retrieved
“More” retrieved
“Less” relevant
“More” relevant
OverEvaluated
UnderEvaluated
E = CE / (OE + UE)
0.5Correctly
Evaluated
S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 14/16
ADM in practiceADM in practiceHow to get URE values? Either
– asking the judge(s) to directly express continuous relevance judgments (feasible, literature evidence), or
– averaging dichotomous/discrete relevance judgments
UREs for all the documents in the database? Impossible!!– Sampling – (that takes place with P & R too, anyway)
S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 15/16
ConclusionsConclusions
ADM, a new measure of retrieval effectiveness– Adequacy– Improvements w.r.t. P & R: avoids hyper-
sensitiveness and lack of sensitiveness– Practical usability (continuous relevance
judgments, sampling)Very preliminary work
S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 16/16
Future workFuture workTheoretical variations and
improvements– Standard deviation in place of the
difference of absolute values?– Which sampling?
Re-examine the data of some evaluation experiments (any volunteers?)
Using ADM in real life