A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University

A new measure of retrieval A new measure of retrieval effectiveness (Or: What’s wrong effectiveness (Or: What’s wrong

with precision and recall)with precision and recall)

Stefano Mizzaro

Department of Mathematics and Computer ScienceUniversity of Udine

[email protected]://www.dimi.uniud.it/~mizzaro

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 2/16

OutlineOutline

Introduction: Measures of retrieval effectiveness... motivation for...

...a new measure: Average Distance Measure (ADM)

Discussion– Theoretical and practical adequacy of ADM– ADM vs. precision and recall– Pbms. with P & R

Conclusions and future work


From binary to continuous From binary to continuous relevance & retrievalrelevance & retrieval

“Less” retrieved

“More” retrieved

“Less” relevant

“More” relevant

Not retrieved

Retrieved

Not relevant

Relevant

[Salton & McGill, 84]Documents database


Continuous relevance & retrievalContinuous relevance & retrieval

URE

SRE

0.5

0.5

1.0

1.0

00



“Less” relevant

“More” relevant

• SRE = System Relevance Estimate (aka RSV)

• URE = User Relevance Estimate


ThresholdsThresholds on URE & SRE: why? on URE & SRE: why?

URE

SRE

0.5

0.5

1.0

1.0

00



“Less” relevant

“More” relevant

Retrieved &

relevant?

Nonretrieved & relevant?

Nonretrieved& nonrelevant?

Retrieved & nonrelevant?

P = RetRel /(RetRel+RetNRel)R = RetRel /(RetRel+NRetRel)

... and historical reasons


Average Distance Measure (ADM)Average Distance Measure (ADM)

SRE:

URE:

ADM = average “distance” between

URE and SRE values

D

dUREdSREADM Dd iqiq

qi

1

iq dURE

iq dSRE


ADM: graphical representationADM: graphical representation

URE

SRE

1.0

1.0

00

Exactly

evaluated


ADM: An exampleADM: An example

URE

SRE

0.5

0.3

0.1 1.0

1.0

00

0.2

0.4

0.6

0.8

0.9

0.4 0.8 URE

SRE

0.5

0.3

0.1 1.0

1.0

00

0.2

0.4

0.6

0.8

0.9

0.4 0.8

0.10.40.8URE

0.71.00.40.8IRS3

0.80.30.61.0IRS2

0.90.20.50.9IRS1

ADMd3d2d1Docs.


Adequacy of ADMAdequacy of ADMOne single numberAllows complete ordering of different

performances...ADM vs. P & R

– No hyper-sensitiveness to small variations close to borders

– No lack of sensitiveness to big variations inside “equivalence” regions

– Wrong thresholds


Hyper-sensitiveness: Hyper-sensitiveness: Three very similar IRSsThree very similar IRSs

URE

SRE

0.5

0.50.49

0.49 1.0

1.0

00

0.8260.50.50.5IRS3

0.830.750.51IRS2

0.830.8410.67IRS1

ADMERP

stable

unstable


Lack of sensitiveness:Lack of sensitiveness:two very different IRSstwo very different IRSs

URE

SRE

0.5

0.5

1.0

1.0

00

0.5111IRS2

1111IRS1

ADMERP

unstable

stable


Again on the thresholds...Again on the thresholds...

URE

SRE

0.5

0.5

1.0

1.0

00



“Less” relevant

“More” relevant

Retrieved &

relevant?

Nonretrieved & relevant?

Nonretrieved& nonrelevant?

Retrieved & nonrelevant?


The “right” thresholdsThe “right” thresholds

URE

SRE

0.5 1.0

1.0

00



“Less” relevant

“More” relevant

OverEvaluated

UnderEvaluated

E = CE / (OE + UE)

0.5Correctly

Evaluated


ADM in practiceADM in practiceHow to get URE values? Either

– asking the judge(s) to directly express continuous relevance judgments (feasible, literature evidence), or

– averaging dichotomous/discrete relevance judgments

UREs for all the documents in the database? Impossible!!– Sampling – (that takes place with P & R too, anyway)


ConclusionsConclusions

ADM, a new measure of retrieval effectiveness– Adequacy– Improvements w.r.t. P & R: avoids hyper-

sensitiveness and lack of sensitiveness– Practical usability (continuous relevance

judgments, sampling)Very preliminary work


Future workFuture workTheoretical variations and

improvements– Standard deviation in place of the

difference of absolute values?– Which sampling?

Re-examine the data of some evaluation experiments (any volunteers?)

Using ADM in real life