16
A new measure of A new measure of retrieval effectiveness retrieval effectiveness (Or: What’s wrong with (Or: What’s wrong with precision and recall) precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University of Udine [email protected] http://www.dimi.uniud.it/~mizzaro

A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University

Embed Size (px)

Citation preview

Page 1: A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University

A new measure of retrieval A new measure of retrieval effectiveness (Or: What’s wrong effectiveness (Or: What’s wrong

with precision and recall)with precision and recall)

Stefano Mizzaro

Department of Mathematics and Computer ScienceUniversity of Udine

[email protected]://www.dimi.uniud.it/~mizzaro

Page 2: A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 2/16

OutlineOutline

Introduction: Measures of retrieval effectiveness... motivation for...

...a new measure: Average Distance Measure (ADM)

Discussion– Theoretical and practical adequacy of ADM– ADM vs. precision and recall– Pbms. with P & R

Conclusions and future work

Page 3: A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 3/16

From binary to continuous From binary to continuous relevance & retrievalrelevance & retrieval

“Less” retrieved

“More” retrieved

“Less” relevant

“More” relevant

Not retrieved

Retrieved

Not relevant

Relevant

[Salton & McGill, 84]Documents database

Page 4: A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 4/16

Continuous relevance & retrievalContinuous relevance & retrieval

URE

SRE

0.5

0.5

1.0

1.0

00

“Less” retrieved

“More” retrieved

“Less” relevant

“More” relevant

• SRE = System Relevance Estimate (aka RSV)

• URE = User Relevance Estimate

Page 5: A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 5/16

ThresholdsThresholds on URE & SRE: why? on URE & SRE: why?

URE

SRE

0.5

0.5

1.0

1.0

00

“Less” retrieved

“More” retrieved

“Less” relevant

“More” relevant

Retrieved &

relevant?

Nonretrieved & relevant?

Nonretrieved& nonrelevant?

Retrieved & nonrelevant?

P = RetRel /(RetRel+RetNRel)R = RetRel /(RetRel+NRetRel)

... and historical reasons

Page 6: A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 6/16

Average Distance Measure (ADM)Average Distance Measure (ADM)

SRE:

URE:

ADM = average “distance” between

URE and SRE values

D

dUREdSREADM Dd iqiq

qi

1

iq dURE

iq dSRE

Page 7: A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 7/16

ADM: graphical representationADM: graphical representation

URE

SRE

1.0

1.0

00

Exactly

evaluated

Page 8: A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 8/16

ADM: An exampleADM: An example

URE

SRE

0.5

0.3

0.1 1.0

1.0

00

0.2

0.4

0.6

0.8

0.9

0.4 0.8 URE

SRE

0.5

0.3

0.1 1.0

1.0

00

0.2

0.4

0.6

0.8

0.9

0.4 0.8

0.10.40.8URE

0.71.00.40.8IRS3

0.80.30.61.0IRS2

0.90.20.50.9IRS1

ADMd3d2d1Docs.

Page 9: A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 9/16

Adequacy of ADMAdequacy of ADMOne single numberAllows complete ordering of different

performances...ADM vs. P & R

– No hyper-sensitiveness to small variations close to borders

– No lack of sensitiveness to big variations inside “equivalence” regions

– Wrong thresholds

Page 10: A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 10/16

Hyper-sensitiveness: Hyper-sensitiveness: Three very similar IRSsThree very similar IRSs

URE

SRE

0.5

0.50.49

0.49 1.0

1.0

00

0.8260.50.50.5IRS3

0.830.750.51IRS2

0.830.8410.67IRS1

ADMERP

stable

unstable

Page 11: A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 11/16

Lack of sensitiveness:Lack of sensitiveness:two very different IRSstwo very different IRSs

URE

SRE

0.5

0.5

1.0

1.0

00

0.5111IRS2

1111IRS1

ADMERP

unstable

stable

Page 12: A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 12/16

Again on the thresholds...Again on the thresholds...

URE

SRE

0.5

0.5

1.0

1.0

00

“Less” retrieved

“More” retrieved

“Less” relevant

“More” relevant

Retrieved &

relevant?

Nonretrieved & relevant?

Nonretrieved& nonrelevant?

Retrieved & nonrelevant?

Page 13: A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 13/16

The “right” thresholdsThe “right” thresholds

URE

SRE

0.5 1.0

1.0

00

“Less” retrieved

“More” retrieved

“Less” relevant

“More” relevant

OverEvaluated

UnderEvaluated

E = CE / (OE + UE)

0.5Correctly

Evaluated

Page 14: A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 14/16

ADM in practiceADM in practiceHow to get URE values? Either

– asking the judge(s) to directly express continuous relevance judgments (feasible, literature evidence), or

– averaging dichotomous/discrete relevance judgments

UREs for all the documents in the database? Impossible!!– Sampling – (that takes place with P & R too, anyway)

Page 15: A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 15/16

ConclusionsConclusions

ADM, a new measure of retrieval effectiveness– Adequacy– Improvements w.r.t. P & R: avoids hyper-

sensitiveness and lack of sensitiveness– Practical usability (continuous relevance

judgments, sampling)Very preliminary work

Page 16: A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University

S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...) 16/16

Future workFuture workTheoretical variations and

improvements– Standard deviation in place of the

difference of absolute values?– Which sampling?

Re-examine the data of some evaluation experiments (any volunteers?)

Using ADM in real life