IR Evaluation tugas kampus

8/10/2019 IR Evaluation tugas kampus

1/25

Introduction to Information Retrieval

Introduction to

Information RetrievalCS276

Information Retrieval and Web Search

Pandu Nayak and Prabhakar Raghavan

Lecture 8: Evaluation


2/25


2

Measures for a search engine

How fast does it index

Number of documents/hour

(Average document size)

How fast does it search Latency as a function of index size

Expressiveness of query language

Ability to express complex information needs Speed on complex queries

Uncluttered UI

Is it free?

Sec. 8.6


3/25


3

Measures for a search engine

All of the preceding criteria are measurable: we can

quantify speed/size

we can make expressiveness precise

The key measure: user happiness What is this?

Speed of response/size of index are factors

But blindingly fast, useless answers wont make a user

happy

Need a way of quantifying user happiness

Sec. 8.6


4/25


4

Happiness: elusive to measure

Most common proxy: relevanceof search results

But how do you measure relevance?

We will detail a methodology here, then examineits issues

Relevance measurement requires 3 elements:

1. A benchmark document collection

2. A benchmark suite of queries

3. A usually binary assessment of either Relevant or

Nonrelevant for each query and each document

Some work on more-than-binary, but not the standard

Sec. 8.1


5/25


5

Evaluating an IR system

Note: the information needis translated into a

query

Relevance is assessed relative to the information

neednot thequery E.g., Information need: I'm looking for information on

whether drinking red wine is more effective at

reducing your risk of heart attacks than white wine.

Query: wine red white heart attack effective

Evaluate whether the doc addresses the information

need, not whether it has these words

Sec. 8.1


6/25


6

Standard relevance benchmarks

TREC - National Institute of Standards and

Technology (NIST) has run a large IR test bed for

many years

Reuters and other benchmark doc collections used Retrieval tasks specified

sometimes as queries

Human experts mark, for each query and for eachdoc, Relevant or Nonrelevant

or at least for subset of docs that some system returned

for that query

Sec. 8.2


7/25


7

Unranked retrieval evaluation:

Precision and Recall

Precision: fraction of retrieved docs that are relevant

= P(relevant|retrieved)

Recall: fraction of relevant docs that are retrieved

= P(retrieved|relevant)

Precision P = tp/(tp + fp)

Recall R = tp/(tp + fn)

Relevant Nonrelevant

Retrieved tp fp

Not Retrieved fn tn

Sec. 8.3


8/25


8

Unranked retrieval evaluation:


Precision

The ability to retrievetop-ranked documents that are

mostly relevant.

Recall The ability of the search to find allof the relevant items in

the corpus.

Sec. 8.3


9/25


9

Precision/Recall

You can get high recall (but low precision) by

retrieving all docs for all queries!

Recall is a non-decreasing function of the number

of docs retrieved

In a good system, precision decreases as either the

number of docs retrieved or recall increases This is not a theorem, but a result with strong empirical

confirmation

Sec. 8.3


10/25


10

Trade-off between Recall and Precision

10

1

Recall

Precision

The ideal

Returns relevant documents but

misses many useful ones too

Returns most relevantdocuments but includes

lots of junk

d f l


11/25


11

F-Measure

One measure of performance that takes into account

both recall and precision.

Harmonic mean of recall and precision:

Compared to arithmetic mean, both need to be highfor harmonic mean to be high.

PRRP

PRF

11

22

I d i I f i R i l


12/25


12

E Measure (parameterized F Measure)

A variant of F measure that allows weightingemphasis on precision over recall:

Value of controls trade-off:

= 1: Equally weight precision and recall (E=F).

> 1: Weight recall more.

< 1: Weight precision more.

PRRP

PR

E 1

2

2

2

2

)1()1(

I d i I f i R i l S 8 3


13/25


13

Ranked retrieval evaluation:


Precision: fraction of retrieved docs that are relevant

= P(relevant|retrieved)

Recall: fraction of relevant docs that are retrieved

= P(retrieved|relevant)

Precision P = tp/(tp + fp)

Recall R = tp/(tp + fn)

Relevant Nonrelevant

Retrieved tp fp

Not Retrieved fn tn

Sec. 8.3

I t d ti t I f ti R t i l


14/25


14

Computing Recall/Precision Points

For a given query, produce the ranked list of

retrievals.

Adjusting a threshold on this ranked list produces

different sets of retrieved documents, and thereforedifferent recall/precision measures.

Mark each document in the ranked list that is

relevant according to the gold standard.

Compute a recall/precision pair for each position in

the ranked list that contains a relevant document.

I t d ti t I f ti R t i l


15/25


Example 1

15

n doc # relevant

1 588 x

2 589 x

3 576

4 590 x

5 986

6 592 x

7 984

8 988

9 578

10 985

11 103

12 591

13 772 x

14 990

R=3/6=0.5; P=3/4=0.75

Let total # of relevant docs = 6

Check each new recall point:

R=1/6=0.167; P=1/1=1

R=2/6=0.333; P=2/2=1

R=5/6=0.833; p=5/13=0.38

R=4/6=0.667; P=4/6=0.667

Missing onerelevant

document.

Never reach

100% recall



16/25


Example 2

16

R=3/6=0.5; P=3/5=0.6

Let total # of relevant docs = 6

Check each new recall point:

R=1/6=0.167; P=1/1=1

R=2/6=0.333; P=2/3=0.667

R=6/6=1.0; p=6/14=0.429

R=4/6=0.667; P=4/8=0.5

R=5/6=0.833; P=5/9=0.556

n doc # relevant

1 588 x

2 576

3 589 x

4 342

5 590 x

6 717

7 984

8 772 x

9 321 x

10 498

11 113

12 628

13 772

14 592 x



17/25


Interpolating a Recall/Precision Curve

Interpolate a precision value for each standard recall

level:

rj{0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}

r0= 0.0, r1= 0.1, , r10=1.0

The interpolated precision at thej-th standard recall

level is the maximum known precision at any recall

level between thej-th and (j + 1)-th level:

17

)(max)(1

rPrPjj rrr

j



18/25


Example 1

18

0.4 0.8

1.0

0.8

0.6

0.4

0.2

0.2 1.00.6 Recall

Precision



19/25


Example 2

19

0.4 0.8

1.0

0.8

0.6

0.4

0.2

0.2 1.00.6 Recall

Precision


20/25



21/25


21

Compare Two or More Systems

The curve closest to the upper right-hand corner ofthe graph indicates the best performance

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Precision

NoStem Stem



22/25


22

Sample RP Curve for CF Corpus



23/25


23

R- Precision

Precision at the R-th position in the ranking of results

for a query that has R relevant documents.

n doc # relevant

1 588 x

2 589 x

3 576

4 590 x

5 986

6 592 x

7 984

8 988

9 578

10 985

11 103

12 591

13 772 x

14 990

R = # of relevant docs = 6

R-Precision = 4/6 = 0.67



24/25


Mean Average Precision (MAP)

Average Precision: Average of the precision values at the

points at which each relevant document is retrieved.

Ex1: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633

Ex2: (1 + 0.667 + 0.6 + 0.5 + 0.556 + 0.429)/6 = 0.625

Mean Average Precision: Average of the average

precision value for a set of queries.

24



25/25


25

Documents

IR Evaluation tugas kampus