SIGIR 2005 Relevance Information: A Loss of Entropy but a Gain for IDF? Arjen P. de Vries [email protected] Thomas Roelleke, [email protected]

SIGIR 2005

Relevance Information:A Loss of Entropy but a Gain for IDF?

Arjen P. de [email protected]

Thomas Roelleke, [email protected]

Motivation

• How should relevance information be incorporated in systems using TF*IDF term weighting?– TF*IDF combines frequent occurrence with

term discriminative-ness– Adding relevance information to a retrieval

system corresponds to a loss of entropy; how does this affect IDF (the measure of term discriminativeness)?

Overview

• PART I– IDF, relevance, and BIR

• PART II– Alternative estimation for IDF

IDF

• A robust summary statistic of term occurrence, that helps identify ‘good’ terms

• Follows naturally from the Binary Independence Retrieval model (BIR)– The ranking that results from the situation without

relevance information– Related to the occurrence probability P(t|C)

||

),(log

C

Ctn

Binary termpresence/absence

BIR

),|(),|( rqxPrqdPqt

t

)|(),|(

),|(

),|(

),|(

),|(

qrOrqdP

rqdP

qdrP

qdrP

qdrO

• Rank documents by their probability of relevance• Using odds of relevance avoids estimation of

some terms without affecting the ranking

BIR

Without relevance information,

h(t,C) = – log n/(N – n)

(Almost)the

Discriminative-ness

of term t in

collection C!

qt

tt QhxqdrO )(),|(log

t

t

t

tt b

b

a

ah

)1(

)1(log

),|1( rqXPa tt

),|1( rqXPb tt

BIR and IDF

• View IDF as term statistic in a set of documents, R or ¬R

• Then, the BIR probability estimation known as F4 corresponds to IDF(t,¬R) – IDF(t,R) + IDF(¬t,R) – IDF(¬t,¬R)

• IDF(t,R) can be interpreted as the discriminativeness of term presence among the relevant documents, etc.

BIR and IDF

• In practice, the ‘complement method’ gives ¬R = C\R ≈ C, so, usually, updating IDF under relevance information corresponds to subtracting IDF(t,R)!

• The BIR modifies h(t,C) more significantly for those terms that are rare in the relevant set; for, they do not help identify good documents

Implication for TF*IDF systems

• A system using IDF(t,C) uses presence weighting only, assuming that the term t occurs in all relevant documents (such that IDF(t,R) = – log R/R = 0)

• Systems using TF*IDF term weighting can incorporate RFB in accordance to the binary independence retrieval model

Estimation IDF

• Recall that IDF(t,C) = – log P(t|C), the occurrence probability of t in C.– Assuming events d are disjoint and

exhaustive

we obtain P(t|C)=n/N

• Q: Is this the best method for estimation?– Notice that, in the BIR formulation, sets R and

¬R have very different cardinality…

d

dPdtPtP )()|()(

Estimation TF

• For TF weights, we know that P(t|d) estimated by a Poisson approximation (e.g., applied in BM25) or by lifting (e.g., applied in Inquery) leads to superior retrieval results

• Motivation for this different estimate is to better handle the influence of varying document lengths

Poisson Estimate

• The ‘Poisson estimate’

approximates the (Poisson-based) probability that term t occurs at least once

),(

),()|(

CtnK

CtnCtP

enPnP 1)0(1)1(

Poisson vs. Estimate vs. n/N

Again, for small n

|Poisson – Estimate|

Experimental Setup

• Ad-hoc retrieval– TREC-7 and TREC-8 (topics 351-400)– No stemming

• Routing– LA Times articles for training (1989/1990)– Remainder for testing (1991-1994)

• BM25 constants:

7627.0

1000

2.1

3

1

b

k

k

Results: IDF vs. IDFp

• IDF

• IDFp

T TD TDN

TREC-7 0.124 0.095 0.041

TREC-8 0.136 0.111 0.064

T TD TDN

TREC-7 0.133 0.143 0.127

TREC-8 0.136 0.158 0.137

IDF vs. IDFp

• For the short T queries, the user selects carefully the most discriminative terms with respect to relevance

• The longer TD and TDN queries contain however also noisy, non-discriminative terms

IDF vs. IDFp

• IDFp orders terms with respect to their discriminative-ness in the same order as IDF, but reduces the influence of the non-discriminative terms on the ranking– Differentiate more between rare terms, and

less between frequent terms

• As a result, the effect of the Poisson-based estimation is much stronger for the longer queries

TF*IDF vs. TF*IDFp

• Estimation with IDFp results in better mean average precision than the ‘traditional’ estimate

• Strong emphasis on discriminative-ness (Poisson approximation IDFp using large values of K) improves effectiveness

• Best overall performance for K=N/10

Routing experiment

• The TF*IDFp results without feedback are better than all TF*IDF results

• But, the TF*IDFp results without feedback are also better than all TF*IDFp results with feedback

• Finally, the TF*IDF results improve only marginally with feedback

• LA times training data not representative?

Conclusions

• PART I– IDF and the Binary Independence Retrieval

model are very closely related– Relevance information can be incorporated in

TF*IDF by revising IDF

• PART II– Different estimation of the occurrence

probability in IDF leads to improved retrieval effectiveness

Open Questions

• Can we derive the choice for K=N/10 analytically?

• Is the observed improvement in effectiveness really due to a better (frequentist) model of the occurrence probability, or is it a qualitative argument for informative-ness?

• More questions in the audience?!

Documents

SIGIR 2005 Relevance Information: A Loss of Entropy but a Gain for IDF? Arjen P. de Vries [email protected] Thomas Roelleke, [email protected]