View
218
Download
0
Embed Size (px)
Citation preview
SIGIR 2005
Relevance Information:A Loss of Entropy but a Gain for IDF?
Arjen P. de [email protected]
Thomas Roelleke, [email protected]
Motivation
• How should relevance information be incorporated in systems using TF*IDF term weighting?– TF*IDF combines frequent occurrence with
term discriminative-ness– Adding relevance information to a retrieval
system corresponds to a loss of entropy; how does this affect IDF (the measure of term discriminativeness)?
IDF
• A robust summary statistic of term occurrence, that helps identify ‘good’ terms
• Follows naturally from the Binary Independence Retrieval model (BIR)– The ranking that results from the situation without
relevance information– Related to the occurrence probability P(t|C)
||
),(log
C
Ctn
Binary termpresence/absence
BIR
),|(),|( rqxPrqdPqt
t
)|(),|(
),|(
),|(
),|(
),|(
qrOrqdP
rqdP
qdrP
qdrP
qdrO
• Rank documents by their probability of relevance• Using odds of relevance avoids estimation of
some terms without affecting the ranking
BIR
Without relevance information,
h(t,C) = – log n/(N – n)
(Almost)the
Discriminative-ness
of term t in
collection C!
qt
tt QhxqdrO )(),|(log
t
t
t
tt b
b
a
ah
)1(
)1(log
),|1( rqXPa tt
),|1( rqXPb tt
BIR and IDF
• View IDF as term statistic in a set of documents, R or ¬R
• Then, the BIR probability estimation known as F4 corresponds to IDF(t,¬R) – IDF(t,R) + IDF(¬t,R) – IDF(¬t,¬R)
• IDF(t,R) can be interpreted as the discriminativeness of term presence among the relevant documents, etc.
BIR and IDF
• In practice, the ‘complement method’ gives ¬R = C\R ≈ C, so, usually, updating IDF under relevance information corresponds to subtracting IDF(t,R)!
• The BIR modifies h(t,C) more significantly for those terms that are rare in the relevant set; for, they do not help identify good documents
Implication for TF*IDF systems
• A system using IDF(t,C) uses presence weighting only, assuming that the term t occurs in all relevant documents (such that IDF(t,R) = – log R/R = 0)
• Systems using TF*IDF term weighting can incorporate RFB in accordance to the binary independence retrieval model
Estimation IDF
• Recall that IDF(t,C) = – log P(t|C), the occurrence probability of t in C.– Assuming events d are disjoint and
exhaustive
we obtain P(t|C)=n/N
• Q: Is this the best method for estimation?– Notice that, in the BIR formulation, sets R and
¬R have very different cardinality…
d
dPdtPtP )()|()(
Estimation TF
• For TF weights, we know that P(t|d) estimated by a Poisson approximation (e.g., applied in BM25) or by lifting (e.g., applied in Inquery) leads to superior retrieval results
• Motivation for this different estimate is to better handle the influence of varying document lengths
Poisson Estimate
• The ‘Poisson estimate’
approximates the (Poisson-based) probability that term t occurs at least once
),(
),()|(
CtnK
CtnCtP
enPnP 1)0(1)1(
Experimental Setup
• Ad-hoc retrieval– TREC-7 and TREC-8 (topics 351-400)– No stemming
• Routing– LA Times articles for training (1989/1990)– Remainder for testing (1991-1994)
• BM25 constants:
7627.0
1000
2.1
3
1
b
k
k
Results: IDF vs. IDFp
• IDF
• IDFp
T TD TDN
TREC-7 0.124 0.095 0.041
TREC-8 0.136 0.111 0.064
T TD TDN
TREC-7 0.133 0.143 0.127
TREC-8 0.136 0.158 0.137
IDF vs. IDFp
• For the short T queries, the user selects carefully the most discriminative terms with respect to relevance
• The longer TD and TDN queries contain however also noisy, non-discriminative terms
IDF vs. IDFp
• IDFp orders terms with respect to their discriminative-ness in the same order as IDF, but reduces the influence of the non-discriminative terms on the ranking– Differentiate more between rare terms, and
less between frequent terms
• As a result, the effect of the Poisson-based estimation is much stronger for the longer queries
TF*IDF vs. TF*IDFp
• Estimation with IDFp results in better mean average precision than the ‘traditional’ estimate
• Strong emphasis on discriminative-ness (Poisson approximation IDFp using large values of K) improves effectiveness
• Best overall performance for K=N/10
Routing experiment
• The TF*IDFp results without feedback are better than all TF*IDF results
• But, the TF*IDFp results without feedback are also better than all TF*IDFp results with feedback
• Finally, the TF*IDF results improve only marginally with feedback
• LA times training data not representative?
Conclusions
• PART I– IDF and the Binary Independence Retrieval
model are very closely related– Relevance information can be incorporated in
TF*IDF by revising IDF
• PART II– Different estimation of the occurrence
probability in IDF leads to improved retrieval effectiveness