Upload
diana-little
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
LANGUAGE MODELS FOR RELEVANCE FEEDBACK
2003. 07. 14
Lee Won Hee
2
Abstract
The language modeling approach to IR Query - random event Documents - ranked according to the likelihood users have a prototypical document in mind and will choose query
terms accordingly inferences about the semantic content of documents do not need to be
made resulting in a conceptually
3
1. Introduction
the language modeling approach to IR Developed by Ponte and Croft, 1998 Query – random event generated according to a probability distribution Document similarity
- estimating a model of the term generation probabilities for the query terms for each document
- ranking the documents according to the probability of generating the query
The main advantage to the language modeling approach Document boundaries are not predefined
- use the document level statistics of tf and idf Uncertainty are modeled by probabilities
- noisy data such as OCR text and automatically recognized speech transcripts
relevance feedback or document routing
4
2. The Language Modeling Approach to IR
The query generation probability The probability will be estimated starting with the maximum likelihood
estimate of the probability of term t in document d
- tf(t,d) : the raw term frequency of term t in document d
- dld : the total number of tokens in document d
dd dl
dttfMtmlP
),()|(
^
5
2.1 Insufficient Data
Two problem with the maximum likelihood estimator We do not wish to assign a probability of zero for a document that is
missing one or more of the query terms
- If a user included several synonyms in the query, a document missing even one of them would not be retrieved
- A more reasonable distribution
We only have a document sized sample from that distribution and so the variation in the raw counts may partially be accounted for by randomness
cs
cft * cft : the raw count of term t in the collection
* cs : the raw collection size or the total number of tokens in the collection
6
2.2 Averaging
The mean probability estimate of t in documents containing it
- to circumvent the problem of insufficient data
- some risk : if the mean were used by itself, there would be no distinction between documents with different term frequencies
Combining the two estimates using the geometric distribution
- Ghosh et al., 1983
- robustness of estimation, minimize the risk
dft
MtPml
tavgP dtdd
)(^
))|((
)( • dft : the document frequency of t
dttf
t
t
t
dt
f
f
fR
,
))0.1(
))0.1(
0.1,
^
• : the mean term frequency of term t in documents tf
7
2.3 Combining the Two Estimates
The estimate of the probability of producing the query for a given document model
- first term : the probability of producing the terms in the query
- second term : the probability of not producing other terms
- better discriminators of the document
Qt
cs
cft
RRd
dtdt tPavgdtPmlMQP ,
^
,
^
)(),({)|( )0.1(^
Qt
cs
cft
RR dtdt tPavgdtPml ,
^
,
^
)(),({0.1 )0.1(
If tf(t,d)>0
otherwise
If tf(t,d)>0
otherwise
8
3. Related Work
1. The harper and van rijsbergen model
2. The rocchio method
3. The inquery model
4. Exponential models
9
3.1 The Harper and Van rijsbergen model (1978)
to obtain better estimates for the probability of relevance of a document given the query
An approximation of the dependence of query terms was defined by the authors by means of a maximal spanning tree each node of the tree : a single query term The edges between nodes : weighted by a measure of term dependency
A tree that spanned all of the nodes and that maximized the expected mutual information
))()(
),(log(),(
,
ji ji
jiji xPxP
xxPxxP
- P(xi,xj) : the probability of term xi and term xj occurring- P(xi) : the probability of term xi occurring in a relevant document- P(xj) : the probability of term xj occurring in a relevant document
10
3.2 The Rocchio Method (1971)
Rocchio method provides a mechanism for the selection and weighting of expansion terms
can be used to rank the terms in the judged documents
- The top N can then be added to the query and weighted
a reasonable solution to the problem of relevance feedback that works very well in practice
empirically determine the optimal value of , ,
RR
r twR
twR
twtw )(||
0.1)(
||
0.1)()( - : the weight assigned for occurring in relevant
doc
- : the weight assigned for occurring in relevant doc
- : the weight assigned for occurring in non-relevant doc
11
3.3 The INQUERY Model(1/2)
INQUERY inference network (Turtle, 1991) document portion
- computed in advance query portion
- computed at retrieval time
Document Network document nodes – d1...di
text nodes – t1...tj
concept representation nodes – r1...rk
Query Network query concepts – c1…cm
queries – q1, q2
Information need – I
Uncertainty due to differences in word sense
Figure 3.1 Example inference network
12
3.3 The INQUERY Model(2/2)
Relevance Feedback Implementation of the theoretical relevance
feedback was done by Hains(1996)
Annotated query network Proposition nodes – k1, k2
Observed relevance judgment nodes – j1, j2
and nodes – require that an annotation to have an effect on the score
The drawback of this technique It requires inferences of considerable
complexity Relevance judgment
- Two additional layers of inference and several new propositions are required
Figure 3.3 Annotated query network
13
3.4 Exponential Models
An approach to predicting topic shifts in text using exponential models (Beeferman et al., 1997) The model utilized ratios of long range language models and short rang
e language models
- predict useful terms
Topic shift
- When a long range language model is not able to predict the next word better than a short range language model
)(
)(log
xPs
xPlL
- Pl(x) : the probability of seeing word x given the context of the last 500 words
- Ps(x) : the probability of seeing word x given the two previous words
14
4. Query Expansion in the Language Modeling Approach
Assumption of this approach Users can choose query terms that are likely to occur in documents in
which they would interested
This approach has been developed into a ranking formula by means of probabilistic language models
15
4.1 Interactive Retrieval with Relevance Feedback
Relevance Feedback Small number of documents are judged relevant by user
- The relevance of all the remaining documents is unknown to the system
16
4.2 Document Routing
Document Routing The task is to choose terms associated with documents of interest and
to avoid those associated with other documents Training collection is available with a large number of relevance
judgments, both positive and negative, for particular query
Ratio Method Can utilize additional information by estimating probabilities for both
sets
17
4.3 The Ratio Method
Ratio Method predict useful terms Terms can be ranked according to the probability of occurrence
according to the relevant document models Terms are ranked according to this ratio and top N are added to the
initial query
- R : the set of relevant documents
- P(t|Md) : the probability of term t given the document model for d
- cft : the raw count of term t in the collection
- cs : the raw collection size
Rd
dt
cs
cftMtP
L)|(
log
18
4.4 Evaluation
Result are measured using the recall and precision
= Relevant Set = Non- Relevant Set = Retrieved Set= Non- Retrieved SetR
r
r
R
)|(||
||
)|(||
||Pr
)|(||
||Re
rRpr
RrFallout
RrpR
Rrecision
rRpr
Rrcall
19
4.5 Experiments(1/2)
Comparison of Rocchio method vs Language Model approach Language Model : log ratio of the probability in the judged relevant set Rocchio : weighting function was tf,idf and no negative feedback( = 0 ) Language Modeling approach works well
20
4.5 Experiments(2/2)
21
4.6 Information Routing Ratio Methods With More Data
Ratio 1
Ratio 2
- The log ratio of the average probability in judged relevant documents vs. the average probability in judged non-relevant documents
Result The language modeling approach is a
good model for retrieval
Rd
dt
cs
cftMtP
L)|(
log
||
)|()|(
||
)|()|(
)|(
)|(log
R
MtPRdtPavg
R
MtPRdtPavg
RdtPavg
RdtPavgL
Rd d
Rd d
t
22
5. Query Term Weighting
probability estimation Maximum likelihood probability The average probability (combined a geometric risk function)
risk function current risk function treats all terms equally The change will be to mix the estimation
- useless term, stop word – term is assigned an equal probability estimate for every documents ( it to have no effect on the ranking )
user specified Language Models Queries
- A specific type of text produced by the user The term weights
- Equivalent to the generation probabilities of the query model