LANGUAGE MODELS FOR RELEVANCE FEEDBACK 2003. 07. 14 Lee Won Hee

LANGUAGE MODELS FOR RELEVANCE FEEDBACK

2003. 07. 14

Lee Won Hee

2

Abstract

The language modeling approach to IR Query - random event Documents - ranked according to the likelihood users have a prototypical document in mind and will choose query

terms accordingly inferences about the semantic content of documents do not need to be

made resulting in a conceptually

3

1. Introduction

the language modeling approach to IR Developed by Ponte and Croft, 1998 Query – random event generated according to a probability distribution Document similarity

- estimating a model of the term generation probabilities for the query terms for each document

- ranking the documents according to the probability of generating the query

The main advantage to the language modeling approach Document boundaries are not predefined

- use the document level statistics of tf and idf Uncertainty are modeled by probabilities

- noisy data such as OCR text and automatically recognized speech transcripts

relevance feedback or document routing

4

2. The Language Modeling Approach to IR

The query generation probability The probability will be estimated starting with the maximum likelihood

estimate of the probability of term t in document d

- tf(t,d) : the raw term frequency of term t in document d

- dld : the total number of tokens in document d

dd dl

dttfMtmlP

),()|(

^

5

2.1 Insufficient Data

Two problem with the maximum likelihood estimator We do not wish to assign a probability of zero for a document that is

missing one or more of the query terms

- If a user included several synonyms in the query, a document missing even one of them would not be retrieved

- A more reasonable distribution

We only have a document sized sample from that distribution and so the variation in the raw counts may partially be accounted for by randomness

cs

cft * cft : the raw count of term t in the collection

* cs : the raw collection size or the total number of tokens in the collection

6

2.2 Averaging

The mean probability estimate of t in documents containing it

- to circumvent the problem of insufficient data

- some risk : if the mean were used by itself, there would be no distinction between documents with different term frequencies

Combining the two estimates using the geometric distribution

- Ghosh et al., 1983

- robustness of estimation, minimize the risk

dft

MtPml

tavgP dtdd

)(^

))|((

)( • dft : the document frequency of t

dttf

t

t

t

dt

f

f

fR

,

))0.1(

))0.1(

0.1,

^

• : the mean term frequency of term t in documents tf

7

2.3 Combining the Two Estimates

The estimate of the probability of producing the query for a given document model

- first term : the probability of producing the terms in the query

- second term : the probability of not producing other terms

- better discriminators of the document

Qt

cs

cft

RRd

dtdt tPavgdtPmlMQP ,

^

,

^

)(),({)|( )0.1(^

Qt

cs

cft

RR dtdt tPavgdtPml ,

^

,

^

)(),({0.1 )0.1(

If tf(t,d)>0

otherwise

If tf(t,d)>0

otherwise

8

3. Related Work

1. The harper and van rijsbergen model

2. The rocchio method

3. The inquery model

4. Exponential models

9

3.1 The Harper and Van rijsbergen model (1978)

to obtain better estimates for the probability of relevance of a document given the query

An approximation of the dependence of query terms was defined by the authors by means of a maximal spanning tree each node of the tree : a single query term The edges between nodes : weighted by a measure of term dependency

A tree that spanned all of the nodes and that maximized the expected mutual information

))()(

),(log(),(

,

ji ji

jiji xPxP

xxPxxP

- P(xi,xj) : the probability of term xi and term xj occurring- P(xi) : the probability of term xi occurring in a relevant document- P(xj) : the probability of term xj occurring in a relevant document

10

3.2 The Rocchio Method (1971)

Rocchio method provides a mechanism for the selection and weighting of expansion terms

can be used to rank the terms in the judged documents

- The top N can then be added to the query and weighted

a reasonable solution to the problem of relevance feedback that works very well in practice

empirically determine the optimal value of , ,

RR

r twR

twR

twtw )(||

0.1)(

||

0.1)()( - : the weight assigned for occurring in relevant

doc

- : the weight assigned for occurring in relevant doc

- : the weight assigned for occurring in non-relevant doc

11

3.3 The INQUERY Model(1/2)

INQUERY inference network (Turtle, 1991) document portion

- computed in advance query portion

- computed at retrieval time

Document Network document nodes – d1...di

text nodes – t1...tj

concept representation nodes – r1...rk

Query Network query concepts – c1…cm

queries – q1, q2

Information need – I

Uncertainty due to differences in word sense

Figure 3.1 Example inference network

12

3.3 The INQUERY Model(2/2)

Relevance Feedback Implementation of the theoretical relevance

feedback was done by Hains(1996)

Annotated query network Proposition nodes – k1, k2

Observed relevance judgment nodes – j1, j2

and nodes – require that an annotation to have an effect on the score

The drawback of this technique It requires inferences of considerable

complexity Relevance judgment

- Two additional layers of inference and several new propositions are required

Figure 3.3 Annotated query network

13

3.4 Exponential Models

An approach to predicting topic shifts in text using exponential models (Beeferman et al., 1997) The model utilized ratios of long range language models and short rang

e language models

- predict useful terms

Topic shift

- When a long range language model is not able to predict the next word better than a short range language model

)(

)(log

xPs

xPlL

- Pl(x) : the probability of seeing word x given the context of the last 500 words

- Ps(x) : the probability of seeing word x given the two previous words

14

4. Query Expansion in the Language Modeling Approach

Assumption of this approach Users can choose query terms that are likely to occur in documents in

which they would interested

This approach has been developed into a ranking formula by means of probabilistic language models

15

4.1 Interactive Retrieval with Relevance Feedback

Relevance Feedback Small number of documents are judged relevant by user

- The relevance of all the remaining documents is unknown to the system

16

4.2 Document Routing

Document Routing The task is to choose terms associated with documents of interest and

to avoid those associated with other documents Training collection is available with a large number of relevance

judgments, both positive and negative, for particular query

Ratio Method Can utilize additional information by estimating probabilities for both

sets

17

4.3 The Ratio Method

Ratio Method predict useful terms Terms can be ranked according to the probability of occurrence

according to the relevant document models Terms are ranked according to this ratio and top N are added to the

initial query

- R : the set of relevant documents

- P(t|Md) : the probability of term t given the document model for d

- cft : the raw count of term t in the collection

- cs : the raw collection size

Rd

dt

cs

cftMtP

L)|(

log

18

4.4 Evaluation

Result are measured using the recall and precision

= Relevant Set = Non- Relevant Set = Retrieved Set= Non- Retrieved SetR

r

r

R

)|(||

||

)|(||

||Pr

)|(||

||Re

rRpr

RrFallout

RrpR

Rrecision

rRpr

Rrcall

19

4.5 Experiments(1/2)

Comparison of Rocchio method vs Language Model approach Language Model : log ratio of the probability in the judged relevant set Rocchio : weighting function was tf,idf and no negative feedback( = 0 ) Language Modeling approach works well

20

4.5 Experiments(2/2)

21

4.6 Information Routing Ratio Methods With More Data

Ratio 1

Ratio 2

- The log ratio of the average probability in judged relevant documents vs. the average probability in judged non-relevant documents

Result The language modeling approach is a

good model for retrieval

Rd

dt

cs

cftMtP

L)|(

log

||

)|()|(

||

)|()|(

)|(

)|(log

R

MtPRdtPavg

R

MtPRdtPavg

RdtPavg

RdtPavgL

Rd d

Rd d

t

22

5. Query Term Weighting

probability estimation Maximum likelihood probability The average probability (combined a geometric risk function)

risk function current risk function treats all terms equally The change will be to mix the estimation

- useless term, stop word – term is assigned an equal probability estimate for every documents ( it to have no effect on the ranking )

user specified Language Models Queries

- A specific type of text produced by the user The term weights

- Equivalent to the generation probabilities of the query model