C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 1

Probabilistic Retrieval Model


Classification Problem

• For each query assume– R=Set of relevant docs– NR=Set of nonrelevant docs

• For each document then what is the probability that it belongs in one or other set

• Retrieve dj if P(djis rel) > P(djis not rel)


Bayes Theorem

• Probability based on related occurrences

• So P(R|di) is probability that a doc is R given that it has been retrieved

• Ex. P(H|E) prob it is July(Hyp) if it is hot(Event)

P(E|H) * P(H) (prob its hot given it is July)

=---------------------

P(E|Hi) *P(Hi) (prob its hot given its Jan etc)


Assumption

• Distribution of keywords of interest is different in the relevant docs vs the not relevant docs

• Also known as the cluster hypothesis


• Getting visas for immigration to Australia and migration within the borders requires a two week entry permit….

• The long range migration pattern of geese interesting enough does not include the southern Pacific ….


How to estimate these probabilities???

• Assume relevance depends only on query and document representation (keywords)

• Computing the odds of a given doc being relevant to a given query!P(dj rel to q)

P(dj notrel to q)

• Use this to rank documents


Similarity as Odds

• Sim (dj,q) = P(dj is rel)

P(dj is not rel)

• Using Bayes get

• Sim (dj,q) = P(dj|R) * P(R)

P(dj|NR) * P(NR)


Move from docs to terms• Assuming independence of terms

• P(ki|R) is the probability that a relevant doc contains the term ki

• Remember that any term may also occur in NR docs so P(ki|R) + P(ki|R) =1

• Sim (dj,q) ~ wi,q * wi,j (log P(ki|R) + log 1-P(ki|NR)

• 1- P(ki|R) P(ki|NR) )• GIVES us RANK


OK now what?

• Work with keywords with weights 0 and 1

• Query is a set of keywords

• Doc is a set of keywords

• Need P(ki|R)

• Prob that a keyword occurs in one of the relevant docs


Getting Started

1. assume that P(ki|R) constant over all k

= 0.5 (even odds) for any given doc

Looking for terms that do not fit this!

2. assume P(ki|NR) = ni/N

i.e based on distribution of terms overall

11

Finding P(ki)

1. First, retrieve set of docs and determine R set V

Vi is subset of V containing keyword ki

Need to improve our guesses for P(ki|R) & P(ki|NR)

2. So

Use distribution of ki in docs in V

P(ki|R) = Vi / V

3. Assume if not retrieved then not relevant

P(ki|NR) = (ni – Vi) / N-V


Now

• Use new probs to rerank docs

• And try again

• This can be done without human judgement BUT it helps to get real feedback at step 1


Good and Bad News

• Advantages– Ranking scheme

• Disadvantages– Making the initial guess to get Vi

– Binary weights– Independence of terms– Computation


Relevance Feedback


Relevance Feedback

• Problem– 2.2 term queries without (explicit) structure

• Example (relevance feedback)• Manual

– Add terms– Remove terms– Adjust the weights if possible– Add/remove operators


What can we do automatically?

• ????

• change query based on documents retrieved

• change query based on user preferences

• Change query based on user history

• Change query based on community of users


Hypothesis

• A better query can be discovered by analyzing the features in relevant and in nonrelevant items


Feedback and VSM

• Q0 = (q1, q2, … qt) , qi is weight of query term

• Generates H0

• Q’ = (q1’, q2’, … qt’) , qi’ is altered weight of query term

• Add term to query by increasing w > 0

• Drop term by decreasing w = 0




VSM View

• Move query vector in the t-dimensional term space from an area of lower density to an area of higher density of close documents


Optimal Query and VSM

• Given

• Sim(Dj,Q)= dij . Qi

• Optimal Query is then (Di is term vector)

• Qopt=

• |Di| is Euclidian vector length


Feedback from relevant Docs retrieved

• Keep original query

• Replace sums with those on known relevant and known nonrelevant docs

• Q1=

• Qi+1=


Example

• Q’= Q+ R- NR

• Q0 = (5,0,3,0,1)

• Relevant: D1=(2,1,2,0,0)

• Nonrelevant: D2=(1,0,0,0,2)

=1, =.5, =.25

• Q1=(5,0,3,0,1)+.5(2,1,2,0,0)-.25(1,0,0,0,2)

• =(5.75,.5,4,0,.5)


Variations

• Don’t normalize by number of judged docs• Use only highest ranked non-relevant docs

– Effective with few judged docs

• Rocchio: choose and =1 for many judged docs

• Expanding by all terms effective• Expanding by most highly weighted terms

is not!


Relevance Feedback for Boolean

• Examine terms in relevant docs• Discover conjuncts (t1 and t2)

– Phrase detection– Persistent co-occurrences (box car)

• Discover co-occurrences (t1 or t3)– Thesaurus– Occasional co-occurrences (auto car)– Co-occur with friends (auto & car car & sedan)


Relevance Feedback Summary

• Can be very effective

• Need reasonable number of judged docs– Unpredictable results < 5 judged docs

• Can be used with both VSM and Boolean

• Requires either direct input from users or monitoring (time, printing, saving, etc)

Documents

C.Watterscsci64031 Probabilistic Retrieval Model