27
C.Watters csci6403 1 Probabilistic Retrieval Model

C.Watterscsci64031 Probabilistic Retrieval Model

Embed Size (px)

Citation preview

Page 1: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 1

Probabilistic Retrieval Model

Page 2: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 2

Classification Problem

• For each query assume– R=Set of relevant docs– NR=Set of nonrelevant docs

• For each document then what is the probability that it belongs in one or other set

• Retrieve dj if P(djis rel) > P(djis not rel)

Page 3: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 3

Bayes Theorem

• Probability based on related occurrences

• So P(R|di) is probability that a doc is R given that it has been retrieved

• Ex. P(H|E) prob it is July(Hyp) if it is hot(Event)

P(E|H) * P(H) (prob its hot given it is July)

=---------------------

P(E|Hi) *P(Hi) (prob its hot given its Jan etc)

Page 4: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 4

Assumption

• Distribution of keywords of interest is different in the relevant docs vs the not relevant docs

• Also known as the cluster hypothesis

Page 5: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 5

• Getting visas for immigration to Australia and migration within the borders requires a two week entry permit….

• The long range migration pattern of geese interesting enough does not include the southern Pacific ….

Page 6: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 6

How to estimate these probabilities???

• Assume relevance depends only on query and document representation (keywords)

• Computing the odds of a given doc being relevant to a given query!P(dj rel to q)

P(dj notrel to q)

• Use this to rank documents

Page 7: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 7

Similarity as Odds

• Sim (dj,q) = P(dj is rel)

P(dj is not rel)

• Using Bayes get

• Sim (dj,q) = P(dj|R) * P(R)

P(dj|NR) * P(NR)

Page 8: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 8

Move from docs to terms• Assuming independence of terms

• P(ki|R) is the probability that a relevant doc contains the term ki

• Remember that any term may also occur in NR docs so P(ki|R) + P(ki|R) =1

• Sim (dj,q) ~ wi,q * wi,j (log P(ki|R) + log 1-P(ki|NR)

• 1- P(ki|R) P(ki|NR) )• GIVES us RANK

Page 9: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 9

OK now what?

• Work with keywords with weights 0 and 1

• Query is a set of keywords

• Doc is a set of keywords

• Need P(ki|R)

• Prob that a keyword occurs in one of the relevant docs

Page 10: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 10

Getting Started

1. assume that P(ki|R) constant over all k

= 0.5 (even odds) for any given doc

Looking for terms that do not fit this!

2. assume P(ki|NR) = ni/N

i.e based on distribution of terms overall

Page 11: C.Watterscsci64031 Probabilistic Retrieval Model

11

Finding P(ki)

1. First, retrieve set of docs and determine R set V

Vi is subset of V containing keyword ki

Need to improve our guesses for P(ki|R) & P(ki|NR)

2. So

Use distribution of ki in docs in V

P(ki|R) = Vi / V

3. Assume if not retrieved then not relevant

P(ki|NR) = (ni – Vi) / N-V

Page 12: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 12

Now

• Use new probs to rerank docs

• And try again

• This can be done without human judgement BUT it helps to get real feedback at step 1

Page 13: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 13

Good and Bad News

• Advantages– Ranking scheme

• Disadvantages– Making the initial guess to get Vi

– Binary weights– Independence of terms– Computation

Page 14: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 14

Relevance Feedback

Page 15: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 15

Relevance Feedback

• Problem– 2.2 term queries without (explicit) structure

• Example (relevance feedback)• Manual

– Add terms– Remove terms– Adjust the weights if possible– Add/remove operators

Page 16: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 16

What can we do automatically?

• ????

• change query based on documents retrieved

• change query based on user preferences

• Change query based on user history

• Change query based on community of users

Page 17: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 17

Hypothesis

• A better query can be discovered by analyzing the features in relevant and in nonrelevant items

Page 18: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 18

Feedback and VSM

• Q0 = (q1, q2, … qt) , qi is weight of query term

• Generates H0

• Q’ = (q1’, q2’, … qt’) , qi’ is altered weight of query term

• Add term to query by increasing w > 0

• Drop term by decreasing w = 0

Page 19: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 19

Page 20: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 20

Page 21: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 21

VSM View

• Move query vector in the t-dimensional term space from an area of lower density to an area of higher density of close documents

Page 22: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 22

Optimal Query and VSM

• Given

• Sim(Dj,Q)= dij . Qi

• Optimal Query is then (Di is term vector)

• Qopt=

• |Di| is Euclidian vector length

Page 23: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 23

Feedback from relevant Docs retrieved

• Keep original query

• Replace sums with those on known relevant and known nonrelevant docs

• Q1=

• Qi+1=

Page 24: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 24

Example

• Q’= Q+ R- NR

• Q0 = (5,0,3,0,1)

• Relevant: D1=(2,1,2,0,0)

• Nonrelevant: D2=(1,0,0,0,2)

=1, =.5, =.25

• Q1=(5,0,3,0,1)+.5(2,1,2,0,0)-.25(1,0,0,0,2)

• =(5.75,.5,4,0,.5)

Page 25: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 25

Variations

• Don’t normalize by number of judged docs• Use only highest ranked non-relevant docs

– Effective with few judged docs

• Rocchio: choose and =1 for many judged docs

• Expanding by all terms effective• Expanding by most highly weighted terms

is not!

Page 26: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 26

Relevance Feedback for Boolean

• Examine terms in relevant docs• Discover conjuncts (t1 and t2)

– Phrase detection– Persistent co-occurrences (box car)

• Discover co-occurrences (t1 or t3)– Thesaurus– Occasional co-occurrences (auto car)– Co-occur with friends (auto & car car & sedan)

Page 27: C.Watterscsci64031 Probabilistic Retrieval Model

C.Watters csci6403 27

Relevance Feedback Summary

• Can be very effective

• Need reasonable number of judged docs– Unpredictable results < 5 judged docs

• Can be used with both VSM and Boolean

• Requires either direct input from users or monitoring (time, printing, saving, etc)