Upload
valerie-heath
View
216
Download
0
Embed Size (px)
Citation preview
C.Watters csci6403 1
Probabilistic Retrieval Model
C.Watters csci6403 2
Classification Problem
• For each query assume– R=Set of relevant docs– NR=Set of nonrelevant docs
• For each document then what is the probability that it belongs in one or other set
• Retrieve dj if P(djis rel) > P(djis not rel)
C.Watters csci6403 3
Bayes Theorem
• Probability based on related occurrences
• So P(R|di) is probability that a doc is R given that it has been retrieved
• Ex. P(H|E) prob it is July(Hyp) if it is hot(Event)
P(E|H) * P(H) (prob its hot given it is July)
=---------------------
P(E|Hi) *P(Hi) (prob its hot given its Jan etc)
C.Watters csci6403 4
Assumption
• Distribution of keywords of interest is different in the relevant docs vs the not relevant docs
• Also known as the cluster hypothesis
C.Watters csci6403 5
• Getting visas for immigration to Australia and migration within the borders requires a two week entry permit….
• The long range migration pattern of geese interesting enough does not include the southern Pacific ….
C.Watters csci6403 6
How to estimate these probabilities???
• Assume relevance depends only on query and document representation (keywords)
• Computing the odds of a given doc being relevant to a given query!P(dj rel to q)
P(dj notrel to q)
• Use this to rank documents
C.Watters csci6403 7
Similarity as Odds
• Sim (dj,q) = P(dj is rel)
P(dj is not rel)
• Using Bayes get
• Sim (dj,q) = P(dj|R) * P(R)
P(dj|NR) * P(NR)
C.Watters csci6403 8
Move from docs to terms• Assuming independence of terms
• P(ki|R) is the probability that a relevant doc contains the term ki
• Remember that any term may also occur in NR docs so P(ki|R) + P(ki|R) =1
• Sim (dj,q) ~ wi,q * wi,j (log P(ki|R) + log 1-P(ki|NR)
• 1- P(ki|R) P(ki|NR) )• GIVES us RANK
C.Watters csci6403 9
OK now what?
• Work with keywords with weights 0 and 1
• Query is a set of keywords
• Doc is a set of keywords
• Need P(ki|R)
• Prob that a keyword occurs in one of the relevant docs
C.Watters csci6403 10
Getting Started
1. assume that P(ki|R) constant over all k
= 0.5 (even odds) for any given doc
Looking for terms that do not fit this!
2. assume P(ki|NR) = ni/N
i.e based on distribution of terms overall
11
Finding P(ki)
1. First, retrieve set of docs and determine R set V
Vi is subset of V containing keyword ki
Need to improve our guesses for P(ki|R) & P(ki|NR)
2. So
Use distribution of ki in docs in V
P(ki|R) = Vi / V
3. Assume if not retrieved then not relevant
P(ki|NR) = (ni – Vi) / N-V
C.Watters csci6403 12
Now
• Use new probs to rerank docs
• And try again
• This can be done without human judgement BUT it helps to get real feedback at step 1
C.Watters csci6403 13
Good and Bad News
• Advantages– Ranking scheme
• Disadvantages– Making the initial guess to get Vi
– Binary weights– Independence of terms– Computation
C.Watters csci6403 14
Relevance Feedback
C.Watters csci6403 15
Relevance Feedback
• Problem– 2.2 term queries without (explicit) structure
• Example (relevance feedback)• Manual
– Add terms– Remove terms– Adjust the weights if possible– Add/remove operators
C.Watters csci6403 16
What can we do automatically?
• ????
• change query based on documents retrieved
• change query based on user preferences
• Change query based on user history
• Change query based on community of users
C.Watters csci6403 17
Hypothesis
• A better query can be discovered by analyzing the features in relevant and in nonrelevant items
C.Watters csci6403 18
Feedback and VSM
• Q0 = (q1, q2, … qt) , qi is weight of query term
• Generates H0
• Q’ = (q1’, q2’, … qt’) , qi’ is altered weight of query term
• Add term to query by increasing w > 0
• Drop term by decreasing w = 0
C.Watters csci6403 19
C.Watters csci6403 20
C.Watters csci6403 21
VSM View
• Move query vector in the t-dimensional term space from an area of lower density to an area of higher density of close documents
C.Watters csci6403 22
Optimal Query and VSM
• Given
• Sim(Dj,Q)= dij . Qi
• Optimal Query is then (Di is term vector)
• Qopt=
• |Di| is Euclidian vector length
C.Watters csci6403 23
Feedback from relevant Docs retrieved
• Keep original query
• Replace sums with those on known relevant and known nonrelevant docs
• Q1=
• Qi+1=
C.Watters csci6403 24
Example
• Q’= Q+ R- NR
• Q0 = (5,0,3,0,1)
• Relevant: D1=(2,1,2,0,0)
• Nonrelevant: D2=(1,0,0,0,2)
=1, =.5, =.25
• Q1=(5,0,3,0,1)+.5(2,1,2,0,0)-.25(1,0,0,0,2)
• =(5.75,.5,4,0,.5)
C.Watters csci6403 25
Variations
• Don’t normalize by number of judged docs• Use only highest ranked non-relevant docs
– Effective with few judged docs
• Rocchio: choose and =1 for many judged docs
• Expanding by all terms effective• Expanding by most highly weighted terms
is not!
C.Watters csci6403 26
Relevance Feedback for Boolean
• Examine terms in relevant docs• Discover conjuncts (t1 and t2)
– Phrase detection– Persistent co-occurrences (box car)
• Discover co-occurrences (t1 or t3)– Thesaurus– Occasional co-occurrences (auto car)– Co-occur with friends (auto & car car & sedan)
C.Watters csci6403 27
Relevance Feedback Summary
• Can be very effective
• Need reasonable number of judged docs– Unpredictable results < 5 judged docs
• Can be used with both VSM and Boolean
• Requires either direct input from users or monitoring (time, printing, saving, etc)