Upload
juliet-skinner
View
215
Download
2
Embed Size (px)
Citation preview
Document ranking
Paolo FerraginaDipartimento di Informatica
Università di Pisa
The big fight: find the best ranking...
Ranking: Google vs Google.cn
Document ranking
Text-based Ranking(1° generation)
Reading 6.2 and 6.3
Similarity between binary vectors
Document is binary vector X,Y in {0,1}D
Score: overlap measure
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
YX What’s wrong ?
Normalization
Dice coefficient (wrt avg #terms):
Jaccard coefficient (wrt possible terms):
YXYX /
|)||/(|2 YXYX
OK, triangular
NO, triangular
What’s wrong in doc-similarity ?
Overlap matching doesn’t consider: Term frequency in a document
Talks more of t ? Then t should be weighted more.
Term scarcity in collection of commoner than baby bed
Length of documents score should be normalized
A famous “weight”: tf-idf
)/log(,, tdtdt nntfw
Number of occurrences of term t in doc d tf t,d
where nt = #docs containing term t n = #docs in the indexed collection
log nnidft
t
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 13,1 11,4 0,0 0,0 0,0 0,0
Brutus 3,0 8,3 0,0 1,0 0,0 0,0
Caesar 2,3 2,3 0,0 0,5 0,3 0,3
Calpurnia 0,0 11,2 0,0 0,0 0,0 0,0
Cleopatra 17,7 0,0 0,0 0,0 0,0 0,0
mercy 0,5 0,0 0,7 0,9 0,9 0,3
worser 1,2 0,0 0,6 0,6 0,6 0,0
Vector Space model
Why distance is a bad idea
Sec. 6.3
A graphical example
Postulate: Documents that are “close together” in the vector space talk about the same things.Euclidean distance sensible to vector length !!
t1
d2
d1
d3
d4
d5
t3
t2
cos() = v w / ||v|| * ||w||
The user query is a very short doc
Easy to Spam
Sophisticated algosto find top-k docs
for a query Q
cosine(query,document)
D
i i
D
i i
D
i ii
dq
dq
d
d
q
q
dq
dqdq
1
2
1
2
1),cos(
Dot product
qi is the tf-idf weight of term i in the querydi is the tf-idf weight of term i in the document
cos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.
Sec. 6.3
Cos for length-normalized vectors
For length-normalized vectors, cosine similarity is simply the dot product (or scalar product):
for q, d length-normalized.
D
i iidqdqdq1
),cos(
Cosine similarity among 3 docs
term SaS PaP WH
affection 115 58 20
jealous 10 7 11
gossip 2 0 6
wuthering 0 0 38
How similar arethe novelsSaS: Sense andSensibilityPaP: Pride andPrejudice, andWH: WutheringHeights?
Term frequencies (counts)
Note: To simplify this example, we don’t do idf weighting.
otherwise 0,
0 tfif, tflog 1 10 t,dt,d
t,dw
3 documents example contd.
Log frequency weighting
term SaS PaP WH
affection 3.06 2.76 2.30
jealous 2.00 1.85 2.04
gossip 1.30 0 1.78
wuthering 0 0 2.58
After length normalization
term SaS PaP WH
affection 0.789 0.832 0.524
jealous 0.515 0.555 0.465
gossip 0.335 0 0.405
wuthering 0 0 0.588
cos(SaS,PaP) ≈0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈ 0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
Storage
For every term, we store the IDF in memory, in terms of nt , which is actually the length of its posting list (so anyway needed).
For every docID d in the posting list of term t, we store its frequency tft,d which is tipically small and thus stored with unary/gamma.
Sec. 7.1.2
)/log(,, tdtdt nntfw
Vector spaces and other operators
Vector space OK for bag-of-words queries
Clean metaphor for similar-document
queries Not a good combination with operators:
Boolean, wild-card, positional, proximity
First generation of search engines Invented before “spamming” web search
Document ranking
Top-k retrieval
Reading 7
Speed-up top-k retrieval
Costly is the computation of the cos
Find a set A of contenders, with K < |A| << N Set A does not necessarily contain the top K,
but has many docs from among the top K Return the top K docs in A, according to the
score
The same approach is also used for other (non-cosine) scoring functions
Will look at several schemes following this approach
Sec. 7.1.1
How to select A’s docs
Consider docs containing at least one query term. Hence this means…
Take this further:1. Only consider high-idf query terms2. Champion lists: top scores3. Only consider docs containing many query
terms4. Fancy hits: for complex ranking functions 5. Clustering
Sec. 7.1.2
Approach #1: High-idf query terms only
For a query such as catcher in the rye Only accumulate scores from catcher and
rye
Intuition: in and the contribute little to the scores and so don’t alter rank-ordering much
Benefit: Postings of low-idf terms have many docs these (many) docs get eliminated from set
A of contenders
Sec. 7.1.2
Approach #2: Champion Lists
Preprocess: Assign to each term, its m best documents
Search: If |Q| = q terms, merge their preferred lists ( mq answers). Compute COS between Q and these docs, and choose the top
k.Need to pick m>k to work well empirically.
Now SE use tf-idf PLUS PageRank (PLUS other weights)
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 13.1 11.4 0.0 0.0 0.0 0.0
Brutus 3.0 8.3 0.0 1.0 0.0 0.0
Caesar 2.3 2.3 0.0 0.5 0.3 0.3
Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0
Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0
mercy 0.5 0.0 0.7 0.9 0.9 0.3
worser 1.2 0.0 0.6 0.6 0.6 0.0
Approach #3: Docs containing many query terms
For multi-term queries, compute scores for docs containing several of the query terms
Say, at least 3 out of 4 Imposes a “soft conjunction” on queries seen
on web search engines (early Google)
Easy to implement in postings traversal
3 of 4 query terms
Brutus
Caesar
Calpurnia
1 2 3 5 8 13 21 34
2 4 8 16 32 64 128
13 16
Antony 3 4 8 16 32 64 128
32
Scores only computed for docs 8, 16 and 32.
Sec. 7.1.2
Complex scores
Consider a simple total score combining cosine relevance and authority
net-score(q,d) = PR(d) + cosine(q,d) Can use some other linear combination than
an equal weighting
Now we seek the top K docs by net score
Sec. 7.1.4
Approach #4: Fancy-hits heuristic Preprocess:
Assign docID by decreasing PR weight Define FH(t) = m docs for t with highest tf-idf weight Define IL(t) = the rest (i.e. incr docID = decr PR
weight) Idea: a document that scores high should be in FH or in the front of IL
Search for a t-term query: First FH: Take the common docs of their FH
compute the score of these docs and keep the top-k docs.
Then IL: scan ILs and check the common docs Compute the score and possibly insert them into the top-k. Stop when M docs have been checked or the PR score
becomes smaller than some threshold.
Approach #5: Clustering
Query
Leader Follower
Sec. 7.1.6
Cluster pruning: preprocessing
Pick N docs at random: call these leaders
For every other doc, pre-compute nearest leader Docs attached to a leader: its
followers; Likely: each leader has ~ N
followers.
Sec. 7.1.6
Cluster pruning: query processing
Process a query as follows:
Given query Q, find its nearest leader L.
Seek K nearest docs from among L’s followers.
Sec. 7.1.6
Why use random sampling
Fast Leaders reflect data distribution
Sec. 7.1.6
General variants
Have each follower attached to b1=3 (say) nearest leaders.
From query, find b2=4 (say) nearest leaders and their followers.
Can recur on leader/follower construction.
Sec. 7.1.6
Document ranking
Relevance feedback
Reading 9
Relevance Feedback
Relevance feedback: user feedback on relevance of docs in initial set of results
User issues a (short, simple) query The user marks some results as relevant or
non-relevant. The system computes a better representation
of the information need based on feedback. Relevance feedback can go through one or
more iterations.
Sec. 9.1
Rocchio (SMART)
Used in practice:
Dr = set of known relevant doc vectors Dnr = set of known irrelevant doc vectors qm = modified query vector; q0 = original query
vector; α,β,γ: weights (hand-chosen or set empirically)
New query moves toward relevant documents and away from irrelevant documents
nrjrj Ddj
nrDdj
rm d
Dd
Dqq
110
Sec. 9.1.1
Relevance Feedback: Problems
Users are often reluctant to provide explicit feedback
It’s often harder to understand why a particular document was retrieved after applying relevance feedback
There is no clear evidence that relevance feedback is the “best use” of the user’s time.
Pseudo relevance feedback
Pseudo-relevance feedback automates the “manual” part of true relevance feedback.
Retrieve a list of hits for the user’s query Assume that the top k are relevant. Do relevance feedback (e.g., Rocchio)
Works very well on average But can go horribly wrong for some
queries. Several iterations can cause query drift.
Sec. 9.1.6
Query Expansion
In relevance feedback, users give additional input (relevant/non-relevant) on documents, which is used to reweight terms in the documents
In query expansion, users give additional input (good/bad search term) on words or phrases
Sec. 9.2.2
How augment the user query? Manual thesaurus (costly to generate)
E.g. MedLine: physician, syn: doc, doctor, MD
Global Analysis (static; all docs in collection) Automatically derived thesaurus
(co-occurrence statistics) Refinements based on query-log mining
Common on the web
Local Analysis (dynamic) Analysis of documents in result set
Sec. 9.2.2
Query assist
Would you expect such a feature to increase the queryvolume at a search engine?