38
Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Embed Size (px)

Citation preview

Page 1: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Document ranking

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Page 2: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

The big fight: find the best ranking...

Page 3: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Ranking: Google vs Google.cn

Page 4: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Document ranking

Text-based Ranking(1° generation)

Reading 6.2 and 6.3

Page 5: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Similarity between binary vectors

Document is binary vector X,Y in {0,1}D

Score: overlap measure

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

YX What’s wrong ?

Page 6: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Normalization

Dice coefficient (wrt avg #terms):

Jaccard coefficient (wrt possible terms):

YXYX /

|)||/(|2 YXYX

OK, triangular

NO, triangular

Page 7: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

What’s wrong in doc-similarity ?

Overlap matching doesn’t consider: Term frequency in a document

Talks more of t ? Then t should be weighted more.

Term scarcity in collection of commoner than baby bed

Length of documents score should be normalized

Page 8: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

A famous “weight”: tf-idf

)/log(,, tdtdt nntfw

Number of occurrences of term t in doc d tf t,d

where nt = #docs containing term t n = #docs in the indexed collection

log nnidft

t

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 13,1 11,4 0,0 0,0 0,0 0,0

Brutus 3,0 8,3 0,0 1,0 0,0 0,0

Caesar 2,3 2,3 0,0 0,5 0,3 0,3

Calpurnia 0,0 11,2 0,0 0,0 0,0 0,0

Cleopatra 17,7 0,0 0,0 0,0 0,0 0,0

mercy 0,5 0,0 0,7 0,9 0,9 0,3

worser 1,2 0,0 0,6 0,6 0,6 0,0

Vector Space model

Page 9: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Why distance is a bad idea

Sec. 6.3

Page 10: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

A graphical example

Postulate: Documents that are “close together” in the vector space talk about the same things.Euclidean distance sensible to vector length !!

t1

d2

d1

d3

d4

d5

t3

t2

cos() = v w / ||v|| * ||w||

The user query is a very short doc

Easy to Spam

Sophisticated algosto find top-k docs

for a query Q

Page 11: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

cosine(query,document)

D

i i

D

i i

D

i ii

dq

dq

d

d

q

q

dq

dqdq

1

2

1

2

1),cos(

Dot product

qi is the tf-idf weight of term i in the querydi is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.

Sec. 6.3

Page 12: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Cos for length-normalized vectors

For length-normalized vectors, cosine similarity is simply the dot product (or scalar product):

for q, d length-normalized.

D

i iidqdqdq1

),cos(

Page 13: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Cosine similarity among 3 docs

term SaS PaP WH

affection 115 58 20

jealous 10 7 11

gossip 2 0 6

wuthering 0 0 38

How similar arethe novelsSaS: Sense andSensibilityPaP: Pride andPrejudice, andWH: WutheringHeights?

Term frequencies (counts)

Note: To simplify this example, we don’t do idf weighting.

otherwise 0,

0 tfif, tflog 1 10 t,dt,d

t,dw

Page 14: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

3 documents example contd.

Log frequency weighting

term SaS PaP WH

affection 3.06 2.76 2.30

jealous 2.00 1.85 2.04

gossip 1.30 0 1.78

wuthering 0 0 2.58

After length normalization

term SaS PaP WH

affection 0.789 0.832 0.524

jealous 0.515 0.555 0.465

gossip 0.335 0 0.405

wuthering 0 0 0.588

cos(SaS,PaP) ≈0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0 ≈ 0.94

cos(SaS,WH) ≈ 0.79

cos(PaP,WH) ≈ 0.69

Page 15: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Storage

For every term, we store the IDF in memory, in terms of nt , which is actually the length of its posting list (so anyway needed).

For every docID d in the posting list of term t, we store its frequency tft,d which is tipically small and thus stored with unary/gamma.

Sec. 7.1.2

)/log(,, tdtdt nntfw

Page 16: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Vector spaces and other operators

Vector space OK for bag-of-words queries

Clean metaphor for similar-document

queries Not a good combination with operators:

Boolean, wild-card, positional, proximity

First generation of search engines Invented before “spamming” web search

Page 17: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Document ranking

Top-k retrieval

Reading 7

Page 18: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Speed-up top-k retrieval

Costly is the computation of the cos

Find a set A of contenders, with K < |A| << N Set A does not necessarily contain the top K,

but has many docs from among the top K Return the top K docs in A, according to the

score

The same approach is also used for other (non-cosine) scoring functions

Will look at several schemes following this approach

Sec. 7.1.1

Page 19: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

How to select A’s docs

Consider docs containing at least one query term. Hence this means…

Take this further:1. Only consider high-idf query terms2. Champion lists: top scores3. Only consider docs containing many query

terms4. Fancy hits: for complex ranking functions 5. Clustering

Sec. 7.1.2

Page 20: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Approach #1: High-idf query terms only

For a query such as catcher in the rye Only accumulate scores from catcher and

rye

Intuition: in and the contribute little to the scores and so don’t alter rank-ordering much

Benefit: Postings of low-idf terms have many docs these (many) docs get eliminated from set

A of contenders

Sec. 7.1.2

Page 21: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Approach #2: Champion Lists

Preprocess: Assign to each term, its m best documents

Search: If |Q| = q terms, merge their preferred lists ( mq answers). Compute COS between Q and these docs, and choose the top

k.Need to pick m>k to work well empirically.

Now SE use tf-idf PLUS PageRank (PLUS other weights)

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 13.1 11.4 0.0 0.0 0.0 0.0

Brutus 3.0 8.3 0.0 1.0 0.0 0.0

Caesar 2.3 2.3 0.0 0.5 0.3 0.3

Calpurnia 0.0 11.2 0.0 0.0 0.0 0.0

Cleopatra 17.7 0.0 0.0 0.0 0.0 0.0

mercy 0.5 0.0 0.7 0.9 0.9 0.3

worser 1.2 0.0 0.6 0.6 0.6 0.0

Page 22: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Approach #3: Docs containing many query terms

For multi-term queries, compute scores for docs containing several of the query terms

Say, at least 3 out of 4 Imposes a “soft conjunction” on queries seen

on web search engines (early Google)

Easy to implement in postings traversal

Page 23: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

3 of 4 query terms

Brutus

Caesar

Calpurnia

1 2 3 5 8 13 21 34

2 4 8 16 32 64 128

13 16

Antony 3 4 8 16 32 64 128

32

Scores only computed for docs 8, 16 and 32.

Sec. 7.1.2

Page 24: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Complex scores

Consider a simple total score combining cosine relevance and authority

net-score(q,d) = PR(d) + cosine(q,d) Can use some other linear combination than

an equal weighting

Now we seek the top K docs by net score

Sec. 7.1.4

Page 25: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Approach #4: Fancy-hits heuristic Preprocess:

Assign docID by decreasing PR weight Define FH(t) = m docs for t with highest tf-idf weight Define IL(t) = the rest (i.e. incr docID = decr PR

weight) Idea: a document that scores high should be in FH or in the front of IL

Search for a t-term query: First FH: Take the common docs of their FH

compute the score of these docs and keep the top-k docs.

Then IL: scan ILs and check the common docs Compute the score and possibly insert them into the top-k. Stop when M docs have been checked or the PR score

becomes smaller than some threshold.

Page 26: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Approach #5: Clustering

Query

Leader Follower

Sec. 7.1.6

Page 27: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Cluster pruning: preprocessing

Pick N docs at random: call these leaders

For every other doc, pre-compute nearest leader Docs attached to a leader: its

followers; Likely: each leader has ~ N

followers.

Sec. 7.1.6

Page 28: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Cluster pruning: query processing

Process a query as follows:

Given query Q, find its nearest leader L.

Seek K nearest docs from among L’s followers.

Sec. 7.1.6

Page 29: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Why use random sampling

Fast Leaders reflect data distribution

Sec. 7.1.6

Page 30: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

General variants

Have each follower attached to b1=3 (say) nearest leaders.

From query, find b2=4 (say) nearest leaders and their followers.

Can recur on leader/follower construction.

Sec. 7.1.6

Page 31: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Document ranking

Relevance feedback

Reading 9

Page 32: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Relevance Feedback

Relevance feedback: user feedback on relevance of docs in initial set of results

User issues a (short, simple) query The user marks some results as relevant or

non-relevant. The system computes a better representation

of the information need based on feedback. Relevance feedback can go through one or

more iterations.

Sec. 9.1

Page 33: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Rocchio (SMART)

Used in practice:

Dr = set of known relevant doc vectors Dnr = set of known irrelevant doc vectors qm = modified query vector; q0 = original query

vector; α,β,γ: weights (hand-chosen or set empirically)

New query moves toward relevant documents and away from irrelevant documents

nrjrj Ddj

nrDdj

rm d

Dd

Dqq

110

Sec. 9.1.1

Page 34: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Relevance Feedback: Problems

Users are often reluctant to provide explicit feedback

It’s often harder to understand why a particular document was retrieved after applying relevance feedback

There is no clear evidence that relevance feedback is the “best use” of the user’s time.

Page 35: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Pseudo relevance feedback

Pseudo-relevance feedback automates the “manual” part of true relevance feedback.

Retrieve a list of hits for the user’s query Assume that the top k are relevant. Do relevance feedback (e.g., Rocchio)

Works very well on average But can go horribly wrong for some

queries. Several iterations can cause query drift.

Sec. 9.1.6

Page 36: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Query Expansion

In relevance feedback, users give additional input (relevant/non-relevant) on documents, which is used to reweight terms in the documents

In query expansion, users give additional input (good/bad search term) on words or phrases

Sec. 9.2.2

Page 37: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

How augment the user query? Manual thesaurus (costly to generate)

E.g. MedLine: physician, syn: doc, doctor, MD

Global Analysis (static; all docs in collection) Automatically derived thesaurus

(co-occurrence statistics) Refinements based on query-log mining

Common on the web

Local Analysis (dynamic) Analysis of documents in result set

Sec. 9.2.2

Page 38: Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa

Query assist

Would you expect such a feature to increase the queryvolume at a search engine?