Upload
ramla
View
31
Download
1
Embed Size (px)
DESCRIPTION
Information Retrieval. CSE 8337 (Part III) Spring 2011 Some Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza -Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ - PowerPoint PPT Presentation
Citation preview
Information Retrieval
CSE 8337 (Part III)
Spring 2011
Some Material for these slides obtained from:Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto
http://www.sims.berkeley.edu/~hearst/irbook/Data Mining Introductory and Advanced Topics by Margaret H. Dunham
http://www.engr.smu.edu/~mhd/bookIntroduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze
http://informationretrieval.org
CSE 8337 Spring 2011 2
CSE 8337 Outline• Introduction• Text Processing• Indexes• Boolean Queries• Web Searching/Crawling
• Vector Space Model• Matching• Evaluation• Feedback/Expansion
CSE 8337 Spring 2011 3
Modeling TOC(Vector Space and Other Models)
Introduction Classic IR Models
Boolean Model Vector Model Probabilistic Model
Extended Boolean Model Vector Space Scoring Vector Model and Web Search
CSE 8337 Spring 2011 4
IR Models
Non-Overlapping ListsProximal Nodes
Structured Models
Retrieval: Adhoc Filtering
Browsing
U s e r
T a s k
Classic Models
boolean vector probabilistic
Set Theoretic
Fuzzy Extended Boolean
Probabilistic
Inference Network Belief Network
Algebraic
Generalized Vector Lat. Semantic Index Neural Networks
Browsing
Flat Structure Guided Hypertext
CSE 8337 Spring 2011 5
The Boolean Model
Simple model based on set theory Queries specified as boolean
expressions precise semantics and neat formalism
Terms are either present or absent. Thus, wij {0,1}
Consider q = ka (kb kc) qdnf = (1,1,1) (1,1,0) (1,0,0) qcc= (1,1,0) is a conjunctive component
CSE 8337 Spring 2011 6
The Boolean Model
q = ka (kb kc)
sim(q,dj) =
1 if qcc | (qcc qdnf) (ki, gi(dj)= gi(qcc))
0 otherwise
(1,1,1)(1,0,0)
(1,1,0)
Ka Kb
Kc
CSE 8337 Spring 2011 7
Drawbacks of the Boolean Model
Retrieval based on binary decision criteria with no notion of partial matching
No ranking of the documents is provided Information need has to be translated into
a Boolean expression The Boolean queries formulated by the
users are most often too simplistic As a consequence, the Boolean model
frequently returns either too few or too many documents in response to a user query
CSE 8337 Spring 2011 8
The Vector Model
Use of binary weights is too limiting Non-binary weights provide
consideration for partial matches These term weights are used to
compute a degree of similarity between a query and each document
Ranked set of documents provides for better matching
CSE 8337 Spring 2011 9
The Vector Model
wij > 0 whenever ki appears in dj
wiq >= 0 associated with the pair (ki,q) dj = (w1j, w2j, ..., wtj) q = (w1q, w2q, ..., wtq) To each term ki is associated a unitary
vector i The unitary vectors i and j are assumed to
be orthonormal (i.e., index terms are assumed to occur independently within the documents)
The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors
CSE 8337 Spring 2011 10
The Vector Model
Sim(q,dj) = cos()
= [dj q] / |dj| * |q|
= [ wij * wiq] / |dj| * |q|
Since wij > 0 and wiq > 0, 0 <= sim(q,dj)
<=1 A document is retrieved even if it matches
the query terms only partially
i
j
dj
q
CSE 8337 Spring 2011 11
Weights wij and wiq ? One approach is to examine the frequency
of the occurence of a word in a document: Absolute frequency:
tf factor, the term frequency within a document
freqi,j - raw frequency of ki within dj
Both high-frequency and low-frequency terms may not actually be significant
Relative frequency: tf divided by number of words in document
Normalized frequency:fi,j = (freqi,j)/(maxl freql,j)
CSE 8337 Spring 2011 12
Inverse Document Frequency
Importance of term may depend more on how it can distinguish between documents.
Quantification of inter-documents separation
Dissimilarity not similarity idf factor, the inverse document
frequency
CSE 8337 Spring 2011 13
IDF N be the total number of docs in the
collection ni be the number of docs which contain ki
The idf factor is computed as idfi = log (N/ni) the log is used to make the values of tf and
idf comparable. It can also be interpreted as the amount of information associated with the term ki.
IDF Ex: N=1000, n1=100, n2=500, n3=800 idf1= 3 - 2 = 1 idf2= 3 – 2.7 = 0.3 idf3 = 3 – 2.9 = 0.1
CSE 8337 Spring 2011 14
The Vector Model
The best term-weighting schemes take both into account.
wij = fi,j * log(N/ni) This strategy is called a tf-idf
weighting scheme
CSE 8337 Spring 2011 15
The Vector Model
For the query term weights, a suggestion is wiq = (0.5 + [0.5 * freqi,q / max(freql,q]) * log(N/ni)
The vector model with tf-idf weights is a good ranking strategy with general collections
The vector model is usually as good as any known ranking alternatives.
It is also simple and fast to compute.
CSE 8337 Spring 2011 16
The Vector Model
Advantages: term-weighting improves quality of the
answer set partial matching allows retrieval of docs that
approximate the query conditions cosine ranking formula sorts documents
according to degree of similarity to the query Disadvantages:
Assumes independence of index terms (??); not clear that this is bad though
CSE 8337 Spring 2011 17
The Vector Model: Example I
k1 k2 k3 q djd1 1 0 1 2d2 1 0 0 1d3 0 1 1 2d4 1 0 0 1d5 1 1 1 3d6 1 1 0 2d7 0 1 0 1
q 1 1 1
d1
d2
d3d4 d5
d6d7
k1k2
k3
CSE 8337 Spring 2011 18
The Vector Model: Example II
d1
d2
d3d4 d5
d6d7
k1k2
k3
k1 k2 k3 q djd1 1 0 1 4d2 1 0 0 1d3 0 1 1 5d4 1 0 0 1d5 1 1 1 6d6 1 1 0 3d7 0 1 0 2
q 1 2 3
CSE 8337 Spring 2011 19
The Vector Model: Example III
d1
d2
d3d4 d5
d6d7
k1k2
k3
k1 k2 k3 q djd1 2 0 1 5d2 1 0 0 1d3 0 1 3 11d4 2 0 0 2d5 1 2 4 17d6 1 2 0 5d7 0 5 0 10
q 1 2 3
CSE 8337 Spring 2011 20
Probabilistic Model
Objective: to capture the IR problem using a probabilistic framework
Given a user query, there is an ideal answer set
Querying as specification of the properties of this ideal answer set (clustering)
But, what are these properties? Guess at the beginning what they could
be (i.e., guess initial description of ideal answer set)
Improve by iteration
CSE 8337 Spring 2011 21
Probabilistic Model An initial set of documents is retrieved
somehow User inspects these docs looking for the
relevant ones (in truth, only top 10-20 need to be inspected)
IR system uses this information to refine description of ideal answer set
By repeating this process, it is expected that the description of the ideal answer set will improve
Have always in mind the need to guess at the very beginning the description of the ideal answer set
Description of ideal answer set is modeled in probabilistic terms
CSE 8337 Spring 2011 22
Probabilistic Ranking Principle
Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant.
But, how to compute probabilities? what is the sample space?
CSE 8337 Spring 2011 23
The Ranking
Probabilistic ranking computed as: sim(q,dj) = P(dj relevant-to q) / P(dj non-
relevant-to q) This is the odds of the document dj being
relevant Taking the odds minimize the probability of
an erroneous judgement Definition:
wij {0,1} P(R | dj) : probability that given doc is
relevant P(R | dj) : probability doc is not relevant
CSE 8337 Spring 2011 24
The Ranking
sim(dj,q) = P(R | dj) / P(R | dj)
= [P(dj | R) * P(R)]
[P(dj | R) * P(R)]
~ P(dj | R)
P(dj | R) P(dj | R) : probability of randomly
selecting the document dj from the set R of relevant documents
CSE 8337 Spring 2011 25
The Ranking
sim(dj,q) ~ P(dj | R)
P(dj | R)
~ [ P(ki | R)] * [ P(ki | R)]
[ P(ki | R)] * [ P(ki | R)]
P(ki | R) : probability that the index term ki is present in a document randomly selected from the set R of relevant documents
CSE 8337 Spring 2011 26
The Ranking
sim(dj,q)
~ log [ P(ki | R)] * [ P(kj | R)] [ P(ki |R)] * [ P(ki | R)]
~ K * [ log P(ki | R) + log P(ki | R) ] P(ki | R) P(ki | R)
where P(ki | R) = 1 - P(ki | R) P(ki | R) = 1 - P(ki | R)
CSE 8337 Spring 2011 27
The Initial Ranking
sim(dj,q) ~ wiq * wij * (log P(ki | R) + log P(ki | R) )
P(ki | R) P(ki | R) Probabilities P(ki | R) and P(ki | R) ? Estimates based on assumptions:
P(ki | R) = 0.5 P(ki | R) = ni
N
Use this initial guess to retrieve an initial ranking
Improve upon this initial ranking
CSE 8337 Spring 2011 28
Improving the Initial Ranking
Let V : set of docs initially retrieved Vi : subset of docs retrieved that
contain ki
Reevaluate estimates: P(ki | R) = Vi
V P(ki | R) = ni - Vi
N - V Repeat recursively
CSE 8337 Spring 2011 29
Improving the Initial Ranking
To avoid problems with V=1 and Vi=0: P(ki | R) = Vi + 0.5
V + 1 P(ki | R) = ni - Vi + 0.5
N - V + 1 Also,
P(ki | R) = Vi + ni/N V + 1
P(ki | R) = ni - Vi + ni/N N - V + 1
CSE 8337 Spring 2011 30
Pluses and Minuses
Advantages: Docs ranked in decreasing order of
probability of relevance Disadvantages:
need to guess initial estimates for P(ki | R)
method does not take into account tf and idf factors
CSE 8337 Spring 2011 31
Brief Comparison of Classic Models
Boolean model does not provide for partial matches and is considered to be the weakest classic model
Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections
This seems also to be the view of the research community
CSE 8337 Spring 2011 32
Extended Boolean Model Boolean model is simple and elegant. But, no provision for a ranking As with the fuzzy model, a ranking can be
obtained by relaxing the condition on set membership
Extend the Boolean model with the notions of partial matching and term weighting
Combine characteristics of the Vector model with properties of Boolean algebra
CSE 8337 Spring 2011 33
The Idea
The Extended Boolean Model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra
Let, q = kx ky
wxj = fxj * idfx associated with [kx,dj] max(idfi)
Further, wxj = x and wyj = y
CSE 8337 Spring 2011 34
The Idea:
qand = kx ky; wxj = x and wyj = y
dj
y = wyj
x = wxj(0,0)
(1,1)
kx
ky
sim(qand,dj) = 1 - sqrt( (1-x) + (1-y) ) 2
2 2
AND
CSE 8337 Spring 2011 35
The Idea:
qor = kx ky; wxj = x and wyj = y
(1,1)
sim(qor,dj) = sqrt( x + y ) 2
2 2
djy = wyj
x = wxj(0,0) kx
ky OR
CSE 8337 Spring 2011 36
Generalizing the Idea
We can extend the previous model to consider Euclidean distances in a t-dimensional space
This can be done using p-norms which extend the notion of distance to include p-distances, where 1 p is a new parameter
CSE 8337 Spring 2011 37
Generalizing the IdeaA generalized disjunctive query is given by qor = k1 k2 . . . kt
A generalized conjunctive query is given by qand = k1 k2 . . . kt
ppp
p pp
sim(qor,dj) = (x1 + x2 + . . . + xm ) m
p p p p1
sim(qand,dj)=1 - ((1-x1) + (1-x2) + . . . + (1-
xm) ) m
p1ppp
CSE 8337 Spring 2011 38
Properties
If p = 1 then (Vector like) sim(qor,dj) = sim(qand,dj) = x1 + . . . + xm
m If p = then (Fuzzy like)
sim(qor,dj) = max (wxj) sim(qand,dj) = min (wxj)
By varying p, we can make the model behave as a vector, as a fuzzy, or as an intermediary model
CSE 8337 Spring 2011 39
Properties This is quite powerful and is a good
argument in favor of the extended Boolean model
q = (k1 k2) k3
k1 and k2 are to be used as in a vector
retrieval while the presence of k3 is
required. sim(q,dj) = ( (1 - ( (1-x1) + (1-x2) ) ) +
x3 ) 2
______ 2
2
CSE 8337 Spring 2011 40
Conclusions Model is quite powerful Properties are interesting and might be
useful Computation is somewhat complex However, distributivity operation does not
hold for ranking computation: q1 = (k1 k2) k3 q2 = (k1 k3) (k2 k3) sim(q1,dj) sim(q2,dj)
CSE 8337 Spring 2011 41
Vector Space Scoring First cut: distance between two
points ( = distance between the end points
of the two vectors) Euclidean distance? Euclidean distance is a bad
idea . . . . . . because Euclidean distance is
large for vectors of different lengths.
CSE 8337 Spring 2011 42
Why distance is a bad idea
The Euclidean distance between qand d2 is large even though thedistribution of terms in the query q and the distribution ofterms in the document d2 are
very similar.
CSE 8337 Spring 2011 43
Use angle instead of distance Thought experiment: take a
document d and append it to itself. Call this document d′.
“Semantically” d and d′ have the same content
The Euclidean distance between the two documents can be quite large
The angle between the two documents is 0, corresponding to maximal similarity.
Key idea: Rank documents according to angle with query.
CSE 8337 Spring 2011 44
From angles to cosines The following two notions are
equivalent. Rank documents in decreasing order
of the angle between query and document
Rank documents in increasing order of cosine(query,document)
Cosine is a monotonically decreasing function for the interval [0o, 180o]
CSE 8337 Spring 2011 45
Length normalization A vector can be (length-)
normalized by dividing each of its components by its length – for this we use the L2 norm:
Dividing a vector by its L2 norm makes it a unit (length) vector
Effect on the two documents d and d′ (d appended to itself) from earlier slide: they have identical vectors after length-normalization.
i ixx 2
2
CSE 8337 Spring 2011 46
cosine(query,document)
V
i i
V
i i
V
i ii
dq
dq
d
d
q
q
dq
dqdq
1
2
1
2
1),cos(
Dot product Unit vectors
qi is the tf-idf weight of term i in the querydi is the tf-idf weight of term i in the documentcos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.
CSE 8337 Spring 2011 47
Cosine similarity amongst 3 documents
term SaS PaP WH
affection 115 58 20
jealous 10 7 11
gossip 2 0 6
wuthering 0 0 38
How similar arethe novelsSaS: Sense andSensibilityPaP: Pride andPrejudice, andWH: WutheringHeights?
Term frequencies (counts)
CSE 8337 Spring 2011 48
3 documents example contd.Log frequency weighting
term SaS PaP WH
affection 3.06 2.76 2.30
jealous 2.00 1.85 2.04
gossip 1.30 0 1.78
wuthering
0 0 2.58
After normalization
term SaS PaP WH
affection 0.789 0.832 0.524
jealous 0.515 0.555 0.465
gossip 0.335 0 0.405
wuthering
0 0 0.588
cos(SaS,PaP) ≈0.789 ∗ 0.832 + 0.515 ∗ 0.555 + 0.335 ∗ 0.0 + 0.0 ∗ 0.0≈ 0.94cos(SaS,WH) ≈ 0.79cos(PaP,WH) ≈ 0.69
Why do we have cos(SaS,PaP) > cos(SAS,WH)?
CSE 8337 Spring 2011 49
tf-idf weighting has many variants
Columns headed ‘n’ are acronyms for weight schemes.
Why is the base of the log in idf immaterial?
CSE 8337 Spring 2011 50
Weighting may differ in Queries vs Documents
Many search engines allow for different weightings for queries vs documents
To denote the combination in use in an engine, we use the notation qqq.ddd with the acronyms from the previous table
Example: ltn.ltc means: Query: logarithmic tf (l in leftmost column),
idf (t in second column), no normalization … Document logarithmic tf, no idf and cosine
normalizationIs this a bad idea?
CSE 8337 Spring 2011 51
tf-idf example: ltn.lnc
Term Query Document Prod
tf-raw
tf-wt df idf wt tf-raw
tf-wt wt n’lized
auto 0 0 5000 2.3
0 1 1 1 0.52 0
best 1 1 50000 1.3
1.3 0 0 0 0 0
car 1 1 10000 2.0
2.0 1 1 1 0.52 1.04
insurance
1 1 1000 3.0
3.0 2 1.3 0.68 1.3 2.04
Document: car insurance auto insuranceQuery: best car insurance
Exercise: what is N, the number of docs?
Score = 0+0+1.04+2.04 = 3.08
Doc length = 92.11101 2222
CSE 8337 Spring 2011 52
Summary – vector space ranking Represent the query as a weighted tf-
idf vector Represent each document as a
weighted tf-idf vector Compute the cosine similarity score
for the query vector and each document vector
Rank documents with respect to the query by score
Return the top K (e.g., K = 10) to the user
CSE 8337 Spring 2011 53
Vector Model and Web Search Speeding up vector space
ranking Putting together a complete
search system Will require learning about a
number of miscellaneous topics and heuristics
CSE 8337 Spring 2011 54
Efficient cosine ranking Find the K docs in the collection
“nearest” to the query K largest query-doc cosines.
Efficient ranking: Computing a single cosine efficiently. Choosing the K largest cosine values
efficiently. Can we do this without computing all N
cosines?
CSE 8337 Spring 2011 55
Efficient cosine ranking What we’re doing in effect: solving the
K-nearest neighbor problem for a query vector
In general, we do not know how to do this efficiently for high-dimensional spaces
But it is solvable for short queries, and standard indexes support this well
CSE 8337 Spring 2011 56
Special case – unweighted queries
No weighting on query terms Assume each query term occurs only
once Then for ranking, don’t need to
normalize query vector
CSE 8337 Spring 2011 57
Faster cosine: unweighted query
CSE 8337 Spring 2011 58
Computing the K largest cosines: selection vs. sorting Typically we want to retrieve the top K
docs (in the cosine ranking for the query) not to totally order all docs in the
collection Can we pick off docs with K highest
cosines? Let J = number of docs with nonzero
cosines We seek the K best of these J
CSE 8337 Spring 2011 59
Use heap for selecting top K Binary tree in which each node’s value
> the values of children Takes 2J operations to construct, then
each of K “winners” read off in 2log J steps.
For J=1M, K=100, this is about 10% of the cost of sorting.
1
.9 .3
.8.3
.1
.1
CSE 8337 Spring 2011 60
Bottlenecks Primary computational bottleneck in
scoring: cosine computation Can we avoid all this computation? Yes, but may sometimes get it wrong
a doc not in the top K may creep into the list of K output docs
Is this such a bad thing?
CSE 8337 Spring 2011 61
Cosine similarity is only a proxy User has a task and a query formulation Cosine matches docs to query Thus cosine is anyway a proxy for user
happiness If we get a list of K docs “close” to the
top K by cosine measure, should be ok
CSE 8337 Spring 2011 62
Generic approach Find a set A of contenders, with K < |A|
<< N A does not necessarily contain the top K,
but has many docs from among the top K
Return the top K docs in A Think of A as pruning non-contenders The same approach is also used for
other (non-cosine) scoring functions Will look at several schemes following
this approach
CSE 8337 Spring 2011 63
Index elimination Basic algorithm of Fig 7.1 only
considers docs containing at least one query term
Take this further: Only consider high-idf query terms Only consider docs containing many
query terms
CSE 8337 Spring 2011 64
High-idf query terms only For a query such as catcher in the
rye Only accumulate scores from
catcher and rye Intuition: in and the contribute
little to the scores and don’t alter rank-ordering much
Benefit: Postings of low-idf terms have many
docs these (many) docs get eliminated from A
CSE 8337 Spring 2011 65
Docs containing many query terms Any doc with at least one query
term is a candidate for the top K output list
For multi-term queries, only compute scores for docs containing several of the query terms Say, at least 3 out of 4 Imposes a “soft conjunction” on
queries seen on web search engines (early Google)
Easy to implement in postings traversal
CSE 8337 Spring 2011 66
3 of 4 query terms
Brutus
Caesar
Calpurnia
1 2 3 5 8 13 21 34
2 4 8 16 32 64128
13 16
Antony 3 4 8 16 32 64128
32
Scores only computed for 8, 16 and 32.
CSE 8337 Spring 2011 67
Champion lists Precompute for each dictionary term t,
the r docs of highest weight in t’s postings Call this the champion list for t (aka fancy list or top docs for t)
Note that r has to be chosen at index time
At query time, only compute scores for docs in the champion list of some query term Pick the K top-scoring docs from amongst
these
CSE 8337 Spring 2011 68
Quantitative
Static quality scores We want top-ranking documents to be
both relevant and authoritative Relevance is being modeled by cosine
scores Authority is typically a query-
independent property of a document Examples of authority signals
Wikipedia among websites Articles in certain newspapers A paper with many citations Many diggs, Y!buzzes or del.icio.us marks (Pagerank)
CSE 8337 Spring 2011 69
Modeling authority Assign to each document a query-
independent quality score in [0,1] to each document d Denote this by g(d)
Thus, a quantity like the number of citations is scaled into [0,1] Exercise: suggest a formula for this.
CSE 8337 Spring 2011 70
Net score Consider a simple total score
combining cosine relevance and authority
net-score(q,d) = g(d) + cosine(q,d) Can use some other linear
combination than an equal weighting Indeed, any function of the two
“signals” of user happiness – more later
Now we seek the top K docs by net score
CSE 8337 Spring 2011 71
Top K by net score – fast methods First idea: Order all postings by
g(d) Key: this is a common ordering for
all postings Thus, can concurrently traverse
query terms’ postings for Postings intersection Cosine score computation
Exercise: write pseudocode for cosine score computation if postings are ordered by g(d)
CSE 8337 Spring 2011 72
Why order postings by g(d)? Under g(d)-ordering, top-scoring
docs likely to appear early in postings traversal
In time-bound applications (say, we have to return whatever search results we can in 50 ms), this allows us to stop postings traversal early Short of computing scores for all docs
in postings
CSE 8337 Spring 2011 73
Champion lists in g(d)-ordering Can combine champion lists with
g(d)-ordering Maintain for each term a champion
list of the r docs with highest g(d) + tf-idftd
Seek top-K results from only the docs in these champion lists
CSE 8337 Spring 2011 74
High and low lists For each term, we maintain two
postings lists called high and low Think of high as the champion list
When traversing postings on a query, only traverse high lists first If we get more than K docs, select the top K
and stop Else proceed to get docs from the low lists
Can be used even for simple cosine scores, without global quality g(d)
A means for segmenting index into two tiers
CSE 8337 Spring 2011 75
Impact-ordered postings
We only want to compute scores for docs for which wft,d is high enough
We sort each postings list by wft,d
Now: not all postings in a common order!
How do we compute scores in order to pick off top K? Two ideas follow
CSE 8337 Spring 2011 76
1. Early termination When traversing t’s postings, stop
early after either a fixed number of r docs wft,d drops below some threshold
Take the union of the resulting sets of docs One from the postings of each query
term Compute only the scores for docs
in this union
CSE 8337 Spring 2011 77
2. idf-ordered terms When considering the postings of
query terms Look at them in order of
decreasing idf High idf terms likely to contribute
most to score As we update score contribution
from each query term Stop if doc scores relatively
unchanged Can apply to cosine or some other
net scores
CSE 8337 Spring 2011 78
Cluster pruning: preprocessing Pick N docs at random: call
these leaders For every other doc, pre-
compute nearest leader Docs attached to a leader: its
followers; Likely: each leader has ~ N
followers.
CSE 8337 Spring 2011 79
Cluster pruning: query processing Process a query as follows:
Given query Q, find its nearest leader L.
Seek K nearest docs from among L’s followers.
CSE 8337 Spring 2011 80
Visualization
Query
Leader Follower
CSE 8337 Spring 2011 81
Why use random sampling Fast Leaders reflect data distribution
CSE 8337 Spring 2011 82
General variants Have each follower attached to
b1=3 (say) nearest leaders. From query, find b2=4 (say)
nearest leaders and their followers. Can recur on leader/follower
construction.
CSE 8337 Spring 2011 83
Putting it all together