26
1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

Embed Size (px)

Citation preview

Page 1: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

1

Random Sampling from a Search Engine‘s IndexZiv Bar-YossefMaxim Gurevich

Department of Electrical Engineering Technion

Page 2: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

2

Search Engine Samplers

IndexPublicInterface

PublicInterface

Search Engine

Sampler

Web

D

Queries

Top k results

Random document x D

Indexed Documents

Page 3: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

3

Motivation Useful tool for search engine evaluation:

Freshness Fraction of up-to-date pages in the index

Topical bias Identification of overrepresented/underrepresented topics

Spam Fraction of spam pages in the index

Security Fraction of pages in index infected by viruses/worms/trojans

Relative Size Number of documents indexed compared with other search

engines

Page 4: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

4

Size Wars

August 2005

: We index 20 billion documents.

So, who’s right?

September 2005

: We index 8 billion documents, but our index is 3 times larger than our competition’s.

Page 5: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

5

Related Work

Random Sampling from a Search Engine’s Index[BharatBroder98, CheneyPerry05, GulliSignorni05]

Anecdotal queries [SearchEngineWatch, Google, BradlowSchmittlein00]

Queries from user query logs [LawrenceGiles98, DobraFeinberg04]

Random sampling from the whole web [Henzinger et al 00, Bar-Yossef et al 00, Rusmevichientong et al 01]

Page 6: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

6

Our Contributions

A pool-based sampler Guaranteed to produce near-uniform samples

A random walk sampler After sufficiently many steps, guaranteed to produce

near-uniform samples Does not need an explicit lexicon/pool at all!

Focus of this talk

Page 7: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

7

Search Engines as Hypergraphs

results(q) = { documents returned on query q } queries(x) = { queries that return x as a result } P = query pool = a set of queries Query pool hypergraph:

Vertices: Indexed documents Hyperedges: { result(q) | q P }

www.cnn.com

www.foxnews.com

news.google.com

news.bbc.co.ukwww.google.com

maps.google.com

www.bbc.co.uk

www.mapquest.com

maps.yahoot.com

“news”

“bbc”

“google”

“maps”

en.wikipedia.org/wiki/BBC

Page 8: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

8

Query Cardinalities and Document Degrees

Query cardinality: card(q) = |results(q)| Document degree: deg(x) = |queries(x)| Examples:

card(“news”) = 4, card(“bbc”) = 3 deg(www.cnn.com) = 1, deg(news.bbc.co.uk) = 2

www.cnn.com

www.foxnews.com

news.google.com

news.bbc.co.ukwww.google.com

maps.google.com

www.bbc.co.uk

www.mapquest.com

maps.yahoot.com

“news”

“bbc”

“google”

“maps”

en.wikipedia.org/wiki/BBC

Page 9: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

9

The Pool-Based Sampler: Preprocessing Step

C

Large corpusPq1

……

Query Pool

Example: P = all 3-word phrases that occur in C If “to be or not to be” occurs in C, P contains:

“to be or”, “be or not”, “or not to”, “not to be”

Choose P that “covers” most documents in D

q2

Page 10: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

10

Monte Carlo Simulation

We don’t know how to generate uniform samples from D directly

How can we use biased samples to generate uniform samples?

Samples with weights that represent their bias can be used to simulate uniform samples

Monte Carlo Simulation Methods

Rejection Sampling

Rejection Sampling

Importance Sampling

Importance Sampling

Metropolis-Hastings

Metropolis-Hastings

Maximum-Degree

Maximum-Degree

Page 11: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

11

Document Degree Distribution

We are able to generate biased samples from the “document degree distribution”

Advantage: Can compute weights representing the bias of p:

Page 12: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

12

Rejection Sampling [von Neumann]

accept := falsewhile (not accept)

generate a sample x from p toss a coin whose heads probability is wp(x) if coin comes up heads,

accept := true

return x

Page 13: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

13

Pool-Based Sampler

Degree distribution sampler

Degree distribution sampler

Search EngineSearch Engine

Rejection Sampling

Rejection Sampling

q1,q2,…results(q1), results(q2),…

x

Pool-Based Sampler

(x(x11,1/deg(x,1/deg(x11)),)),

(x(x22,1/deg(x,1/deg(x22)),…)),…

Uniform sample

Documents sampled from degree distribution with corresponding weights

Degree distribution: p(x) = deg(x) / x’deg(x’)

Page 14: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

14

Sampling documents by degree

Select a random q P Select a random x results(q) Documents with high degree are more likely to be sampled If we sample q uniformly “oversample” documents that

belong to narrow queries We need to sample q proportionally to its cardinality

www.cnn.com

www.foxnews.com

news.google.com

news.bbc.co.ukwww.google.com

maps.google.com

www.bbc.co.uk

www.mapquest.com

maps.yahoot.com

“news”

“bbc”

“google”

“maps”

en.wikipedia.org/wiki/BBC

Page 15: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

15

Sampling queries by cardinality

Sampling queries from pool uniformly: Easy Sampling queries from pool by cardinality: Hard

Requires knowing cardinalities of all queries in the search engine

Use Monte Carlo methods to simulate biased sampling via uniform sampling: Sample queries uniformly from P Compute “cardinality weight” for each sample:

Obtain queries sampled by their cardinality

Page 16: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

16

Dealing with Overflowing Queries

Problem: Some queries may overflow (card(q) > k) Bias towards highly ranked documents

Solutions: Select a pool P in which overflowing queries are rare

(e.g., phrase queries) Skip overflowing queries Adapt rejection sampling to deal with approximate weights

Theorem:

Samples of PB sampler are at most -away from uniform. ( = overflow probability of P)

Page 17: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

17

Bias towards Long Documents

0%

10%

20%

30%

40%

50%

60%

1 2 3 4 5 6 7 8 9 10

Deciles of documents ordered by size

Perc

ent

of

docu

ments

fro

m s

am

ple

.

Pool Based

Random Walk

Bharat-Broder

Page 18: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

18

Relative Sizes of Google, MSN and Yahoo!

Google = 1

Yahoo! = 1.28

MSN Search = 0.73

Page 19: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

19

Conclusions

Two new search engine samplersPool-based samplerRandom walk sampler

Samplers are guaranteed to produce near-uniform samples, under plausible assumptions.

Samplers show no or little bias in experiments.

Page 20: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

20

Thank You

Page 21: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

21

Top-Level Domains in Google, MSN and Yahoo!

0%

10%

20%

30%

40%

50%

60%

Top level domain name

Pe

rce

nt

of

do

cu

me

nts

fro

m s

am

ple

.

GoogleMSNYahoo!

Page 22: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

22

Query Cardinality Distribution

Unrealistic assumptions: Can sample queries from the cardinality distribution

In practice, don’t know a priori card(q) for all q P q P, 1 ≤ card(q) ≤ k

In practice, some queries underflow (card(q) = 0) oroverflow (card(q) > k)

results(q) = { documents returned on query q } card(q) = |results(q)| Cardinality distribution:

Page 23: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

23

Degree Distribution Sampler

Search EngineSearch Engine

results(q)

xCardinality Distribution Sampler

Cardinality Distribution Sampler

Sample x uniformly from results(q)

Sample x uniformly from results(q)

q

Degree Distribution Sampler

Query sampled from cardinality

distribution

Document sampled from

degree distribution

Page 24: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

24

Cardinality Distribution Sampler

Search EngineSearch Engine

q

Cardinality Distribution Sampler

Uniform Query Sampler

Uniform Query Sampler

Rejection Sampling

Rejection Sampling

q1,q2,…card(q1), card(q2),…

(q1,card(q1)/k),(q2,card(q2)/k),…

Sample from cardinality distribution

Uniform samples from P

Page 25: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

25

Degree Distribution Sampler

Degree Distribution Sampler

Complete Pool-Based Sampler

Search EngineSearch Engine

Rejection Sampling

Rejection Sampling

x(x,1/deg(x)),…

Uniform document

sample

Documents sampled from degree distribution with corresponding weights

Uniform Query Sampler

Uniform Query Sampler

Rejection Sampling

Rejection Sampling

(q,card(q)),…

Uniform query

sampleQuery

sampled from cardinality distribution

(q,results(q)),…

Page 26: 1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

26

A random walk sampler Define a graph G over the indexed documents

(x,y) E iff results(x) ∩ results(y) ≠

Run a random walk on G Limit distribution = degree distribution Use MCMC methods to make limit distribution uniform.

Metropolis-Hastings Maximum-Degree

Does not need a preprocessing step Less efficient than the pool-based sampler