25
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn , Karl Aberer D D istributed istributed T T able: able: Efficient Query-Driven Efficient Query-Driven Processing of Multi-Term Processing of Multi-Term Queries in P2P Networks Queries in P2P Networks Cache Cache Hash Hash P2PIR’2006, collocated with CIKM’06, Arlington P2PIR’2006, collocated with CIKM’06, Arlington VA, USA VA, USA Gleb Skobeltsyn , Karl Aberer Nov 11, 2006 EPFL Ecole Polytechnique Fédérale de Lausanne, Switzerland

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

Embed Size (px)

Citation preview

Page 1: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

DDistributed istributed TTable: able:

Efficient Query-Driven Efficient Query-Driven Processing of Multi-Term Queries Processing of Multi-Term Queries in P2P Networksin P2P Networks

CachCacheeHashHash

P2PIR’2006, collocated with CIKM’06, Arlington VA, USAP2PIR’2006, collocated with CIKM’06, Arlington VA, USA

Gleb Skobeltsyn, Karl Aberer

Nov 11, 2006

EPFL Ecole Polytechnique Fédérale de Lausanne, Switzerland

Page 2: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 22 // 2525

Problem definitionProblem definition

• Given a document corpus stored in a DHT P2P network

• Provide an efficient indexing mechanism to find matching documents given a multi-term query

• Traffic consumption to be minimized

• The storage space provided by peers is limited

• Solutions: broadcast, naïve indexing of terms, HDK…

Page 3: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 33 // 2525

How the naïve approach works (1)?How the naïve approach works (1)?

• Naïve approach 1: store terms’ Inverted Lists in a DHT• An inverted lists contains document ids.

K I

K I

K I

K I

K I

K I

K I

K I

Query: “T1 AND T2”

{I1,I2}

{I2}

(h(T1), {I1,I2})

(h(T2), {I2,I3})(h(T3), {I4,I5})

K I

This slide was borrowed from B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica presentation: Enhancing P2P File-Sharing with an Internet-Scale Query Processor

Page 4: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 44 // 2525

How the naïve approach works (2)?How the naïve approach works (2)?

• Naïve approach 2: store terms’ Inverted Lists in a DHT• An inverted lists contains document summaries.

K I

K I

K I

K I

K I

K I

K I

K I

Query: “T1 AND T2”

{I2}

(h(T1), {I1,I2})

(h(T2), {I2,I3})(h(T3), {I4,I5})

K I

{I2}

OROR

Page 5: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 55 // 2525

Can we do better?Can we do better?

• Inverted lists can be very large => consume traffic• Indexing of all/selected terms in all documents =>

huge redundancy in the index, space limitations• Indexing of term combinations => how to choose

them?• Many index items are never or very rarely used.

• Our idea:– Indexing=cachingIndexing=caching– Efficiently fill in the available (distributed) storage

space with result sets for popular queries– Use stored caches to answer queries

Page 6: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 66 // 2525

What is our idea?What is our idea?

• Conventionally, index is generated purely from the data

• Very large number of unused index entries

Let us use the query popularity distribution by gathering Let us use the query popularity distribution by gathering statistics!statistics!

• We try to build an index specifically targeted for the current query log

• The size of the index is bounded by the available storage provided by peers

• Everything which is not indexed is searched via broadcast

Page 7: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 77 // 2525

• Given a set of documents, each doc contains a set of terms

• We have an inverted index over all extracted terms: {key=h(term)} – {inverted list}

What is our idea? Another explanationWhat is our idea? Another explanation

T1

T2

T3

T4

T5

T6

T7

T8

T9

D1D1

T1, T2, T3T1, T2, T3D2D2

T1, T4, T5T1, T4, T5D3D3

T1,T2,T6T1,T2,T6

D4D4

T5,T6,T7T5,T6,T7D5D5

T1,T8,T9T1,T8,T9

D o c u m e n t s: Search Keys:

Inverted lists:

D1, D2, D3, D5

D1, D3

D1

D2

D2, D3, D4

D4

D4

D5

D5

Query popularity

T1 & T2 very high

T3 high

T3 & T4 high

T7 low

T8 & T9 very low

D1, D3T1&T2

• We can monitor We can monitor Query LoadQuery Load statistics: statistics:

Delete unused Delete unused index entriesindex entries

Index termIndex termcombinationscombinations(queries)(queries)

Page 8: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 88 // 2525

Idea: exampleIdea: example

Query Inv.list

Flooding 3

Query & P2P 2

Term Inv.list

Efficient 1

Search 1,3

P2P 1,2

Query 2

Processing 2

Network 2,3

Flooding 3

ID Data

1 Efficient search in P2P

2 Query processing in P2P networks

3 Search via network flooding

Query statistics

search flooding

search query P2P

flooding

query processing P2P

query P2P

Data:

Index:

Term Inv.list

Efficient 1

Query 2

Processing 2

Flooding 3

P2P & Search 1

Network & Search

3

Network & P2P 2

Page 9: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 99 // 2525

What are we searching for?What are we searching for?

Cache Cache all queriesall queries

Index Index all dataall data

Query-driven indexing structure

Query subsumption? Unused index items?

Page 10: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1010 // 2525

ContentsContents

• Motivation & IdeaMotivation & Idea• Query subsumptionQuery subsumption• Optimization problemOptimization problem

• DCTDCT’s indexing and caching strategy:’s indexing and caching strategy:– Meta-indexMeta-index– Cache managementCache management– Top-K cachingTop-K caching– Load BalancingLoad Balancing

• EvaluationsEvaluations• ConclusionsConclusions

Page 11: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1111 // 2525

Query subsumptionQuery subsumption

• Given a query q, we are interested in locating at least one cache for a query q’ s.t.: RS(q’) contains RS(q)

• Query subsumptionQuery subsumption: q’ subsumes q if all terms of q’ are contained in q. That means RS(q’) contains RS(q).

• We can demonstrate subsumption on a lattice of size 2m-1, where m is the number of terms

a b c d

ab bc adac bd cd

abc abdacd bcdabcd

a b c d

ab bc adac bd cd

abc abdacd bcdabcd

Query subsumption if a and cd are cached

Page 12: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1212 // 2525

Optimization problemOptimization problem

• A vocabulary T=t1,t2…tm: all terms in the query load.– A query q=t1,t2…tn: q in 2T

– A document d=t1,t2…tr: d in 2T

• A Query load L=q1,q2…ql: qi in 2T, – p(qi) – probability, |RS(qi)| – result set size for qi in L

• A cachehit function: – cachehit(q)=1, if there exists a cached query q’ subsuming q;– cachehit(q)=0, otherwise.

• Problem: to find a set of cached queries Ω, s.t:– Ω=argmax Σqi in L cachehit(qi)*p(qi)

– Having a storage constraint: SΩ = Σqi in Ω |RS(qi)|<S0

A document d is the valid answer for a query q <=> d contains q

Page 13: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1313 // 2525

DCT: Indexing and caching strategyDCT: Indexing and caching strategy

• DCT caches result sets of certain queries without constraining physical cache locations

• Each peer is running two services:– Meta-index service: stores index items with cache

locations – Caching service: answers a query form a cache

• Meta-index: given a query q finds a list of cache locations capable of answering q.

• Cache service: returns the result set for q from the q’ cache (q’ subsumes q).

Page 14: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1414 // 2525

DCT: Meta-indexDCT: Meta-index

• Meta-index is based on the standard DHT indexing functionality.

• Index update: If a peer π caches a query q, it advertise the cache availability in the meta-index:

It inserts a tuple {q-> address(π)} at the peer responsible for a random term from q.

• Lookup: If a query q=t1&t2&…&tn is submitted, every peer responsible for t1,t2…tn is asked to provide a set of caches it indexes that subsume q. One of them (if any) is chosen randomly.

Page 15: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1515 // 2525

DCT: Meta-index exampleDCT: Meta-index example

cd

πc

a

πa

π RS(“cd”)

q=”acd”?

metaindex

cache

Legend:

P2P

πorig(1)

(1)

(2)

(2)

(3)

(4)

1. πorig looks up the meta-index: contacts

peers πa, πc and πd**

2. πa, πc and πd response with known locations of caches subsuming q

3. πorig randomly selects a cache from the obtained list. Assume “cd” is picked.

4. RS(q) is sent to πorig

** interactions with πd are not shown

q=“acd”q=“acd” is submitted is submitted at πorig

Page 16: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1616 // 2525

DCT: Cache ManagementDCT: Cache Management

• Each peer provides some storage space s0 for caches

• Caches with low profits are evicted:

profit(q)=popularity(q) / (|RS(q)|+1)

• Every time a peer has to broadcast a query, it tries to cache it

• The query q with the result set size |RS(q)| is cached if:– There is enough free space to store |RS(q)|,– There is NOT enough free space but the least

profitable caches can be dropped to fit q cache.

Page 17: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1717 // 2525

DCT: Top-K cachingDCT: Top-K caching

• Problem: – A popular query q with a large result set

might NOT be cached as its profit is relatively low

• Solution:– Introduce a top-k cache:– Can serve only q, no subsumption;– But consumes little space, avoids broadcasting

the popular q

Page 18: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1818 // 2525

EvaluationEvaluation

Page 19: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1919 // 2525

Evaluations: query load and dataEvaluations: query load and data

• Source data:– English Wikipedia XML dump (6Gb) 05.2006– Two Wikipedia query traces from August and September

2004

• Query load properties (August trace):– 1.3M unique queries, asked 4.6M times during the

month– 500K repeated at least twice, 800K only once– 225K unique terms in both traces (after stemming)– Average number of terms in a query = 2.6

• Java simulation:– Simulates a number of virtual peers– Each peer provides 200K records of storage space

Page 20: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 2020 // 2525

Evaluations: how much storage do we need?Evaluations: how much storage do we need?

35

50

1000*250 500

100

1

10

20

0

10

20

30

40

50

60

70

80

90

100

1 10 100 1K

number of peers, each peer provides 200K records capacity

cach

hit

(%)

CacheHit

SubsumHit

TopKHit

98% max98% max cache cache hit withhit with unlimited unlimited storagestorage

81% max81% max cache cache hit withhit with unlimited unlimited storage but storage but nono subsumptionsubsumption

Page 21: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 2121 // 2525

Evaluations: Traffic consumptionEvaluations: Traffic consumption

• 100 peers, 200K each

• Converges to 85% cache hit with 100x200K=20M records global cache capacity

• The naïve approach requires at least 240M records for the term index (if built for query load terms only)

0

10

20

30

40

50

60

70

80

90

100

0 2.0M 4.0M 6.0M

number of generated queries

cach

ehit,

spa

ce u

tiliz

atio

n (%

)   CacheHit (%)

SubsumHit (%)

TopKHit (%)

SpaceUtilization (%)

1

10

100

1K

10K

100K

0 2.0M 4.0M 6.0M

number of generated queries

aver

age

traffi

c pe

r que

ry (r

ecor

ds)

Naive-Random

Naive-Sort

Broadcast

DCT-All

Page 22: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 2222 // 2525

Evaluations: stress testEvaluations: stress test

• 300 peers, 200K each

• Converges to 97% cache hit with 300x200K=60M capacity

• Very small cache hit drop when changing the load due to the subsumption

0

10

20

30

40

50

60

70

80

90

100

0 4.5M 9.0M

number of generated queries

cach

ehit

(%)

CacheHit

SubsumHit

TopKHit

SpaceUtilization

New

query load

1

10

100

1K

10K

100K

0 4.5M 9.0M

number of generated queries

aver

age

traffi

c pe

r que

ry (r

ecor

ds)

Naive-Random

Naive-Sort

Broadcast

DCT-All

New

query load

log

Page 23: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 2323 // 2525

Evaluations: load balancingEvaluations: load balancing

• Cache imbalance => only several peers are overloaded• Meta-index imbalance => has less impact, can be

partially avoided

012345

0 10 20 30 40 50 60 70 80 90

peers

Cac

he lo

ad (%

)

012345

0 10 20 30 40 50 60 70 80 90

peers

Met

a-in

dex

load

(%)

Page 24: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 2424 // 2525

ConclusionsConclusions

• Distributed Cache Table: a (quite) large scale distributed cache for P2P IR applications based on both:– Query load– Data distribution

• Properties:– Efficiently utilizes and adapts to the available storage space – Trade off between huge index size and extra traffic costs

for broadcasting rare queries– Subsumption is important: resilient to query load changes– Sufficiently load balanced– Requires 1-2 orders of magnitude less traffic than the naive

approach– Requires substantially less storage then per-term index

Page 25: P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 2525 // 2525

Last slideLast slide

Thank you for your attention!Questions?