P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer

DDistributed istributed TTable: able:

Efficient Query-Driven Efficient Query-Driven Processing of Multi-Term Queries Processing of Multi-Term Queries in P2P Networksin P2P Networks

CachCacheeHashHash

P2PIR’2006, collocated with CIKM’06, Arlington VA, USAP2PIR’2006, collocated with CIKM’06, Arlington VA, USA

Gleb Skobeltsyn, Karl Aberer

Nov 11, 2006

EPFL Ecole Polytechnique Fédérale de Lausanne, Switzerland

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 22 // 2525

Problem definitionProblem definition

• Given a document corpus stored in a DHT P2P network

• Provide an efficient indexing mechanism to find matching documents given a multi-term query

• Traffic consumption to be minimized

• The storage space provided by peers is limited

• Solutions: broadcast, naïve indexing of terms, HDK…


How the naïve approach works (1)?How the naïve approach works (1)?

• Naïve approach 1: store terms’ Inverted Lists in a DHT• An inverted lists contains document ids.

K I

K I

K I

K I

K I

K I

K I

K I

Query: “T1 AND T2”

{I1,I2}

{I2}

(h(T1), {I1,I2})

(h(T2), {I2,I3})(h(T3), {I4,I5})

K I

This slide was borrowed from B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica presentation: Enhancing P2P File-Sharing with an Internet-Scale Query Processor


How the naïve approach works (2)?How the naïve approach works (2)?

• Naïve approach 2: store terms’ Inverted Lists in a DHT• An inverted lists contains document summaries.

K I

K I

K I

K I

K I

K I

K I

K I

Query: “T1 AND T2”

{I2}

(h(T1), {I1,I2})

(h(T2), {I2,I3})(h(T3), {I4,I5})

K I

{I2}

OROR


Can we do better?Can we do better?

• Inverted lists can be very large => consume traffic• Indexing of all/selected terms in all documents =>

huge redundancy in the index, space limitations• Indexing of term combinations => how to choose

them?• Many index items are never or very rarely used.

• Our idea:– Indexing=cachingIndexing=caching– Efficiently fill in the available (distributed) storage

space with result sets for popular queries– Use stored caches to answer queries


What is our idea?What is our idea?

• Conventionally, index is generated purely from the data

• Very large number of unused index entries

Let us use the query popularity distribution by gathering Let us use the query popularity distribution by gathering statistics!statistics!

• We try to build an index specifically targeted for the current query log

• The size of the index is bounded by the available storage provided by peers

• Everything which is not indexed is searched via broadcast


• Given a set of documents, each doc contains a set of terms

• We have an inverted index over all extracted terms: {key=h(term)} – {inverted list}

What is our idea? Another explanationWhat is our idea? Another explanation

T1

T2

T3

T4

T5

T6

T7

T8

T9

D1D1

T1, T2, T3T1, T2, T3D2D2

T1, T4, T5T1, T4, T5D3D3

T1,T2,T6T1,T2,T6

D4D4

T5,T6,T7T5,T6,T7D5D5

T1,T8,T9T1,T8,T9

D o c u m e n t s: Search Keys:

Inverted lists:

D1, D2, D3, D5

D1, D3

D1

D2

D2, D3, D4

D4

D4

D5

D5

Query popularity

T1 & T2 very high

T3 high

T3 & T4 high

T7 low

T8 & T9 very low

D1, D3T1&T2

• We can monitor We can monitor Query LoadQuery Load statistics: statistics:

Delete unused Delete unused index entriesindex entries

Index termIndex termcombinationscombinations(queries)(queries)


Idea: exampleIdea: example

Query Inv.list

Flooding 3

Query & P2P 2

Term Inv.list

Efficient 1

Search 1,3

P2P 1,2

Query 2

Processing 2

Network 2,3

Flooding 3

ID Data

1 Efficient search in P2P

2 Query processing in P2P networks

3 Search via network flooding

Query statistics

search flooding

search query P2P

flooding

query processing P2P

query P2P

Data:

Index:

Term Inv.list

Efficient 1

Query 2

Processing 2

Flooding 3

P2P & Search 1

Network & Search

3

Network & P2P 2


What are we searching for?What are we searching for?

Cache Cache all queriesall queries

Index Index all dataall data

Query-driven indexing structure

Query subsumption? Unused index items?


ContentsContents

• Motivation & IdeaMotivation & Idea• Query subsumptionQuery subsumption• Optimization problemOptimization problem

• DCTDCT’s indexing and caching strategy:’s indexing and caching strategy:– Meta-indexMeta-index– Cache managementCache management– Top-K cachingTop-K caching– Load BalancingLoad Balancing

• EvaluationsEvaluations• ConclusionsConclusions


Query subsumptionQuery subsumption

• Given a query q, we are interested in locating at least one cache for a query q’ s.t.: RS(q’) contains RS(q)

• Query subsumptionQuery subsumption: q’ subsumes q if all terms of q’ are contained in q. That means RS(q’) contains RS(q).

• We can demonstrate subsumption on a lattice of size 2m-1, where m is the number of terms

a b c d

ab bc adac bd cd

abc abdacd bcdabcd

a b c d

ab bc adac bd cd

abc abdacd bcdabcd

Query subsumption if a and cd are cached


Optimization problemOptimization problem

• A vocabulary T=t1,t2…tm: all terms in the query load.– A query q=t1,t2…tn: q in 2T

– A document d=t1,t2…tr: d in 2T

• A Query load L=q1,q2…ql: qi in 2T, – p(qi) – probability, |RS(qi)| – result set size for qi in L

• A cachehit function: – cachehit(q)=1, if there exists a cached query q’ subsuming q;– cachehit(q)=0, otherwise.

• Problem: to find a set of cached queries Ω, s.t:– Ω=argmax Σqi in L cachehit(qi)*p(qi)

– Having a storage constraint: SΩ = Σqi in Ω |RS(qi)|<S0

A document d is the valid answer for a query q <=> d contains q


DCT: Indexing and caching strategyDCT: Indexing and caching strategy

• DCT caches result sets of certain queries without constraining physical cache locations

• Each peer is running two services:– Meta-index service: stores index items with cache

locations – Caching service: answers a query form a cache

• Meta-index: given a query q finds a list of cache locations capable of answering q.

• Cache service: returns the result set for q from the q’ cache (q’ subsumes q).


DCT: Meta-indexDCT: Meta-index

• Meta-index is based on the standard DHT indexing functionality.

• Index update: If a peer π caches a query q, it advertise the cache availability in the meta-index:

It inserts a tuple {q-> address(π)} at the peer responsible for a random term from q.

• Lookup: If a query q=t1&t2&…&tn is submitted, every peer responsible for t1,t2…tn is asked to provide a set of caches it indexes that subsume q. One of them (if any) is chosen randomly.


DCT: Meta-index exampleDCT: Meta-index example

cd

πc

a

πa

π RS(“cd”)

q=”acd”?

metaindex

cache

Legend:

P2P

πorig(1)

(1)

(2)

(2)

(3)

(4)

1. πorig looks up the meta-index: contacts

peers πa, πc and πd**

2. πa, πc and πd response with known locations of caches subsuming q

3. πorig randomly selects a cache from the obtained list. Assume “cd” is picked.

4. RS(q) is sent to πorig

** interactions with πd are not shown

q=“acd”q=“acd” is submitted is submitted at πorig


DCT: Cache ManagementDCT: Cache Management

• Each peer provides some storage space s0 for caches

• Caches with low profits are evicted:

profit(q)=popularity(q) / (|RS(q)|+1)

• Every time a peer has to broadcast a query, it tries to cache it

• The query q with the result set size |RS(q)| is cached if:– There is enough free space to store |RS(q)|,– There is NOT enough free space but the least

profitable caches can be dropped to fit q cache.


DCT: Top-K cachingDCT: Top-K caching

• Problem: – A popular query q with a large result set

might NOT be cached as its profit is relatively low

• Solution:– Introduce a top-k cache:– Can serve only q, no subsumption;– But consumes little space, avoids broadcasting

the popular q


EvaluationEvaluation


Evaluations: query load and dataEvaluations: query load and data

• Source data:– English Wikipedia XML dump (6Gb) 05.2006– Two Wikipedia query traces from August and September

2004

• Query load properties (August trace):– 1.3M unique queries, asked 4.6M times during the

month– 500K repeated at least twice, 800K only once– 225K unique terms in both traces (after stemming)– Average number of terms in a query = 2.6

• Java simulation:– Simulates a number of virtual peers– Each peer provides 200K records of storage space


Evaluations: how much storage do we need?Evaluations: how much storage do we need?

35

50

1000*250 500

100

1

10

20

0

10

20

30

40

50

60

70

80

90

100

1 10 100 1K

number of peers, each peer provides 200K records capacity

cach

hit

(%)

CacheHit

SubsumHit

TopKHit

98% max98% max cache cache hit withhit with unlimited unlimited storagestorage

81% max81% max cache cache hit withhit with unlimited unlimited storage but storage but nono subsumptionsubsumption


Evaluations: Traffic consumptionEvaluations: Traffic consumption

• 100 peers, 200K each

• Converges to 85% cache hit with 100x200K=20M records global cache capacity

• The naïve approach requires at least 240M records for the term index (if built for query load terms only)

0

10

20

30

40

50

60

70

80

90

100

0 2.0M 4.0M 6.0M

number of generated queries

cach

ehit,

spa

ce u

tiliz

atio

n (%

) CacheHit (%)

SubsumHit (%)

TopKHit (%)

SpaceUtilization (%)

1

10

100

1K

10K

100K

0 2.0M 4.0M 6.0M


aver

age

traffi

c pe

r que

ry (r

ecor

ds)

Naive-Random

Naive-Sort

Broadcast

DCT-All


Evaluations: stress testEvaluations: stress test

• 300 peers, 200K each

• Converges to 97% cache hit with 300x200K=60M capacity

• Very small cache hit drop when changing the load due to the subsumption

0

10

20

30

40

50

60

70

80

90

100

0 4.5M 9.0M


cach

ehit

(%)

CacheHit

SubsumHit

TopKHit

SpaceUtilization

New

query load

1

10

100

1K

10K

100K

0 4.5M 9.0M


aver

age

traffi

c pe

r que

ry (r

ecor

ds)

Naive-Random

Naive-Sort

Broadcast

DCT-All

New

query load

log


Evaluations: load balancingEvaluations: load balancing

• Cache imbalance => only several peers are overloaded• Meta-index imbalance => has less impact, can be

partially avoided

012345

0 10 20 30 40 50 60 70 80 90

peers

Cac

he lo

ad (%

)

012345

0 10 20 30 40 50 60 70 80 90

peers

Met

a-in

dex

load

(%)


ConclusionsConclusions

• Distributed Cache Table: a (quite) large scale distributed cache for P2P IR applications based on both:– Query load– Data distribution

• Properties:– Efficiently utilizes and adapts to the available storage space – Trade off between huge index size and extra traffic costs

for broadcasting rare queries– Subsumption is important: resilient to query load changes– Sufficiently load balanced– Requires 1-2 orders of magnitude less traffic than the naive

approach– Requires substantially less storage then per-term index


Last slideLast slide

Thank you for your attention!Questions?

Documents

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in