Upload
lawrence-marvin
View
229
Download
3
Tags:
Embed Size (px)
Citation preview
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer
DDistributed istributed TTable: able:
Efficient Query-Driven Efficient Query-Driven Processing of Multi-Term Queries Processing of Multi-Term Queries in P2P Networksin P2P Networks
CachCacheeHashHash
P2PIR’2006, collocated with CIKM’06, Arlington VA, USAP2PIR’2006, collocated with CIKM’06, Arlington VA, USA
Gleb Skobeltsyn, Karl Aberer
Nov 11, 2006
EPFL Ecole Polytechnique Fédérale de Lausanne, Switzerland
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 22 // 2525
Problem definitionProblem definition
• Given a document corpus stored in a DHT P2P network
• Provide an efficient indexing mechanism to find matching documents given a multi-term query
• Traffic consumption to be minimized
• The storage space provided by peers is limited
• Solutions: broadcast, naïve indexing of terms, HDK…
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 33 // 2525
How the naïve approach works (1)?How the naïve approach works (1)?
• Naïve approach 1: store terms’ Inverted Lists in a DHT• An inverted lists contains document ids.
K I
K I
K I
K I
K I
K I
K I
K I
Query: “T1 AND T2”
{I1,I2}
{I2}
(h(T1), {I1,I2})
(h(T2), {I2,I3})(h(T3), {I4,I5})
K I
This slide was borrowed from B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica presentation: Enhancing P2P File-Sharing with an Internet-Scale Query Processor
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 44 // 2525
How the naïve approach works (2)?How the naïve approach works (2)?
• Naïve approach 2: store terms’ Inverted Lists in a DHT• An inverted lists contains document summaries.
K I
K I
K I
K I
K I
K I
K I
K I
Query: “T1 AND T2”
{I2}
(h(T1), {I1,I2})
(h(T2), {I2,I3})(h(T3), {I4,I5})
K I
{I2}
OROR
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 55 // 2525
Can we do better?Can we do better?
• Inverted lists can be very large => consume traffic• Indexing of all/selected terms in all documents =>
huge redundancy in the index, space limitations• Indexing of term combinations => how to choose
them?• Many index items are never or very rarely used.
• Our idea:– Indexing=cachingIndexing=caching– Efficiently fill in the available (distributed) storage
space with result sets for popular queries– Use stored caches to answer queries
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 66 // 2525
What is our idea?What is our idea?
• Conventionally, index is generated purely from the data
• Very large number of unused index entries
Let us use the query popularity distribution by gathering Let us use the query popularity distribution by gathering statistics!statistics!
• We try to build an index specifically targeted for the current query log
• The size of the index is bounded by the available storage provided by peers
• Everything which is not indexed is searched via broadcast
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 77 // 2525
• Given a set of documents, each doc contains a set of terms
• We have an inverted index over all extracted terms: {key=h(term)} – {inverted list}
What is our idea? Another explanationWhat is our idea? Another explanation
T1
T2
T3
T4
T5
T6
T7
T8
T9
D1D1
T1, T2, T3T1, T2, T3D2D2
T1, T4, T5T1, T4, T5D3D3
T1,T2,T6T1,T2,T6
D4D4
T5,T6,T7T5,T6,T7D5D5
T1,T8,T9T1,T8,T9
D o c u m e n t s: Search Keys:
Inverted lists:
D1, D2, D3, D5
D1, D3
D1
D2
D2, D3, D4
D4
D4
D5
D5
Query popularity
T1 & T2 very high
T3 high
T3 & T4 high
T7 low
T8 & T9 very low
D1, D3T1&T2
• We can monitor We can monitor Query LoadQuery Load statistics: statistics:
Delete unused Delete unused index entriesindex entries
Index termIndex termcombinationscombinations(queries)(queries)
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 88 // 2525
Idea: exampleIdea: example
Query Inv.list
Flooding 3
Query & P2P 2
Term Inv.list
Efficient 1
Search 1,3
P2P 1,2
Query 2
Processing 2
Network 2,3
Flooding 3
ID Data
1 Efficient search in P2P
2 Query processing in P2P networks
3 Search via network flooding
Query statistics
search flooding
search query P2P
flooding
query processing P2P
query P2P
Data:
Index:
Term Inv.list
Efficient 1
Query 2
Processing 2
Flooding 3
P2P & Search 1
Network & Search
3
Network & P2P 2
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 99 // 2525
What are we searching for?What are we searching for?
Cache Cache all queriesall queries
Index Index all dataall data
Query-driven indexing structure
Query subsumption? Unused index items?
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1010 // 2525
ContentsContents
• Motivation & IdeaMotivation & Idea• Query subsumptionQuery subsumption• Optimization problemOptimization problem
• DCTDCT’s indexing and caching strategy:’s indexing and caching strategy:– Meta-indexMeta-index– Cache managementCache management– Top-K cachingTop-K caching– Load BalancingLoad Balancing
• EvaluationsEvaluations• ConclusionsConclusions
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1111 // 2525
Query subsumptionQuery subsumption
• Given a query q, we are interested in locating at least one cache for a query q’ s.t.: RS(q’) contains RS(q)
• Query subsumptionQuery subsumption: q’ subsumes q if all terms of q’ are contained in q. That means RS(q’) contains RS(q).
• We can demonstrate subsumption on a lattice of size 2m-1, where m is the number of terms
a b c d
ab bc adac bd cd
abc abdacd bcdabcd
a b c d
ab bc adac bd cd
abc abdacd bcdabcd
Query subsumption if a and cd are cached
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1212 // 2525
Optimization problemOptimization problem
• A vocabulary T=t1,t2…tm: all terms in the query load.– A query q=t1,t2…tn: q in 2T
– A document d=t1,t2…tr: d in 2T
• A Query load L=q1,q2…ql: qi in 2T, – p(qi) – probability, |RS(qi)| – result set size for qi in L
• A cachehit function: – cachehit(q)=1, if there exists a cached query q’ subsuming q;– cachehit(q)=0, otherwise.
• Problem: to find a set of cached queries Ω, s.t:– Ω=argmax Σqi in L cachehit(qi)*p(qi)
– Having a storage constraint: SΩ = Σqi in Ω |RS(qi)|<S0
A document d is the valid answer for a query q <=> d contains q
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1313 // 2525
DCT: Indexing and caching strategyDCT: Indexing and caching strategy
• DCT caches result sets of certain queries without constraining physical cache locations
• Each peer is running two services:– Meta-index service: stores index items with cache
locations – Caching service: answers a query form a cache
• Meta-index: given a query q finds a list of cache locations capable of answering q.
• Cache service: returns the result set for q from the q’ cache (q’ subsumes q).
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1414 // 2525
DCT: Meta-indexDCT: Meta-index
• Meta-index is based on the standard DHT indexing functionality.
• Index update: If a peer π caches a query q, it advertise the cache availability in the meta-index:
It inserts a tuple {q-> address(π)} at the peer responsible for a random term from q.
• Lookup: If a query q=t1&t2&…&tn is submitted, every peer responsible for t1,t2…tn is asked to provide a set of caches it indexes that subsume q. One of them (if any) is chosen randomly.
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1515 // 2525
DCT: Meta-index exampleDCT: Meta-index example
cd
πc
a
πa
π RS(“cd”)
q=”acd”?
metaindex
cache
Legend:
P2P
πorig(1)
(1)
(2)
(2)
(3)
(4)
1. πorig looks up the meta-index: contacts
peers πa, πc and πd**
2. πa, πc and πd response with known locations of caches subsuming q
3. πorig randomly selects a cache from the obtained list. Assume “cd” is picked.
4. RS(q) is sent to πorig
** interactions with πd are not shown
q=“acd”q=“acd” is submitted is submitted at πorig
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1616 // 2525
DCT: Cache ManagementDCT: Cache Management
• Each peer provides some storage space s0 for caches
• Caches with low profits are evicted:
profit(q)=popularity(q) / (|RS(q)|+1)
• Every time a peer has to broadcast a query, it tries to cache it
• The query q with the result set size |RS(q)| is cached if:– There is enough free space to store |RS(q)|,– There is NOT enough free space but the least
profitable caches can be dropped to fit q cache.
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1717 // 2525
DCT: Top-K cachingDCT: Top-K caching
• Problem: – A popular query q with a large result set
might NOT be cached as its profit is relatively low
• Solution:– Introduce a top-k cache:– Can serve only q, no subsumption;– But consumes little space, avoids broadcasting
the popular q
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1818 // 2525
EvaluationEvaluation
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 1919 // 2525
Evaluations: query load and dataEvaluations: query load and data
• Source data:– English Wikipedia XML dump (6Gb) 05.2006– Two Wikipedia query traces from August and September
2004
• Query load properties (August trace):– 1.3M unique queries, asked 4.6M times during the
month– 500K repeated at least twice, 800K only once– 225K unique terms in both traces (after stemming)– Average number of terms in a query = 2.6
• Java simulation:– Simulates a number of virtual peers– Each peer provides 200K records of storage space
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 2020 // 2525
Evaluations: how much storage do we need?Evaluations: how much storage do we need?
35
50
1000*250 500
100
1
10
20
0
10
20
30
40
50
60
70
80
90
100
1 10 100 1K
number of peers, each peer provides 200K records capacity
cach
hit
(%)
CacheHit
SubsumHit
TopKHit
98% max98% max cache cache hit withhit with unlimited unlimited storagestorage
81% max81% max cache cache hit withhit with unlimited unlimited storage but storage but nono subsumptionsubsumption
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 2121 // 2525
Evaluations: Traffic consumptionEvaluations: Traffic consumption
• 100 peers, 200K each
• Converges to 85% cache hit with 100x200K=20M records global cache capacity
• The naïve approach requires at least 240M records for the term index (if built for query load terms only)
0
10
20
30
40
50
60
70
80
90
100
0 2.0M 4.0M 6.0M
number of generated queries
cach
ehit,
spa
ce u
tiliz
atio
n (%
) CacheHit (%)
SubsumHit (%)
TopKHit (%)
SpaceUtilization (%)
1
10
100
1K
10K
100K
0 2.0M 4.0M 6.0M
number of generated queries
aver
age
traffi
c pe
r que
ry (r
ecor
ds)
Naive-Random
Naive-Sort
Broadcast
DCT-All
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 2222 // 2525
Evaluations: stress testEvaluations: stress test
• 300 peers, 200K each
• Converges to 97% cache hit with 300x200K=60M capacity
• Very small cache hit drop when changing the load due to the subsumption
0
10
20
30
40
50
60
70
80
90
100
0 4.5M 9.0M
number of generated queries
cach
ehit
(%)
CacheHit
SubsumHit
TopKHit
SpaceUtilization
New
query load
1
10
100
1K
10K
100K
0 4.5M 9.0M
number of generated queries
aver
age
traffi
c pe
r que
ry (r
ecor
ds)
Naive-Random
Naive-Sort
Broadcast
DCT-All
New
query load
log
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 2323 // 2525
Evaluations: load balancingEvaluations: load balancing
• Cache imbalance => only several peers are overloaded• Meta-index imbalance => has less impact, can be
partially avoided
012345
0 10 20 30 40 50 60 70 80 90
peers
Cac
he lo
ad (%
)
012345
0 10 20 30 40 50 60 70 80 90
peers
Met
a-in
dex
load
(%)
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 2424 // 2525
ConclusionsConclusions
• Distributed Cache Table: a (quite) large scale distributed cache for P2P IR applications based on both:– Query load– Data distribution
• Properties:– Efficiently utilizes and adapts to the available storage space – Trade off between huge index size and extra traffic costs
for broadcasting rare queries– Subsumption is important: resilient to query load changes– Sufficiently load balanced– Requires 1-2 orders of magnitude less traffic than the naive
approach– Requires substantially less storage then per-term index
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer 2525 // 2525
Last slideLast slide
Thank you for your attention!Questions?