48
pNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands [email protected]

PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands [email protected]

Embed Size (px)

Citation preview

Page 1: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

pNear Combining Content Clustering and Distributed Hash-Tables

Ronny Siebes

Vrije Universiteit, AmsterdamThe [email protected]

Page 2: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Presentation outline Problem statement Related work Our approach

Model Example

Simulation Results

Page 3: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Problem statement

Searching content:Centralized solutions and semi-centralized soluctions are often working fine (Google, KaZaa, Napster), at least for popular content. However in some cases completely decentralized solutions are needed.

Especially in cases where privacy, non-sensorship and undisclosed content plays a role

Page 4: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Problem statement

Main aim:How to efficiently find content in a completely distributed P2P network?

For the user we want high satisfaction (precision and recall of documents) For the network we want high performance (low nr of messages )

Page 5: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Related work and their limitations…

Broadcasting/ Random forwarding (Gnutella) Many messages and/or low recall

Distributed hash-tables (Pastry, Chord, CAN) Expensive maintenance when peers move/leave/join or content changes Single point of failures for keys (traditional) or multiple points

(multiplying cost) No Load-balancing

Semantic overlay networks Shared ‘semantic’ datastructures (P-Search, Bibster (build on

jxta), [Crespo et al]) Rich vs small datastructure Expensive updating

Term clustering (Voulgaris et al., FreeNet) Difficult to find clusters: assumption that the peer’s questions are

related to its expertise (cluster) and the assumption that a peer has a description may both not be realistic in some cases

Page 6: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

pNear

pNear : Combines Distributed Hash-Tables

Term Clustering

Goal -> reducing the disadvantages of both individual approaches and keeping the good

properties.

Subgoal: getting an overview of the properties of the combination: which parameters are

important.

Page 7: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

pNearuse DHT to find a (small) set of relevant

peers for a query, and use a SON to find the remaining (larger) set of relevant peers.

Only a small fixed set of peers is stored on the node responsible for the key (instead of all in pure DHT). Load balancing is handled automatically by the SON.

Page 8: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

A pNear peer

Each peer describes/summarizes its content in a set of terms which we call an expertise description.

These descriptions are stored in expertise registers via DHT where the keys are a subset of individual terms from those expertise descriptions. In this way expertise registers responsible for a term are responsible for maintaining a list of peers that have registered themselves for that term (i.e. which functions as a kind of ‘yellow page’)

Page 9: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

A pNear peer

Besides maintaining expertise registers, each peer also has its own expertise cache, in which it stores peers with related expertise resulting in an Semantic Overlay Network. These related peers are found via the clustering process

Peers use the expertise registers for getting pointers to some peers relevant for the queries

Peers use the expertise caches for finding remaining peers within the clusters of relevant peers

Page 10: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

A pNear peer

Expertise

Cache

Expertise Registers

Identifier

Document

Storage

Expertise description: [t1, t2, … , tn]

tx

ty

tz

Page 11: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

How pNear works1. Let N be the set of neighbors of a peer p in

the SON2. Select a (small) subset of terms S from the

terms in the expertise description e_p from p

3. For each term s in S do:i. Hash the term s ii. Send a ‘register message’ (containing

e_p and the network id of p) via the DHT overlay (using the key of s as message id) to the register responsible for the term s. The register responds with a set of [peer_id, relevance] pairs, R

Page 12: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

How pNear works4. Select a subset r from R (where r is not

visited before in the clustering process) and do:

i. Send an advertisement message (containing e and the peers id) to r

ii. r responds (if online) with a set of [peer_id, E, relevance] triples R’

iii. The [peer_id, relevance] pairs are added to R and the [peer_id, E] pairs are added to N.

Continue step 4 untill a maximum number.

Page 13: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

How pNear worksCosts of Clustering in amount of direct

messages:

log(nrOfPeers)*#registered_terms + #advertisementMsgs +#advertisementResultMsgs

Page 14: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

How pNear works1. Determine if its needed to query the registers or

not.2. - If yes, than for each query term q in the query:

1. Send a query consult message (containing q, the peer_id) to the register r responsible for term q.

2. r responds (if online) with a set of [peer_id, relevance] triples R

3. When possible, add to R a set of neigbors ( [peer_id, relevance] pairs) from N that are relevant for the query (based on their expertise descriptions)

- If no, than add to R a set of neigbors ( [peer_id, relevance] pairs) from N that are relevant for the query (based on their expertise descriptions)

Page 15: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

How pNear works

3. Select a subset r from R (where r is not visited before in the query process) and do:

i. Send a query message (containing the query and the peers id) to r

ii. r responds (if online) with a set of [peer_id, relevance] pairs R’ and a set of documents (answers to query)

iii. The [peer_id, relevance] pairs are added to R’ and the query results are presented to the user

Continue step 3 untill a maximum number of query rounds have passed or until the user stops the process.

Page 16: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

How pNear works

Costs of querying in terms of messages:

Using registers log(nrOfPeers) x #queryterms + #queryMsgs +

#queryResultMsgs

Without using registers

#queryMsgs + #queryResultMsgs

Page 17: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

Page 18: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

Extracting expertise description…

Page 19: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

1223

Suppose one peer with identifier 1223 unique in the system

Page 20: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

1223

Peer 1223 has a set of documents (research articles) that it wants to share in the network

Page 21: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

An extraction tool is applied over the documents, resulting in the expertise description of the peer

TextToOnto

1223

[animal,dog,bulldog,mammal,cat]

documents

Expertise description

Page 22: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Registering expertise descriptions…

Example

Page 23: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

Peer 1223 joins the network

2665

3443

12233100

7665

[animal,dog,bulldog,mammal,cat]

Page 24: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

2665

3443

1223

7665

3100

Each peer has one or more expertise registers, where the peers can register their expertise descriptions

[animal,dog,bulldog,mammal,cat]

Page 25: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

Peer 1223 registers its expertise description (via DHT) in the topic register of the peers 3443 and 2665

2665

3443

1223

7665

3100[animal,dog,bulldog,mammal,cat]

[(1223) animal,dog,bulldog,mammal,cat]

[(1223) animal,dog,bulldog,mammal,cat]

Reg

Reg

Page 26: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

Imagine that 2665 has a previously registered description from 3100. It calculates the relevance of 3100 for 1223 and in this case returns the [pointer, relevance] pair

2665

3443

1223

7665

3100[animal,dog,bulldog,mammal,cat]

[(3100) vegetable:9,tree:4,mammal:1][(1223)

animal,dog,bulldog,mammal,cat]

Reg

Page 27: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

1223 sends an advertisemt query to 3100

2665

3443

1223

7665

3100[animal,dog,bulldog,mammal,cat]

[3100,0.9]

Page 28: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

3100 answers with its expertise description and it returns from his cache also 7665 (peer_id, relevance pair) as a relevant peer for 1223.

2665

3443

1223

7665

3100[animal,dog,bulldog,mammal,cat]

[3100,0.9][7665,0.8]

[(7665) animal,dog, train,cat][(1223) animal, dog, bulldog, mammal, cat]

Cache

Page 29: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

1233 adds 3100 to its expertise cache

2665

3443

1223

7665

3100[animal,dog,bulldog,mammal,cat]

[3100,0.9][7665,0.8]

[(3100) vegetable:9,tree:4,mammal:1]

Cache

Page 30: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

1223 sends 7665 an advertisement message, where 7665 stores the expertise description of 1223

2665

3443

1223

7665

3100[animal,dog,bulldog,mammal,cat]

[3100,0.9][7665,0.8]

[(1223) animal, dog, bulldog, mammal, cat]

Cache

Page 31: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

Solving a query…

Page 32: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

Imagine a new peer 9999 that does not want to be clustered in the network.

2665

3443

1223

7665

3100 9999

Page 33: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

Peer 9999 has a query [cat] and imagine that 2665 is the register responsible for that term than 9999 sends a register consult message to 2665 (via DHT)

2665

3443

1223

7665

3100 9999

Page 34: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

Peer 2665 answers that 1223 is a good candidate, and 9999 adds 1223 to the list of potential candidates

2665

3443

1223

7665

3100 9999

[(3100) vegetable,tree,mammal][(1223)

animal,dog,bulldog,mammal,cat]

[1223,0.9]

Page 35: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

Peer 9999 queries peer 1223

2665

3443

1223

7665

3100 9999

[1223,0.9][(3100) vegetable,cat,mammal]

Page 36: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

Peer 1223 returns matching documents and 3100 as a relevant peer

2665

3443

1223

7665

3100 9999

[1333,0.9][3100,0.3]

[(3100) vegetable,cat,mammal]

Page 37: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

Peer 9999 queries 3100

2665

3443

1223

7665

3100 9999

[3100,0.9][3100,0.3]

Page 38: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Example

In this example 3100 also has some documents that match the query but no further pointers

2665

3443

1223

7665

3100 9999

[3100,0.9][3100,0.3]

Page 39: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Simulation Query set crawled from Excite’s

“SearchSpy” consists of +/- 30.000 real user queries

For each query, we included the documents (web-pages) from the max 100 hits returned by Google

Page 40: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Simulation For each web-page we extracted

around 100 terms by TextToOnto NLP tool, which serves as expertise description for each peer (web-pages are represented as peers in our simulation) resulting in a dataset of more than 1M peers

Simulation platform runs up to 400.000 peers

Page 41: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Results

Total number of nodes in the system 100,000 Maximum number of expertise descriptions in a peer’s

register 5 Maximum number of expertise descriptions in a peer’s cache

50 Maximum number of recommendations given by a register 5 Maximum number of recommendations given by a cache 30 Maximum number of advertisement rounds per

advertisement initialization 5 Maximum number of neighbors selected per advertisement

round 4 Maximum number of neighbors selected per query round 7 Average number of terms to register selected from expertise

description 3

Page 42: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Results

Only 3 out of 100 terms are needed to be registered to get 35% recall after

sending < 200 query messages in the 100K network

Page 43: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Results

Cache size is more important than nr. of recs. Still, only 50 slots are needed to

get a recall of 35% with less than 200K query messages

Page 44: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Results

Selecting more neighbors to query helps.

Useful when user wants many results very fast

Page 45: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Results

More advertising helps.

Page 46: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Results

When on average 3 topics per ED are registered, the register does not need

much more slots.

Page 47: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Results

Even in a network of 400K peers, <200 query messages are needed to get a

recall of 25%

Page 48: PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

Questions