PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands [email protected]

pNear Combining Content Clustering and Distributed Hash-Tables

Ronny Siebes

Vrije Universiteit, AmsterdamThe [email protected]

Presentation outline Problem statement Related work Our approach

Model Example

Simulation Results

Problem statement

Searching content:Centralized solutions and semi-centralized soluctions are often working fine (Google, KaZaa, Napster), at least for popular content. However in some cases completely decentralized solutions are needed.

Especially in cases where privacy, non-sensorship and undisclosed content plays a role

Problem statement

Main aim:How to efficiently find content in a completely distributed P2P network?

For the user we want high satisfaction (precision and recall of documents) For the network we want high performance (low nr of messages )

Related work and their limitations…

Broadcasting/ Random forwarding (Gnutella) Many messages and/or low recall

Distributed hash-tables (Pastry, Chord, CAN) Expensive maintenance when peers move/leave/join or content changes Single point of failures for keys (traditional) or multiple points

(multiplying cost) No Load-balancing

Semantic overlay networks Shared ‘semantic’ datastructures (P-Search, Bibster (build on

jxta), [Crespo et al]) Rich vs small datastructure Expensive updating

Term clustering (Voulgaris et al., FreeNet) Difficult to find clusters: assumption that the peer’s questions are

related to its expertise (cluster) and the assumption that a peer has a description may both not be realistic in some cases

pNear

pNear : Combines Distributed Hash-Tables

Term Clustering

Goal -> reducing the disadvantages of both individual approaches and keeping the good

properties.

Subgoal: getting an overview of the properties of the combination: which parameters are

important.

pNearuse DHT to find a (small) set of relevant

peers for a query, and use a SON to find the remaining (larger) set of relevant peers.

Only a small fixed set of peers is stored on the node responsible for the key (instead of all in pure DHT). Load balancing is handled automatically by the SON.

A pNear peer

Each peer describes/summarizes its content in a set of terms which we call an expertise description.

These descriptions are stored in expertise registers via DHT where the keys are a subset of individual terms from those expertise descriptions. In this way expertise registers responsible for a term are responsible for maintaining a list of peers that have registered themselves for that term (i.e. which functions as a kind of ‘yellow page’)

A pNear peer

Besides maintaining expertise registers, each peer also has its own expertise cache, in which it stores peers with related expertise resulting in an Semantic Overlay Network. These related peers are found via the clustering process

Peers use the expertise registers for getting pointers to some peers relevant for the queries

Peers use the expertise caches for finding remaining peers within the clusters of relevant peers

A pNear peer

Expertise

Cache

Expertise Registers

Identifier

Document

Storage

Expertise description: [t1, t2, … , tn]

tx

ty

tz

How pNear works1. Let N be the set of neighbors of a peer p in

the SON2. Select a (small) subset of terms S from the

terms in the expertise description e_p from p

3. For each term s in S do:i. Hash the term s ii. Send a ‘register message’ (containing

e_p and the network id of p) via the DHT overlay (using the key of s as message id) to the register responsible for the term s. The register responds with a set of [peer_id, relevance] pairs, R

How pNear works4. Select a subset r from R (where r is not

visited before in the clustering process) and do:

i. Send an advertisement message (containing e and the peers id) to r

ii. r responds (if online) with a set of [peer_id, E, relevance] triples R’

iii. The [peer_id, relevance] pairs are added to R and the [peer_id, E] pairs are added to N.

Continue step 4 untill a maximum number.

How pNear worksCosts of Clustering in amount of direct

messages:

log(nrOfPeers)*#registered_terms + #advertisementMsgs +#advertisementResultMsgs

How pNear works1. Determine if its needed to query the registers or

not.2. - If yes, than for each query term q in the query:

1. Send a query consult message (containing q, the peer_id) to the register r responsible for term q.

2. r responds (if online) with a set of [peer_id, relevance] triples R

3. When possible, add to R a set of neigbors ( [peer_id, relevance] pairs) from N that are relevant for the query (based on their expertise descriptions)

- If no, than add to R a set of neigbors ( [peer_id, relevance] pairs) from N that are relevant for the query (based on their expertise descriptions)

How pNear works

3. Select a subset r from R (where r is not visited before in the query process) and do:

i. Send a query message (containing the query and the peers id) to r

ii. r responds (if online) with a set of [peer_id, relevance] pairs R’ and a set of documents (answers to query)

iii. The [peer_id, relevance] pairs are added to R’ and the query results are presented to the user

Continue step 3 untill a maximum number of query rounds have passed or until the user stops the process.

How pNear works

Costs of querying in terms of messages:

Using registers log(nrOfPeers) x #queryterms + #queryMsgs +

#queryResultMsgs

Without using registers

#queryMsgs + #queryResultMsgs

Example

Example

Extracting expertise description…

Example

1223

Suppose one peer with identifier 1223 unique in the system

Example

1223

Peer 1223 has a set of documents (research articles) that it wants to share in the network

Example

An extraction tool is applied over the documents, resulting in the expertise description of the peer

TextToOnto

1223

[animal,dog,bulldog,mammal,cat]

documents

Expertise description

Registering expertise descriptions…

Example

Example

Peer 1223 joins the network

2665

3443

12233100

7665


Example

2665

3443

1223

7665

3100

Each peer has one or more expertise registers, where the peers can register their expertise descriptions


Example

Peer 1223 registers its expertise description (via DHT) in the topic register of the peers 3443 and 2665

2665

3443

1223

7665

3100[animal,dog,bulldog,mammal,cat]

[(1223) animal,dog,bulldog,mammal,cat]

[(1223) animal,dog,bulldog,mammal,cat]

Reg

Reg

Example

Imagine that 2665 has a previously registered description from 3100. It calculates the relevance of 3100 for 1223 and in this case returns the [pointer, relevance] pair

2665

3443

1223

7665


[(3100) vegetable:9,tree:4,mammal:1][(1223)

animal,dog,bulldog,mammal,cat]

Reg

Example

1223 sends an advertisemt query to 3100

2665

3443

1223

7665


[3100,0.9]

Example

3100 answers with its expertise description and it returns from his cache also 7665 (peer_id, relevance pair) as a relevant peer for 1223.

2665

3443

1223

7665


[3100,0.9][7665,0.8]

[(7665) animal,dog, train,cat][(1223) animal, dog, bulldog, mammal, cat]

Cache

Example

1233 adds 3100 to its expertise cache

2665

3443

1223

7665


[3100,0.9][7665,0.8]

[(3100) vegetable:9,tree:4,mammal:1]

Cache

Example

1223 sends 7665 an advertisement message, where 7665 stores the expertise description of 1223

2665

3443

1223

7665


[3100,0.9][7665,0.8]

[(1223) animal, dog, bulldog, mammal, cat]

Cache

Example

Solving a query…

Example

Imagine a new peer 9999 that does not want to be clustered in the network.

2665

3443

1223

7665

3100 9999

Example

Peer 9999 has a query [cat] and imagine that 2665 is the register responsible for that term than 9999 sends a register consult message to 2665 (via DHT)

2665

3443

1223

7665

3100 9999

Example

Peer 2665 answers that 1223 is a good candidate, and 9999 adds 1223 to the list of potential candidates

2665

3443

1223

7665

3100 9999

[(3100) vegetable,tree,mammal][(1223)

animal,dog,bulldog,mammal,cat]

[1223,0.9]

Example

Peer 9999 queries peer 1223

2665

3443

1223

7665

3100 9999

[1223,0.9][(3100) vegetable,cat,mammal]

Example

Peer 1223 returns matching documents and 3100 as a relevant peer

2665

3443

1223

7665

3100 9999

[1333,0.9][3100,0.3]

[(3100) vegetable,cat,mammal]

Example

Peer 9999 queries 3100

2665

3443

1223

7665

3100 9999

[3100,0.9][3100,0.3]

Example

In this example 3100 also has some documents that match the query but no further pointers

2665

3443

1223

7665

3100 9999

[3100,0.9][3100,0.3]

Simulation Query set crawled from Excite’s

“SearchSpy” consists of +/- 30.000 real user queries

For each query, we included the documents (web-pages) from the max 100 hits returned by Google

Simulation For each web-page we extracted

around 100 terms by TextToOnto NLP tool, which serves as expertise description for each peer (web-pages are represented as peers in our simulation) resulting in a dataset of more than 1M peers

Simulation platform runs up to 400.000 peers

Results

Total number of nodes in the system 100,000 Maximum number of expertise descriptions in a peer’s

register 5 Maximum number of expertise descriptions in a peer’s cache

50 Maximum number of recommendations given by a register 5 Maximum number of recommendations given by a cache 30 Maximum number of advertisement rounds per

advertisement initialization 5 Maximum number of neighbors selected per advertisement

round 4 Maximum number of neighbors selected per query round 7 Average number of terms to register selected from expertise

description 3

Results

Only 3 out of 100 terms are needed to be registered to get 35% recall after

sending < 200 query messages in the 100K network

Results

Cache size is more important than nr. of recs. Still, only 50 slots are needed to

get a recall of 35% with less than 200K query messages

Results

Selecting more neighbors to query helps.

Useful when user wants many results very fast

Results

More advertising helps.

Results

When on average 3 topics per ED are registered, the register does not need

much more slots.

Results

Even in a network of 400K peers, <200 query messages are needed to get a

recall of 25%

Questions

Documents

PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands [email protected]