Paraskevi Raftopoulou 1,2 Paraskevi Raftopoulou 1,2 and Euripides G.M. Petrakis 2 1 Max-Planck Institute for Informatics, Saarbruecken, Germany

Paraskevi RaftopoulouParaskevi Raftopoulou1,21,2 and Euripides G.M. Petrakis2

1Max-Planck Institute for Informatics, Saarbruecken, Germany http://www.mpi-inf.mpg.de/

2 Technical University of Crete, Chania, Greece http://www.intelligence.tuc.gr/

A Measure for Cluster Cohesion in Semantic Overlay Networks

Workshop on Large-Scale Distributed Systems for Information Retrieval Napa Valley, California, 30 October, 2008

Paraskevi Raftopoulou Max-Planck Institute for Informatics & Technical University of Crete

Outline Motivation & Related work Distributed resource sharing iCluster architecture Measuring clustering quality Experimental evaluation Conclusion

2 of 25


Motivation & Related work

3 of 25



Motivation Resource sharing is at the core of today’s

computing (Web, P2P, Grid) Information retrieval functionality is

needed Overlay networks is a nice technology to

built on Measures are used for evaluating network

organisation and retrieval efficiency

4 of 25



Related Work Semantic Overlay Networks

Initial approaches include: [KJ04], [SMZ03], [PMW07]

Based on the idea of small-world networks:[Smi04], [LLS04], [VSI06], DESENT

Concepts & measures quantifying network organisation (generalised) Clustering coefficient:

[WS98], [HAH07] Extensions/modifications:

[FHJS02], [BGW08], [RMJ07], [FH06]

5 of 25


Distributed resource sharing

6 of 25



Semantic overlay networks Self-organising overlay networks The idea:

Peers that are semantically, thematically, or socially close

(i.e., sharing similar interests or resources) are organised

into groups. Queries are routed to the appropriate group.

Peers hold routing indices with links to other peers Peers connected to each other are called

neighbours Support rich data models and expressive query

languages

7 of 25



Rewiring strategies Techniques for self-organising peers:

abandon old connections and create new ones periodic process

Inspired by the ‘small world effect’ reach anybody in a small number of routing

hops

8 of 25



There are cliques and subgraphs that are characterised by connections between almost any two peers within them.

Small-world networks Peers are not neighbours of one another Peers can be reached from every other

peer by a small number of hops

Main characteristics:1. small average shortest path length2. high clustering coefficient

Most pairs of peers will be connected by at least one short path.

9 of 25


iCluster architecture

10 of 25



iCluster basics (i) intelligent + (Cluster) clustering = iClusterDL

Contributions: Architecture and protocols to support IR

functionality seamless and easy integration of peers, scalable fast query processing

Self-organising peers based on SONs support rich query models benefits from loosely-connected peers

11 of 25



iCluster Protocols Peer join/leave Peer rewiring Query processing Document retrieval

12 of 25



Peer rewiring

A peer p1. computes its intra-cluster similarity

(average similarity with its neighbours)2. initiates rewiring if similarity < threshold θ 3. sends a message (msg) with its interest to m

neighbours

All peers receiving msg append their interest and forward msg to m neighbours

The message is sent back to p when TTL τR= 0

13 of 25



Query processingA peer p

1. compares q against its interests & selects the interest int most similar to q

2. if similarity ≥ threshold θ forwards a message (msg)

including q to all its neighbours with TTL τb 3. if similarity < threshold θ forwards msg to the m of

its neighbours most similar to q

All peers receiving msg do the same process The message is forwarded until TTL τf = 0

14 of 25


Measuring clustering quality

15 of 25



Clustering coefficient The ratio of links between the

peers within pi’s neighborhood with the number of links that could possibly exist between them

pi

ci = 1/6ci = 1/2

pipi

ci = 1ci = 0

pi

Takes values in the interval [0, 1] if ci = 1, every peer

connected to pi is also connected to every other peer within the neighborhood

If ci = 0, no peer that is connected to pi connects to any other peer connected to pi

jkikj

kj

i RIpRIppss

ppc

,,,

)1(

,

Takes into account only the immediate neighbours of the peer Takes high values when there are cliques Loses the general view of the network

16 of 25



Clustering efficiency A new measure that

quantifies network organisation and reflects retrieval effectiveness

Based on the network organisation and on the query processing protocols

Consider that a peer pi’ s neighborhood consists of all peers by radius τb around pi

17 of 25



Takes values in the interval [0, 1] if κi = 1, the

neighborhood of pi contains all peers similar to pi

If κi = 0, the neighborhood of pi contains none peer similar to pi

N

1kkik

N

1jjibjiGj

i

)p,p(sim:p

)p,p(sim,t)p,p(d:p

Clustering efficiency The number of peers

similar to pi that can be reached from pi within τb hops divided by the total number of similar peers

pi

ci = 0

κi = 1

Gives information about the underlying network organisation involving more than just the immediate neighbors Looks at how the network is organised at a larger scale

18 of 25


Experimental evaluation

19 of 25



Experimental Evaluation Used different parameters:

Data corpus Similarity threshold Query TTL Forwarding strategies

Parameter Symbol Value

peers N 2,000

short-range links s 8

long-range links l 4

similarity threshold θ 0.9

rewiring TTL τR 4

fixed forwarding TTL τf 6

broadcast TTL τb 2

message fanout m 2

OHSUMED TREC30,000 medical articles10 categories

TREC-6556,000 documents100 categories

the start of the rewiring is randomly chosen from the time interval [0, 4K]

the periodicity is randomly selectedfrom a normal distribution of 2K

20 of 25

Looked into the: Network organisation Recall

The better the network organisation is, the better the performance of retrievals should be!

The experiments are intended to: associate the performance of retrievals with the

quality of network organisation recommend the clustering measure that better

represents this association



Experimental Evaluation

Clustering coefficient ci for different forwarding strategies

21 of 25




Clustering efficiency κi for different forwarding strategies

22 of 25




Retrieval

23 of 25


Outlook

24 of 25



Conclusion

The idea focus on IR on top of SON look at how the network is organised at a large scale

Clustering efficiency quantifies the underlying (dynamic) P2P structure reflects retrieval effectiveness

The results indicate that clustering efficiency measure is better modeling network clustering quality compared to other existing measures

25 of 25

Documents

Paraskevi Raftopoulou 1,2 Paraskevi Raftopoulou 1,2 and Euripides G.M. Petrakis 2 1 Max-Planck Institute for Informatics, Saarbruecken, Germany