Topology Formation and Replication Strategies for Gnutella ...pabitra/facad/06CS6019t.pdfA brief overview of previous works done on topology formation and replication strategies of

Topology Formation andReplication Strategies for

Gnutella Network

Santosh Kumar Shaw

Topology Formation andReplication Strategies for

Gnutella Network

Thesis submitted in partial fulfillment of the requirementsfor the degree of

MASTER OF TECHNOLOGY

by

Santosh Kumar Shaw

Under the supervision of

Prof. Niloy Ganguly

Department of Computer Science and EngineeringIndian Institute of Technology, Kharagpur

West Bengal, India 721302May 2008

Department of Computer Science and Engineering

Indian Institute of Technology, Kharagpur

Kharagpur, India 721302.

Certificate

This is to certify that the thesis entitled Topology Formation and Replication

Strategies for Gnutella Network, submitted by Santosh Kumar Shaw, to

the Department of Computer Science and Engineering, Indian Institute of Tech-

nology, Kharagpur, India, in partial fulfillment for the award of the Master of

Technology, is a record of an original research work carried out by his under

my supervision and guidance. The thesis fulfills all requirements as per the reg-

ulations of this Institute and in our opinion has reached the standard needed for

submission. Neither this thesis nor any part of it has been submitted for any

degree or academic award elsewhere.

Prof. Niloy Ganguly

Department of Computer Science and Engineering

Indian Institute of Technology, Kharagpur

India 721302.I.I.T. Kharagpur

5th May, 2008

Acknowledgment

I would like to express my deep gratitude to Prof. Niloy Ganguly for his invaluable

guidance, help and encouragement throughout my work. I would like to take this

opportunity to my sincere and profound gratitute to him for the pains he has

taken for the project. I have learned many many things in life from him.

I am grateful to all the faculty members of the CSE Department for their teaching,

support and encouragement. I am particularly indebted to Prof. Arijit Bishnu,

Prof. Indranil Sengupta and Prof. Arobinda Gupta for their guidance and moti-

vation. I wish to thank the Software Laboratory staff and all the secretarial staff

of the CSE Department for their sympathetic co-operation. I am grateful to all

my friends in the Department of CSE for their help and encouragement.

I give my heartfelt thanks to my parents and my in-laws for their constant support.

Without their help and blessings this venture would never have been possible.

Finally, I thank my wife, Pallavi, whose unending help, encouragement, inspiration

and love made me pursue my dreams. Without her, this would simply not have

happened. I dedicate this thesis to her.

Santosh Kumar Shaw

Computer Science and Engineering

Indian Institute of Technology –Kharagpur INDIA

Abstract

Gnutella-like networks generate enormous amount of redundant messages in flood-

based search due to their topology structure and therefore lose scalability.

In this thesis, we propose a completely distributed handshake protocol (named

HPC5) to form an efficient topology for Gnutella-like networks. The protocol

directs each peer to select neighbors in such a way that any cyclic path present in

the overlay network will have a minimum length of 5.

We show that our approach can be deployed into the existing Gnutella network

without disturbing any of its parameters. Simulation results signify that HPC5 is

very effective for Gnutella’s dynamic query search over limited flooding. Structural

analysis indicates that the proposed network is as robust as existing Gnutella

network.

Limited flood-based search with lower TTL decreases the chance of getting

proper results, consequently decreases the number of results. To increase the num-

ber of results (query hits) we evaluate some of the index table replica placement

strategies and compared our replica placement technique against two existing ap-

proaches. Simulation shows that our placement technique is very useful in terms

of query hits, latency and search cost in our topological structure.

Hit ratio is one of the main obstacle in implementing our approach. Thus a

hit ratio based design approach is given to satisfy some network’s requirements.

Contents

List of Figures v

List of Tables vi

1 Introduction 1

2 Related Works 5

3 System Model 7

3.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Gnutella 0.6 Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.3 Simulated Gnutella Network . . . . . . . . . . . . . . . . . . . . . . 11

4 Handshake Protocol for Cycle-5 Networks 12

4.1 Handshake Protocol: HPC5 . . . . . . . . . . . . . . . . . . . . . . 12

4.2 Hurdles in Implementing the Scheme . . . . . . . . . . . . . . . . . 14

4.2.1 Compatibility with the Current Population of Gnutella . . . 15

4.2.2 Hit Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2.3 Consistency Problem . . . . . . . . . . . . . . . . . . . . . . 19

4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4 Network Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.5 Comparison with Zhu’s work . . . . . . . . . . . . . . . . . . . . . . 28

CONTENTS iv

5 Index Table Replication 30

5.1 Replication in Gnutella Network . . . . . . . . . . . . . . . . . . . . 31

5.2 Replica Placement Techniques . . . . . . . . . . . . . . . . . . . . . 31

5.3 Search Technique (Flooding) in Cycle-5 Networks with Replication . 32

5.4 Replica Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.5.1 RPT Selection: . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.5.2 Number of Replica Selection . . . . . . . . . . . . . . . . . . 37

6 Design Issues 38

7 Conclusion and Future work 41

7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

7.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

List of Figures

1.1 Effect of topology structure and replication on limited flood based

search. The number inside the circle represents the TTL value

required to reach that node from start node S in flooding. . . . . . 3

4.1 Selection of neighbor by an ultrapeer-1 after making peer-2 as a

neighbor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2 A part of an Ultra-peer layer, where a node represents all nodes

that are present in that level. Like, Q represents all 1st neighbors

of P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.3 Theoretical and empirical hit ratio of a peer against number of peers

in the network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.4 A part of cycle-5 network, representing parallel update inconsistency 20

4.5 Comparison on Network Coverage and Message Complexity . . . . . 25

4.6 Distribution of Higher order Clustering Coefficient . . . . . . . . . . 28

4.7 Comparison of Robustness between cycle-3 and cycle-5 Networks . . 29

5.1 Query Hit probability . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.1 Effects of rx in the network . . . . . . . . . . . . . . . . . . . . . . 40

List of Tables

3.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 Properties of Gnutella and Simulated Gnutella Network . . . . . . 11

4.1 Simulation Results for Search Performance . . . . . . . . . . . . . . 26

C H A P T E R 1

Introduction

A peer-to-peer (P2P) system is an overlay network that relies on the computing

resources available from all the participants in the network instead of concentrat-

ing on a small number of centralized servers. These networks are useful for many

purposes like file-sharing, streaming media, internet telephony, distributed com-

putation etc. Depending upon the topology formation, P2P networks are broadly

classified as unstructured and structured. An unstructured P2P network is formed

when the overlay links are established arbitrarily. Structured P2P networks ensure

that any peer can efficiently route a search to some peer that has the desired files,

even if the file is extremely rare. In these networks, the placement of resources

are very controlled and defined, which arise the necessity of a structured pattern

in overlay links.

Decentralized (fully distributed control), unstructured P2P networks (Gnutella,

FastTrack etc) are the most popular file-sharing overlay networks. The absence of

a structure and central control makes such systems much more robust and highly

self-healing compared to structured systems [9, 15]. But the main problem of these

2

kinds of networks is scalability due to generation of large number of redundant

messages during query search. Consequently as these networks are becoming more

popular the quality of service is degrading rapidly [6, 10].

To make the network scalable, Gnutella [1, 2, 3] is continuously upgrading it’s

features and introducing new concepts. All these improvements can be catego-

rized into two broad areas: improvements of search techniques and modification

of the topological structure of the overlay network to enhance search efficiency.

In enhanced search techniques, several improvements like Time-To-Live (TTL),

Dynamic query, Query-caching and Query Routing Protocol (QRP) have been in-

troduced. One of the most significant topological modification in unstructured

network was done by inducing the concept of super-peer (ultra peer) with a two-

tier network topology.

The basic search mechanism adhered by Gnutella is limited flooding [1, 2, 3].

In flooding, a peer, which searches for a file, issues a query and sends it to all of its

neighbor peers. The peer which receives the query forwards it to all its neighbors

except the neighbor from which it is received. By this way, a query is propagated

up to a predefined number of hops (TTL) from the source peer. The TTL followed

by Gnutella is generally 1 or 2 for popular searches. However, TTL(3) (numeric

value inside parenthesis represents the number of hops to search with. Henceforth,

we follow this notation) is initiated for rare searches.

Objective and Contributions

The main goal of this dissertation is to improve the scalability of the Gnutella

network by reducing redundant messages and increase the number of query hits

(results). One of the ways to achieve this is to modify the overlay network, so that

small size loops get eliminated from the overlay topology. The rationale behind

the proposition is explained through Fig. 1.1. In this figure, all three networks

have the same number of connections. With a TTL(2) flooding, the network in

3

Fig. 1.1(a) discovers 4 peers at the expense of 7 messages, whereas the network in

Fig. 1.1(b) discovers 6 peers without any redundant messages. This happens due

to the absence of any 3-length cycle in the network of Fig. 1.1(b). On generalizing,

we can say that for a TTL(r) flooding, networks devoid of cycles of length less than

(2r + 1) do not generate any redundant messages. In Fig. 1.1(c), the placement

of replica of index table (hash table of all the files the peer is sharing) is shown

using the dotted arrow lines. Like all node 2’s replicas are stored in node S. So

node S can access all node 2 using replicas in TTL(0) and S can access all other

nodes either directly or using replicas in TTL(1). As a result, whole network is

explored with TTL(1) flooding which reduces query latency.

S

1 1

22

3 3

4 4(a)

S

1 1

2 2

2 2

3

(b)

3

S

2 2

1 1

2 2

3 3

(c)

Figure 1.1: Effect of topology structure and replication on limited flood basedsearch. The number inside the circle represents the TTL value required to reachthat node from start node S in flooding.

We make five contributions in this thesis: (1) We propose a completely distrib-

uted handshake protocols (HPC5) which generates a cycle-5 network (a network

which does not have any cycles up to length (r − 1) is referred as cycle-r net-

work) and extended the algorithm for cycle-r (r > 3) networks. (2) We show

that our approach can be deployed into the existing Gnutella network without

disturbing any of its parameters. (3) Through simulation results, we show that

4

cycle-r networks are very effective for Gnutella’s dynamic query search over lim-

ited flooding. Structural analysis indicates that the cycle-5 network is as robust as

existing Gnutella network. (4) We propose and evaluate a new index table replica

placement technique for cycle-5 networks to improve query hits and reduce both

latency and search cost. (5) We analyze some design issues related to hit ratio of

the network.

Organization of the Thesis:

The rest of the thesis is organized as follows.

A brief overview of previous works done on topology formation and replication

strategies of unstructured P2P networks is presented in Chapter 2.

A model of our network environment is presented in Chapter 3. Mainly topol-

ogy, searching technique, replication technique and basic handshake protocol of

Gnutella network are described in this chapter.

Chapter 4 propose a completely distributed handshake protocol (HPC5) for

cycle-5 networks. We also extend the protocol for cycle-r networks. The problems

and compatibility of implementing our protocol in Gnutella network are described

in this chapter. Through simulations we evaluate the search performance and

structural property of the evolved networks to validate our approach.

Chapter 5 discuss some of the replication technique and propose a new repli-

cation techniques for cycle-5 networks.

In Chapter 6 we discuss hit ratio based design of cycle-5 networks and its

effects on performance.

Finally, Chapter 7 concludes the thesis and proposes some directions for future

work.

C H A P T E R 2

Related Works

Many algorithms exist in the literature which modify the topology in unstructured

P2P networks to solve the excessive traffic problem and improve query hits by

using replication techniques.

The structural mismatching between the overlay and underlying network topol-

ogy is alleviated by using location aware topology matching algorithms [7, 8].

A class of overlay topology based on distance between a node and its neighbors

in the physical network structure is presented in [11].

Papadakis et al. presented an algorithm to monitor the ratio of duplicated

message through each network connection and the node does not forward any

query through that connection whose ratio exceeds certain threshold [14].

Zhu et al. very recently presented a distributed algorithm in [18] to improve

the scalability of Gnutella like networks by reducing redundant messages. They

have pointed the same concept of elimination of 3 and 4-length cycles. However

this is demand driven and involves a lot of control overhead. Also it is not clear

how the algorithm will perform in the face of heavy traffic. The algorithm also

6

does not take care in preserving the Gnutella parameters (like degree distribution,

average peer distance, diameter, etc), hence robustness of the evolved network is

not maintained. In our work we take into considerations all the above aspects

and propose a holistic approach to topology formation. The algorithm initiates as

soon as a peer enters in the network rather than having it demand driven.

Qin et al. discussed on various alternatives of data replication, search tech-

niques and network topology for Gnutella network in [10]. The authors described

some replication techniques (uniform, proportional and square-root) depending

upon the query distribution. They also shown that random replication is better

than owner or path replication in Gnutella-like networks.

C H A P T E R 3

System Model and Notations

In this chapter we discuss about the basic Gnutella protocols and our simulated

Gnutella network. We give a list of notations in Section 3.1 for ready reference.

In Section 3.2 we describe the basic Gnutella 0.6 protocol which we have used as

our system model. Then we describe about our simulated Gnutella network in

Section 3.3.

3.1 Notations

Before describing the system model, we first list the main notations in table 3.1

that are going to be used in this thesis report for additional clarity and to avoid

any kind of confusion.

3.1 Notations 8

Table 3.1: Notations

TTL(r) Query search with TTL = rcycle-r A cycle of minimum length r

cycle-r network A network which does not have any cycle up to length (r − 1)cycle-3 network Gnutella network

N Total number of peers in the networkU Total number of ultra-peers in the networkL Total number of leaf-peers in the network

duu Avg. no. of ultra-neighbors of an ultra-peerdul Avg. no. of leaf-neighbors of an ultra-peerdlu Avg. no. of ultra-neighbors of a leaf-peerHk Hit ratio to select kth ultra-neighbor〈H〉 Average hit ratio of a peer〈Hev〉 Average evolved hit ratio of a peer

rth neighbor A peer at a distance of r hops. All immediate neighbors are1st neighbors, all neighbors of 1st neighbors are 2nd neighborsand so on.

3.2 Gnutella 0.6 Protocol 9

3.2 Gnutella 0.6 Protocol

The basic Gnutella [1, 2, 3] consists of a large collection of nodes that are assigned

unique identifiers and which communicate through message exchanges.

Topology: Gnutella 0.6 is a two-tier overlay network, consisting of two types

of nodes : ultra-peer and leaf-peer. An ultra-peer is connected with a limited

number of other ultra-peers and leaf-peers. A leaf-peer is also connected with

some ultra-peers. However, there is no direct connection between any two leaf-

peers in the overlay network. Yet another type of peer is called legacy-peers, which

are present in ultra-peer level and do not accept any leaves. In our model we are

not considering legacy-peers.

Search Technique: The network follows limited flood based query search. A

query of an ultra-peer is forwarded to its leaf-peers with TTL(0) and to all its

ultra-neighbors with one less TTL only when (TTL > 0). A leaf-peer does not

forward any received query. On the other hand ultra-peers perform query searching

on behalf of their leaf peers. The query of a leaf-peer is initially sent to its

connected ultra-peers. All the connected ultra-peers simultaneously forward the

query to their neighbor ultra-peers up to a limited number of hops. Since multiple

ultra-peers are initiating flooding, a leaf-peer’s query will produce more redundant

messages if the distance between any two ultra-neighbors are not enough. Gnutella

0.6 incorporates dynamic querying over limited flooding as query search technique.

In dynamic querying, an ultra-peer incrementally forwards a query in 3 steps

(TTL(1), TTL(2), TTL(3) respectively) through each connection while measuring

the responsiveness to that query. The ultra-peer can stop forwarding query at any

step if it gets sufficient number of query hits. Consequently dynamic querying

uses TTL(3) only for rare searches.

Query Routing Protocol (QRP): An ultra-peer forwards query only to it’s

3.2 Gnutella 0.6 Protocol 10

leaf neighbors which likely to have a result, the technique is called QRP. A leaf-

peer creates a hash table of all the files it is sharing and sends to all the connected

ultra-peer/s. As a result, when a query reaches to an ultra-peer it is forwarded

to only those connected leaf-peers which would have query hits. So searching is

performed only at the ultra-peer layer, since ultra-peers contain the indices of their

children.

Basic Handshake Protocol: Many softwares (clients) are used to access the

Gnutella network (like Limewire, Bearshare, Gtk-gnutella). The most popular

client software, Limewire’s handshake protocol is used in our simulation as a base

handshake protocol. Through handshaking, a peer establishes connection with

any other ultra-peer. To start handshake protocol a peer first collects the address

of an online ultra-peer from a pool of online ultra-peers. A peer can collect the

list of online peers from hardcoded address/es and/or from GwebCache systems

[4] and/or through pong-caching and/or from its own hard-disk which has obtain

list of online ultra-peers in the previous run [6]. The handshake protocol is used

to make new connections. A handshake consists of 3 groups of headers [1, 2]. The

steps of handshaking is elaborated next:

1. The program (peer) that initiates the connection sends the first group of

headers, which tells the remote program about its features and the status to

imply the type of neighbor (leaf or ultra) it wants to be.

2. The program that receives the connection responds with a second group

of headers which essentially conveys the message whether it agrees to the

initiator’s proposal or not.

3. Finally, the initiator sends a third group of header to confirm and establish

the connection.

3.3 Simulated Gnutella Network 11

This protocol is modified in this thesis to overcome the problem of message over-

head.

3.3 Simulated Gnutella Network

To generate existing Gnutella network, we have simulated a strip down version of

Gnutella 0.6 protocols which follows Limewire [1]. Our implementation is in C-

language and in Linux platform. When a new node wants to join in the network,

a separate thread is allotted for bootstrapping and neighbors selection process for

that peer. We have followed the basic handshake protocol described in chapter 3

for bootstrapping process. The similar way we have generated cycle-5 networks

by following HPC5 protocol. Our simulated Gnutella network exhibits all fea-

tures (like degree distribution, diameter, average path length between two peers,

proportion of ultra-peers, clustering coefficient, etc.) that are related to Gnutella

network which are obtained from the snapshots collected by crawlers [3, 16, 17].

The properties of the Gnutella and simulated Gnutella networks are given in the

table 3.2 [1, 17].

Table 3.2: Properties of Gnutella and Simulated Gnutella Network

Property Gnutella Simulated Gnutella

No. of Peers 2000k 100kUltra-peer ratio 15-16% of total peers 15% of total peersAvg. Diam. of ultra-layer 6-7 4-5Maximum connections ultra-ultra 32 ultra-ultra 32

ultra-leaf 30 ultra-leaf 30leaf-ultra 3 leaf-ultra 3(Applicable to Limewire)

Average Connections duu: 25-26 duu: 22-23dul: 20-22 dul: 17-18dlu: 3-4 dlu: 3

Clustering Coeff. 0.02 0.08-0.10

C H A P T E R 4

Handshake Protocol for Cycle-5

Networks (HPC5)

In this chapter, we first discuss our handshake protocol for cycle-5 networks in

Section 4.1. In Section 4.2 we discuss about the hurdles to implement the pro-

tocol (HPC5) in the real-life networks (Gnutella). Then we evaluate the search

performance and robustness of HPC5 protocol generated networks (cycle-5 and

cycle-6 networks) in Section 4.3. Finally in Section 4.5, we compare our approach

with the DCMP protocol [18] which is very similar to our work.

4.1 Handshake Protocol: HPC5

We modify the basic handshake protocol to generate cycle-5 networks. Fig. 4.1

illustrates the proposed HPC5 graphically. In Fig. 4.1, peer-1 requests other

online ultra-peers to be its neighbor, given that, peer-2 is already a neighbor of

peer-1. In Fig. 4.1(a) and 4.1(b), the possibility of the formation of triangle and

4.1 Handshake Protocol: HPC5 13

quadrilateral arises if a 1st or 2nd neighbor of peer-2 is selected. However, this

possibility is discarded in Fig. 4.1(c) and a cycle of length 5 is formed.

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

(c)(a) (b)

Figure 4.1: Selection of neighbor by an ultrapeer-1 after making peer-2 as a neigh-bor.

Each peer maintains a list of its 1st and 2nd neighbors, which contains only

ultra-peers (because a peer only sends request to an ultra-peer to make neigh-

bor). The 2nd ultra-neighbors of a leaf-peer represents the collection of 1st ultra-

neighbors of the connected ultra-peers. To keep updated knowledge, each ultra-

peer exchanges its list of 1st neighbors periodically with its neighbor ultra-peers

and sends the list of 1st neighbors to its leaf-peers. To do this with minimal

overhead, piggyback technique can be used in which an ultra-peer can append

its neighbor list to the messages passing through it. The three steps of modified

handshake protocol (HPC5) is described below.

1. The initiator peer first sends a request to a remote ultra-peer which is not

in its 1st or 2nd neighbor set. The request header contains the type of the

initiator peer. The presence of remote peer in 2nd neighbor set implies the

possibility of 3-length cycle. In Fig. 4.1, peer-1 cannot send request to peer

2 or 3, on the other hand peer 3 & 4 are eligible remote ultra-peers.

2. The recipient replies back with its list of 1st neighbors and the neighbor-hood

acceptance/rejection message. If the remote peer discards the connection in

this step, the initiator closes the connection and keeps the record of neighbors

of the remote peer for future handshaking process. On acceptance of the

invitation by the remote-peer, the initiator peer performs the following tasks.

4.2 Hurdles in Implementing the Scheme 14

3. The initiator peer checks at least one common peer between its 2nd neighbor

set (say, A) and the 1st neighbor set of the remote peer (say, B). A common

ultra-peer between sets A and B indicates the possibility of 4-length cycle.

If no common peer is present between sets A and B then the initiator sends

accept connection to remote peer.

Otherwise the initiator sends reject connection to remote peer.

HPC5 prevents the possibility of forming a cycle of length 3 or 4 and generates a

cycle-5 network.

The generalized version of the protocol HPCr (r > 3) is as follows. To generate

cycle-r networks, each ultra-peer keeps up to (r − 3)th neighbors lists in its data-

structure and exchanges up to (r − 4)th neighbors lists with its immediate ultra-

neighbors. Obviously, the bandwidth requirement will be more as r increases. For

example, to generate cycle-6 networks each ultra-peer exchanges its list of both

1st and 2nd neighbors periodically with its neighbor ultra-peers and sends the list

of both 1st and 2nd neighbors to its leaf-peers. In result, each peer contains the

information up to its 3rd neighbor ultra-peers.

4.2 Hurdles in Implementing the Scheme

In this section we mainly discuss the hurdles of implementing cycle-5 networks.

Before embedding HPC5 in Gnutella network, several important questions have to

be taken into consideration to assess the viability of HPC5. The most important

of them are listed below.

1. Is this scheme compatible with the current populations of Gnutella network?

2. On average, how many trials are required to get an ultra-neighbor?


3. Is there any possibility of inconsistency and if so, how can such inconsistency

be removed?

Each of the questions are discussed one by one.

4.2.1 Compatibility with the Current Population of Gnutella

From ultra-peer point of view, the total number of ultra-leaf connections is U · dul

and from leaf-peer point of view it is L · dlu. By equating both, we get

U · dul = L · dlu

U · dul = (N − U) · dlu

N = U · (dul + dlu)

dlu

(4.1)

Fig. 4.2 represents a part of an ultra-peer layer where P has immediate neighbors

at level Q. Suppose, P is already connected with (duu − 1) number of ultra-peers

at Q level and wants to get dthuu ultra-neighbor. According to HPC5, P should not

connect to any ultra-peer from R or S level as its next neighbor. However, T can

be a neighbor of P.

P T S R Q

Figure 4.2: A part of an Ultra-peer layer, where a node represents all nodes thatare present in that level. Like, Q represents all 1st neighbors of P

Thus, we can say that if P wants to make a new ultra-neighbor then P has to

exclude at most (duu−1), (duu−1)2 and (duu−1)3 number of ultra-peers from Q, R


and S level respectively. So, total [(duu−1)+(duu−1)2 +(duu−1)3] ≈ d3uu number

of peers cannot be considered as next neighbor(s) of P. Therefore the number of

ultra-peers in the network needs to be at least

U ≈ d3uu (4.2)

From equations 4.1 and 4.2 we get

N ≈ (dul + dlu) · d3uu

dlu

(4.3)

Presently Gnutella network is having the population of almost 2000k of peers

at any time [1]. From equation 4.2 it can be seen that for the present values

of duu, dul & dlu (table 3.2), 120-130k peers are sufficient to implement HPC5

protocol. However, to form cycle-6 networks (HPC6) the number of peers

N ≈ (dul + dlu) · d4uu

dlu

required is more than 2000k. Hence the current population will not be able to

support any such attempts.

4.2.2 Hit Ratio

Hit ratio is defined as the inverse of the number of trials required to get a valid

ultra-peer neighbor. As our protocol puts some constraints on neighbor selection,

a contacted agreeing remote ultra-peer may not be selected as neighbor. Math-

ematically, on an average if a peer (say, P) is looking for its kth ultra-neighbor

and the mthk contacted ultra-peer satisfies the constraints and becomes kth ultra-

neighbor of P, then the hit ratio for kth neighbor will be Hk = 1mk

. We first make

a static analysis of hit ratio, then fine tune it considering that the network is


evolving.

At the time of kth ultra-neighbor selection in HPC5, a peer (say, P) does not

consider its 1st ((k − 1) ultra-peers) and 2nd ((k − 1)(duu − 1) ultra-peers) ultra-

neighbors as a potential neighbor and this exclusion is locally done by checking

the 1st & 2nd neighbors lists of P. The number of ultra-peers excluded (U ′) is

[(k−1)+(k−1)(duu−1)]. According to step-3 of HPC5, P cannot make neighbor

from any ultra-peer of level S (3rd ultra-neighbors of P) of Fig. 4.2 as its neighbor

which are U ′′ = [(k− 1)(duu− 1)2] in number. So, total [U ′+U ′′] number of ultra-

peers are excluded. We assume that the probability of getting any ultra-peer is

uniform. So hit ratio can be given as

Hk =U − (U ′ + U ′′)

U − U ′

Assuming U ′ ¿ U, U ′ ¿ U ′′ and U ′′ ≈ d2uu · (k − 1)

Therefore Hk becomes

Hk ≈ U − d2uu · (k − 1)

U(4.4)

The upper bound of k and consequently average ultra-degree differs in leaf-peer

and ultra-peer. To generalize further calculations, let m be the average ultra-

degree of a peer. So, average hit ratio is

〈H〉 =1

m·

m∑

k=1

Hk

=1

m·

m∑

k=1

[1− d2uu · (k − 1)

U]

= 1− d2uu · (m− 1)

2 · U (4.5)


The equation 4.5 shows the average hit ratio of peer joining the network when the

population of ultra-peers in the network is U . It also reflects that (1 − 〈H〉) is

inversely proportional to the number of ultra-peers (U) in the complete network.

Now as each node joins, the network grows. As a result the average hit ratio

changes with the network growth. Therefore evolved hit ratio is the average value

of all average hit ratios which are calculated at each growing stages of the network.

Let U0 and Un be the number of ultra-peers in the initial and final networks. So,

evolved hit ratio is

〈Hev〉 =1

Un − U0

·Un∑

Ui=U0

〈H〉

= 1− d2uu · (m− 1)

2 · (Un − U0)·

Un∑Ui=U0

1

Ui

(4.6)

Now,

n∑

k=1

k =

(1

1

)+

(1

2+

1

3

)+

(1

4+

1

5+

1

6+

1

7

)...

≤ 1 + 1 + 1 + ... log n terms

≤ log n

Therefore,

n∑

k=m

k ≤ log n− log m

≤ log n/m

and the equation 4.6 becomes

〈Hev〉 ≈ 1− d2uu · (m− 1)

2 · (Un − U0)· log (Un/U0) (4.7)


From equations 4.5 and 4.7 we get,

〈Hev〉 ≤ 〈H〉

As d and the maximum value of m are bounded, the value of 〈Hev〉 increases with

U . Again we have tested this phenomenon through our simulation and plotted the

evolved hit-ratio against the network size of 200k-1000k in Fig 4.3 and observed

the similarity between them. The similarity is not pronounced in the beginning

as the approximations made to develop equations 4.4 and 4.7 play major role in

smaller networks.

0

0.2

0.4

0.6

0.8

1

400000 600000 800000 1e+06

Hit

Rat

io

Number of Peers

Theo: Evolved Hit RatioEmp: Evolved Hit Ratio

Figure 4.3: Theoretical and empirical hit ratio of a peer against number of peersin the network

4.2.3 Consistency Problem

Periodically exchanging the list of neighbors facilitates the peers to get up-to-date

information about their neighbors. In between two successive updates, a peer

may possibly have erroneous knowledge about it’s neighbors. As a result, this


inconsistency of the network leads to the presence of 3-length or 4-length cycles.

Parallel update is possible when many peers enter simultaneously or there is a

huge failure/attack in the network whereby many nodes have lost their neighbors

and would now like to gain some.

1. Parallel update: In parallel update, due to inconsistency, smaller length

cycles are formed while multiple peers from the same cycle and any other

peer are handshaked in parallel to become each other’s neighbor. In Fig.

4.4(a), peer-1 and peer-5 execute the following actions according to steps of

HPC5 and form smaller length cycle.

(a) Both peer-1 and peer-5 find that peer-P is a valid remote peer to contact

and both send request to P.

(b) Peer-P gets their request more-or-less at the same time and sends back

the neighbor-hood status to them.

(c) As peer-1 and peer-5 do not know each other’s activity or updated

status, they make P as their new neighbor, therefore a cycle-3 is formed

due to this inconsistency.

Similarly smaller cycles may be created when multiple peers contact each

other as a directed cycle (as in Fig. 4.4(b)) within the period of two succes-

sive updates.

P

1 2

3

45

3

3

2

1

(a) (b)

Figure 4.4: A part of cycle-5 network, representing parallel update inconsistency


2. Inconsistency arising in the face of failure/attack: Here we discuss about

the topological status of the network when x fraction (where x ¿ 1) of

nodes are left/removed from the network. We assume that the nodes have

left uniformly from the different parts of the network. So each peer loses a

fraction of its neighbors and in effect the average degree of a peer in the net-

work becomes less. To maintain the degree distribution of the network, each

peer contacts other remote ultra-peers to fulfil neighbor deficiency. During

this process, 3-length and 4-length cycles are created temporarily due to

inconsistency between two successive updates. We calculate the effects of

inconsistency due to peers removal from an ultra-peer point-of-view, so in

this section the term peer will be used to represent an ultra-peer.

• 3-length cycle: A 3-length cycle is created if two neighbor peers (say, peer-

1 and peer-5) and another remote peer (say, P) get involved in HPC5 as in

Fig. 4.4. The initiation of handshake protocol in different combinations

among peer-1, peer-5 and P may create triangle. Here in calculation we

are following the combination shown in Fig. 4.4(a). After removal process

Urem = (U − xU) number of ultra-peers remain in the network. According

to HPC5, a peer cannot make any ultra-peers at level Q, R or S in Fig. 4.2

as its neighbor. So the probability of selecting P as neighbor by peer-1 is

P0 =Urem − [duu(1− x)]3

Urem

The probability of choosing the same ultra-peer P as neighbor by any neigh-

bor of peer-1 (here peer-5) is

P1 ≈ 1

Urem − [duu(1− x)]3


So the probability of forming a 3-length cycle is

Pt = P0 · P1 ≈ 1

Urem

Therefore, the average number of 3-length cycles are created around an ultra-

peer is [(duu + dul)(1− x)/Urem] and total number of 3-length cycles are

formed in the network is

L3 = O((duu + dul)(1− x)

Urem

· Urem)

= O((duu + dul)(1− x)) = O(duu)

(O(f) represents big-oh(f)).

• 4-length cycle: A 4-length cycle is created if P and one of its 2nd ultra-

neighbor or any two 1st neighbors (leaf or ultra) of P contact T and become

neighbors. Similar to 3-length cycles calculation, the average number of

4-length cycles are created around an ultra-peer is

[duu(1− x)]2 + duudul(1− x)2

Urem

and total number of 4-length cycles are formed in the network is

L4 = O([duu(1− x)]2 + duudul(1− x)2) = O(duu2)

So the total number of smaller length cycles are created due to nodes removal

is

L = L3 + L4 = O(duu2)

which is very less compared to network size.

4.3 Performance Evaluation 23

In the same way we can prove the effect of inconsistency from leaf-peer point-

of-view is O(duu · dlu).

The inconsistency is removed in the following manner. After getting next up-

to-date information from neighbors, a peer removes the neighbor/s responsible for

creating smaller cycles. The elimination of smaller length cycles will be acceler-

ated if the gap between two successive updates is minimized. This is a message

overhead which the design engineers have to decide. However the presence of a

small percentage of smaller length cycles in the network is tolerable, as they do

not affect much on the performance.

4.3 Performance Evaluation

To validate our approach, we have performed numerous experiments. We have

taken different sizes (up to 800k nodes) of networks and performed experiments

on those networks several times to obtain the average behavior. Through these

experiments, we have shown that our approach is better than the existing proto-

cols. In this section we have analyzed search performance of cycle-5 and cycle-6

networks through simulation. By these analyses we will get an idea of study the

performance of cycle-r (r > 6) networks.

Metrics: The efficiency of search algorithms can be measured using various met-

rics such as success rate, average number of hops required to get results (response

time), message complexity, network coverage (#nodes discovered), percentage of

message duplication, etc [10]. In our simulations, we have used message complexity

and network coverage as performance metrics.

Definition 4.1 Message complexity is defined as the average number of mes-

sages required to discover a peer in the overlay network.


Definition 4.2 Network coverage implies the number of unique peers explored

during query propagation in limited flooding.

Performance Metrics: We have plotted the network coverage and message

complexity (y-axis) with TTL(2) and TTL(3) flooding respectively against the

size of the network (x-axis). We have shown the performance separately for the

query originated from leaf-peers and ultra-peers. To get the overall performance

of the network, we have chosen the number of ultra-peers and leaf-peers for query

flooding in the same ratio. The performance of the network is greatly influenced

by the value of TTL used in search and thus we have discussed the performance

metrics based on TTL(2) and TTL(3) separately. The search performance (spe-

cially message complexity) also depends on the implementation of QRP technique.

With QRP technique, searching is performed only at the ultra-peer layer, since

ultra-peers contain the indices of their children [1, 2]. So, the measurement of

message complexity at the ultra-peer layer is more appropriate to compare results

with Gnutella networks.

It is clear from Fig. 4.5 that cycle-5 and cycle-6 networks are better than

cycle-3 networks in terms of both message complexity and network coverage. In

cycle-5 & cycle-6 networks, the network coverage is approximately 15-20% more

than that of cycle-3 networks in both TTL(2) and TTL(3) searches which is shown

in Fig. 4.5(a) and 4.5(b). Network coverage of cycle-6 networks is little more than

that of cycle-5 networks. The ultra-peer layer message complexity is shown in

Fig. 4.5(c). In TTL(2) message complexity of both cycle-5 & cycle-6 networks

is very close to 1, whereas in cycle-3 network it is almost 10-15% more. Message

complexity of cycle-3 networks with TTL(3) is almost 6-7% more than that of

cycle-5 networks and almost 20% more than that of cycle-6 networks. Cycle-6

networks show slightly better performance than cycle-5 networks in TTL(3).

Relative search performance of cycle-5 and cycle-6 networks is shown in the


10000

20000

30000

40000

50000

200000 400000 600000 800000

Net

wor

k C

over

age

Number of Peers

TTL-2: Network Coverage

Cycle-3Cycle-5Cycle-6

(a) Network Coverage with TTL(2)

100000

200000

300000

400000

200000 400000 600000 800000

Net

wor

k C

over

age

Number of Peers

TTL-3: Network Coverage


(b) Network Coverage with TTL(3)

1

2

3

4

200000 400000 600000 800000

Mes

sage

Com

plex

ity

Number of Peers

TTL(2):Cycle-3TTL(3):Cycle-3TTL(2):Cycle-5TTL(3):Cycle-5TTL(2):Cycle-6TTL(3):Cycle-6

(c) Message Complexity

Figure 4.5: Comparison on Network Coverage and Message Complexity

4.4 Network Analysis 26

table 4.1 which reflects that any cycle-r (r > 3) network is better than that of cycle-

3 networks in terms of search performance and larger (r) gives higher performance

with larger TTL search. Gnutella network uses dynamic querying over limited

flooding and uses TTL(3) rarely for rare searches only. Also to maintain cycle-6

networks (HPC6) required more bandwidth. So cycle-5 networks are good enough

in performance to save bandwidth. For that in the next sections we mainly focus

on cycle-5 networks.

Table 4.1: Simulation Results for Search Performance

Cycle-5 Cycle-6

Search Perf.

Network Coverage (TTL-2) 15-20% more 15-20% moreNetwork Coverage (TTL-3) 15-20% more 15-20% moreMessage Complexity (TTL-2) 10-15% less 10-15% lessMessage Complexity (TTL-3) 6-7% less almost 20% less

4.4 Network Analysis

In this section we analyze the evolved networks. We compare the structural prop-

erties of the evolved networks with that of the base network and try to understand

various performance related issues.

Clustering Coefficient: The density of cycles with different sizes are rep-

resented by a more generalized term, called Clustering Coefficient (CC) of the

network. A high CC implies a large number of redundant messages in flood-based

querying [17]. The standard definition of CC gives the probability that two near-

est neighbors of the same node are also mutual neighbors. There is also a general

definition for higher order CC. The CC of order x for the node i is the probability

that there is a path of length x between two neighbors of the node i [5]. If the

4.4 Network Analysis 27

number of x distances is Ei(x), then x order CC of the node i is

Ci(x) =2Ei(x)

ki(ki − 1)

and the CC of the whole network C(x) is average of all Ci(x)′s [5, 12, 13].

We have plotted the distribution of CCs of different order for different type

of networks with 60k peers as in Fig. 4.6. The distribution looks like Gaussian

curve (bell shaped curve) [5]. In case of cycle-3 networks, 2nd order CC is very

high. On the other hand in cycle-5 networks, both 1st and 2nd order CCs become

zero but 3rd and 4th order CCs are very high. From the definition, the value of

rth order CC is related to the density of (r + 2) length cycles. Also during query

propagation, the number of generated redundant message with TTL(r) (r > 1) is

influenced by the CCs of order 1st to (2r−2), which are related to cycles of length

3 to 2r. As a result in cycle-3 networks, query starts to generate lots of redundant

messages from TTL(2) onwards and hence reduces the network coverage. In cycle-

5 network, it shows very good performance with TTL(2) but message complexity

becomes high from TTL(3) onwards. We could have got better performance with

TTL(3) if our algorithm incorporated the capability of tuning distribution of CCs.

Robustness of the topology: In order to test the robustness of the network

evolved through HPC5, we have considered networks with 50k nodes and plotted

the percentage of peers belonging to the largest component against the percentage

of peers removal. We have removed peers in two ways: (i) random removal and (ii)

pathologically removing the highest-degree nodes first [17]. The upper (right) two

curves in Fig. 4.7 represents the effect of random removal, whereas lower (left)

two curves represent pathological removal. Both cycle-3 and cycle-5 networks

are extremely resilient to random removal and the largest connected component

still contains 80% of existing peers even after removing almost 75-80% of nodes.

4.5 Comparison with Zhu’s work 28

1

10

100

1000

10000

100000

1e+06

0 1 2 3 4 5 6 7 8 9

CC

*106 in

logs

cale

Order of CC

Clustering Coefficient Distribution


Figure 4.6: Distribution of Higher order Clustering Coefficient

The networks gets fragmented only after 85% peers removal. But in pathological

removal, both cycle-3 and cycle-5 networks gets fragmented only after removing

6-7% of total nodes. However, through our empirical study, we can say that

robustness is similar for both cycle-3 and cycle-5 networks.

4.5 Comparison with Zhu’s work

Zhu et al. very recently presented a distributed algorithm DCMP in [18] to elim-

inate a fraction of smaller length cycles from the network. The main differences

between DCMP and HPC5 are described below.

• DCMP is activated by a peer only after getting a duplicate message and it

generates some control messages to detect a suitable edge to delete from the

cycle for which duplicate message was produced. So, the network always

contains some smaller length cycles. On the other hand HPC5 initiated

whenever a new neighbor is required and ensures that the topology is always

free from cycle-3 and cycle-4.

4.5 Comparison with Zhu’s work 29

0

10

20

30

40

50

60

70

80

90

100

1 10 100

Size

of

the

Lar

gest

Com

pone

nt(%

)

Percentage of nodes removed

Robustness on Random and Pathological Removal

Random in Cycle-3

Random in Cycle-5

Patho. in Cycle-3

Patho. in Cycle-5

Figure 4.7: Comparison of Robustness between cycle-3 and cycle-5 Networks

• DCMP is a general approach to eliminate cycles of length up to (2 ∗ TTL).

But they have shown that only removing cycle-3 and cycle-4 provides best

performance which is supporting our cycle-5 networks.

• Due to the elimination of edges in DCMP the average distance between two

peers increases and diameter of the network too. In simulation with 10k

peers they have shown that to cover almost 100% of the network coverage

require TTL(8) flooding. In our simulation we have seen that the diameter

for 10k nodes is 4-5 and to cover whole network TTL(4− 5) is sufficient.

• In DCMP, the networks are not as robust as Gnutella network.

• In DCMP, after elimination no rewiring is followed which decreses the aver-

age degree of peers, as well as the degree distribution of the network.

So, HPC5 generated networks gives best performance in DCMP without disturbing

the parameters and robustness of Gnutella networks.

C H A P T E R 5

Index Table Replication

Data availability in the unstructured P2P networks can be improved in many ways

like better search technique, higher TTL search, replication, etc. If multiple copies

of data or index table of the data files exist in multiple peers, then the chance of

at least one copy being accessible is increased. Our aim in this chapter to increase

query hit probability (QHP) with less latency and search cost. To achieve this we

use here index table replication technique.

Section 5.1 discuss on the replica placement technique (RPT) that Gnutella

network follows. We describe some of the replica placement techniques in Section

5.2 and propose a search mechanism for the replication implemented cycle-5 net-

works in Section 5.3. Section 5.4 estimate the time duration to update replica to

keep consistency. We select an appropriate RPT and number of replica through

simulation in Section 5.5.

5.1 Replication in Gnutella Network 31

5.1 Replication in Gnutella Network

Objects or files are not replicated in the Gnutella network, only peers that request

an object make copy of the object. Gnutella supports only index table replication.

An ultra-peer creates a master index table (MIT) using own index table and the

index tables collected from its connected leaf-peers through QRP protocol. So a

query is forwarded to only those leaf-peers whose index table matches the query.

The replication is performed only at the ultra-peer layer where each ultra-peer in

the Gnutella network exchanges their MIT with immediate neighbor ultra-peers

which saves one TTL search [1].

5.2 Replica Placement Techniques

In each RPT two questions have to be answered,

1. How many replicas of a MIT is required to fulfill the requirements ?

2. where should the new replicas be created ?

It is obvious that in any scheme more number of replicas increases the number of

query hits. But maintaining more replica requires more storage and bandwidth.

In the sequel, different RPTs are being described.

1. Random RPT: In this scheme each ultra-peer replicates its MIT to a con-

stant number of randomly selected ultra-peers. In literature [10] it has been

found that this scheme is very effective in Gnutella like networks.

2. 1st Neighbor set RPT: In this scheme each ultra-peer replicates its MIT to

a subset of its 1st ultra-neighbor set. Current Gnutella protocol follows this

scheme.

5.3 Search Technique (Flooding) in Cycle-5 Networks with Replication32

3. 2nd Neighbor set RPT: In this scheme each ultra-peer replicates its MIT to

a subset of 2nd ultra-neighbors set. This saves two TTL search.

We prefer 2nd-neighbor RPT for our cycle-5 networks and the reasons are described

in section 5.5. So the search technique and replica update procedure are described

with respect to 2nd-neighbor RPT networks.

5.3 Search Technique (Flooding) in Cycle-5 Net-

works with Replication

The replication is performed only at the ultra-peer layer, where each ultra-peer

replicates own MIT to the constant number of other ultra-peers. The search

technique is similar to normal search of cycle-5 networks (dynamic query search

over limited flooding). The goals in this proposed search technique are 1) high

QHP, 2) low latency and 3) low search cost. With replication, search technique

(with TTL(2)) is performed in 5 steps,

1. TTL(2) : A leaf-peer forwards the query to its connected ultra-peers. An

ultra-peer searches in its MIT and forwards query to the selected leaf-peers

(whose index table matches the query) and collects results (say, R1) from

them. Ultra-peer also searches in the replicas for suitable matching of the

query and collects the matched ultra-peer’s addresses (say, Um1). The ultra-

peer sends back the combined results (R1+Um1) to the initiator and forwards

the query to the connected ultra-neighbors with one less TTL .

2. TTL(1) : In this step each ultra-peer has same behavior as in step(1), collects

results (R2 + Um2) and sends back to the initiator through reverse overlay

path. The query is forwarded to the next level of ultra-peers with TTL(0).

5.4 Replica Update 33

3. TTL(0) : The ultra-peer which receives a query with TTL(0) performs the

same activity as step(1), collects and sends back the results (R3 + Um3) to

the initiator through reverse overlay path. No forwarding of query to the

next level ultra-peers is performed in this step.

4. Contact Peers: The initiator continuously collects and checks the results.

Search results present in Ri is confirmed, whereas results in Umi are probable

(because these results are returned by matching the hash table only, which

does not guarantee for exact desired result). That is why the initiator peer

first count the number of results present in (R1 + R2 + R3). If number

of results are sufficient then the query is stopped. Otherwise it starts to

send query with TTL(−1) to the ultra-peers in (Um1 + Um2 + Um3) using

the underlying networks (not overlay networks) in multiple iterations and

collects results. The iteration will go on until sufficient number of results

are collected or all the peers in Umi are contacted.

5. TTL(−1) : The ultra-peer which receives a query with TTL(−1) only searches

in its master index file and forwards query to the selected leaf-peers and col-

lects results (R4) from them and sends them back directly to the initiator

using underlying networks. No replica is sent in this step.

5.4 Replica Update

Periodically each ultra-peer sends its MIT to its 2nd-neighbors to make MIT up-

to-date. Piggyback technique can be used to do this with minimal overhead, in

which an ultra-peer appends its MIT to the results when a query comes from

its 2nd-neighbor. Now we want to calculate the period of update for 2nd-neighbor

RPT to keep the traffic as it is. Gnutella’s TTL(3) search approximately generates

d3uu number of messages for each query. Let, n be the average number of query

5.5 Simulation 34

issued in the network in one unit time. So on average total (n · d3uu) number of

messages are generated in the network at each unit time. On the other hand,

each ultra-peer updates their MITs with d2uu number of ultra-peers in the cycle-5

network with 2nd-neighbor RPT. So in each period of update (say, T unit time)

it expenses (U · d2uu) number of replica-messages and (T · n · d2

uu) number of query

messages (neglecting the number of messages send with TTL(−1)). Therefore, to

keep number of messages as it is

[(U · d2uu) + (T · n · d2

uu)] ≤ (T · n · d3uu)

T ≥ U

n · (duu − 1)(5.1)

The equation 5.1 indicates that T is inversely proportional to the query rate (n).

5.5 Simulation

We simulate different RPTs to find appropriate RPT. We also investigate the

effects of number of replicas in the search performance. For experimentation we

have taken several cycle-5 networks of size 100k-1000k.

5.5.1 RPT Selection:

QHP with TTL(2) of different RPTs with a constant number of replicas (in our

simulation it is 20) and without replication has been shown in the Fig. 5.1(a).

QHP with replication is 10-20 times higher than without replication. In the figure

both random and 2nd-neighbor RPTs are almost same. The QHP of 1st-neighbor

RPT is almost 10% lesser than that of 2nd-neighbor RPT. So in this respect

random and 2nd-neighbor RPTs are slightly better than that of 1st-neighbor RPT.

On the other hand we cannot increase the number of replicas for each MIT above

5.5 Simulation 35

duu in case of 1st-neighbor RPT. So even we have sufficient bandwidth for more

replicas or efficient utilization (using piggyback technique to exchange replicas) of

bandwidth, we can not reach to the desired QHP through 1st-neighbor RPT with

lower TTL (Fig. 5.1(b)).

In the Fig. 5.1(b) we have shown the QHP without and with different RPTs.

In simulation of random RPT, each ultra-peer replicates its MIT to 500 number of

other ultra-peers which are selected randomly. As described in paper [10] random

RPT performs outstanding. The QHP in random RPT with TTL(2) for smaller

size networks is almost 1, even for larger networks (1000k peers) it is more than

80%. In 2nd neighbor RPT each ultra-peer replicates its MIT to all of its 2nd ultra-

neighbors. So average number of replicas of each MIT is (d2uu). In this technique,

the QHP (with TTL(2)) for smaller number of replicas is almost similar to random

RPT. However, for large number of replicas QHP of 2nd neighbor RPT is little

less (in our simulation it is almost 10%) than that of random RPT. The QHP

with TTL(1) in 2nd neighbor RPT is equivalent to QHP of TTL(3) search without

replication implementation.

Although in simulation, random RPT is better than 2nd-neighbor RPT, we

prefer to follow later one because of the limitation in uniform random ultra-peer

selection in Gnutella network without having global knowledge. The ultra-peer

selection depends on how a peer discovers other online peers. The ultra-peer se-

lection is not uniform even if a peer collects list of ultra-peers from hardcoded

addresses and/or from GwebCache systems. So in real networks the apparent ran-

dom selection is biased and the randomness occurs within a small set of ultra-peers

due to unavailability of global knowledge. On the other hand 2nd-neighbor RPT

is deterministic and maintaining replicas are also easy as availability of updated

2nd-neighbors in period basis.

To summarize, the benefits from 2nd-neighbor RPT are as follows. First, the

5.5 Simulation 36

0

0.1

0.2

0.3

200000 400000 600000 800000

Que

ry H

it Pr

obab

ility

Number of Peers

Random RPT2nd Neighbor RPT1st Neighbor RPT

Without Replication

(a) Comparison of RPTs for 20 Replicas with TTL(2)search

0

0.2

0.4

0.6

0.8

1

200000 400000 600000 800000 1e+06

Que

ry H

it Pr

obab

ility

Number of Peers

Without Replication: TTL(3)Random Replication: TTL(2)

2nd-Neighbor Replication: TTL(2)2nd-Neighbor Replication: TTL(1)1st-Neighbor Replication: TTL(2)1st-Neighbor Replication: TTL(3)

(b) Comparison of RPTs

0

0.2

0.4

0.6

0.8

1

100 200 300 400

Que

ry H

it Pr

obab

ility

Number of Replica

2nd-Neighbor ReplicationRandom Replication

(c) QIP Based on No. of Replicas with TTL(2)

Figure 5.1: Query Hit probability

5.5 Simulation 37

QHP increases by almost 4-6 times with respect to normal TTL(3) search. Second,

the search latency is minimized. It is actually doing search two TTLs ahead. As

a result the popular searches complete in TTL(1) and rare searches in TTL(2).

Third, search cost is minimal. The search is performed with maximum of TTL(2)

and in cycle-5 networks the message complexity at TTL(2) is nearly 1. On the

other side, it uses physical network to contact the peers in Umi. As a result

the message complexity of the whole search becomes close to 1. Fourth, the

search returns only the required number of result (not much more results than

necessary, overshooting problem). In the TTL(−1) step, the search is performed

incrementally and stops after getting required number of results. This benefit

conflicts with the latency. Due to use of multiple iterations in this step search

latency may be high. So the design engineers decide the number of contacts at

each iteration in TTL(−1) step to balance the latency.

5.5.2 Number of Replica Selection

In Fig. 5.1(c) we have plotted QHP based on number of replicas of each MIT for

both random and 2nd-neighbor RPT. For this experimentation we take a cycle-5

network of 500k peers and in each iteration the number of replica increases by 10.

In the figure QHP increases very rapidly at the initial stages and reaches almost at

0.8 in random RPT (0.7 in 2nd-neighbor RPT) with 200 number (which is almost

0.5∗d2uu) of replicas. Hereafter both the curves rise slowly as the number of replicas

increases. So depending upon the requirement designers select optimum number

of replicas.

C H A P T E R 6

Hit Ratio Based Design

Hit ratio is one of the main hurdle in the implementation of our approach. In the

equation 4.7, the average evolved hit ratio inversely indicates the cost of neighbor

selection. In some applications/protocols higher hit ratio is required or in some

protocol the cost of collecting information about online ultra-peers is large. The

problem of hit ratio can be solved by providing some level of flexibility in the

neighbor selection. A peer may be allowed to select some neighbors in spite of

creating smaller length cycles in the network. In other words, some neighbors of

a peer are selected using HPC5 and rest of the neighbors are randomly selected,

which may form smaller length cycles.

Let m be the average ultra-degree of a peer and at most x fraction of ultra-

neighbors follow HPC5 to get desired hit ratio. So, mx number of neighbor selec-

tion is done with hit ratio 〈Hev〉 and rest of the neighbors with hit ratio 1 (as they

are selected in random). Therefore average number of iterations required to get a

valid neighbor is

Tx <1

m· [mx · 1

〈Hev〉 + m(1− x) · 1]

39

Let 〈Hd〉 be the desired hit ratio. Therefore,

1

〈Hd〉 ≤ Tx

1

〈Hd〉 <1

m· [mx · 1

〈Hev〉 + m(1− x) · 1]

x >

[1

〈Hd〉 − 1

]/[1

〈Hev〉 − 1

](6.1)

From the equation 6.1 we can claim that if at most

rx =

[1

〈Hd〉 − 1

]/[1

〈Hev〉 − 1

]

fraction of ultra-neighbors of each peer follows HPC5 then the resultant hit ratio

will be at least 〈Hd〉.We have plotted desired (〈Hd〉) and experimental hit ratio with respect to rx

in Fig. 6.1(a). For this experiment we take a cycle-5 network of 500k peers and

calculate the minimum hit ratio (which is our initial 〈Hd〉) using the equation

4.7. Then after in each iteration we increase 〈Hd〉 by 0.05, calculate rx and find

experimental hit ratio. In the Fig. 6.1(a) experimental results are always higher

than desired hit ratio (〈Hd〉). The gap between two curves initiates the question

of fine-tuning of value rx. In our future work we want to include this.

In the Fig. 6.1(b) and 6.1(c) we have shown the relative performance of the

networks with different rx with respect to the performance of cycle-5 networks

(in which all neighbors of each peer follow HPC5). According to the expectation

the performance of the network in both network coverage and message complexity

degrades with rx decreases. In TTL(3) the degradation rate is less than TTL(2).

So there is a trade-off between performance and average hit ratio of the net-

work. A system designer sets the hit ratio depending on the requirement of the

networks/applications.

40

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Hit

Rat

io

Fraction of Neighbors follows HPC5

Desired Hit RatioHit Ratio Obtained

(a) Hit Ratio

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

Rat

io o

f N

odes

Dis

cove

ry


Nodes Disc with TTL-2Nodes Disc with TTL-3

(b) Network Coverage

1

1.1

1.2

0 0.2 0.4 0.6 0.8 1

Rat

io o

f M

esg.

Com

plex

ity


Mesg Comp with TTL-2Mesg Comp with TTL-3

(c) Message Complexity

Figure 6.1: Effects of rx in the network

C H A P T E R 7

Conclusion and Future work

7.1 Conclusion

In my dissertation, we have presented a handshake protocol which is compatible

with Gnutella like unstructured two-tier overlay topology. We have shown that

the protocol is far more efficient than existing protocols. A relation among TTL,

minimum cycle length in the topology and network performance has been observed

and proposed. Our proposed replica placement technique is useful for popular and

rare searches with less latency and search cost.

A major fraction of internet bandwidth is occupied by Gnutella-like unstruc-

tured popular networks. P2P implementation of the 2nd generation web applica-

tions requires a huge internet bandwidth which initiates the optimum utilization

of bandwidth. In this regard our protocol can be instrumental in improving the

scalability of P2P networks.

7.2 Future Works 42

7.2 Future Works

In simulation we have shown the distribution of higher order CCs but our algorithm

does not have any control parameter to tune it to a desired distribution. In our

future work we want to develop such an algorithm. We also want to model our

approach theoretically as our future work.

7.2 Future Works 43

References

[1] Gnutella and limewire: www.limewire.org.

[2] The gnutella protocol specification 0.6: http://rfc-gnutella.sourceforge.net.

[3] Gnutella: www.gnutellaforums.com.

[4] Gwebcache system: www.gnucleus.com.

[5] Fronczak, A., Holyst, J. A., Jedynak, M., and Sienkiewicz, J. Higher order

clustering coefficients in barabasi-albert networks. Physica A 316 (December 2002), 688–

694.

[6] Karbhari, P., Ammar, M. H., Dhamdhere, A., Raj, H., Riley, G. F., and Zegura,

E. W. Bootstrapping in gnutella: A measurement study. In PAM (2004), vol. 3015 of

Lecture Notes in Computer Science, Springer, pp. 22–32.

[7] Liu, Liu, Xiao, Ni, and Zhang. Location-aware topology matching in P2P systems. In

INFOCOM: The Conference on Computer Communications, joint conference of the IEEE

Computer and Communications Societies (2004).

[8] Liu, Xiao, Liu, Ni, and Zhang. Location awareness in unstructured peer-to-peer systems.

IEEETPDS: IEEE Transactions on Parallel and Distributed Systems 16 (2005).

[9] Lua, K., Crowcroft, J., Pias, M., Sharma, R., and Lim, S. A survey and compari-

son of peer-to-peer overlay network schemes. Communications Surveys & Tutorials, IEEE

(2005), 72–93.

[10] Lv, Q., Cao, P., Cohen, E., Li, K., and Shenker, S. Search and replication in

unstructured peer-to-peer networks. In Proceedings of the 2002 International Conference

on Supercomputing (16th ICS’02) (June 2002), ACM, pp. 84–95.

[11] Merugu, S., Srinivasan, S., and Zegura, E. W. Adding structure to unstructured

peer-to-peer networks: The role of overlay topology. In NGC 2003, and ICQT 2003, Pro-

ceedings (2003), vol. 2816 of Lecture Notes in Computer Science, Springer, pp. 83–94.

[12] Newman. The structure and function of complex networks. SIREV: SIAM Review 45

(2003).

[13] Newman, M. E. J. Models of the small world: A review, May 09 2000.

7.2 Future Works 44

[14] Papadakis, C., Fragopoulou, P., Athanasopoulos, E., Dikaiakos, M. D.,

Labrinidis, A., and Markatos, E. A feedback-based approach to reduce duplicate

messages in unstructured peer-to-peer networks. In Integrated Research in GRID Comput-

ing. February 2007.

[15] Saroiu, S., Gummadi, P. K., and Gribble, S. D. A measurement study of peer-to-peer

file sharing systems. Tech. rep., July 23 2002.

[16] Stutzbach, D., and Rejaie, R. Capturing accurate snapshots of the gnutella network.

IEEE, pp. 2825–2830.

[17] Stutzbach, D., Rejaie, R., and Sen, S. Characterizing unstructured overlay topologies

in modern p2p file-sharing systems. In Internet Measurment Conference (2005), USENIX

Association, pp. 49–62.

[18] Zhenzhou, Z., Panos, K., and Spiridon, B. Dcmp: A distributed cycle minimization

protocol for peer-to-peer networks. In Parallel and Distributed Systems, IEEE Transactions

(2008), vol. 19, IEEE, pp. 363–367.

Documents

Topology Formation and Replication Strategies for Gnutella ...pabitra/facad/06CS6019t.pdfA brief overview of previous works done on topology formation and replication strategies of