B1 - 01 - A Survey and Comparison of Peer-To-peer Overlay Network Schemes 01610546

7/29/2019 B1 - 01 - A Survey and Comparison of Peer-To-peer Overlay Network Schemes 01610546

1/22IEEE Communications Surveys & Tutorials Second Quarter 200572

ENG KEONG LUA, JON CROWCROFT, AND MARCELO PIAS, UNIVERSITY OF CAMBRIDGE

RAVI SHARMA, NANYANG TECHNOLOGICAL UNIVERSITY

STEVEN LIM, MICROSOFT ASIA

ABSTRACT

Over the Internet today, computing and communications environments aresignificantly more complex and chaotic than classical distributed systems, lacking any

centralized organization or hierarchical control. There has been much interest inemerging Peer-to-Peer (P2P) network overlays because they provide a good substrate

for creating large-scale data sharing, content distribution, and application-levelmulticast applications. These P2P overlay networks attempt to provide a long list offeatures, such as: selection of nearby peers, redundant storage, efficient search/loca-tion of data items, data permanence or guarantees, hierarchical naming, trust andauthentication, and anonymity. P2P networks potentially offer an efficient routingarchitecture that is self-organizing, massively scalable, and robust in the wide-area,

combining fault tolerance, load balancing, and explicit notion of locality. In thisarticle we present a survey and comparison of various Structured and UnstructuredP2P overlay networks. We categorize the various schemes into these two groups in

the design spectrum, and discuss the application-level network performanceof each group.

A SURVEY AND COMPARISON OFPEER-TO-PEER OVERLAY NETWORK SCHEMES

SECOND QUARTER 2005, VOLUME 7, NO. 2

www.comsoc.org/pubs/surveys

eer-to-peer (P2P) overlay networks are distributed sys-tems in nature, without any hierarchical organizationor centralized control. Peers form self-organizing over-

lay networks that are overlayed on the Internet Protocol (IP)networks, offering a mix of various features such as robustwide-area routing architecture, efficient search of data items,selection of nearby peers, redundant storage, permanence,hierarchical naming, trust and authentication, anonymity, mas-sive scalability, and fault tolerance. Peer-to-peer overlay sys-tems go beyond services offered by client-server systems byhaving symmetry in roles where a client may also be a server.It allows access to its resources by other systems and supportsresource sharing, which requires fault-tolerance, self-organiza-tion, and massive scalability properties. Unlike Grid systems,P2P overlay networks do not arise from the collaborationbetween established and connected groups of systems andwithout a more reliable set of resources to share.

We can view P2P overlay network models spanning a widespectrum of the communication framework, which specifies afully-distributed, cooperative network design with peers build-ing a self-organizing system. Figure 1 shows an abstract P2Poverlay architecture, illustrating the components in the over-

lay

The Network Communications layer describes the networkcharacteristics of desktop machines connected over the Inter-net or small wireless or sensor-based devices that are connect-ed in an ad-hoc manner. The dynamic nature of peers poseschallenges in the communication paradigm. The OverlayNodes Management layer covers the management of peers,which include discovery of peers and rout ing algorithms foroptimization. The Features Management layer deals with thesecurity, reliability, fault resiliency, and aggregated resourceavailability aspects of maintaining the robustness of P2P sys-tems. The Services Specific layer supports the underlying P2Pinfrastructure and the application-specific componentsthrough scheduling of parallel and computation-intensivetasks, content and file management. Meta-data describes thecontent stored across the P2P peers and the location informa-tion. The Application-level layer is concerned with tools,applications, and services that are implemented with specificfunctionalities on top of the underlying P2P overlay infra-structure. Thus, there are two classes of P2P overlay networks:Structured and Unstructured.

The technical meaning of structured is that the P2P overlaynetwork topology is tightly controlled and content is placed

not at random peers but at specified locations that will make

P

- .

1553-877X/05/$20.00 2005 IEEE


2/22IEEE Communications Surveys & Tutorials Second Quarter 2005 73

subsequent queries more efficient. Such structured P2P sys-tems use the Distributed Hash Table (DHT) as a substrate, inwhich data object (or value) location information is placed

deterministically, at the peers with identifiers correspondingto the data objects unique key . DHT-based systems have aproperty that consistently assigned uniform random NodeIDsto the set of peers into a large space of identifiers. Dataobjects are assigned unique identifiers called keys , chosenfrom the same identifier space. Keys are mapped by the over-lay network protocol to a unique live peer in the overlay net-work. The P2P overlay networks support the scalable storageand retrieval of {key,value} pairs on the overlay network, asillustrated in Fig. 2. Given a key, a store operation(put(key,value)) lookup retrieval operation (value=get(key))can be invoked to store and retrieve the data object corre-sponding to the key, which involves routing requests to thepeer corresponding to the key.

Each peer maintains a small routing table consisting of its

neighboring peers NodeIDs and IP addresses. Lookupqueries or message routing are forwarded across overlay pathsto peers in a progressive manner, with the NodeIDs that arecloserto the key in the identifier space. Different DHT-basedsystems will have different organization schemes for the dataobjects and its key space and routing strategies. In theory,DHT-based systems can guarantee that any data object can belocated in small O(logN) overlay hops on average, whereNisthe number of peers in the system. The underly-ing network path between two peers can be sig-nificantly different from the path on theDHT-based overlay network. Therefore, thelookup latency in DHT-based P2P overlay net-wo rk s ca n be qu it e hi gh an d co ul d ad ve rs el yaffect the performance of the applications run-ning over it. Plaxtonet al. [1] provide an elegantalgorithm that achieves nearly optimal latency ongraphs that exhibit power-law expansion [2], atthe same time preserving the scalable routingproperties of the DHT-based system. However,this algorithm requires pair-wise probing betweenpeers to determine latencies, and it is unlikely toscale to a large number of peers in the overlay.DHT-based systems [37] are an important classof P2P routing infrastructures. They support therapid development of a wide variety of Internet-scale applications ranging from distributed fileand naming systems to application-layer multi-cast. They also enable scalable, wide-area

retrieval of shared information.

In 1999 Napster [8] pioneered the idea of apeer-to-peer file sharing system supporting acentralized file search facility. It was the first sys-tem to recognize that requests for popular con-tent need not be sent to a central server butinstead could be handled by many peers thathave the requested content. Such P2P file-shar-ing systems are self-scaling in that as more peers

join the system, they add to the aggregate down-load capability. Napster achieved this self-scalingbehavior by using a centralized search facilitybased on file lists provided by each peer; thus, itdoes not require much bandwidth for the central-ized search. Such a system has the issue of a sin-gle point of failure due to the centralized searchmechanism. However, a lawsui t filed by theRecording Industry Association of America(RIAA) attempted to force Napster to shut downthe free-for-all file-sharing service for copyright-

ed digital music literally, its killer application. However, thisparadigm caught the imagination of platform providers andusers alike. Gnutella [911] is a decentralized system that dis-

tributes both the search and download capabilities, establish-ing an overlay network of peers. It is the first system thatmakes use of an unstructured P2P overlay network. Anunstructured P2P system is composed of peers joining the net-work with some loose rules, without any prior knowledge ofthe topology. The network uses flooding as the mechanism tosend queries across the overlay with a limited scope. When apeer receives the flood query, it sends a list of all contentmatching the query to the originating peer. While flooding-based techniques are effective for locating highly replicateditems and are resilient to peers joining and leaving the system,they are poorly suited for locating rare items. Clearly thisapproach is not scalable as the load on each peer grows lin-early with the total number of queries and the system size.Thus, unstructured P2P networks face one basic problem:

peers readily become overloaded, and thus the system doesnot scale when handling a high rate of aggregate queries andsudden increases in system size.

Although structured P2P networks can eff icient ly locaterare items since the key-based routing is scalable, they incursignificantly higher overheads than unstructured P2P networksfor popular content. Consequently, over the Internet todaythe decentralized unstructured P2P overlay networks are more

I Figure 1.An abstract P2P overlay network architecture.

Applications

Servicesmanagement

Securitymanagement

Resourcemanagement

Reliability andfault resiliency

Routing andlocation lookup

Resourcesdiscovery

Network

Servicesmessaging

Meta-data

Services scheduling

Tools Services Application-levellayer

Services-specificlayer

Featuresmanagment layer

Overlay nodesmanagement layer

Networkcommunications

layer

I Figure 2.Application interface for structured DHT-based P2P overlay sys-

tems.

API interface:Put(Key, Value)

Distributed structured P2P overlay application

Distributed hash table

Peer

API interface:Remove(Key)

Peer

API interface:Value=Get(Key)

Peer

Value

Peer



commonly used. However, there are recent efforts on key-based routing (KBR) API abstractions [12] that allow moreapplication-specific functionalities to be built over these com-mon basic KBR API abstractions, and OpenDHT1 (Openpublicly accessible DHT service) [13] that allow the unifica-tion platform to provide developers with a basic DHT servicemodel that runs on a set of infrastructure hosts, to deployDHT-based overlay applications without the burden of main-taining a DHT, and with ease of use to spur the deploymentof DHT-based applications. In contrast, unstructured P2Poverlay systems are ad-hoc in nature, and do not present thepossibilities of being unified under a common platform forapplication development.

In later sections of this article we will describe the key fea-tures of structured P2P and unstructured P2P overlay net-works and their operation functionalities. After providing abasic understanding of the various network overlay schemes inthese two classes, we proceed to evaluate these various over-

lays schemes in both classes and discuss its developments.Then we attempt to use the following taxonomies to makecomparisons between the various discussed structured andunstructured P2P overlay schemes:Decentralization examine whether the overlay system

is distributed.Architecture describe the overlay system architecture

with respect to its operation.

Lookup protocol the lookup query protocol adoptedby the overlay system.

System parameters the required system parametersfor the overlay system operation.

Routing performance the lookup routing protocolperformance in overlay routing.

Routing state the routing state and scalability of theoverlay system.

Peers join and leave describe the behavior of theoverlay system when churn and self-organizationoccurred.

Security look into the security vulnerabilities of over-lay systems.

Reliability and fault resiliency examine how robust the

overlay system is when subjected to faults.Finally, we conclude with some thoughts on the relativeapplicability of each class to some of the research problemsthat arise in ad-hoc, location-based, and content delivery net-works, as examples.

STRUCTURED P2P OVERLAYNETWORKS

In this category, the overlay network assigns keys to dataitems and organizes its peers into a graph that maps eachdata key to a peer. This structured graph enables efficientdiscovery of data items using the given keys. However, in itssimple form, this class of systems does not support complexqueries and it is necessary to store a copy or a pointer to

each data object (or value) at the peer responsible for thedata objects key. In this section, we survey and compare thefollowing structured P2P overlay networks: Content Address-able Network (CAN) [5], Tapestry [7], Chord [6], Pastry [4],Kademlia [14], and Viceroy [15].

CONTENT ADDRESSABLE NETWORK (CAN)

The Content Addressable Network (CAN) [5] is a distributeddecentralized P2P infrastructure that provides hash-tablefunctionality on an Internet-like scale. CAN is designed to bescalable, fault-tolerant, and self-organizing. The architecturaldesign is a v irtual multi-dimensional Cartesian coordinatespace on a multi-torus. This d-dimensional coordinate space iscompletely logical. The entire coordinate space is dynamicallypartitioned among all the peers (Nnumber of peers) in thesystem such that every peer possesses its individual, distinctzone within the overall space. A CAN peer maintains a rout-ing table that holds the IP address and virtual coordinate zoneof each of its neighbors in the coordinate space. A CAN mes-sage includes the destination coordinates. Using the neighborcoordinates, a peer routes a message toward its destinationusing a simple greedy forwarding to the neighbor peer that isclosest to the destination coordinates. CAN has a routing per-formance ofO(d N1/d) and its routing state is of 2 dbound. As shown in Fig. 3, which we adapted from the CANpaper [5], the virtual coordinate space is used to store{key,value} pairs as follows: to store a pair {K,V}, key Kisdeterministically mapped onto a point P in the coordinate

space using a uniform hash function. The lookup protocol

I Figure 3.Example of 2-d space CAN before and after Peer Zjoins.

Peer Xs coordinate neighbor set = {A B C D}

Peer Xs coordinate neighbor set = {A B D Z}New Peer Zs coordinate neighbor set = {A C D X}

Sample

routingpath fromPeer X toPeer E

A

X CB

D

E

A

X Z CB

D

E

1

The OpenDHT project was formerly named OpenHash. http://opendht.org/



retrieves an entry corresponding to key K, and any peer canapply the same deterministic hash function to map K ontopoint P and then retrieve the correspondingvalue Vfrom thepoint P. If the requesting peer or its immediate neighbors donot own the point P, the request must be routed through theCAN infrastructure until it reaches the peer where P lays. Apeer maintains the IP addresses of those peers that hold coor-dinate zones adjoining its zone. This set of immediate neigh-

bors in the coordinate space serves as a coordinate routingtable that enables efficient routing between points in thisspace.

A new peer that joins the system must have its own portionof the coordinate space allocated. This can be achieved bysplitting an existing peers zone in half, retaining half for thepeer and allocating the other half to the new peer. CAN hasan associated DNS domain name that is resolved into the IPaddress of one or more CAN bootstrap peers (which maintaina partial list of CAN peers). For a new peer to join a CANnetwork, the peer looks up in the DNS of a CAN domainname to retrieve a bootstrap peers IP address, similar to thebootstrap mechanism in [16]. The bootstrap peer supplies theIP addresses of some randomly chosen peers in the system.

The new peer randomly chooses a point P and sends a JOINrequest destined for point P. Each CAN peer uses the CANrouting mechanism to forward the message until it reaches thepeer in which zone P lies. The current peer in zone P thensplits its zone in half and assigns the other half to the newpeer. For example, in a 2-dimensional space, a zone wouldfirst be split along the Xdimension, then the Y, and the split-ting continues. The {K,V} pairs from the half zone to behanded over are also transferred to the new peer. Afterobtaining its zone, the new peer learns of the IP addresses ofits neighbor set from the previous peer in point P, and adds tothat the previous peer itself.

When a peer leaves the CAN network, an immediatetakeover algorithm ensures that one of the failed peers neigh-bors takes over the zone and starts a takeover timer. The peer

updates its neighbor set to eliminate those peers that are nolonger its neighbors. Every peer in the system then sends soft-state updates to ensure that all of their neighbors will learnabout the change and update their own neighbor sets. Thenumber of neighbors a peer maintains depends only on thedimensionality of the coordinate space (i.e., 2 d) and it isindependent on the total number of peers in the system.

The Fig. 3 example illustrated a simple routing path frompeer X to point E and a new peer Z joining the CAN net-work. For a d-dimensional space partitioned into n equalzones, the average routing path length is (d/4) (n1/d) hopsand individual peers maintain a list of 2 d neighbors. Thus,the growth of peers (or zones) can be achieved withoutincreasing per peer state while the average path length growsas O(n1/d). Since there are many different paths between twopoints in the space, when one or more of a peers neighborsfail, this peer can still route along the next best available path.

Improvement of the CAN algorithm can be accomplishedby maintaining multiple, independent coordinate spaces, witheach peer in the system being assigned a different zone ineach coordinate space, called reality. For a CAN with rreali-ties, a single peer is assigned rcoordinate zones, one on eachreality available, and this peer holds rindependent neighborsets. The contents of the hash table are replicated on everyreality, thus improving data availability. For further data avail-ability improvement, CAN could use k different hash func-tions to map a given key onto k points in the coordinatespace. This results in the replication of a single {key,value}pair at k distinct peers in the system. A {key,value} pair is

then unavailable only when all the k replicas are simultane-

ously unavailable. Thus, queries for a particular hash tableentry could be forwarded to all k peers in parallel, therebyreducing the average query latency, and reliability and faultresiliency properties are enhanced.

CAN could be used in large-scale storage management sys-tems such as the OceanStore [17], Farsite [18], and Publius[19]. These systems require efficient insert and retrieval ofcontent in a large distributed storage network with a scalable

indexing mechanism. Another potential application for CANsis in the construction of wide-area name resolution servicesthat decouple the naming scheme from the name resolutionprocess. This enables an arbitrary and location-independentnaming scheme.

CHORD

Chord [6] uses consistent hashing [20] to ass ign keys to itspeers. Consistent hashing is designed to let peers enter andleave the network with minimal interruption. This decentral-ized scheme tends to balance the load on the system, sinceeach peer receives roughly the same number of keys, andthere is little movement of keys when peers join and leave the

system. In a steady state, for a total ofN

peers in the system,each peer maintains routing state information for aboutO(logN) other peers. This may be efficient but performancedegrades gracefully when that information is out-of-date.

The consistent hash functions assign peers and data keysan m-bit identifier using SHA-1 [21] as the base hash func-tion. A peers identifier is chosen by hashing the peers IPaddress, while a key identifier is produced by hashing the datakey. The length of the identifier m must be large enough tomake the probability of keys hashing to the same identifiernegligible. Identifiers are ordered on an identifier circle mod-ulo 2m. Keyk is assigned to the first peer whose identifier isequal to or followsk in the identifier space. This peer is calledthe successor peer of keyk, denoted bysuccessor(k). If identi-fiers are represented as a circle of numbers from 0 to 2m 1,

thensuccessor(k) is the first peer clockwise fromk. The iden-tifier circle is called the Chord ring. To maintain consistenthashing mapping when a peer n joins the network, certa inkeys previously assigned tons successor now need to be reas-signed to n. When peer n leaves the Chord system, all of itsassigned keys are reassigned to ns successor. Therefore, peersjoin and leave the system with (logN)2 performance. No otherchanges of keys assignment to peers need to occur. In Fig. 4(adapted from [6]), the Chord ring is depicted with m = 6.This particular ring has 10 peers and stores five keys. The suc-cessor of the identifier 10 is peer 14, so key 10 will be locatedat NodeID 14. Similarly, if a peer were to join with identifier26, it would store the key with identifier 24 from the peer withidentifier 32.

Each peer in the Chord ring needs to know how to contactits current successor peer on the identifier circle. Lookupqueries involve the matching of key and NodeID. For a givenidentifier could be passed around the circle via these succes-sor pointers until they encounter a pair of peers that includethe desired identifier; the second peer in the pair is the peerthe query maps to. An example is presented in Fig. 4, wherebypeer 8 performs a lookup for key 54. Peer 8 invokes the findsuccessor operation for this key, which eventually returns thesuccessor of that key, i.e. peer 56. The query visits every peeron the circle between peer 8 and peer 56. The response isreturned along the reverse of the path.

As m is the number of bits in the key/NodeID space, eachpeern maintains a routing table with up to m entries, calledthe finger table. The ith entry in the table at peern contains

the identity of the first peer s that succeedsn by at least 2i 1



on the identifier circle, i.e., s = successor(n + 2 i 1 ), where1 i m. Peer s is the ith finger of peer n (n finger[i]). Afinger table entry includes both the Chord identifier and theIP address (and port number) of the relevant peer. Figure 4shows the finger table of peer 8, and the first finger entry forthis peer points to peer 14, as the latter is the first peer thatsucceeds (8+20) mod 26 = 9. Similarly, the last finger of peer8 points to peer 42, i.e., the first peer that succeeds (8 + 25)

mod 26 = 40. In this way, peers store information about onlya small number of other peers, and know more about peersclosely following it on the identifier circle than other peers.Also, a peers finger table does not contain enough informa-tion to directly determine the successor of an arbitrary key k.For example, peer 8 cannot determine the successor of key 34by itself, as the successor of this key (peer 38) is not presentin peer 8s finger table.

When a peer joins the system, the successor pointers ofsome peers need to be changed. It is important that the suc-cessor pointers are up to date at any time because the correct-ness of lookups is not guaranteed otherwise. The Chordprotocol uses astabil ization protocol [6] running periodicallyin the background to update the successor pointers and the

entries in the finger table. The correctness of the Chord pro-tocol relies on the fact that each peer is aware of its succes-sors. When peers fail, it is possible that a peer does not know

its new successor, and that it has no chance to learn about it.To avoid this situation, peers maintain a successor list of sizer, which contains the peers firstrsuccessors. When the suc-cessor peer does not respond, the peer simply contacts thenext peer on its successor list. Assuming that peer failuresoccur with a probability p, the probability that every peer onthe successor list will fail ispr. Increasingrmakes the systemmore robust. By tuning this parameter, any degree of robust-

ness with good reliability and fault resiliency may be achieved.The following applications are examples of how Chord

could be used.Cooperative mirroring or cooperative file system (CFS)

[22], in which multiple providers of content cooperate to storeand serve each others data. Spreading the total load evenlyover all participant hosts lowers the total cost of the system,since each participant needs to provide capacity only for theaverage load, not for the peak load. There are two layers inCFS. The DHash (Distributed Hash) layer performs blockfetches for the peer, distributes the blocks among the servers,and maintains cached and replicated copies. The Chord layerdistributed lookup system is used to locate the servers respon-sible for a block.

Chord-based DNS [23] provides a lookup service, withhost names as keys and IP addresses (and other host informa-tion) as values. Chord could provide a DNS-like service by

hashing each host name to a key [20]. Chord-based DNS would require no special servers,while ordinary DNS systems rely on a set of spe-cial root servers. DNS also requires manualmanagement of the routing information (DNSrecords) that allows clients to navigate the nameserver hierarchy. Chord automatically maintainsthe correctness of the analogous routing infor-mation. DNS only works well when host namesare hierarchically structured to reflect adminis-trative boundaries. Chord imposes no namingstructure. DNS is specialized to the task of find-

ing named hosts or services, while Chord canalso be used to find data object values that arenot tied to particular machines.

TAPESTRY

Sharing similar properties with Pastry, Tapestry[7] employs decentralized randomness toachieve both load distribution and routinglocality. The difference between Pastry andTapestry is the handling of network locality anddata object replication, and this difference willbe more apparent when described in the sec-tion on Pasty. Tapestrys architecture uses avar iant of th e Pl axt on et al . [1] distributedsearch technique, with additional mechanismsto provide availability, scalability, and adapta-tion in the presence of failures and attacks.Plaxton et al. propose a distributed data struc-ture, known as the Plaxton mesh, optimized tosupport a network overlay for locating nameddata objects that are connected to one rootpeer. On the other hand, Tapestry uses multi-ple roots for each data object to avoid a singlepoint of failure. In the Plaxton mesh, peers cantake on the roles of servers (where data objectsare stored), routers (forward messages), andclients (entity of requests). It uses local routingmaps at each peer to incrementally route over-

lay messages to the destination ID digit by

I Figure 4. Chord ring with identifier circle consisting of ten peers and five datakeys. It shows the path followed by a query originated at Peer 8 for the lookup

of key 54. Finger table entries for Peer 8.

Finger table

N8 + 1 N14N8 + 2 N14N8 + 4 N14N8 + 8 N21N8 + 16 N32N8 + 32 N42

+32

+16+8

+4+2

+1

N56

N51

N48

N42

N38

N32

N21

N14

N8

N4

Lookup (K54)

K54

K10

K38

K30

K24

N56

N51

N48

N42

N38N32

N21

N14

N8

N4



digit, e.g., * * * 7 * * 97 *297 3297, where * is thewildcard, similar to the longest prefix routing in the CIDRIP address allocation architecture [24]. The resolution ofdigits from right to left or left to right is arbitrary. A peerslocal routing map has multiple levels, where each of themrepresents a match of the suffix with a digit position in theID space. The nth peer that a message reaches shares a suf-fix of at least length nwith the destination ID. To locate the

next router, the (n + 1)th level map is examined to locatethe entry matching the value of the next digit in the destina-tion ID. This routing method guarantees that any existingunique peer in the system can be located within at mostlogBNlogical hops, in a system with Npeers using NodeIDsof base B. Since the peers local routing map assumes thatthe preceding digits all match the current peers suffix, thepeer needs only to keep a small constant size (B) entry ateach route level, yielding a routing map of fixed constantsize: (entries/map) no. of maps = B logBN.

The lookup and routing mechanisms of Tapestry are basedon matching the suffix in NodeID as described above. Routingmaps are organized into routing levels, where each level con-tains entries that point to a set of peers closest in distance

that matches the suffix for that level. Also, each peer holds alist of pointers to peers referred to as neighbors. Tapestrystores the locations of all data object replicas to increasesemantic flexibility and allow the application level to choosefrom a set of data object replicas based on some selection cri-teria, such as date. Each data object may include an optionalapplication-specific metric in addition to a distance metric.For example, the OceanStore [17] global storage architecturefinds the closest cached document replica that satisfies theclosest distance metric. These queries deviate from the simplefind first semantics, and Tapestry will route the message to theclosest k distinct data objects.

Tapestry handles the problem of a single point of failuredue to a single data objects root peer by assigning multipleroots to each object. Tapestry makes use ofsurrogate routing

to select root peers incrementally during the publishing pro-cess to insert location information into Tapestry. Surrogaterouting provides a technique by which any identifier can beuniquely mapped to an existing peer in the network. A dataobjects root or surrogate peer is chosen as the peer thatmatches the data objects ID, X. This is unlikely to happen,given the sparse nature of the NodeID space. Nevertheless,Tapestry assumes peerXexists by attempting to route a mes-sage to it. A route to a non-existent identifier will encounterempty neighbor entries at various positions along the way.The goal is to select an existing link that can act as an alterna-tive to the desired link; i.e. the one associated with a digit X.Routing terminates when a map is reached where the onlynon-empty routing entry belongs to the current peer. Thatpeer is then designated as the surrogate root for the dataobject. While surrogate routing may take additional hops toreach a root if compared with the Plaxton algorithm, the addi-tional number of hops is small. Thus, surrogate routing inTapestry has minimal routing overhead relative to the staticglobal Plaxton algorithm.

Tapestry addresses the issue of fault adaptation and main-tains cached content for fault recovery by relying on TCPtimeouts and UDP periodic heartbeat packets to detect link,server failures during normal operations, and reroutingthrough its neighbors. During fault operation each entry inthe neighbor map maintains two backup neighbors in additionto the closest/primary neighbor. On a testbed of 100 machineswith 1000 peer simulations, the results in [7] show the goodrouting rates and maintenance bandwidths during instanta-

neous failures and continuing churn.

A variety of different applications have been designed andimplemented on Tapestry. Tapestry is self-organizing, faulttolerant, resilient under load, and is a fundamental compo-nent of the OceanStore system [17, 25]. OceanStore is a glob-al-scale, highly available storage utility deployed on thePlanetLab [26] testbed. OceanStore servers use Tapestry todisseminate encoded file blocks efficiently, and clients canquickly locate and retrieve nearby file blocks by their ID,

despite server and network failures. Other Tapestry applica-tions include the Bayeux [27], an efficient self organizingapplication-level multicast system, and SpamWatch [28], adecentralized spam-filtering system that uses a s imilaritysearch engine implemented on Tapestry.

PASTRY

Pastry [4], like Tapestry, makes use of Plaxton-like prefix rout-ing to build a self-organizing decentralized overlay network,where each peer routes client requests and interacts with localinstances of one or more applications. Each peer in Pastry isassigned a 128-bit peer identifier (NodeID). The NodeID isused to give a peers position in a circular NodeID space,

which ranges from 0 to 2

128

1. The NodeID is assigned ran-domly when a peer joins the system, and it is assumed to begenerated such that the resulting set of NodeIDs is uniformlydistributed in the 128-bit space. For a network ofNpeers, Pas-try routes to the numerically closest peer to a given key in lessthan logBNsteps under normal operation (whereB = 2b is aconfiguration parameter with typical value ofb = 4). TheNodeIDs and keys are considered a sequence of digits withbaseB. Pastry routes messages to the peer whose NodeID isnumerically closest to the given key. A peer normally forwardsthe message to a peer whose NodeIDs share with the key aprefix that is at least one digit (or b bits) longer than the prefixthat the key shares with the current peer NodeID.

As shown in Fig. 5, each Pastry peer maintains a rout ingtable, a neighborhood set, and a leaf set. A peer routing table

is designed withlogBNrows, where each row holdsB 1 num-ber of entries. The B 1 number of entries at row n of therouting table each refer to a peer whose NodeID shares thecurrent peers NodeID in the firstn digits, but whose (n+1)thdigit has one of the B 1 possible values other than the(n+1)th digit in the current peers NodeID. Each entry in therouting table contains the IP address of peers whose NodeIDshave the appropriate prefix, and it is chosen according to aclose proximity metric. The value ofb could be chosen with atradeoff between the size of the populated portion of the rout-ing table [approximately (logBN) (B 1) entries] and maxi-mum number of hops required to route between any pair ofpeers (logBN). The neighborhood set,M, contains the NodeIDsand IP addresses of the M peers that are closest in proximityto the local peer. The network proximity that Pastry uses isbased on a scalar proximity metric such as the IP routing geo-graphic distance. The leaf set, L, is the set of peers with L/2numerically closest larger NodeIDs and L/2 peers wi thnumerically smaller NodeIDs, in relation to the current peersNodeID. Typical values for L and M areB or 2 B. Evenwith concurrent peer failure, eventual delivery is guaranteedwith good reliability and fault resiliency, unless L/2 peerswith adjacent NodeIDs fail simultaneously (L is a configura-tion parameter with a typical value of 16 or 32).

When a new peer (NodeID is X) joins the network, itneeds to initialize its routing table and inform other peers ofits presence. This new peer needs to know the address of acontact or bootstrap peer in the network. A small list of con-tact peers, based on a proximity metric (e.g., the RTT to each

peer) to provide better performance, could be provided as a



service in the network, and the new peer could select at ran-dom one of the peers for contact. As a result, this new peerwill know initially about a closest Pastry peer A. Peer XthenasksA to route a join message with the key equal to X. Pastryroutes the join message to the existing peer Zwhose NodeIDis numerically closest to X. Upon receiving the join request,peersA,Z and all peers encountered on the path fromA to Zsend their routing tables to X. Finally, Xinforms any peersthat need to be aware of its arrival. This ensures that Xinitial-izes its routing table with appropriate information and therouting tables in all other affected peers are updated based onthe information received. As peer A is topologically close tothe new peer X, As neighborhood set is used to initialize Xsneighborhood set.

A Pastry peer is considered to have lef t the overlay net -work when its immediate neighbors in the NodeID space canno longer communicate with the peer. To replace this failedpeer in the leaf set of its neighbors, its neighbors in the

NodeID space contact the live peer with the largest index on

the side of the failed peer, and request its leaf table. Let thereceived leaf set be L, which overlaps the current peers leafset L, and it contains peers with nearby NodeIDs not residinginL. The appropriate peer is chosen to insert into L, verifyingthat the peer is actually still alive by contacting it. The neigh-borhood set is not used in the routing of messages, but it isstill kept fresh because this set plays an important role inexchanging information about nearby peers. Therefore, a peercontacts each member of the neighborhood set periodically totest if it is still alive. If the peer is not responding, the contact-ing peer asks other members for their neighborhood sets andchecks for the closest proximity of each of the newly contact-ed peers and updates its own neighborhood set.

Pastry is being used in the implementation of a scalableapplication-level multicast infrastructure called Scribe [29, 30].Instead of relying on a multicast infrastructure in the networkthat is not widely available, the participating peers route anddistribute multicast messages using only unicast network ser-

vices. It supports a large number of groups with large num-

I Figure 5.Pastry peer's routing table, leaf set, and neighbor set. An example of routing path for a pastry peer.

Routing from peer 37A0F1 with key B57B2D

37A0F1

B24EA3

B5324F

B573AB

B573D6B57B2D

B581F1

Live peersin Pastry

Route(B57B2D)

Leaf set(smaller)

37A001 37A011 37A022 37A033

37A044 37A055 37A066 37A077

Leaf set(larger)

37A0F2 37A0F4 37A0F6 37A0F8

37A0FA 37A0FB 37A0FC 37A0FE

Neighborhoodset

1A223B 1B3467 245AD0 2670AB

3612AB 37890A 390AF0 3912CD

46710A 477810 4881AB 490CDE

279DE0 290A0B 510A0C 5213EA

11345B 122167 16228A 19902D

221145 267221 28989C 199ABC

NodeID 37A0F1

0x 1x 2x 3x 4x ... Dx Ex Fx

30x 31x 32x ... 37x 38x ... 3Ex 3Fx

370x 371x 372x ... 37Ax 37Bx ... 37Ex 37Fx

37A0x 37A1x 37A2x ... 37ABx 37ACx 37ADx 37AEx 37AFx

Routing table of a Pastry peer with NodeID 37A0x,b=4, digits are in hexadecimal, x is an arbitrary suffix

Example: routing state of a Pastry peerwith NodeID 37A0F1, b=4, L=16, M=32



bers of members per group. Scribe is built on top of Pastry,which is used to create and manage groups and to build effi-cient multicast trees for dissemination of messages to eachgroup. Scribe builds a multicast tree formed by joining Pastryroutes from each group member to a rendezvous point associ-ated with a group. Membership maintenance and message dis-semination in Scribe leverages the robustness, selforganization, locality, and reliability properties of Pastry.

SplitStream [31] allows a cooperative multicasting environ-ment where peers contribute resources in exchange for usingthe service. The key idea is to split the content into k stripesand to multicast each stripe using a separate tree. Peers join asmany trees as there are stripes they wish to receive and theyspecify an upper bound on the number of stripes that they arewilling to forward. The challenge is to construct this forest ofmulticast trees such that an interior peer in one tree is a leafpeer in all the remaining trees and the bandwidth constraintsspecified by the peers are satisfied. This ensures that the for-warding load can be spread across all participating peers. Forexample, if all peers wish to receivek stripes and they are will-ing to forwardk stripes, SplitStream will construct a forest suchthat the forwarding load is evenly balanced across all peers

while achieving low delay and link stress across the network.Squirrel [32] uses Pastry as its data object location service,to identify and route to peers that cache copies of a requesteddata object. It facilitates mutual sharing of Web data objectsamong client peers, and enables the peers to export their localcaches to other peers in the network, thus creating a largeshared virtual Web cache. Each peer then performs both Webbrowsing and Web caching, without the need for expensiveand dedicated hardware for centralized Web caching. Squirrelfaces a new challenge whereby peers in a decentralized cacheincur the overhead of having to serve each other requests, andthis extra load must be kept low.

PAST [33, 34] is a large scale P2P persistent storage utilitythat is based on Pastry. The PAST system is composed of peersconnected to the Internet such that each peer is capable of ini-

tiating and routing client requests to insert or retrieve files.Peers may also contribute storage to the system. A storage sys-tem like PAST is attractive because it exploits the multitudeand diversity of peers in the Internet to achieve strong persis-tence and high availability. This eradicates the need for physicaltransport of storage media to protect lookup and archival data,and the need for explicit mirroring to ensure high availabilityand throughput for shared data. A global storage utility alsofacilitates the sharing of storage and bandwidth, thus permittinga group of peers to jointly store or publish content that wouldexceed the capacity or bandwidth of any individual peer.

Pastiche [35] is a simple and inexpensive backup systemthat exploits excess disk capacity to perform P2P backup withno administrative costs. The cost and inconvenience of backupare unavoidable and often prohibitive. Small-scale solutionsrequire significant administrative efforts. Large-scale solutionsrequire aggregation of substantial demand to justify the capi-tal costs of a large, centralized repository. Pastiche builds onthree architectures: Pastry, which provides the scalable P2Pnetwork with self-administered routing and peer location;content-based indexing [36, 37], which provides flexible dis-covery of redundant data for similar files; and convergentencryption [18], which allows hosts to use the same encryptedrepresentation for common data without sharing keys.

KADEMLIA

The Kademlia [14] P2P decentralized overlay network takes thebasic approach of assigning each peer a NodeID in the 160-bit

key space, and {key,value} pairs are stored on peers with IDs

close to the key. A NodeID-based routing algorithm will beused to locate peers near a destination key. One of the keyarchitectures of Kademlia is the use of a novel XOR metric fordistance between points in the key space. XOR is symmetricand it allows peers to receive lookup queries from precisely thesame distribution of peers contained in their routing tables.Kademlia can send a query to any peer within an interval,allowing it to select routes based on latency or send parallel

asynchronous queries. It uses a single routing algorithmthroughout the process to locate peers near a particular ID.

Every message being transmitted by a peer includes its peerID, permitting the recipient to record the sender peers exis-tence. Data keys are also 160-bit identifiers. To locate {key,value}pairs, Kademlia relies on the notion of distance between twoidentifiers. Given two 160-bit identifiers, a and b, it defines thedistance between them as their bitwise exclusive OR (XOR,interpreted asd(a,b) =a b =d(b,a) for alla,b), and this is anon-Euclidean metric. Thus,d(a,b) = 0,d(a,b) > 0(ifa b),and for alla,b:d(a,b) =d(b,a). XOR also offers the triangleinequality property:d(a,b) + d(b,c) d(a,c), sinced(a,c) =d(a,b) d(b,c) and (a +b a b) for alla,b = 0. Similar toChords clockwise circle metric, XOR is unidirectional. For any

given pointx

and distanced

> 0, there is exactly one pointy

suchthatd(x,y) =d. The unidirectional approach makes sure that alllookups for the same key converge along the same path, regard-less of the originating peer. Hence, caching {key,value} pairsalong the lookup path alleviates hot spots.

The peer in the network stores a list of {IP address,UDPport,NodeID} triples for peers of distance between 2i and 2i+1

from itself. These lists are called k-buckets. Eachk-bucket iskept sorted by last time seen, i.e. least recently accessed peerat the head, most-recently accessed at the tail. The Kademliarouting protocol consists of the following steps: PING probes a peer to check if it is active. STORE instructs a peer to store a {key,value} pair for

later retrieval. FIND_NODE takes a 160-bit ID, and returns {IP

address,UDP port,NodeID} triples for the k peers itknows that are closest to the target ID.

FIND_VALUE is similar to FIND_NODE: it returns {IPaddress,UDP port,NodeID} triples, except in the casewhen a peer receives a STORE for the key, in which caseit just returns the stored value.Importantly, Kademlias peer must locate the k closest

peers to some given NodeID. This lookup initiator starts bypickingXpeers from its closest non-empty k-bucket, and thensends parallel asynchronous FIND_NODE to the Xpeers ithas chosen. If FIND_NODE fails to return a peer that is anycloser than the closest peers already seen, the initiator resendsthe FIND_NODE to all of the k closest peers it has notalready queried. It can route for lower latency because it hasthe flexibility to choose any one ofk peers to forward arequest. To find a {key,value} pair, a peer starts by perform-ing a FIND_VALUE lookup to find thek peers with IDs clos-est to the key. To join the network, a peern must have contactwith an already participating peer m. Peer n inserts peer minto the appropriate k-bucket, and then performs a peerlookup for its own peer ID. Peer n refreshes allk-buckets far-ther away than its closest neighbor, and during this refresh,peern populates its ownk-buckets and inserts itself into otherpeersk-buckets, if needed.

VICEROY

The Viceroy [15] P2P decentralized overlay network isdesigned to handle the discovery and location of data and

resources in a dynamic butterfly fashion. Viceroy employs



consistent hashing [20] to distribute data so that it is balancedacross the set of servers and resilient to servers joining andleaving the network. It utilizes the DHT to manage the distri-bution of data among a changing set of servers and allowingpeers to contact any server in the network to locate any storedresource by name. In addition to this, Viceroy maintains anarchitecture that is an approximation to a butterfly network[38], as shown in Fig. 6 (adapted from the diagram in [15]),and uses links between successors and predecessors ideasthat were based on Kleingberg [39] and Barriereet al. [40] on the ring (a key is mapped to its successor on the ring) forshort distances. Its diameter of the overlay is better than CANand its degree is better than Chord, Tapestry, and Pastry.

When Npeers are operational, one of the logNlevels isselected with near equal probability. Levell peers two edges

are connected to peers at level l + 1. A down-right edge isadded to a long-range contact at level l + 1 at a distanceabout 1/(2l) away, and a down-left edge is added at a closedistance on the ring to the level l +1. The up edge to a nearbypeer at level l 1 is included ifl > 1. Then, level-ring linksare added to the next and previous peers of the same level l.Routing is done by climbing using up connections to a levell 1 peer, then proceeds down the levels of the tree using thedown links, and moving from level l to levell + 1. It followseither the edge to the nearby down link or the further downlink, depending on distance > 1/(2l). This recursively contin-ues until a peer is reached with no down links, and it is in thevicinity of the target peer. So a vicinity lookup is performedusing the ring and level-ring links. For reliability and faultresiliency, when a peer leaves the overlay network, it handsover its key pairs to a successor from the ring pointers andnotifies other peers to find a replacement. It is formalized andproved [15] that the routing process requires only O(logN),whereNis the number of peers in the network.

DISCUSSION OFSTRUCTURED P2P OVERLAYNETWORK

The algorithm of Plaxton was originally devised to route Webqueries to nearby caches, and this influenced the design ofPastry, Tapestry, and Chord. The method of Plaxton has loga-rithmic expected join/leave complexity. Plaxton ensures thatqueries never travel further in network distance than the peer

where the key is stored. However, Plaxton has several disad-

vantages : it requires global knowledge to con-struct the overlay; an objects root peer is thesingle point of failure; no insertion or deletionof peers; and no avoidance of hotspot conges-tion. Pastry and Tapestry schemes relied onDHT to provide the substrate for semantic-freeand data-centric references, through the assign-ment of a semantic-free NodeID, such as a 160-

bit key, and performed efficient request routingbetween lookup peers using an efficient anddynamic routing infrastructure, whereby peersleave and join. Overlays that perform queryrouting in DHT-based systems have strong theo-retical foundations, guaranteeing that a key canbe found if it exists, and they do not capture therelationships between the object name and itscontent. However, DHT-based systems have afew problems in terms of data object lookuplatency:

For each overlay hop, peers route a messageto the next intermediate peer that can be locatedvery far away with regard to physical topology of

the underlying IP network. This can result in high networkdelay and unnecessary long-distance network traffic, from adeterministic short overlay path ofO(logN), where Nis thenumber of peers.

DHT-based systems assume that all peers equally partici-pate in hosting published data objects or their location infor-mation. This would lead to a bottleneck at low-capacity peers.

Pastry and Tapestry routing algorithms are a randomizedapproximation of a hypercube, and routing toward an object isdone by matching longer address suffixes until either theobjects root peer or another peer with a nearby copy is found.Rhea et al. [41] make use of FreePastry implementation todiscover that most lookups fail to complete when there isexcessive churn. They claimed that short-lived peers leave theoverlay with lookups that have not yet timed out. They out-

lined design issues pertaining to DHT-based performanceunder churn: lookup timeouts; reactive versus periodic recov-ery of peers; and the choice of nearby neighbors. Since thereactive recovery will increase traffic to congested links, theymake use of periodic recovery; and for lookup they suggestedan exponential weighted moving average of each neighborsresponse time instead of alternative fixed timeout. They dis-covered that selection of nearby neighbors required globalsampling, which is more effective than simply sampling neigh-bors neighbors. However, Castroet al. [42] use the MSPastryimplementation to show that it can cope with high churn ratesby achieving shorter routing paths and a lesser maintenanceoverhead. Pastry exploits network locality to reduce routingdelays by measuring the delay round-trip-time (RTT) to asmall number of peers when building the routing tables. Foreach routing table entry, it chooses one of the closest peers inthe network topology whose NodeID satisfies the constraintsfor that entry. The average IP delay of each Pastry hopincreases exponentially until it reaches the average delaybetween two peers in the network. Chords routing protocol issimilar to Pastrys location algorithm in PAST. However, Pas-try is a prefix-based routing protocol and differs in otherdetails from Chord.

Chord maps keys and peers to an identifier ring and guar-antees that queries make a logarithmic number of hops andthat keys are well balanced. It uses consistent hashingto mini-mize disruption of keys when peers leave and join the overlaynetwork. Consistent hashing ensures that the total number ofcaches responsible for a particular object is limited, and when

these caches change, the minimum number of object refer-

I Figure 6.A simplified Viceroy network. For simplicity, the up link, ring, andlevel-ring links are not shown.

0 1

B

A

Level 1

Level 2

Level 3



ences will move to maintain load balancing. Since the Chordlookup service presents a solution where each peer maintainsa logarithmic number of long-range links, it gives a logarith-mic join/leave update. In Chord, the network is maintainedappropriately by a background maintenance process, i.e. aperiodic stabilization procedure that updates predecessor andsuccessor pointers to cater to newly joined peers. Liben-Now-ell et al. [43] ask the question of how often the stabilization

procedure needs to run to determine the success of Chordslookups and if determining the optimum involves the mea-surement of peers behavior. Stoica et al. [6] demonstrate theadvantage of recursive lookups over iterative lookups, butfuture work is proposed to improve resiliency to network par-titions using a small set of known peers, and to reduce theamount of messages in lookups by increasing the size of eachstep around the ring with a larger finger in each peer. Alimaet al. [44] propose acorrection-on-use mechanism in their Dis-tributed K-ary Search (DKS), which is similar to Chord, toreduce the communication costs incurred by Chords stabiliza-tion procedure. The mechanism makes corrections to theexpired routing entries by piggybacking lookups and inser-tions.

The work on CAN has a constant degree network for rout-ing lookup requests. It organizes the overlay peers into a d-dimensional Cartesian coordinate space, with each peer takingownership of a specific hyper-rectangular shape in the space.The key motivation of the CAN design is based on the argu-ment that Plaxton-based schemes would not perform wellunder churn, given that peer departures and arrivals wouldaffect a logarithmic number of peers. It maintains a routingtable with its adjacent immediate neighbors. Peers joining theCAN cause the peer owning the region of space to split, giv-ing half to the new peer and retaining half. Peers leaving theCAN will pass its NodeID, neighbors NodeID, IP addressesand its {key,value} pairs to a takeover peer. CAN has a num-ber of tunable parameters to improve routing performance:dimensionality of the hypercube; network-aware routing by

choosing the neighbor closest to the destination in CANspace; multiple peers in a zone, allowing CAN to deliver mes-sages to anyone of the peers in the zone in an anycast man-ner; uniform partitioning, made possible by comparing thevolume of a region with the volumes of neighboring regionswhen a peer joins; and landmark-based placement which caus-es peers, at join time, to probe a set of well known landmarkhosts, estimating each of their network distances. There areopen research questions on CANs resiliency, load balancing,locality, and latency/hopcount costs.

Kademlias XOR topology-based routing resembles verymuch the first phase in the routing algorithms of Pastry,Tapestry, and Plaxton. For these three algorithms, there is aneed for an additional algorithmic structure for discoveringthe target peer within the peers that share the same prefix butdiffer in the next b-bit digit. It was argued in [14] that Pastryand Tapestry algorithms require secondary routing tables ofsize O(2b) in addition to the main tables of size O(2blog2bN),which increases the cost of bootstrapping and maintenance.Kademlia resolves in their distinctive ways through the use ofXOR metrics for thedistance between 160-bit NodeIDs, andeach peer maintains a list of contact peers, of which longer-lived peers are given preference on this list. Kademlia caneasily be optimized with a base other than 2, by configuringthe bucket table so that it approaches the target b bits perhop. This requiress having one bucket for each range of peersat a distance [j (2160(i+1)b), (j + 1) (2160(i+1)b)], for each0 < j < 2b and 0 i < 160/b. This expects no more than(2b 1) (log2bN) buckets.

The Viceroy overlay network (butterfly) presents an effi-

cient network construction proved formally in [15], and main-tains constant degree networks in a dynamic environment,similar to CAN. Viceroy has logarithmic diameter, similar toChord, Pastry, and Tapestry. Viceroys diameter is proven tobe better than CAN and its degree is better than Chord, Pas-try, and Tapestry. Its routing is achieved in O(logN) hops(where N is the number of peers) and with nearly optimalcongestion. Peers joining and leaving the system induce

O(logN) hops and require only O(1) peers to change theirstates. Liet al. [45] suggest in their paper that limited degreemay increase the risk of network partition or limitations in theuse of local neighbors. However, its advantage is the constant-degree overlay properties. Kaashoek et al. [46] highlight itsfault-tolerant blind spots and its complexity.

Further work was done by Viceroys authors with the pro-posal of a two-tier, locality-aware DHT [47] which gives lowerdegree properties in each lower-tier peer, and the bounded-degree P2P overlay using de Bruijn graph [48]. Since de Brui-jn graphs give very short average routing distances and highresilience to peer failure, they are well suited for StructuredP2P overlay networks. The P2P overlays discussed above aregreedy, and for a given degree, the algorithms are suboptimal

because the routing distance is longer. There are increasingimprovements to de Bruijn P2P overlay proposals [46, 4952].The de Bruijn graph of degree k (k can be varied) couldachieve an asymptotically optimum diameter (maximum hop-counts between any two peers in the graph) oflogkN, whereNis the total number of peers in the system. Given O(logN)neighbors in each peer, the de Bruijn graphs hop count isO(logN/loglogN). A good comparison study has been done byLoguinovet al. [50] where they use examples of Chord, CAN,and de Bruijn to study routing performance and resilience ofP2P overlay networks, including graph expansion and cluster-ing properties. They confirmed that de Bruijn graphs for agiven degree k offer the best diameter and average distancebetween all pairs of peers (this determines the expectedresponse time in number of hops), optimal resilience (k-peer

connectivity), large bisection width (the bisection width of agraph provides tight upper bounds on the achievable capacityof the graph), and good node (peer) expansion, which guaran-tees little overlap between parallel paths to any destinationpeer. (If there is a peer failure, very few alternative paths to adestination peer are affected.)

P2P DHT-based overlay systems are susceptible to securitybreaches from malicious peers attacks. One simple attack ona DHT-based overlay system is when the malicious peerreturns wrong data objects to the lookup queries. The authen-ticity of the data objects can be handled by using cryptograph-ic techniques through some cost-effective public keys and/orcontent hashes to securely link together different pieces ofdata objects. Such techniques can neither prevent undesirabledata objects from polluting the search results, nor preventdenial of attacks. Malicious peers may still be able to corrupt,deny access, or respond to lookup queries of replicas of a dataobject, and impersonate so that replicas may be stored on ille-gitimate peers. Sit et al. [53] provide a very clear descriptionof security considerations that involve the adversaries that arepeers in the DHT overlay lookup system that do not followthe protocol correctly: malicious peers are able to eavesdropthe communication between other nodes; malicious peers canonly receive data objects addressed to its IP address, and thus,the IP address can be a weak form of peer identity; and mali-cious peers can collude together, giving believable false infor-mation. They presented a taxonomy of possible attack sinvolving routing deficiencies due to corrupted lookup routingand updates; vulnerability to partitioning and virtualization

into incorrect networks when new peers join and contact mali-



cious peers; lookup and storage attacks; inconsistent behaviorof peers; denial of service attacks preventing access by over-loading the victims network connection; and unsolicitedresponses to a lookup query. Defenses design principles thatcan be classified as defining verifiable system invariants forlookup queries, NodeID assignment, peer selection in routing,cross checking using random queries, and avoiding singlepoints of responsibility.

Castroet al. [54] relate the problems of secure routing forStructured P2P overlay networks in terms of the possibilitiesthat a small number of peers could compromise the overlaysystem if peers are malicious and conspire with each other(this is also termed an Eclipse attack [55]). They presented adesign and analysis of techniques for secure peer joining,routing table maintenance, and robust message forwarding inthe presence of malicious peers in Structured P2P overlays.The technique can tolerate up to 25 percent of maliciouspeers while providing good performance when the number ofcompromised peers is small. However, this defense restrictsthe flexibility necessary to implement optimizations such asproximity neighbor selection, and only works in StructuredP2P overlay networks. So, Singh et al. [55] propose a defense

that prevents Eclipse attacks for both Structured and Unstruc-tured P2P overlay networks, by bounding the degree of over-lay peers, i.e., the in-degree of overlay peers is likely to behigher than the average in-degree of legitimate peers, andlegitimate peers choose their neighbors from a subset of over-lay peers whose in-degree is below a threshold. Having donethe in-degree bounding, it is still possible for an attacker toconsume the in-degree of legitimate peers and prevent otherlegitimate peers from referencing to them. Therefore, bound-ing the out-degree is necessary so that legitimate peers chooseneighbors from the subset of overlay peers whose in-degreeand out-degree are below some threshold. An auditing schemeis also introduced to prevent mis-stating incorrect informationof the in-degree and out-degree.

Another good survey on security issues in P2P overlays is

from Wallach [56], which describes secured routing primitives:assigning NodeIDs, maintaining routing tables, and forward-ing of messages securely. He also suggested looking at dis-tributed auditing, the sharing of disk space resources in P2Poverlay networks as a barter economy, and the mechanism toimplement such an economy. The work on BarterRoam [57]sheds light on a formal computational approach that is appli-cable to P2P overlay systems toward exchanging resources sothat higher level functionality, such as incentive-compatibleeconomic mechanisms, can be layered at the higher layers.Formal game theoretical approaches and models [5860]could be constructed to analyze the equilibrium of user strate-gies to implement incentives for cooperation. The ability toovercome free-rider problems in P2P overlay networks willdefinitely improve the systems reliability and its value.

In a Sybil attack, described by Douceur [61], there are alarge number of potentially malicious peers in the system withno central authority to certify peers identities. Thus, itbecomes very difficult to trust the claimed identity. Dingledineet al. [62] propose puzzle schemes, including the use of micro-cash, which allow peers to build up reputations. Although thisproposal provides a degree of accountability, this still allows aresourceful attacker to launch attacks. Many P2P computa-tional models of trust and reputation systems have emerged toassess trustworthiness behavior through feedback and interac-tion mechanisms. The basic assumption of these computation-al trust and reputation models is that the peers engage inbilateral interactions and evaluations performed on a globallyagreed scale. However, most such trust and reputation sys-

tems suffer from two problems, as highlighted by Despotovic

et al . [63]: extensive implementation overhead and vaguetrust-related model semantics. The causes lie in the aggrega-tion of the feedback about all peers in the overlay network inorder to assess the trustworthiness of a single peer, and alsothe anti-intuitive feedback aggregation strategies resulting inoutputs that are difficult to interpret. They proposed a simpleprobabilistic estimation technique with maximum likelihoodestimation sufficient to reduce these problems when the feed-

back aggregation strategy is employed.Finally, since each of these basic Structured P2P DHT-

based systems defines different methods in the higher-levelDHT abstractions to map keys to peers and other StructuredP2P application-specific systems such as cooperative storage,content distribution, and messaging, there have been recentefforts [12] to define basic common API abstractions for thecommon services they provide, which they called key-basedrouting API (KBR), at the lower tiers of the abstractions. Atthe higher tiers, more abstractions can be built upon this basicKBR. In addition to DHT abstraction, which provides thesame functionality as the hash table in Structured P2P DHT-based systems by mapping between keys and objects, thegroup anycast and multicast (CAST) (which provides scalable

group communication and coordination) and decentralizedobject location and routing (DOLR) (which provides a decen-tralized directory service) are also defined. However, Karp etal. [13] point out that the above mentionedbundled librarymodel, in which the applications read the local DHT state andreceive upcalls from the DHT, requires the codes for the sameset of applications to be available at all DHT hosts. This pre-vents the sharing of a single DHT deployment by multipleapplications and generates maintenance traffic from runningthe DHT on its underlying infrastructure. Thus, they proposedOpenDHT with ReDiR, a distributed rendezvous servicemodel that requires only a put()/get() interface and shares acommon DHT routing platform. The authors argued that thisopen DHT service will spur more development of DHT-basedapplications without the burden of deploying and maintaining

a DHT.Table 1 summarizes the characteristics of Structured P2P

overlay networks that have been discussed earlier.

UNSTRUCTURED P2P OVERLAYNETWORKS

In this category, the overlay networks organize peers in a ran-dom graph in a flat or hierarchical manner (e.g., Super-Peerslayer) and use flooding or random walks or expanding-ringTime-To-Live (TTL) search, etc. on the graph to query con-tent stored by overlay peers. Each peer visited will evaluatethe query locally on its own content, and will support complexqueries. This is inefficient because queries for content that arenot widely replicated must be sent to a large fraction of peers,and there is no coupling between topology and data itemslocation. In this section, we will survey and compare some ofthe more seminal Unstructured P2P overlay networks: Freenet[64], Gnutella [9], FastTrack [65]/KaZaA [66], BitTorrent[67], and Overnet/eDonkey2000 [68, 69].

FREENET

Freenet is an adaptive P2P network of pee rs that makequeries to store and retrieve data items, which are identifiedby location-independent keys. This is an example oflooselyStructured decentralized P2P networks with the placement offiles based on anonymity. Each peer maintains a dynamicrouting table, which contains addresses of other peers and the

data keys that they are holding. The key features of Freenet



are the ability to maintain locally a set of files in accordancewith the maximum disk space allocated by the network opera-tor, and to provide security mechanisms against maliciouspeers. The basic model is that requests for keys are passedalong from peer to peer through a chain of proxy requests inwhich each peer makes a local decision about the location tosend the request next, similar to Internet Protocol (IP) rout-ing. Freenet also enables users to share unused disk space,thus allowing a logical extension to their own local storagedevices.

The basic architecture consists of data items being identi-fied by binary file keys obtained by applying the 160-bitSHA-1 hash function [70]. The simplest type of file key is the

Keyword-Signed Key (KSK), which is derived from a short

descriptive text string chosen by the user, e.g., /music/Brit-ney.Spears. The descriptive text string is used as the input todeterministically generate a public/private key pair, and thepublic half is then hashed to yield the data file key. The pri-vate half of the asymmetric key pair is used to sign the datafile, thus providing a minimal integrity check that a retrieveddata file matches its data file key. The data file is also encrypt-ed using the descriptive string itself as a key, so as to performan explicit lookup protocol to access the contents of theirdata-stores.

However, nothing prevents two users from independentlychoosing the same descriptive string for different files. Theseproblems are addressed by the Signed-Subspace Key (SSK),

which enables personal namespaces. The public namespace key

I Table 1.A comparison of various structured P2P overlay network schemes.

Algorithmtaxonomy

Structured P2P Overlay Network Comparisons

CAN Chord Tapestry Pastry Kademlia Viceroy

Decentralization DHT functionality on Internet-like scale

Architecture

Multi-dimensionalID coordinate

space.

Uni-directionaland circularNodeID space.

Plaxton-styleglobal meshnetwork.

Plaxton-styleglobal meshnetwork.

XOR metric fordistance betweenpoints in the key

space.

Butterfly networkwith connectedring of predecessorand successor links;

data managed byservers.

Lookupprotocol

{key, value}pairs to map apoint P in thecoordinatespace usinguniform hashfunction.

Matching keyand NodeID.

Matching suffixin NodeID.

Matching key andprefix in NodeID.

Matching key andNodeID-basedrouting.

Routing throughlevels of tree untila peer is reachedwith no downlinks;vicinity searchperformed usingring and level-ringlinks.

Systemparameters

N-number ofpeers in

network andd-number ofdimensions.

N-number of

peers innetwork.

N-number ofpeers innetwork andB-base of thechosen peeridentifier.

N-number ofpeers in networkand b-number of

bits (B = 2b) usedfor the base ofthe chosenidentifier.

N-number ofpeers in network

and b-number ofbits (B = 2b) ofNodeID.

N-number ofpeers in network.

Routingperformance O(d.N

1/d) O(logN) O(logBN) O(logBN)O(logBN)+c

where c = smallconstant

O(logN)

Routing state 2d logN logBN BlogBN + BlogBN BlogBN + B logN

Peersjoin/leave

2d (logN)2 logBN logBNlogBN +c where

c = small constantlogN

Security Low level. Suffers from man-in-middle and Trojan attacks.

Reliability/faultresiliency

Failure of peerswill not causenetwork-widefailure. Multiplepeersresponsible foreach data item.On failures,applicationretries.

Failure ofpeers will notcause network-wide failure.Replicate dataon multipleconsecutivepeers. Onfailures, appli-cation retries.

Failure of peerswill not causenetwork-widefailure.Replicate dataacross multiplepeers. Keeptrack ofmultiple pathsto each peer.

Failure of peerswill not causenetwork-widefailure. Replicatedata acrossmultiple peers.Keep track ofmultiple paths toeach peer.

Failure of peerswill not causenetwork-widefailure. Replicatedata acrossmultiple peers.

Failure of peerswill not causenetwork-widefailure. Loadincurred bylookups routingevenly distributedamongparticipatinglookup servers.



and the descriptive string are hashed independently, XORedtogether, and hashed to yield the data file key. For retrieval,the user publishes the descriptive string together with the usersubspaces public key. Storing data requires the private key, sothat only the owner of a subspace can add files to it, and own-ers have the ability to manage their own namespaces. Thethird type of key in FreeNet is the Content-Hash Key (CHK),which is used for updating and splitting of contents. This key isderived from hashing the contents of the corresponding file,which gives every file a pseudo-unique data file key. Data files

are also encrypted by a randomly generated encryption key.For retrieval, the user publishes the content-hash key itselftogether with the decryption key. The decryption key is neverstored with the data file but is only published with the data filekey, so as to provide a measure of cover for operators. TheCHK can also be used for splitting data files into multipleparts in order to optimize storage and bandwidth resources.This is done by inserting each part separately under a CHKand creating an indirect file or multiple levels of indirect filesto point to the individual parts. The routing algorithm for stor-ing and retrieving data is designed to adaptively adjust routesover time and to provide efficient performance while usinglocal knowledge, since peers only have knowledge of theirimmediate neighbors. Thus, the routing performance is goodfor popular content. Each request is given a Hops-To-Live(HTL) limit, similar to the IP Time-To-Live (TTL), which isdecremented at each peer to prevent infinite chains. Eachrequest is also assigned a pseudo-unique random identifier, sothat peers can avoid loops by rejecting requests they have seenbefore. If this happens, the preceding peer chooses a differentpeer to forward to. This process continues until the requesteither is satisfied or has exceeded its HTL limit. The success orfailure signal (message) is returned back up the chain to thesending peer. Joining the network will rely on discovering theaddress of one or more of the existing peers through out-of-band means, and no peer is privileged over any other peer, sono hierarchy or centralized point of failure can exist. This intu-itive resilience and decentralization enhances the performanceand scalability, thus giving a constant routing state while peers

join and leave the overlay.

In addition, as described in [64], Freenet uses its data-store to increase system performance. When an object isreturned (forwarded) after a successful retrieval (insertion),the peer caches the object in its datastore, and passes theobject to the upstream (downstream) requester, which thencreates a new entry in its routing table associating the objectsource with the requested key. Thus, when a new objectarrives from either a new insert or a successful request, this

causes the datastore to exceed the designated size and LeastRecently Used (LRU) objects are ejected in order untilthere is space. LRU policy is also applied to the routingtable entries when the table is full.

Figure 7 depicts a typical sequence of request messages.The user initiates a data request at peer A, which forwardsthe request to peer B, and then forwards it to peer C. PeerC is unable to contact any other peer and returns a back-trackingfai led request message to peer B. Peer B tries itssecond choice, peer E, which forwards the request to peer F,which then delivers it to peer B. Peer B detects the loop andreturns a backtracking failure message. Peer F is unable tocontact any other peer and backtracks one step further backto peer E. Peer E forwards the request to its second choice,

peer D, which has the data. The data is returned from peerD, via peers E, B, and A. The data is cached in peers E, B,and A, therefore it creates a routing short-cut for the nextsimilar queries. This example shows that the overlay suffersfrom security problems such as man-in-middle and Trojan

attacks, and the failure of peers will not cause network-widefailure, because of its lack of centralized structure. This givesgood reliability and fault resiliency.

GNUTELLA

Gnutella (pronouncednewtella) is a decentralized protocol fordistributed search on a flat topology of peers (servents).Gnutella is widely used, and there has been a large amount ofwork on improving Gnutella [7173]. Although the Gnutella

protocol supports a traditional client/centralized server searchparadigm, Gnutellas distinction is its peer-to-peer, decentral-ized model for document location and retrieval, as shown inFig. 8. In this model, every peer is a server or client. This sys-tem is neither a centralized directory nor does it possess anyprecise control over the network topology or file placement.The network is formed by peers joining the network followingsome loose rules. The resulting topology has certain proper-ties, but the placement of data items is not based on anyknowledge of the topology, as in the Structured P2P designs.To locate a data item, a peer queries its neighbors, and themost typical query method is flooding. The lookup query pro-tocol is flooded to all neighbors within a certain radius. Suchdesign is extremely resilient to peers entering and leaving thesystem. However, the current search mechanisms are not scal-able and generate unexpected loads on the network.

The so-called Gnutella servents (peers) perform tasks nor-mally associated with both clients and servers. They provideclient-side interfaces through which users can issue queriesand view search results, while at the same time they alsoaccept queries from other servents, check for matches againsttheir local data set, and respond with applicable results. Thesepeers are responsible for managing the background trafficthat spreads the information used to maintain network integri-ty. Due to its distributed nature, a network of servents thatimplement the Gnutella protocol is highly fault-tolerant, asoperation of the network will not be interrupted if a subset ofservents goes offline.

To join the system, a new servent (peer) initially connects

to one of several known hosts that are almost always avail-

I Figure 7.A typical request sequence in Freenet.

Object request

Reply

12

1

6

7

5

8

114

10

9

3

2

Peerrequest

Peer withobjectF

A

C

B

E

D

Failed request



able, e.g., list of peers available from http://gnutellahosts. com.

Once connected to the network, peers send messages to inter-act with each other. These messages are broadcasted (i.e. sentto all peers with which the sender has open TCP connections),or simply back-propagated (i.e., sent on a specific connectionon the reverse of the path taken by an initial, broadcast mes-sage). First, each message has a randomly generated identifi-er. Second, each peer keeps a short memory of the recentlyrouted messages, used to prevent re-broadcasting and toimplement back-propagation. Third, messages are flaggedwith TTL and hops passed fields. The message s that areallowed in the network are: Group Membership (PING and PONG) Messages. A

peer joining the network initiates a broadcasted PINGmessage to announce its presence. The PING message isthen forwarded to its neighbors and initiates a back-

propagated PONG message, which contains informationabout the peer, such as the IP address, number and sizeof the data items.

Search (QUERY and QUERY RESPONSE) Messages.QUERY contains a user specified search string that eachreceiving peer matches against locally stored file namesand it is broadcast. QUERY RESPONSE messages arebackpropagated replies to QUERY messages and includeinformation necessary to download a file.

File Transfer (GET and PUSH) Messages. File down-loads are performed directly between two peers usingthese types of messages.Therefore, to become a member of the network, a servent

(peer) has to open one or many connections with other peersthat are already in the network. With such a dynamic networkenvironment, to cope with the unreliability after joining thenetwork, a peer periodically PINGs its neighbors to discoverother participating peers. Peers decide where to connect inthe network based only on local information. Thus, the entireapplication-level network has servents as its peers and openTCP connections as its links, forming a dynamic, self-organiz-ing network of independent entities.

The latest versions of Gnutella uses the notion of super-peers or ultra-peers [11] (peers with better bandwidth connec-tivity), to help improve the routing performance of thenetwork. However, it is still limited by the flooding mechanismused for communications across ultra-peers. Moreover, theultra-peer approach makes a binary decision about a peerscapacity (ultra-peer or not) and to our knowledge, it has no

mechanism to dynamically adapt the ultra-peer-client topolo-

gies as the system evolves. Ultra-peers perform query process-ing on behalf of their leaf peers. When a peer joins the net-work as a leaf, it selects a number of ultra-peers, and then itpublishes its file list to those ultra-peers. A query for a leafpeer is sent to an ultra-peer, which floods the query to itsultra-peer neighbors up to a limited number of hops. Dynamicquerying [74] is a search technique whereby queries thatreturn fewer results are re-flooded deeper into the network.

Saroiu et al. [75] examined the bandwidth, latency, avail-ability, and file sharing patterns of the peers in Gnutella andNapster, and highlighted the existence of significant hetero-geneity in both systems. Krishnamurthy et al. [76] propose acluster-based architecture for P2P systems (CAP), which usesa network-aware clustering technique (based on a centralclustering server) to group peers into clusters. Each clusterhas one or more delegate peers that act as directory serversfor objects stored at peers within the same cluster. Chawatheet al. [73] propose a model called Gia, by modifying Gnutel-las algorithm to include flow control, dynamic topologyadaptation, one-hop replication, and careful attention to peerheterogeneity. The simulation results suggest that these mod-

ifications provide three to five orders of magnitude improve-

ment in the total capacity of the sys tem while retainingsignificant robustness to failures. Thus, making a few simplechanges to Gnutellas search operations would result in dra-matic improvements in its scalability.

FASTTRACK/KAZAA

FastTrack [65] P2P is a decentralized file-sharing system thatsupports meta-data searching. Peers form a structured overlayof super-peer architectures to make search more efficient, asshown in Fig. 9. Super-peers are peers with high bandwidth,disk space, and processing power, and have volunteered to beelected to facilitate search by caching the meta-data. The ordi-nary peers transmit the meta-data of the data files they aresharing to the super-peers. All the queries are also forwarded

to the super-peer. Then, Gnutella-type broadcast-based searchis performed in a highly pruned overlay network of super-peers. The P2P system can exist without any super-peer, butthis would result in worse query latency. However, thisapproach still consumes bandwidth so as to maintain the indexat the super-peers on behalf of the peers that are connected.The super-peers still use a broadcast protocol for search, andthe lookup queries are routed to peers and super-peers thathave no relevant information to the query. Both KaZaA [66]and Crokster [77] are both FastTrack applications.

As mentioned, KaZaA is based on the proprietary Fast-Track protocol which uses specially designated super-peersthat have higher bandwidth connectivity. Pointers to eachpeers data are stored on an associated super-peer, and allqueries are routed to the super-peers. Although this approachseems to offer better scaling properties than Gnutella, itsdesign has not been analyzed. There have been proposals toincorporate this approach into the Gnutella network [11]. TheKaZaA peer-to-peer file sharing network client supports asimilar behavior, allowing powerful peers to opt-out of net-work support roles that consume CPU and bandwidth.

KaZaA file transfer traffic consists of unencrypted HTTPtransfers; all transfers include KaZaA-specific HTTP headers(e.g.,X-KaZaA-IP). These headers make it simple to distin-guish between KaZaA activity and other HTTP activity. TheKaZaA application has an auto-update feature, meaning arunning instance of KaZaA will periodically check for updatedversions of itself . If it is found, it downloads the new exe-cutable over the KaZaA network.

A power-law topology, commonly found in many practical

I Figure 8. Gnutella utilizes a decentralized architecture docu-ment location and retrieval.

Download

ResponseRespons

eQuery

Query

Query

Query

Peer

Peer

Peer

Peer

Peer



networks such as WWW [78, 79], has the property that a small

proportion of peers have a high out-degree (i.e., have manyconnections to other peers), while the vast majority of peershave a low out-degree (i.e., have connections to few peers).Formally, the frequencyfd of peers with out-degreed exhibitsa power-law relationship of the form fd da; a < 0. This isthe Zipf property with Zipf distributions looking linear whenplotted on a log-log scale. Faloutsos et al. [80] have found thatInternet routing topologies follow this power-law relationshipwith a 2. However, Gummadi et al. [75, 81] observe thatthe KaZaA measured popularity of the file-sharing workloaddoes not follow a Zipf distribution. The popularity of the mostrequested objects (mostly large, immutable video and audioobjects) is limited, since clients typically fetch objects at mostonce, unlike the Web. Thus, the popularity of KaZaAsobjects tends to be short-lived, and popular objects tend to be

recently born. There is also significant locality in the KaZaAworkload, which means that substantial opportunity exists forcaching to reduce wide-area bandwidth consumption.

BITTORRENT

BitTorrent [67] is a centralized P2P system that uses a centrallocation to manage users downloa

Documents

B1 - 01 - A Survey and Comparison of Peer-To-peer Overlay Network Schemes 01610546