20120412 searching techniques in peer to peer networks

  • Published on

  • View

  • Download

Embed Size (px)




<ul><li> 1. Author: Xiuqi Li and Jie WuPresenter: Zia Ush ShamszamanANLAB, ICE, HUFS</li></ul><p> 2. Survey of major searching techniques in peer-to-peer (P2P)networks. Concept of P2P networks and the methods for classifyingdifferent P2P networks . Various searching techniques in unstructured P2P systems, strictlystructured P2P systems, and loosely structured P2P systems hasbeen discussed. Searching in unstructured P2Ps covers both blind search schemesand informed search scheme. Searching in Strictly structured P2Ps focuses on hierarchicalDistributed Hash Table (DHT) P2Ps and Non-DHT P2Ps andnon-hierarchical DHT P2Ps is brieflyoverviewed. 2 3. P2P networks are overlay networks on top of Internet, wherenodes are end systems in the Internet and maintain informationabout a set of other nodes (called neighbors) in the P2P. P2P networks offer the following benefits They do not require any special administration orfinancialarrangements. They are self-organized and adaptive. Peers may come and go freely.P2P systems handle these events automatically. They can gather and harness the tremendous computation and storageresources on computers across the Internet. They are distributed and decentralized. Therefore,they arepotentially fault-tolerant and load-balanced.3 4. P2P networks can be classified based on the control overdata location and network topology. There are three categories: Unstructured: In an unstructured P2P network such asGnutella, no rule exists which defines where data is stored andthe network topology is arbitrary. Loosely structured: In a loosely structured network such asFreenet and Symphony, the overlay structure and the datalocation are not precisely determined. Highly structured: In a highly structured P2P network such asChord, both the network architecture and the data placementare precisely specified.4 5. P2P networks can also be classified as centralized and decentralized In a centralized P2P such as Napster, a central directory of object location, IDassignment, etc. is maintained in a single location. Decentralized P2Ps adopt a distributed directory structure.These systems can be further divided into Purely decentralized systems, such as Gnutella and Chord, peers are totally equal. Hybrid systems, some peers called dominating nodes or super-peers serve the searchrequest of other regular peers. Another classification of P2P systems is hierarchical &amp; non-hierarchicalbased on whether the overlay structure is a hierarchy or not. All hybrid systems and few purely decentralized systems such as Kelips, arehierarchical systems. Hierarchical systems provide good scalability,opportunity to take advantage of node heterogeneity, and high routingefficiency Most purely decentralized systems have flat overlays and are non-hierarchicalsystems. Non-hierarchical systems offer load-balance and highresilience 5 6. Searching means locating desired data. Most existing P2P systems support thesimple object lookup by keyor identifier. Some existing P2P systems can handle more complex keywordqueries, which find documents containing keywords in queries. More than one copy of an object may exist in a P2P system.There may be more than one document that contains desiredkeywords. Some P2P systems are interested in a single data item; others areinterested in all data items or as many data items as possiblethat satisfy a given condition. Most searching techniques are forwarding-based. Starting withthe requesting node, a query is forwarded (or routed) to thedesired node/s.6 7. High-quality query results Minimal routing state maintained per node High routing efficiency Load balance Resilience to node failures Support of complex queries 7 8. The quality of query results is application dependent. Generally, it is measured by the number of results andrelevance. The routing state refers to the number of neighbors eachnode maintains. The routing efficiency is generally measured by thenumber of overlay hops per query. In some systems, it is also evaluated using the number ofmessages per query. Different searching techniques make different trade-offsbetween these desired characteristics. 8 9. Yang and Garcia-Molina borrowed the idea ofiterative deepening from artificial intelligence. The querying node periodically issues a sequence of BFSsearches with increasing depth limits D1 &lt; D2 &lt; &lt; Di. The query is terminated when the query result is satisfiedor when the maximum depth limit D has been reached. All nodes use the same sequence of depth limits calledPolicy: set of depths {0,2,4,5} and,11 10. Iterative Deepening {0, 2, 4, 5}, 3 Holding Frozen QueryAlready Processed Query Unaware of Query 11. Iterative Deepening {0, 2, 4, 5}, 3 Holding Frozen QueryAlready Processed Query Unaware of Query 12. Iterative Deepening {0, 2, 4, 5}, 3 Holding Frozen QueryAlready Processed Query Unaware of Query 13. Iterative Deepening {0, 2, 4, 5}, 3 Holding Frozen QueryAlready Processed Query Unaware of Query 14. Standard Random Walker Forward the query to a randomly chosen neighbor at eachstep Each message a walker. Cut message overhead Increase query searching delay(#hops)k-walkers The requesting node sends k query messages and eachquery message takes its own random walk Periodically, when a node receives a query, it checks withsource node to see if query has been satisfied k walkers after T steps should reach roughly the samenumber of nodes as 1 walker after kT steps So cut delay by a factor of k.To decrease delay, increase walkers16 15. Why shouldnt Ifind a song? A sends a walkerto find song.mp3that is stored on B 16. TTL-based or Hop Count Checking: the walker periodically checks with the originalrequestor before walking to the next node (again use alarge TTL, just to prevent loops)Experiments show thatchecking once at every 4th step strikes a good balancebetween the overhead of the checking message and thebenefits of checking18 17. Directed Breadth First Search Source Only sends queries to good neighbors Good neighbors might have Produced results in the past Low latency Lowest hop count for results They have good neighbors Highest traffic neighbors Theyre stable Shortest message queue Routed as normal BFS after first hop 18. Directed Breadth First Search 19. Directed Breadth First Search 20. Efficient Search - Methods Directed Breadth First Search 21. Directed Breadth First Search 22. Directed Breadth First Search 23. Idea is that a node maintains information aboutwhat files neighbors, and possibly neighborsneighbors store. radius of knowledge All nodes know about radius, and know aboutpolicy Policy lists which levels will respond and which willignore messages Servents look at TTL/Hops to determine if theyprocess or ignore. Memory issue of maintaining lists Size is far below a megabyte for radius &lt; 5 Network issue of building list Hit from extra packets 24. Nodes at level 1 Each node has a radius ofrespond with 2 knows about the filesinformation about on its neighbors andlevels 1, 2 and 3,neighbors neighborsand forward to nextlevelPolicy is that levels 1, 4 and 7 respond Searches move to levels 2 and 3, whichignore and forward reaches When search level 4, it responds with information about Layers 5 level67Finally, and levels 4, 5, and 6, simply ignore and own thenresponds with its forward. forwards the messages. and terminatesdata,the query. 25. Joining a new node: sends a join message with TTL=r andall the nodes within r hops update their indices. Join message contains the metadata about the joiningnode. When a node receives this join message it, in turn, sendjoin message containing its meta data directly to the newnode. New node updates its indices. Node dies: Other nodes update their indices based on thetimeouts. Updating the node: When a node updates its collection, hisnode will send out a small update message with TTL= r,containing the metadata of the affected item. All nodesreceiving this message subsequently update their index. 26. The objective of a Routing Index (RI) is to allowa node to select the best neighbors to send aquery. A RI is a data structure that, given a query, returns a list of neighbors, ranked according to theirgoodness for the query. Each node has a local index for quickly finding local documents when a query is received. Nodesalso have a CRI containing the number of documents along each path the number of documents on each topic28 27. For A, there are 100 documents available from B (and its descendents) 20 belong to Database category 10 belong to Theory category 30 belong to Languages category Goodness of a neighbor CRI ( s i )Number Of Documentsi NumberOf DocumentsCRI(si) is the value for the cell at the column for topic si and at therow for a neighbor 28. For documents of databases and languages2030Goodness ( B ) 100 6100 1000 50Goodness ( C ) 1000 0 1000 1000100 150Goodness ( D ) 200 75200200 29. New connectionRI propagationD+A+JD+A+I 30. Attenuated Bloom Filters are extensions to bloom filters. Bloom filters are often used to approximately andefficiently summarize elements in a set. Assumes that each stored document has many replicasspread over the P2P network. Documents are queried by names. It intends to quickly find replicas close to the query sourcewith high probability. This is achieved by approximately Summarizing the documents that likely exist in nearbynodes. However, the approach alone fails to find replicas far awayfrom the query source.32 31. In a strictly structured system, the neighbor relationshipbetween peers and data locations is strictly defined. Searching in such systems is therefore determined by theparticular network architecture. Among the strictly structured systems, some implement adistributed hash table (DHT) using different datastructures. Others do not provide a DHT interface. Some DHT P2Psystems have flat overlay structures; othershave hierarchical overlay structures. A DHT is a hash table whose table entries are distributed amongdifferent p eers located in arbitrary locations. Each data item is hashed to a unique numeric key. Each node isalso hashed to a unique ID in the same key space 33 32. Different non-hierarchical DHT P2Ps use different flat datastructures to implement the DHT. These flat data structures include ring, mesh, hypercube, andother special graphs such as de Bruijn graph. Chord uses a ring data structure Node IDs form a ring. Each node keeps a finger table that contains the IP addresses of nodes Pastry uses a tree-based data structure which can beconsidered as a generalization of a hypercube. To shorten the routing latency, each pastry node also keeps a routingtable of pointers to other nodes in the ID space Some other examples are Koorde, Viceroy, and Cycloid etc. 34 33. All hierarchical DHT P2Ps organize peers into differentgroups or clusters. Each group forms its own overlay. All groups together form the entire hierarchical overlay. Typically the overlay hierarchies are two-tier or three-tier. They differ mainly in the number of groups in each tier,the overlay structure formed by each group. Superpeers/dominating nodes generally contribute more computing resources, are more stable, and take moreresponsibility in routing than regular peers. Example of this category is Kelips and Coral. 35 34. Kelips is composed of k virtual affinity groups with groupIDs. IP address and port number of a node n is hashed to agroup ID of the group to which the node n belongs. The consistent hashing function SHA-1 provides a goodbalance of group members with high probability. Each file name is mapped to a group using the same SHA-1function. Inside a group, a file is stored in a randomly chosen groupmember, called the files homenode. Kelips offers load balance in the same group and amongdifferent groups 36 35. Each node n in an affinity group g keeps in the memorythe following routing state: View of the belonging affinity group g: This is the information about the set of nodes in the same group.The data includes the roundtrip time estimate, the heartbeatcount, etc. Contacts of all other affinity groups : This is the information about a small constant number of nodes inall other groups. Filetuples: This is the intra-group index about the set of files whosehomenodes are in the same affinity group. A file tuple consists of a file name and the IP address of the fileshomenode. A heartbeat count is also associated with a file tuple. 37 36. The total number of routing table entries pernode is- N/k+c*(k- 1 )+F/kWhere, N: refers to the total number of nodes, c: the number of contacts per group, F: total number of files in the system, and k: the number of affinity groupsF is proportional to N and c is fixed 38 37. To look up a file f, the querying node A in the group Ghashes the file to the files belonging group G. If G is the same as G, the query is resolved by checking thenode As local data store and local intra-group data index. Otherwise, A forwards the query to the topologically closestcontact in group G. On receiving a query request, the contact in the group Gsearches its local data store and local intra-group data index.The IP address of fs homenode is then returned to thequerying node directly. In case of a file lookup failure, the querying node retriesusing different contacts in the group G Using a random walk in the group G , or a random walk inthe group G.39 38. Coral in is an indexing scheme. It does not dictate how tostore or replicate data items. Objectives of Coral are to avoid hot spots and to findnearby data without querying distant nodes. A distributed sloppy hash table (DSHT) has been proposedto eliminate hot spots. In DHT, a key is associated with a single value which is adata item or a pointer to a data item. In a DSHT, a key is associated with a number of valueswhich are pointers to replicas of data items. DSHT provides the interface: put(key, value) and get(key). put(key,value) stores a value under a key. get(key) returns a subset of values under a key. 40 39. When a file replica is stored locally on a node A, The node A hashes the file name to a key k and inserts apointer nodeaddr (As address) to that file into the DSHTby calling put(k,nodeaddr). To query for a list of values for a key k, get(k) is forwardedin the identifier space. Coral organizes nodes into a hierarchy of clusters and putsnearby nodes in the same cluster. Coral consists of three levels of clusters. In the lowest-level, Level2, cover peers located in the same region and have the clusterdiameter (round-trip time) 30msecs. Level 1, cover peers located in the same continent and have thecluster diameter 100msecs. Level 0, is a single cluster for the entire planet and the clusterdiameter is infinite.41 40. DHT balance load among different nodes, but hashingdestroys data locality. The non-DHT P2Ps try to solve the problems of DHT P2Psby avoiding hashing. Hashing does not keep data locality and is not amenable torange que...</p>