20120412 searching techniques in peer to peer networks

Embed Size (px)



Text of 20120412 searching techniques in peer to peer networks

  • 1. Author: Xiuqi Li and Jie WuPresenter: Zia Ush ShamszamanANLAB, ICE, HUFS

2. Survey of major searching techniques in peer-to-peer (P2P)networks. Concept of P2P networks and the methods for classifyingdifferent P2P networks . Various searching techniques in unstructured P2P systems, strictlystructured P2P systems, and loosely structured P2P systems hasbeen discussed. Searching in unstructured P2Ps covers both blind search schemesand informed search scheme. Searching in Strictly structured P2Ps focuses on hierarchicalDistributed Hash Table (DHT) P2Ps and Non-DHT P2Ps andnon-hierarchical DHT P2Ps is brieflyoverviewed. 2 3. P2P networks are overlay networks on top of Internet, wherenodes are end systems in the Internet and maintain informationabout a set of other nodes (called neighbors) in the P2P. P2P networks offer the following benefits They do not require any special administration orfinancialarrangements. They are self-organized and adaptive. Peers may come and go freely.P2P systems handle these events automatically. They can gather and harness the tremendous computation and storageresources on computers across the Internet. They are distributed and decentralized. Therefore,they arepotentially fault-tolerant and load-balanced.3 4. P2P networks can be classified based on the control overdata location and network topology. There are three categories: Unstructured: In an unstructured P2P network such asGnutella, no rule exists which defines where data is stored andthe network topology is arbitrary. Loosely structured: In a loosely structured network such asFreenet and Symphony, the overlay structure and the datalocation are not precisely determined. Highly structured: In a highly structured P2P network such asChord, both the network architecture and the data placementare precisely specified.4 5. P2P networks can also be classified as centralized and decentralized In a centralized P2P such as Napster, a central directory of object location, IDassignment, etc. is maintained in a single location. Decentralized P2Ps adopt a distributed directory structure.These systems can be further divided into Purely decentralized systems, such as Gnutella and Chord, peers are totally equal. Hybrid systems, some peers called dominating nodes or super-peers serve the searchrequest of other regular peers. Another classification of P2P systems is hierarchical & non-hierarchicalbased on whether the overlay structure is a hierarchy or not. All hybrid systems and few purely decentralized systems such as Kelips, arehierarchical systems. Hierarchical systems provide good scalability,opportunity to take advantage of node heterogeneity, and high routingefficiency Most purely decentralized systems have flat overlays and are non-hierarchicalsystems. Non-hierarchical systems offer load-balance and highresilience 5 6. Searching means locating desired data. Most existing P2P systems support thesimple object lookup by keyor identifier. Some existing P2P systems can handle more complex keywordqueries, which find documents containing keywords in queries. More than one copy of an object may exist in a P2P system.There may be more than one document that contains desiredkeywords. Some P2P systems are interested in a single data item; others areinterested in all data items or as many data items as possiblethat satisfy a given condition. Most searching techniques are forwarding-based. Starting withthe requesting node, a query is forwarded (or routed) to thedesired node/s.6 7. High-quality query results Minimal routing state maintained per node High routing efficiency Load balance Resilience to node failures Support of complex queries 7 8. The quality of query results is application dependent. Generally, it is measured by the number of results andrelevance. The routing state refers to the number of neighbors eachnode maintains. The routing efficiency is generally measured by thenumber of overlay hops per query. In some systems, it is also evaluated using the number ofmessages per query. Different searching techniques make different trade-offsbetween these desired characteristics. 8 9. Yang and Garcia-Molina borrowed the idea ofiterative deepening from artificial intelligence. The querying node periodically issues a sequence of BFSsearches with increasing depth limits D1 < D2 < < Di. The query is terminated when the query result is satisfiedor when the maximum depth limit D has been reached. All nodes use the same sequence of depth limits calledPolicy: set of depths {0,2,4,5} and,11 10. Iterative Deepening {0, 2, 4, 5}, 3 Holding Frozen QueryAlready Processed Query Unaware of Query 11. Iterative Deepening {0, 2, 4, 5}, 3 Holding Frozen QueryAlready Processed Query Unaware of Query 12. Iterative Deepening {0, 2, 4, 5}, 3 Holding Frozen QueryAlready Processed Query Unaware of Query 13. Iterative Deepening {0, 2, 4, 5}, 3 Holding Frozen QueryAlready Processed Query Unaware of Query 14. Standard Random Walker Forward the query to a randomly chosen neighbor at eachstep Each message a walker. Cut message overhead Increase query searching delay(#hops)k-walkers The requesting node sends k query messages and eachquery message takes its own random walk Periodically, when a node receives a query, it checks withsource node to see if query has been satisfied k walkers after T steps should reach roughly the samenumber of nodes as 1 walker after kT steps So cut delay by a factor of k.To decrease delay, increase walkers16 15. Why shouldnt Ifind a song? A sends a walkerto find song.mp3that is stored on B 16. TTL-based or Hop Count Checking: the walker periodically checks with the originalrequestor before walking to the next node (again use alarge TTL, just to prevent loops)Experiments show thatchecking once at every 4th step strikes a good balancebetween the overhead of the checking message and thebenefits of checking18 17. Directed Breadth First Search Source Only sends queries to good neighbors Good neighbors might have Produced results in the past Low latency Lowest hop count for results They have good neighbors Highest traffic neighbors Theyre stable Shortest message queue Routed as normal BFS after first hop 18. Directed Breadth First Search 19. Directed Breadth First Search 20. Efficient Search - Methods Directed Breadth First Search 21. Directed Breadth First Search 22. Directed Breadth First Search 23. Idea is that a node maintains information aboutwhat files neighbors, and possibly neighborsneighbors store. radius of knowledge All nodes know about radius, and know aboutpolicy Policy lists which levels will respond and which willignore messages Servents look at TTL/Hops to determine if theyprocess or ignore. Memory issue of maintaining lists Size is far below a megabyte for radius < 5 Network issue of building list Hit from extra packets 24. Nodes at level 1 Each node has a radius ofrespond with 2 knows about the filesinformation about on its neighbors andlevels 1, 2 and 3,neighbors neighborsand forward to nextlevelPolicy is that levels 1, 4 and 7 respond Searches move to levels 2 and 3, whichignore and forward reaches When search level 4, it responds with information about Layers 5 level67Finally, and levels 4, 5, and 6, simply ignore and own thenresponds with its forward. forwards the messages. and terminatesdata,the query. 25. Joining a new node: sends a join message with TTL=r andall the nodes within r hops update their indices. Join message contains the metadata about the joiningnode. When a node receives this join message it, in turn, sendjoin message containing its meta data directly to the newnode. New node updates its indices. Node dies: Other nodes update their indices based on thetimeouts. Updating the node: When a node updates its collection, hisnode will send out a small update message with TTL= r,containing the metadata of the affected item. All nodesreceiving this message subsequently update their index. 26. The objective of a Routing Index (RI) is to allowa node to select the best neighbors to send aquery. A RI is a data structure that, given a query, returns a list of neighbors, ranked according to theirgoodness for the query. Each node has a local index for quickly finding local documents when a query is received. Nodesalso have a CRI containing the number of documents along each path the number of documents on each topic28 27. For A, there are 100 documents available from B (and its descendents) 20 belong to Database category 10 belong to Theory category 30 belong to Languages category Goodness of a neighbor CRI ( s i )Number Of Documentsi NumberOf DocumentsCRI(si) is the value for the cell at the column for topic si and at therow for a neighbor 28. For documents of databases and languages2030Goodness ( B ) 100 6100 1000 50Goodness ( C ) 1000 0 1000 1000100 150Goodness ( D ) 200 75200200 29. New connectionRI propagationD+A+JD+A+I 30. Attenuated Bloom Filters are extensions to bloom filters. Bloom filters are often used to approximately andefficiently summarize elements in a set. Assumes that each stored document has many replicasspread over the P2P network. Documents are queried by names. It intends to quickly find replicas close to the query sourcewith high probability. This is achieved by approximately Summarizing the documents that likely exist in nearbynodes. However, the approach alone fails to find replicas far awayfrom the query source.32 31. In a strictly structured system, the neighbor relationshipbetween peers and data locations is strictly defined. Searching in such systems is therefore determined by theparticular network architecture. Among the strictly structured systems, some implement adistributed hash table (DHT) using different datastructures. Others do not provide a DHT interface. Some DHT P2Psystems have flat overlay structures; othershave hierarchical overlay structures. A DHT is a