Upload
michael0066
View
219
Download
0
Embed Size (px)
Citation preview
8/14/2019 Distance Join Processing in a P2P World
1/25
Distance Join Processing
in a P2P World
Xiaoqi Zhang
Student ID: 261273
8/7/2008
Supervisor: Dr. Egemen Tanin
8/14/2019 Distance Join Processing in a P2P World
2/25
2 | P a g e
Distance Join Processing in a P2P World
Abstract
P2P networks have expanded their use to the area of distributed database
systems. The P2P paradigm is famous for its various advantages over the conventional
client-server paradigm in that it provides excellent scalability both in computation and
bandwidth as well as no single point of failure due to decentralization. Spatial data is
widely used today in P2P applications. By exploiting the features of the P2P paradigm,
efficient spatial data retrieval becomes possible. A large body of work has been done
in spatial data retrieval over P2P networks, which focuses on the classic query
operations of range query and nearest neighbor query. However, to the best of my
knowledge, no work has been done in spatial data distance join operations in thecontext of P2P paradigm. This report gives a detailed review on the first distance join
algorithm for P2P networks along with its implementation. A comprehensive
experiment is carried out at the end to examine different aspects of the algorithm.
Keywords: P2P, client-server, spatial data, GIS
1. Introduction
Spatial data has become a critical ingredient in various applications and
databases including location-based services [1], public transportation services
scientific data management [2,3,4] and digital government [5]. Not only is spatial data
widely used in scientific or government organizations but also it is used by the general
public, such as in-car GPS systems, real-estate agencies, etc.
2D worlds and their representations are the most frequently used spatial data in
spatial data processing domain. A 2D presentation of a virtual or a real world in an
application contains many spatial objects which have positional values. One solution
to eliminate the bottleneck problem that the conventional client-server architecture
may bring into the applications is to distribute such spatial objects among machines in
the P2P networks so that operations on the spatial data are carried out in a P2Pparadigm rather than a client-server paradigm. New P2P applications, i.e.,
job-employee seeker networks, buyer-seller networks, event/location finders for a city,
follow the solution. For example, in a buyer-seller P2P network, information about
sellers and products is distributed over the network. A potential buyer may supply
his/her location and an area in the map where sellers may be located along with some
information about the product to a search system and the system returns a list about
the sellers who is selling the related products. This type of operation can be done by
simply clicking on a 2D map to choose the location and area. Another similar type of
query will yield the distance join result which contains ordered pairs of spatial objects.
Such order depends on the distance between the two spatial objects. Finding the
8/14/2019 Distance Join Processing in a P2P World
3/25
3 | P a g e
closest bar-restaurant pair will be one example of such applications. One
straightforward approach towards this type of operation is to simply forward
messages among available nodes in the network for locating desired data. Such an
approach is obviously not feasible, which makes an extra large amount of peers that
do not have the desired data participate in this operation. In the unpublished paper [6],Tanin et al. have proposed an elegant way that exploits the features of P2P networks.
They used a data structure called quadtree [7] to partition underlying spatial data in
2D worlds on which distance join queries are carried out. The content of this report is
based on [6, 8]. It gives a detailed explanation of the proposed distance join algorithm
and the results of a comprehensive experiment are presented at the end.
The rest of this report is organized as follows. Section 2 gives a brief review of
related works focusing on sequential distance join algorithms and distributed quadtree
index; section 3 discusses 2 other types of query on distributed quadtree index;
section 4 explains the distance join algorithm and one implementation of mine;
section 5 gives the details of the experiments and the results; in section 6, conclusionand future work are given.
2. Related Work
2.1. Base Sequential Algorithm
Several works has been done regarding to distance join algorithms. Hjaltason
and Samet examined various similarity search algorithms in metric spaces in [9] with
the main contribution being the use of a priority queue-based ranking algorithm for
spatial data. This algorithm can find the results of a ranking query in an incremental
fashion. In [10], they proposed a distance join algorithm that works on a hierarchical
spatial data structures. In the paper, the authors use a data structure called R-tree as
the storage of the spatial data/R-tree blocks. Priority queue based approach is adopted
to facilitate the process of the ranking algorithm. Pairs of spatial objects and R-tree
blocks are inserted into the priority queue. The distance between each pair is used as
the criterion for ordering the queue. At each step of the algorithm, the pair at the head
of the priority queue is retrieved and processed, i.e., the pair with the smallest distance.
If the pair is formed by two data objects, then the pair is reported as the next closest
pair. If one of the items in the dequeued pair is a node from the R-tree, then the R-treenode in the pair is substituted by its descendants, i.e., objects or sub-nodes, to form
new pairs. This method works in an incremental fashion. Their algorithm has a
drawback. Pairs in the priority queue are processed sequentially. Thus in a P2P
network, the algorithm will work inefficiently due to the accumulated communication
delay. The algorithm examined in this report employs the similar priority queue based
approach but it is carefully designed so that it works efficiently in P2P networks by
utilizing the parallelism in the network.
2.2. Distributed Quadtree Index
8/14/2019 Distance Join Processing in a P2P World
4/25
4 | P a g e
2.2.1. Partition Spatial Data Using Quad-CIF Tree
The distance join algorithm examined in this report is based on distributed
quadtree index proposed in [11]. In the paper [11] a data structure called quad-CIF
tree [12] is used for partitioning spatial data. A quad-CIF tree is a variation of quad
tree [13] and is originally used for speeding-up algorithms used in computer-aideddesign of integrated circuits [12]. A quadtree is a tree data structure with each node
can have maximum 4 sub nodes. The quadtree can represent a 2D space in the
following way: At the beginning a root node in the quadtree represents the entire 2D
space. The space is then divided into 4 identical sub regions, which equals the root
node splitting itself into 4 sub nodes
with each one of them corresponding to
a sub region. For each one of the sub
regions, the same process then proceeds
recursively until a certain criterion is
met. Figure 1 shows this process. QuadCIF-tree extends quadtree definition in
that it specifies the criteria of when to start
the subdivision and when to stop the
subdivision given the distribution of
spatial data in the 2D space. The start and
stop rules are defined as follows:
For any one of the spatial objects
within a certain 2D region, the region thatcompletely contains the spatial object
root node
o
A B
C D
level 1 nodes
root node
o
root node
o
A B
C D
level 1 nodes
O
root node
AB
C D
A
B C
D
O
O
root node
O
root node
AB
C D
A
B C
D
O
CA CB
CC CD
CA
CB CC
CD
rectangle 1
Figure 1. Quad tree demo
Figure 2. Quad-CIF tree partitions spatial data
8/14/2019 Distance Join Processing in a P2P World
5/25
5 | P a g e
splits itself into 4 identical sub regions; and for any one of the 4 sub regions that
completely contains the spatial object, split itself again, until no sub region can
contain the spatial object in its entirety. And the spatial object is inserted to the node
which corresponds to the smallest region that contains the spatial object in its entirety.
The process is depicted in figure 2.In the paper [11], the proposers give a concept of control point for each
region and sub region, which is simply the centroid of the region. As shown in figure
2, each node in the quadtree maintains the information about its corresponding control
point denoted as :, which can be represented in the following formula
() = ({ , , , }, ,) ,( )
Basically, these are 3 pieces of information: first, the information about the 4 children
of the node, denoted as , , , , which are just type of integer indicating how
many spatial objects does the corresponding child have; ,) ( is the 2D Cartesian
point in the 2D region; and contains all the spatial objects which are inserted
to this quadtree node. The information is crucial for searching algorithms (rang query,nearest neighbor query, distance join query) to conduct. It makes it possible to decide
whether to forward a query further down on the quadtree. Details will be given at
section 3.
2.2.2. Routing Desired Data Using Chord
The P2P distance join algorithm proposed employs the distributed quadtree
index as well as the well known DHT (distributed hash table) protocol Chord [14] as
the application level routing protocol.
There are 2 major reasons for choosing the Chord as the application level
transport protocol:
Firstly, the hashing function which Chord employs provides uniformly random
key-location mappings, which guarantee that keys are near uniformly distributed
among the peers in the P2P networks. In other words, no peer is allocated keys
significantly more than others. This is good for load balancing. Because no peer in the
network will overload due to the fact that more queries are forwarded to it; Secondly,
Chord uses consistent hash function SHA-1 [15] which is excellent for an unstable
network such as P2P networks where peers leave and join the networks frequently.
Without consistent hashing, as peers join or leave the network, all the existing hashed
keys must be rehashed which results in issue that the most of the network bandwidthis taken over by the messages used for rehashing.
As mentioned previously, every node in the quadtree stores a control point
which controls the underlying region. For distributing the quadtree among the
available machines in the P2P network, the string representation of x y coordinates of
control points stored at each quadtree node are used as the key of SHA-1 hash
function. It is in the format of (x, y). Practically, no two control points are hashed to
the same location due to the fact that there are no two control points are exactly the
same. With Chord protocol a control point and the information about it ( ()
described previously) are hashed into the Chord virtual circle space. With a string
representation of a control point one can easily find the desired data just by following
8/14/2019 Distance Join Processing in a P2P World
6/25
6 | P a g e
Chord specification. Figure 3 shows one possible result of hashing control point of
each quadtree node to the Chord virtual circle space. As depicted in the figure, peer1
has the control points C, O and CB along with the spatial data stored; peer345 has the
control points D and CD; etc.
When partitioning the spatial data, smaller objects tend to be inserted into the
deeper level in the quadtree which may cause the problem that a query is passed down
to many levels in the quadtree before a spatial object can be found. A major impact of
this is that more messages are needed to find the smaller spatial objects therefore,
causes longer communication delay. A variable Fmax is proposed to specify the
maximum level in the quadtree into which a spatial object can be inserted. Variable
Fmax prevents the quadtree generated from partitioning process from being too high,
which may results in long time traverse along the quadtree when doing queries.
Note that for any queries, they all start processing from the peer who has
information about root quadtree node, which may cause single point of failure. A newvariable which is similar to Fmax, namely, Fmin is defined. Fmin specifies the
minimum level in the quadtree into which a spatial object can be inserted. When
spatial objects are inserted into the quadtree, at minimum, they are inserted into the
Fmin level nodes in the quadtree. When no Fmin node can contain the spatial data in
its entirety, then the spatial object is inserted into those Fmin level nodes whose
controlled regions intersect with the spatial object. By doing this, every query now
starts processing from those nodes at Fmin level in the quadtree not a single root
node.
3. Algorithms for Basic Spatial Query
3.1. Range Query
3.1.1. High Level Description
Range query, nearest neighbor query and distance join query are all based on
distributed quadtree index [11]. Figure 4 shows the pseudo code of range query. In
figure 4, procedure D (u) returns a reference of control point u; C(u,i) returns the ith
children control point of control point u; R( ) returns the range that the specified
control point controls. Range query is initiated from one peer in a P2P network by
calling the InitiateRangeQuery procedure with a parameter Q being the 2D rectanglewithin which one wants to check whether there are some spatial objects located.
O
root node
AB
C D
AB C
D
O
CA CB
CC CD
CA
CB CC
CD
rectangle 1
peer 1
peer 345
peer 1567
peer m
O
A
B
C
D
CA
CB
CC
CD
Figure 3. Hashing result of quadtree in Chord circle space
8/14/2019 Distance Join Processing in a P2P World
7/25
7 | P a g e
Firstly, procedure Subdivide is called to get the Fmin level of control points
whose controlled ranges intersect with the query Q. And then for each of such control
points, forward the range query to the peers who possess the desired control points by
following Chord protocol (denoted as Delegate(u)-> DoRangeQuery(Q, u) in figure
4). Upon arrival, peers that get the forwarded range query return any spatial objects
that intersect with the query range Q and then for each children of the queried control
point, forward the query Q to those who have spatial objects and whose controlled
range intersects with the query. The range query process is shown in figure 5 with
Fmin=0. Peer1567 initiates the range query. Translucent rectangle (denoted as query
Q in the figure) is the query rectangle. In a distributed quadtree index P2P network,
every query starts to process from Fmin level in the quadtree. In this case, Fmin=0,
the query starts from root node. Query is passed down on the quadtree. Initially, the
InitiateRangeQuery(query Q)
{
control point list G = {}
Subdivide (Q, root, G)
for each u in G do
Delegate(u)-> DoRangeQuery(Q, u)
}
DoRangeQuery (query Q, control point u)
{
intersect objects in D(u).list with Q
send results
for i = 1 to 4 do
if (Ints(R(C(u, i)), Q) is not empty) and(D(u).di > 0) then
Delegate (C(u, i))->DoRangeQuery(Q, C(u, i))
}
Figure 4.
Algorithm for range query
O
root node
AB
C D
A
B C
D
O
CA CB
CC CD
CA
CB CC
CD
rectangle 1
peer 1
peer 345
peer 1567
peer m
O
A
B
C
D
CA
CB
CC
CDquery Q
11
2
2
3
3
Figure 5. Chord, quad tree and spatial data
8/14/2019 Distance Join Processing in a P2P World
8/25
8 | P a g e
result of Subdivide contains only control point O which controls the entire region.
Peer1567 then passes the query to the peer in the network which has the data about
control point O. This process is depicted as the curve marked 1 in figure 5. With the
help of Chord, the query is then passed to peer1 who has information about control
point O. When query is arrived in peer1, peer1 first examines whether it has anyspatial data (in this simplified example, rectangles) that intersects the query
rectangle; and then, it checks are there any children of the node O whose controlled
range intersects the query rectangle Q and who has spatial data. After examining,
peer1 finds that the children of O, C meets such requirements. Then peer1 forwards
the query to the peer who has information about control point C. With Chord, we
know that the peer is still peer1. This process is depicted by curve marked 2. Peer1
repeated process 1, and finds that sub region CD intersects the query Q and has spatial
data in it. Then peer1 forwards the query to the peer who has information regarding
control point CD, namely the peer345. The routing process is depicted by the curve 3.
When query arrives at peer345 it finds it has spatial object rectangle1 and no subregions have spatial objects. Then, after sending the result back, the range query stops.
As described, the query starts at root node and is passed down on the quadtree with
the order: O->C->CD.
3.1.2.Implementation
For implementation part, I use tables to show the features which I implemented
and in Extra column, I added some specials and key points that must be paid
attention to.
Table 1 shows the implementation details.
Item Implemented Extra
Routing (Chord) Basic data structures This project does not deal with the
issues that arise when node join or leave
the Chord network, only routing is dealt
with. Caching mechanism in Chord is
NOT implemented.
find_predecessor
find_successor
Indexing Basic data structures Quadtree, control point. Quadtree
node, rectangle, Fmin, Fmax, etc.
Algorithm Basic data structures Implementation strictly follows
the protocol defined in the original paper
[8]
InitiateRangeQuery()
Subdivide (Q, root, G)
Delegate(u)
DoRangeQuery(Q, u)
Table 1. Implementation details for range query
3.2.Nearest Neighbor Query
3.2.1. High Level Description
Hjaltason and Samet [9] gave a comprehensive analysis of various similarity searchalgorithms in metric spaces. The main contribution of theirs was to propose a priority
8/14/2019 Distance Join Processing in a P2P World
9/25
9 | P a g e
queue based ranking algorithm that can
find the results of a ranking query in an
incremental fashion. Ranking is a more
general form of NN query where all the
spatial objects will eventually be retrievedin the increasing order of their distance
from a query point. Initially, by first
iteration of the algorithm the root node of
the data structure is inserted into the
priority queue. The priority is measured
by the distance between the data structure
and the query point. In the next iteration
of the algorithm, all children of the root
which are in turn added to the priority
queue. Hence, in this fashion, at eachiteration of the algorithm, the element
with the smallest distance is removed and
visited, and its children are inserted into
the queue. Eventually, there will be an
object at the head of the queue, which is
the object with the shortest distance to the
query point. Note that their algorithm
works in an incremental fashion.
Elements in the priority queue are
contacted sequentially, which is clearly
not suitable for P2P paradigm where the
power of parallelism must be fully
exploited.
Tannin, et al. proposed an elegant
way of doing nearest neighbor query in
[8]. Their algorithm is based on the
priority queue based approach. Figure 6
shows the pseudo code of nearest
neighbor query. The peer that initiates the nearest neighbor query maintains thepriority queue. At the beginning, instead inserting just the root node into the priority
queue, all the control points at level Fmin are inserted into the priority queue. There is
a new variable called WCDist, which is the worst case distance from the query point
to the controlled range of the control point. The WCDist is used as a criterion to
decide which peers are to be contacted in parallel during one iteration of the algorithm.
This is the most remarkable difference between this algorithm and the algorithm
proposed by Hjaltason and Samet in [10]. During each iteration, the WCDist is
updated as follows: Let dbe the distance between the first spatial object (if any) in the
priority queue and the query point. And let D be the maximum distance between the
query point and the top element (cannot be a spatial object, because spatial objects at
InitiateNNQuery(query q)
{
priority queue pqueue =
GetSortedControlPoints (q, fmin)
control point c =FindControlPoint (q, fmin)
WCDist =MaxDist(q, c)
SendMessagesWithin(WCDist)
}
DoNNQuery(control point u)
{
Msg= CreateReplyMessage()
msg.Put (D(u).list)
for i = 1 to 4 do
if (D(u).di > 0) then
msg.Put (C(u, i))
SendMessageBack(msg)
}
Synchronized ReceiveNNMessage(message msg)
{
for each object X in msg.list do
pqueue. Add (X )
for each control point u in msg do
pqueue. Add(u)
pqueue.Remove(SenderOf(msg))
WCDist=UpdateWCDist()
SendMessagesWithin(WCDist)
}
Figure 6
Algorithm for nearest neighbor query
8/14/2019 Distance Join Processing in a P2P World
10/25
10 | P a g e
top will be deleted as soon as they are found). Thus the WCDist= Min (d, D). Then,
for each control point in the priority queue, those with the distance from their
Figure 7. Process of nearest neighbor query
controlled ranges to the query point less than or equal to WCDist are contacted in
parallel. The entire process is depicted in figure 7 with Fmin=1. Peer345 initiates the
nearest neighbor query by calling InitiateNNQuery. GetSortedControlPoints will
return a priority queue, which contains level 1 control points, namely, A,B,C and D.
The status of the priority queue is denoted as priority queue status 1 in the figure.
The first WCDist and the range it covers are denoted by the quadrant marked as
Wcdis1 in the figure. Therefore, SendMessagesWithin will forward the query in
parallel to the peers who possess control points C, A and D respectively. As shown in
the figure, peer345, peer1 and peer m get this message. Then DoNNQuery procedure
is called at each one of them. They will create reply message put any spatial objects
they have along with any control points which have spatial object in it to the message
and send it back to query initiating peer, in this case, peer345. Assuming the reply
message corresponding to control point C arrives at peer345 first (the arriving order
may vary due to message delay; however, this doesnt affect the correctness of the
algorithm). ReceiveNNMessage is called at peer345. After inserting all the control
points and spatial objects into the priority queue, the status of the priority queue isdenoted as priority queue status 2 in figure 7. Control point C is deleted from the
priority queue after handling the reply message corresponding to it. Then
UpdateWCDist is called to update the WCDist. The updated WCDist is shown as the
smaller quadrant in figure 7 denoted as Wcdis2, where the SendMessagesWithin
procedure will sent the query to the peer that just has control point CD (because
control points A and D has been contacted previously). This time peer345 is contacted.
Before peer345 returns a reply message back, assuming reply message about control
point A just arrives at the query initiating peer which is peer345, according to the
algorithm, the spatial objects and control points are inserted into the priority queue.
priority queue status 3 in the figure shows the status of the priority queue after
insertion. Note that the distance from control point D, B, CD to query point is closer
O
root node
A
B
C D
A
B C
D
O
CA CB
CC CD
CA
CB CC
CD
rectangle 1
peer 1
peer 345
peer 1567
peer m
O
A
B
C
D
CA
CB
CC
CDquery Q
q
A BC D
priority queue status1:
A BCD D
priority queue status 2:
Wcdist1Wcdist2
BCD D
priority queue status 3:
rect0
Brect1 D
priority queue status 4:
rect0
BD
priority queue status 5:
rect0
priority queue status 6:
rect0
8/14/2019 Distance Join Processing in a P2P World
11/25
11 | P a g e
than that of rectangle 0, thus, rectangle 0 is at the end of the priority queue. Now
peer345 sent the reply message back along with the spatial object rectangle 1 to the
query initiating peer. After this iteration, the status of the priority queue is shown as
priority queue status 4. Now, there is a spatial object becoming the head of the
queue. So it will be the nearest spatial object with respect to query point q. Thealgorithm can now stop or proceed as needed. Because neither do both control points
B and D possess any spatial objects nor their children, when the reply messages
corresponding to them are returned, B and D are simply deleted. The nearest neighbor
query stops automatically when the priority queue is empty.
3.2.2. Implementation
Table 2 shows the implementation details of algorithm for nearest neighbor
query.
Item Implemented Extra
Routing
(Chord)
Basic data structures This project does not deal with the
issues that arise when node join or
leave the Chord network, only
routing is dealt with. Caching
mechanism in Chord is NOT
implemented.
find_predecessor
find_successor
Indexing Basic data structures Quadtree, control point. Quadtree
node, rectangle, Fmin, Fmax, etc.
Algorithm Basic data structures Implementation strictly follows the
protocol defined in the original paper
[8]. Data structures include priority
queue, two types of queue elements,
etc.
InitiateNNQuery(query q)
GetSortedControlPoints (q, fmin)
FindControlPoint (q, fmin)
SendMessagesWithin(WCDist)
DoNNQuery(control point u)
CreateReplyMessage
SendMessageBack(msg)
Synchronized ReceiveNNMessage(message msg)UpdateWCDist()
Table 2. Implementation details for nearest neighbor query
4. Distance Join Algorithm for P2P Networks
4.1. High Level Description
Distance join algorithm is working on two sets of spatial data. The goal of the
algorithm is to try to find the closest pair of spatial objects from two spatial data sets.Such type of searching has great potentials in real life. Imaging at a weekend, one
8/14/2019 Distance Join Processing in a P2P World
12/25
12 | P a g e
wants to go out for dinner and watch a
great movie then. The first mind off
the top of his/her mind is to try to find
a restaurant with a cinema nearby. The
shorter the distance between the twothe better (no one wants to drive a long
way to watch a movie after having
dinner). Finding the closest
cinema-restaurant pair is one possible
application of distance join algorithm.
One straightforward approach is
to retrieve all the spatial objects in data
set 1 and data set 2, and compute the
Cartesian product of the two sets,
order the result in increasing orderbased on distance. The first pair in the
ordered result is the closest pair. This
is clearly not suitable for a large P2P
network with extremely huge amount
of spatial objects distributed among
the machines in the network. Several
works has been done regarding to
distance join algorithms [9, 10].
However, the algorithms proposed
only work in a centralized
environment and algorithm proceeds
sequentially. To fully exploit the
advantages of P2P networks, extra
work has to be done.
Chord, distributed quadtree
index and priority queue based
approach, all three form the essence of
the newly proposed distance join
algorithm for P2P networks. Similarwith the proposal in [10], the query
initiating peer maintains the priority
queue and acts as a query processing
front. Two pieces of information are
crucial for forwarding a distance join
query in query initiating peer. One is
the information about how quadtree
partitions underlying spatial data. The
other is the information about 4
children of a control point. The former
JoinInit(QuadTreeNode root1,QuadTreeNode root2)
{
PQueue=new PriorityQueue()
MessageCacheList=new List();
controlpoint1=GetRootControlPoint(root1)
controlpoint2=GetRootControlPoint(root2)
SendMessageTo(controlpoint1,id)
SendMessageTo(controlpoint2,id)
}
ProcessReply(ControlPoint u,id)
{
msg=CreateReplyMessage(id
msg.Put (D(u).list)
for i = 1 to 4 do
if (D(u).di > 0) then msg.Put (C(u, i))
SendMessageBack(msg)
}
Synchronized RecvMessage(Message msg)
{
if MessageCacheList.contains(msg.id) then
doCombine(msg, MessageCacheList.get(id))
PQueue.deque(msg,MessageCacheList.get(id)
else
MessageCacheList.add(msg);
Return;
for each new pair P generated from doCombine do
Pqueue.add(P)
WCDist=UpdateWCDist()
for each element pair E in Pqueue do in parallel
{ i f E.Dist
8/14/2019 Distance Join Processing in a P2P World
13/25
13 | P a g e
is implicitly known by every peer in the P2P network, thus no communication is
required. The latter is automatically obtained after distributing the quad-CIF tree
among the machines in the P2P network (mentioned in section 2.2.1, each control
point contains information in the form: () = ({ , , , }, ,) ,( )). Therefore, it is
very easy for a query initiating peer to forward the distance join query from root nodedown on the quadtree. Figure 8 is the pseudo code for P2P distance join algorithm.
Initially, there is only one pair in the priority queue, namely, the root control point of
each quadtree. As the algorithm proceeds, pairs of control points and spatial objects
are inserted into the priority queue. Thus, four types of queue element exist, (spatial
object, spatial object), (spatial object, control point), (control point, spatial object),
(control point, control point). The processing of a pair in the query initiating peer
must be strictly synchronized in the sense that messages that are sent as a pair must be
processed together. In the P2P distance join algorithm, elements in priority queue are
control points and objects pair. As algorithm proceeds, pairs of messages are sent. The
reply messages corresponding to paired-messages sent previously must be handledtogether. However, due to the uncertainty in communication delay, reply messages
may arrive at query initiating peer at arbitrary time. Therefore, for handling reply
messages pairwise, extra work has to be done. My solution is giving the messages that
are sent in pair a unique ID and caching the single message to which that hasnt
received a paired reply message. Whenever a reply message with the same ID as the
cached one is received, we can say that the two replay messages are in one pair, thus
they can be handled together. This strict synchronization property of pairwise
message processing guarantees that the new pairs generated from doCombine will not
contain redundant pairs. As shown in the algorithm, pairs in the priority queue are
contacted in parallel rather than sequentially. The newly defined variable WCDist is
used here to be a criterion to determine which pairs are contacted. The procedureUpdateWCDist updates the WCDist in the following way: let D be the maximum
O
A B
C D
CA CB
CC CDrectangle X1
Status1:BA
BDrectangleY1
SETX SETY
A B
BCHead
Tail
Status2:
SETX SETY
Tail CD
CD
BA
BD
A B
BA
BD
rectX0
Head
Status3:
Tail CD
CD
BA
BD
SETX SETY
rectX0
BD
rectX0
Status4:
Tail CD
CD
BA
BD
SETX SETY
rectX0
rectY0
Figure 9. Process of distributed distance join algorithm
8/14/2019 Distance Join Processing in a P2P World
14/25
14 | P a g e
distance between the items of a pair that is in the head of the priority queue and is
none-object-object pair. And let d be the maximum distance between the spatial
objects of the first object-object pair (if any) found in the priority queue (can not be
the first, because as soon as found in head, it will be retrieved as the next closest pair).
Then WCDist=Min(D,d). Then for those pairs in the priority queue whose distancebetween the two items in the pair is less than or equal to WCDist is contacted in
parallel, which makes this algorithm distinct from the traditional sequential algorithm.
Figure 9 shows a simple case to demonstrate the distance join algorithm. There
are 2 sets of data, depicted using two different colors. Rectangles X0, X1 belong to
dataset X. Rectangles Y0, Y1 belong to dataset Y. At the beginning, procedure JoinInit
is called at query initiating peer. As shown in the pseudo code, peers that own the root
control point of each data set are first contacted; in this case, two control points O of
two data sets. Two distance join initialization messages are sent with the same unique
ID (for processing messages in pair). Whenever a peer receives a distance join related
message procedure ProcessReply is called, it will put any spatial objects along withany children control points which contain spatial objects in a reply message and sent it
back to the query initiating peer. Procedure RecvMessage is called at query initiating
peer upon receiving a reply message. Due to the fact that reply messages
corresponding to pairwise sent messages can be delay randomly, for being able to
process the messages in pair, a message cache is used to temporarily store the early
arrived reply message (the unique id is used to pair messages). Assuming reply
message from peer that owns control point O of data set X arrives first, and that of
data set Y arrives second. The algorithm then finds the paired reply messages, and
calls procedure doCombine to generate new pairs from the reply messages. After
processing the messages, it deletes the processed element from the queue. Now one of
the possible statuses of the priority queue is denoted as Status 1 in figure 9 (it also
can be (A,B),(C,B), because the distance between control block A and B is equal to
that of C and B). Then the worst case distance WCDist is calculated, the result is
denoted in the figure as WCDist1 which is the maximum distance between control
block C and B. Then pairs in priority queue whose distance between two items in the
pair is less than or equal to WCDist1 are contacted. Thus peer that has control point C
in data set Xand peer that has control point B in data set Yare contacted. The same for
pair (A,B). Until now, the first iteration of the algorithm finishes. Note that same
control points in one data set may appear in more than one pair in the priority queue,thus potentially will be contacted multiple times, which causes communication
overheads. To overcome the problem, the results of previously contacted control
points are stored locally in the query initiating peer for eliminating unnecessary
communication. In the next iteration, assuming paired reply messages for (C,B) arrive
first (algorithm works correctly if paired reply messages for (A,B) arrive first). Status
2 in figure 9 shows the content of priority queue after receiving reply messages for
(C,B). Status 3 shows the content after receiving reply messages for (A,B). Note that
a new iteration may begin when the queue is in Status 2 where the previously
contacted pair (A,B) will not be contacted again. Assuming the new iteration begins
after Status3. The corresponding updated WCDist is denoted as WCDist2 in the
8/14/2019 Distance Join Processing in a P2P World
15/25
15 | P a g e
figure, which is the maximum distance between rectangle X0 and control block BA.
Again, pairs in the priority queue that satisfy the worst case criterion are contacted. In
this case, all 4 pairs are contacted. For the reason of clarity and simplicity, we only
look at pair (rectX0, BA). When the reply messages for control point BA is received,
after calling procedure doCombine, the content of the queue is denoted in the figure asStatus 4. As shown in the figure, an object-object pair appears at the top of the
queue; this is the closest pair in two different data sets. Once such a pair is found, it is
retrieved immediately and the algorithm should allow the users to determine whether
to proceed or stop the distance join algorithm.
The simple example described previously started the query from the root control
point of each data set. The distributed quadtree index allows spatial data to be inserted
from Fmin level in the quadtree rather than from root level which is the same as when
Fmin=0. Therefore a slight modification of the algorithm is needed to allow query to
start from Fmin level rather than root level to avoid communication overheads when
forwarding query from level 0 to Fmin level.
4.2. Implementation
Table below shows the implantation details of P2P distance join algorithm.
Item Implemented Extra
Routing
(Chord)
Basic data structures This project does not deal with the issues that arise
when node join or leave the Chord network, only
routing is dealt with. Caching mechanism in Chord
is NOT implemented.
ind_predecessor
ind_successor
Indexing Basic data structures Quadtree, control point. Quadtree node, rectangle,
Fmin, Fmax, etc..
Algorithm Basic data structures Implementation strictly follows the protocol defined
in the original paper [6]. Data structures include
priority queue, four types of queue element, queue
operations, etc.
But the priority queue only allows sequential access,
but implementation allows contacting multiple peers
in parallel.
Implementation only allows distance join query to
start from root level rather than from Fmin level.
Implementation allows caching the results of
previously contacted control points.
oinInit(QuadTreeNode
root1,QuadTreeNode root2)
MessageCacheList
SendMessageTo(controlpoint,id)
ProcessReply(ControlPoint u,id)
CreateReplyMessage(id)
SendMessageBack(msg)
Synchronized RecvMessage(Message msg)
doCombine(msg1,msg2)
PQueue.deque(msg.id)
UpdateWCDist()
Table 3. Implementation details of P2P distance join algorithm
5. Experiments
8/14/2019 Distance Join Processing in a P2P World
16/25
16 | P a g e
5.1. Experimental Environment
Figure 10. Example of transit-stub model
In the experiment part, J-Sim (www.j-sim.org) is used for simulation
environment. Because there are no random factors which may result in differences in
testing results for the same test case, for each test case I run the test for only once.
There are several assumptions that my experiments are based on: 1. No packets
lost during communication; 2. Query response time are introduced mainly for the
reasons of messages propagation delay; 3. The P2P network is extremely stable that
during the entire progress of the experiments no node will leave or join the network
and no node will randomly crash. By defining such assumptions, I actually create an
ideal world to measure the performance of this algorithm in ideal state.Before conducting experiments, network topology and test data sets must be
prepared. For network topology, I create a static topology for each test case, which is
similar to Transit-Stub model [16] as shown in figure 10, where intermediate nodes
can be regarded as transit nodes and nodes shown on the edge can be regarded as stub
nodes. In real life, transit domains can be thought as the metropolitan area networks
and transit nodes play the role of internet service provider. Stub domains resemble
networks within different organizations, companies, campuses, etc. Table 4 gives the
physical characteristics of the underlying network used in J-Sim. All of the test
parameters are chosen to closely reflect the real world scenario. Some of them are
statistics generated from Rogers Communications Inc [17].
transit node
stub node
Transit domain2Transit domain1
Transit domain3
stub domain
8/14/2019 Distance Join Processing in a P2P World
17/25
Tabl
For test data sets, ob
found to generate near
distribution in urban regi
in urban region in Melbcan only yield unifor
performance of this algori
types of data studied in t
Zipfian distribution [18].
Zipfian distribution. For a
them sharing a centroid
spatial objects are distribin the inner square ring i
Parameter
Network delay in loc
Network delay betw
Network delay betw
Network delay betwBandwidth in local a
Bandwidth between
Bandwidth between
Bandwidth between
F
4. Physical parameters for underlying network
taining real life data can be tricky. Thus a s
real life test data sets, for example, all
n in Melbourne and all the seven-eleven con
urne. Merely adopting random functions ply distributed data which cannot reflec
thm towards real world. According to Zipf's
e physical and social sciences can be appr
My test data sets are generated roughly
2D region, it is divided into 8 square rings
the innermost one becomes a square). A fi
ted in the following manner: the number os roughly twice as many as that of in its i
Value
al area network 10
een stub nodes 40
een stub node and transit node 200
een transit nodes 200rea network 54
stub nodes 100
stub node and transit node 100
transit nodes 1000
igure 11. Sample test data with 400 spatial object
17 | P a g e
lution must be
the restaurants
venience stores
ovided by APIt the genuine
law [18], many
ximated with a
following the
ith each one of
xed number of
spatial objectsmediate outer
Unit
ms
ms
ms
msMbps
Mbps
Mbps
Mbps
8/14/2019 Distance Join Processing in a P2P World
18/25
square ring; and within a
spatial data. By doing this
the 2D region while spars
life data distribution. Figu
that follows Zipfian distriGenerally speaking,
parameters: Fmin; num
simultaneously initiated;
said to be finished when t
Besides, peers are a
from each stub domain is
5.2. Results
5.2.1. Different Fmin:
The first experimenpeers in the network, whi
set contains 200 spatial o
10 and Fmax is set to 9.
point of failure. With Fmi
level or deeper in the qua
node. Multiple peers in th
the effects of increasing
are split into smaller piec
resulting in increasing the
complexity to become bi
before actual spatial data i
As can be observed
processing time so much,
roughly steady. However,
This is due to the longer
are forwarded from rootspatial objects are actuall
0.000
5.000
10.000
15.000
20.000
25.000
30.000
35.000
Av
erageResponseTime
(seconds)
Figur
ertain square ring, random function API is
, spatial objects are densely distributed in th
ely distributed in the outer region, which si
re 11 shows one example of 400 spatial obj
ution.the experiments are conducted by changin
er of peers in the P2P network; num
umber of spatial objects in each data set. T
e top 10 closest pairs are found.
most equally allocated to stub nodes and nu
oughly the same.
t examines how Fmin affects the algorithm.h are uniformly distributed in the stub dom
jects. The number of simultaneous client r
he philosophy behind the variable Fmin is
n, the spatial objects are forced to be inserte
dtree. Therefore, queries are no longer proc
e network are contacted as soon as the queri
min will be that as Fmin increases the bigge
s and pieces of objects are falling deeper do
height of the quadtree, which in turn cause
ger. Another effect is that more messages
s retrieved which causes overheads in comm
in the figure 12, different Fmins do not aff
as Fmin increase, the average response tim
as Fmin reaches its maximum, a slight incre
uery messages propagation delay introduce
level to Fmin level in the distributed quay located. For the first few Fmins, there i
24.033 24.677 24.365 24.200 23.392 24.95926.554
29.0063
0 1 2 3 4 5 6 7
Fmin
Changing Fmin
12. Average query response time as Fmin increases
18 | P a g e
sed to generate
central area in
ulates the real
cts distribution
the following
er of queries
he one query is
ber of queries
There are 200ains. Each dada
quests is set to
to avoid single
d into the Fmin
ssed from root
es start. One of
spatial objects
n the quadtree
s the algorithm
have to be sent
unication.
ect the average
e curve remain
se is observed.
d when queries
tree where theno significant
1.060
8
8/14/2019 Distance Join Processing in a P2P World
19/25
19 | P a g e
difference in average response time, which is because: 1. For finding the first 10
closest pairs is quite different from that of finding all the pairs; 2. Fmin doesnt affect
the test data set significantly before it is reaching a certain value due to the fact that
the test data set contains many smaller spatial objects; 3. Even if spatial objects are
split into smaller pieces which will cause communication overheads (shown in figure13), the parallel communication property of the algorithm compensates for such
overheads with regard to average response time.
Figure 13. Average number of messages for finishing one query as Fmin increases
Figure 13 shows the variation in the number of messages per query (each query
finds the first 10 closest pairs) as Fmin increases. As expected, number of messages
increases when Fmin increases. For the first few cases, Fmin doesnt affect the
number of messages so much. However, as it reaches 5, there is a relatively steep
increase due to the fact that the underlying 2D space is divided into so many tiny
squares and hence the increase in height of the distributed quadtree.
For different Fmins, figure 14 shows the load distribution in terms of the
standard deviation. As can be observed, as Fmin increases, the standard deviation
drops gradually which means the load among peers in the network tends to be more
balanced.
Figure 15 shows the actual load for peers in the network. There are 15 slots on
5,975 7,020 7,782 9,40313,550
26,361
46,691
71,769
120,302
0
20000
40000
60000
80000
100000
120000
140000
0 1 2 3 4 5 6 7 8M
essagesPerRequest
Fmin
Average Number of Messages
0
5
10
15
20
25
0 1 2 3 4 5 6 7 8StandardDeviationinLoad
Fmin
Standard Deviation for Fmin
Figure 14. Standard deviation of number of messages for finishing one
query as Fmin increases
8/14/2019 Distance Join Processing in a P2P World
20/25
20 | P a g e
the x-axis with each of them representing a number-of-message-range a certain
number of peers have received for finishing 10 queries. Each of the slots potentially
has 9 bars indicating load for different Fmin. For example, if one wants to know the
load distribution for Fmin=0, then he/she needs to see the first bar in every slot. As
shown in the figure, there are around 80 peers in the network which get less than orequal to 10 messages; and around 7 peers which got more than 10 but less than or
equal to 20 messages, etc. There is a general trend can be seen, as the Fmin increases,
more and more peers in the network handle more messages. When Fmin=0, 81 out of
200 peers handle less than 10 messages, no peer handles more than 5120 messages.
While when Fmin come to 8, only 14 peers in the network handle less than 10
messages, 47 out of 200 peers handle more than 5120 messages totally. Load is
increasing along with the increase of Fmin, However, load is roughly uniformly
distributed among the network.
Figure 15. Load distribution for finishing 10 queries with different Fmins
5.2.2. Distributed VS Sequential:
The most prominent advantage of the P2P distance join algorithm over the
traditional distance join algorithm is that it will contact the relevant peers in a parallel
manner rather than a sequential manner, which enables it to exploit the parallelism ofP2P network. Figure 16 gives the comparison of experiment results between parallel
algorithm and sequential algorithm. As shown, parallel algorithm gives a steady curve.
The average response time isnt affected significantly by increasing Fmin; while the
sequential one fluctuates severely, because the elements in the priority queue are
handled one by one. Besides, different Fmins will cause the uncertainty in spatial
objects distribution when partitioning them using the distributed quadtree, which
gives the uncertainty in average response time. Without surprise, the parallel one
works much better than the sequential one from the response time point of view.
Next several experiments will examine how well the P2P distance join
algorithm scales with respect to increasing the number of peers, the number of
simultaneous queries and the number of spatial objects.
0
10
20
30
40
50
60
70
80
90
NumberofPeers
Slots of Number of Messages
Load Distribution for Different Fmin (finish 10 queries)
fmin=0
fmin=1
fmin=2
fmin=3
fmin=4
fmin=5
fmin=6
fmin=7
fmin=8
8/14/2019 Distance Join Processing in a P2P World
21/25
5.2.3. Different Num
The first experiment
of peers in the network. F
in the region; number of s
pairs found account for fi
As shown in the fig
time remains roughly stea
due to the fact that as th
located at more peers, th
query to finish.
Figure 17. Ave
0.000
5.000
10.000
15.000
20.000
25.000
30.000
AverageResponseTime(seconds)
24.033
404.13
0.000
100.000
200.000
300.000
400.000
500.000
600.000
0
AverageResponseTime(seconds)
Figure 16. Avera
c
er of Peers:
examines how the algorithm scales with inc
in is set to 2; Fmax is set to 9; there are 20
imultaneous queries is set to 10; and only the
ishing 1 query. The result is shown in figu
ure, as the number of peers increases the a
y, although there are tiny increase in averag
re are more peers in the network, 200 sp
refore, more hops in the Chord network a
age response time per query as number of peers increas
21.00722.796
25.014 25.53827.498
200 400 600 800 1000
Number of Peers in the Network
Changing Number of Peers
24.677 24.365 24.200 23.392 24.959 26.554 29.006 31.06
446.853425.328
492.621
449.236
196.662171.358
277.483
269.2
1 2 3 4 5 6 7 8
Fmin
arallel One VS Sequential One
Sequential Distance
Algorithm
ge response time per query for P2P distance join algorit
mparison to centralized sequential algorithm
21 | P a g e
reasing number
spatial objects
first 10 closest
e 17.
erage response
e response time
tial objects are
e needed for a
s
4
Join
m in
8/14/2019 Distance Join Processing in a P2P World
22/25
22 | P a g e
5.2.4. Different Number of Simultaneous Queries:
The second scalability experiment examines how well the algorithm scales as
the number of simultaneous queries increases. Again, Fmin is set to 2; Fmax is set to 9;
there are 200 spatial objects in the 2D space; number of peers in the network is set to
200; and only the first 10 closest pairs found account for finishing 1 query. The resultis shown in figure 18. In the figure, there is a drop at the beginning. One possible
reason that introduces the drop in average response time is that most of the queries are
forwarded to the same peers that previously forwarded the same messages. However,
the rest of the curve remains steady.
Figure 18. Average response time per query as number of query increases
5.2.5. Different Number of Spatial Objects
The last experiment examines how well the algorithm performs with the
increasing number of spatial objects. With fixed number of peers in the network, as
more and more spatial objects are inserted into the network, for one single peer, there
must be an increase in the number of spatial objects allocated to it, which will reduce
the number of hops a query needs to be forwarded in the Chord network to fetch
needed spatial objects before the first 10 closest pairs are returned. In this experiment,
Fmin is set to 2; Fmax is set to 9; number of peers in the network is set to 200; the
number of simultaneous queries is set to 10; and only the first 10 closest pairs found
account for finishing 1 query.
Figure 19. Average response time per query as number of objects increases
26.153
24.36524.703
24.966 24.928
23.000
23.500
24.00024.500
25.000
25.500
26.000
26.500
5 10 20 40 80
AverageResponseTime
Number of Simultaneous Queries
Changing Number of Queries
25.497
22.717
23.555
21.81321.404
19.000
20.000
21.000
22.000
23.000
24.000
25.000
26.000
200 400 600 800 1000AverageResponseTime
(seconds)
Number of Spatial Objects
Changing Number of Spatial Objects (response time)
8/14/2019 Distance Join Processing in a P2P World
23/25
23 | P a g e
Figure 19 shows the result. As expected, as the number of spatial objects
increases, the general trend in average response time is in a decreasing pattern
regardless of a sudden increase when the number of objects is set to 600, which is
possible for the reason of the randomness in distribution of spatial objects among the
machines in the P2P network.Although the average response time decreases, as more and more spatial objects
are inserted into the network, the number of messages generated for finishing one
query is in an increasing pattern (shown in figure 20). The reason is intuitive. As more
spatial objects are inserted, more quadtree blocks (control points) are needed to be
inserted into the network including both the quadtree blocks (control points) that
contain spatial objects or those whose children contain spatial objects. Therefore,
either the distributed quadtree is becoming fuller or the height of the quadtree is
increasing. In either case, more messages are needed to finish one query.
Figure 20. Average messages per query as number of objects increases
6. Conclusion and Future Work
P2P paradigm is absolutely a trend in todays network development. More and
more people start to use applications that employ P2P technology. However, complex
queries on spatial data over P2P networks can be difficult to achieve. The P2P
distance join algorithm examined in this report fully exploits the advantages of P2P
networks. In this project, I did heaps of research on the unpublished P2P distance joinalgorithm and made one implementation of it as well as 2 other algorithms, range
query and nearest neighbour query. At the end, several experiments have been
conducted to examine different aspects of the P2P distance join algorithm. The results
of experiments show that the distance join algorithm works pretty well in a 2D
environment with respect to average response time. The variable Fmin proposed in
the original paper [8] is very important to this algorithm. Finding an appropriate Fmin
so that single point of failure will not likely to happen and meanwhile the number of
messages generated for finishing one single query isnt overwhelming, isnt a trivial
task. However Fmin and Fmax do give a lot of flexibility to the applications built on
top of it.
The P2P distance join algorithm implemented for experiments always starts
7,480
16,471
27,64832,512
45,652
0
10,000
20,000
30,000
40,000
50,000
200 400 600 800 1000AverageNumberofMessages
PerRequest
Number of Spatial Objects
Changing Number of Spatial Objects (messages/request)
8/14/2019 Distance Join Processing in a P2P World
24/25
24 | P a g e
query from root control points of 2 data sets, which causes communication overheads
from passing down the query form level 0 to level Fmin in the distributed quadtree.
This problem can be solved by allowing the query to start from Fmin level rather than
0 level. In real life applications, other query criteria can be applied, such as giving a
query range, within which find the closest pair or allowing the users to specify twocertain types of data sets that are in users interest.
8/14/2019 Distance Join Processing in a P2P World
25/25
25 | P a g e
References
[1]. Front Page of Business Link. Business Link Web Site. [Online]
http://www.businesslink.gov.uk.
[2]. Wilson, Jim. Front Page of National Aeronautics and Space Administration.NASA Official Web Site. [Online] http://www.nasa.gov.
[3]. Front Page of National Institutes of Health. Official Web Site of National
Institutes of Health. [Online] http://www.nih.gov.
[4]. Front Page of National Geospatial Intelligence Agency. Official Web Site of
National Geospatial Intelligence Agency. [Online] http://www.nga.mil.
[5]. Front Page of National Institute of Justice. Official Web Site of National
Institute of Justice. [Online] http://www.ojp.usdoj.gov/nij.
[6]. Egemen Tanin and Deepa Nayar. An Efficient Distributed Distance Join
Algorithm for Peer-to-Peer Networks.
[7]. Raphael Finkel and J.L. Bentley. Quad Trees: A Data Structure for Retrieval onComposite Keys. Acta Informatica 4 (1): 1-9.
[8]. E. Tanin, A. Harwood, H. Samet, D. Nayar, and S. Nutanong. Building and
querying a P2P virtual world, Geoinformatica, 2006, 10(1):91-116,.
[9]. G.R. Hjaltason and H. Samet. Index-Driven Similarity Search in Metric Spaces,
ACM Tran. On Database Systems, Dec 2003, Vol.28, No. 4, pp. 517-580.
[10]. G.R.Hjaltason and H.Samet, Incremental. Distance Join Algorithms for Spatial
Databases, Proc. Of the ACM SIGMOD Conference, Seattle, WA, 1998, pp.
237-248.
[11]. E. Tanin, A. Harwood and H. Samet. A distributed quadtree index for
peer-to-peer settings, in Proceedings of the IEEE International Conference on
Data Engineering, Tokyo, Japan, April 2005, pp. 254-255.
[12]. Gershon Kedem. The Ouad-ClF Tree:A Data Structure for Hierarchical On-Line
Algorithms, University of Rochester Rochester, New York 14627.
[13]. Raphael Finkel and J.L. Bentley. Quad Trees: A Data Structure for Retrieval on
Composite Keys, Acta Informatica 4(1): 1-9.
[14]. Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek and Hari
Balakrishnan. A scalable peer-to-peer lookup service for Internet applications,
in Proceedings of the ACM SIGCOMM 01, San Diego, CA, August 2001, pp.
149-160.[15]. Secure Hash Standard, FIPS PUB 180, by US government standards agency
NIST (National Institute of Standards and Technology).
[16]. Zegura EW, Calvert KL and Donahoo MJ. A quantitative comparison of
graph-based models for Internet topology. IEEE/ACM Trans. on Networking,
1997, 5(6):770-783.
[17]. Looking Glass and Network Information. Rogers Communications Inc. [Online]
https://supernoc.rogerstelecom.net/ops/.
[18]. G.K.Zipf. Human Behavior and the Principle of Least-Effort,
Addison-Wesley ,MA, 1965.