12
Scalable Sweeping-Based Spatial Join Lars Arge Octavian Procopiuc Sridhar Ramaswamy Torsten Suel Jeffrey Scott Vitter Abstract In this paper, we consider the filter step of the spatial join problem, for the case where neither of the inputs are indexed. We present a new al- gorithm, Scalable Sweeping-Based Spatial Join (SSSJ), that achieves both efficiency on real-life data and robustness against highly skewed and worst-case data sets. The algorithm combines a method with theoretically optimal bounds on I/O transfers based on the recently proposed distribution-sweeping technique with a highly optimized implementation of internal-memory plane-sweeping. We present experimental re- sults based on an efficient implementation of the SSSJ algorithm, and compare it to the state-of- the-art Partition-Based Spatial-Merge (PBSM) algorithm of Patel and DeWitt.

Scalable Sweeping-BasedSpatial Joinhuangyan/6330/Papers/arge98scalable.pdf · nal PBSM. On the other hand, the improved version of PBSM actually performed about better than SSSJ on

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalable Sweeping-BasedSpatial Joinhuangyan/6330/Papers/arge98scalable.pdf · nal PBSM. On the other hand, the improved version of PBSM actually performed about better than SSSJ on

Scalable Sweeping-Based Spatial Join

Lars Arge Octavian Procopiuc Sridhar Ramaswamy

Torsten Suel Jeffrey Scott Vitter

Abstract

In this paper, we consider the filter step of thespatial join problem, for the case where neitherof the inputs are indexed. We present a new al-gorithm, Scalable Sweeping-Based Spatial Join(SSSJ), that achieves both efficiency on real-lifedata and robustness against highly skewed andworst-case data sets. The algorithm combinesa method with theoretically optimal bounds onI/O transfers based on the recently proposeddistribution-sweeping technique with a highlyoptimized implementation of internal-memoryplane-sweeping. We present experimental re-sults based on an efficient implementation of theSSSJ algorithm, and compare it to the state-of-the-art Partition-Based Spatial-Merge (PBSM)algorithm of Patel and DeWitt.

Center for Geometric Computing, Department of Computer Sci-ence, Duke University, Durham, NC 27708–0129. Supported in partby U.S. Army Research Office grant DAAH04–96–1–0013. Email:[email protected].

Center for Geometric Computing, Department of Computer Sci-ence, Duke University, Durham, NC 27708–0129. Supported in partby the U.S. Army Research Office under grant DAAH04–96–1–0013and by the National Science Foundation under grant CCR–9522047.Email: [email protected].

Information Sciences Research Center, Bell Laboratories, 600Mountain Avenue, Box 636, Murray Hill, NJ 07974–0636. Email:[email protected].

Information Sciences Research Center, Bell Laboratories, 600Mountain Avenue, Box 636, Murray Hill, NJ 07974–0636. Email:[email protected].

Center for Geometric Computing, Department of Computer Sci-ence, Duke University, Durham, NC 27708–0129. Supported in part bythe U.S. Army Research Office under grant DAAH04–96–1–0013 andby the National Science Foundation under grant CCR–9522047. Partof this work was done while visiting Bell Laboratories, Murray Hill,NJ. Email: [email protected].

Permission to copy without fee all or part of this material is grantedprovided that the copies are not made or distributed for direct commer-cial advantage, the VLDB copyright notice and the title of the pub-lication and its date appear, and notice is given that copying is bypermission of the Very Large Data Base Endowment. To copy other-wise, or to republish, requires a fee and/or special permission from theEndowment.

Proceedings of the 24th VLDB ConferenceNew York, USA, 1998

1 Introduction and MotivationGeographic Information Systems (GIS) have gener-ated enormous interest in the commercial and researchdatabase communities over the last decade. Severalcommercial products that manage spatial data are avail-able. These include ESRI’s ARC/INFO [ARC93], Inter-Graph’s MGE [Int97], and Informix [Ube94]. GISs typi-cally store and manage spatial data such as points, lines,poly-lines, polygons, and surfaces. Since the amountof data they manage is quite large, GISs are often disk-based systems.

An extremely important problem on spatial data is thespatial join, where two spatial relations are combined to-gether based on some spatial criteria. A typical use forthe spatial join is the map overlay operation that com-bines two maps of different types of objects. For exam-ple, the query “find all forests in the United States thatreceive more than 20 inches of average rainfall per year”can be answered by combining the information in a land-cover map with the information in a rainfall map.

Spatial objects can be quite large to represent. For ex-ample, representing a lake with an island in its middleas a non-convex polygon may require many hundreds,if not thousands, of vertices. Since manipulating suchlarge objects can be cumbersome, it is customary in spa-tial database systems to approximate spatial objects andmanipulate the approximations as much as possible. Onetechnique is to bound each spatial object by the smallestaxis-parallel rectangle that completely contains it. Thisrectangle is referred to as the spatial object’s minimumbounding rectangle (MBR). Spatial operations can thenbe performed in two steps [Ore90]:

Filter Step: The spatial operation is performed onthe approximate representation, such as the MBR.For example, when joining two spatial relations,the first step is to identify all intersecting pairs ofMBRs. When answering a window query, all MBRsthat intersect the query window are retrieved.

Refinement Step: The MBRs retrieved in the filterstep are validated with the actual spatial objects. Ina spatial join, the objects corresponding to each in-tersecting MBR pair produced by the filter step arechecked to see whether they actually intersect.

The filter step of the spatial join has been studied ex-tensively by a number of researchers. In the case where

Page 2: Scalable Sweeping-BasedSpatial Joinhuangyan/6330/Papers/arge98scalable.pdf · nal PBSM. On the other hand, the improved version of PBSM actually performed about better than SSSJ on

spatial indices have been built on both relations, theseindices are commonly used in the implementation of thespatial join. In this paper, we focus on the case in whichneither of the inputs to the join is indexed. As discussedin [PD96] such cases arise when the relations to be joinedare intermediate results, and in a parallel database envi-ronment where inputs are coming in from multiple pro-cessors.

1.1 Summary of this Paper

We present a new algorithm for the filter step calledScalable Sweeping-Based Spatial Join (SSSJ). The al-gorithm uses several techniques for I/O-efficient com-puting recently proposed in computational geometry[APR 98, GTVV93, Arg95, AVV98, Arg97], plus thewell-known internal-memory plane-sweeping technique(see, e.g., [PS85]). It achieves theoretically optimalworst-case bounds on both internal computation time andI/O transfers, while also being efficient on the more well-behaved data sets common in practice. We present exper-imental results based on an efficient implementation ofthe SSSJ algorithm, and compare it to the original as wellas an optimized version of the state-of-the-art Partition-Based Spatial-Merge (PBSM) algorithm of Patel and De-Witt [PD96].

The basic idea in our new algorithm is the follow-ing: An initial sorting step is performed along the verti-cal axis, after which the distribution-sweeping techniqueis used to partition the input into a number of verticalstrips, such that the data in each strip can be efficientlyprocessed by an internal-memory plane-sweeping algo-rithm. This basic idea of partitioning space such that thedata in each partition fits in memory, and then solving theproblem in each partition in internal memory, has beenused in many algorithms, including, e.g., the PBSM al-gorithm [PD96].

However, unlike most previous algorithms , our algo-rithm only partitions the data along one axis. Anotherimportant property of our algorithm is its theoretical op-timality, which is based on the fact that, unlike in mostother algorithms, no data replication occurs.

A key observation that allowed us to greatly improvethe practical performance of our algorithm is that inplane-sweeping algorithms not all input data is neededin main memory at the same time. Typically only the so-called sweepline structure needs to fit in memory, that is,the data structure that contains the objects that intersectthe current sweepline of the plane-sweeping algorithm.During our initial experiments, we observed that on allour realistic data sets of size , this sweepline structurenever grew beyond size .

This observation, which is known as the square-rootrule in the VLSI literature (see, e.g., [GS87]), seems tohave been largely overlooked in the spatial join litera-ture (with the exception of [BG92]). It implies that formost real-life data sets, we can bypass the vertical par-titioning step in our algorithm and directly perform the

with the exception of the work in [GS87]

plane-sweeping algorithm after the initial sorting step.The result is a conceptually very simple algorithm, whichfrom an I/O perspective just consists of an external sortfollowed by a single scan through the data. We im-plemented our algorithm such that partitioning is onlydone when the sweepline structure does not fit in mem-ory. This assures that it is not only extremely efficienton real-life data, but also offers guaranteed worst-casebounds and predictable behavior on highly skewed andworst-case input data .

The overall performance of SSSJ depends on the ef-ficiency of the internal plane-sweeping algorithm that isemployed. The same is of course also true for PBSMand other spatial join algorithms that use plane-sweepingat the lower level. This motivated us to perform experi-ments with a number of different techniques for perform-ing the plane-sweep. By using an efficient partitioningheuristic, we were able to decrease the time spent on per-forming the internal plane-sweep in PBSM by a factor of

as compared to the original implementation of Pateland DeWitt [PD96].

The data we used is a standard benchmark data forthe spatial join, namely the Tiger/Line data from theUS Bureau of Census [Tig92]. Our experiments showedthat SSSJ performs at least better than the origi-nal PBSM. On the other hand, the improved version ofPBSM actually performed about better than SSSJon the real-life data we used - we believe that this isdue to an inefficiency in our implementation of SSSJ asexplained in Section 7. To illustrate the effect of dataskew on PBSM and SSSJ, we also ran experiments onsynthetic data sets. Here, we observe that SSSJ scalessmoothly with data size, while both versions of PBSMare very adversely affected by skew.

Thus, our conclusion is that if we can assume that thedata is well-behaved, then a simple sort followed by anoptimized plane-sweep provides a solution that is veryeasy to implement, particularly if we can leverage anexisting optimized sort procedure, and that achieves aperformance that is at least competitive to that of previ-ous, usually more complicated, spatial join algorithms.If, on the other hand, the data cannot be assumed tobe well-behaved, then it appears that PBSM, as well asmany other proposed spatial-join algorithms, may notfare much better on such data than the simple plane-sweeping solution, as they are also susceptible to skewin the data . In this case, an algorithm with guaranteedbounds on the worst-case running time, such as SSSJ,appears to be a better choice.

The remainder of this paper is organized as follows.In Section 2 we discuss related research in more detail.In Section 3, we describe a worst-case optimal solution tothe spatial-join problem. Section 4 discusses the square-

We believe that SSSJ should also scale well with the dimension, though this remains to be demonstrated experimentally. We would

expect the sweepline structure to grow as , which can exceedthe size of main memory even for reasonable values of .

Most previous papers seem to limit their experiments to well-behaved data.

Page 3: Scalable Sweeping-BasedSpatial Joinhuangyan/6330/Papers/arge98scalable.pdf · nal PBSM. On the other hand, the improved version of PBSM actually performed about better than SSSJ on

root rule and its implications for plane sweeping algo-rithms. Section 5 presents the SSSJ algorithm. In Sec-tion 6 we describe and compare different implementa-tions of the internal-memory plane-sweeping algorithm.Section 7 presents our main experimental results com-paring SSSJ with PBSM. Finally, Section 8 offers someconcluding remarks.

2 Related Research

Recall that the spatial join is usually solved in two steps:a filter step followed by a refinement step. These twosteps are largely independent in terms of their implemen-tation, and most papers in the literature concentrate ononly the filter step. In this section, we discuss the vari-ous approaches that have been proposed to solve the filterstep. (For the rest of the paper, we will use the term “spa-tial join” to refer to the filter step of the spatial join unlessexplicitly stated otherwise.)

An early algorithm proposed by Orenstein [Ore86,OM88, Ore89] uses a space-filling curve, called Peanocurve or z-ordering, to associate each rectangle with aset of small blocks, called pixels, on that curve, and thenperforms a sort-merge join along the curve. The perfor-mance of the resulting algorithm is sensitive to the size ofthe pixels chosen, in that smaller pixels leads to better fil-tering, but also increase the number of pixels associatedwith each object.

In another transformational approach [BHF93], theMBRs of spatial objects (which are rectangles in twodimensions) are transformed into points in four dimen-sions. The resulting points are stored in a multi-attributedata structure such as the grid file [NHS84], which isthen used for the filter step.

Rotem [Rot91] proposes a spatial join algorithmbased on the join index of Valduriez [Val87]. The joinindex used in [Rot91] partially computes the result of thespatial join using a grid file.

There has recently been much interest in usingspatial index structures like the R-tree [Gut85], R -tree [SRF87], R -tree [BKSS90], and PMR quad-tree [Sam89] to speed up the filter step of the spatial join.Brinkhoff, Kriegel, and Seeger [BKS93] propose a spa-tial join algorithm based on R -trees. Their algorithm isa carefully synchronized depth-first traversal of the twotrees to be joined. An improvement of this algorithm wasrecently reported in [HJR97]. (Another interesting tech-nique for efficiently traversing a multi-dimensional indexstructure was proposed in [KHT89] in a slightly differentcontext.) Gunther [Gun93] studies the tradeoffs betweenusing join indices and spatial indices for the spatial join.Hoel and Samet [HS92] propose the use of PMR quad-trees for the spatial join and compare it against membersof the R-tree family.

Lo and Ravishankar [LR94] discuss the case whereexactly one of the relations does not have an index. Theyconstruct an index for that relation on the fly, by usingthe index on the other relation as a starting point (the

seed). Once the index is constructed, the tree join algo-rithm of [BKS93] is used to perform the actual join.

The other major direction for research on the spatialjoin has focused on the case where neither of the inputrelations has an index. Lo and Ravishankar [LR95] pro-pose to first build indices for the relations on the fly usingspatial sampling techniques and then use the tree join al-gorithm of [BKS93] for computing the join. Another re-cent paper [KS97] proposes an algorithm based on a filtertree structure.

Patel and DeWitt [PD96] and Lo and Ravis-hankar [LR96] both propose hash-based spatial join al-gorithms that use a spatial partitioning function to sub-divide the input, such that each partition fits entirely inmemory. Patel and DeWitt then use a plane-sweeping al-gorithm proposed in [BKS93] to perform the join withineach partition, while Lo and Ravishankar use an indexednested loop join.

Guting and Schilling [GS87] give an interesting dis-cussion of plane-sweeping for computing rectangle inter-sections, and point out the importance of the “square-rootrule” for this problem. They are also the first to considerthe effect of I/O from an analytical standpoint; the algo-rithmic bounds they obtain are slightly suboptimal in thenumber of I/O operations. Their algorithm was subse-quently implemented in the Gral system [BG92], an ex-tensible database system for geometric applications, butwe are not aware of any experimental comparison withother approaches.

The work in [GS87, BG92] is probably the previouscontribution most closely related to our approach. In par-ticular, the algorithm that is proposed is similar to ours inthat it partitions the input along a single axis. However,at each level the input is partitioned into only two stripsas opposed to strips in our algorithm, resultingin an additional factor of in the running time.

3 An Optimal Spatial Join Algorithm

In this section, we describe a spatial join algorithm that isworst-case optimal in terms of the number of I/O trans-fers. This algorithm will be used as a building block inthe SSSJ algorithm described in the next section. Wepoint out that this section is based on the results andtheoretical framework developed in [APR 98]. The al-gorithm uses the distribution-sweeping technique devel-oped in [GTVV93] and further developed in [Arg95,AVV98].

Following Aggarwal and Vitter [AV88] we use the fol-lowing I/O-model: We make the assumption that eachaccess to disk transmits one disk block with units ofdata, and we count this as one I/O operation. We denotethe total amount of main memory by . We assumethat we are given two sets and

In practice there is, of course, a large difference between the per-formance of random and sequential I/O. The correct way to interpretthe theoretical results of this section in a practical context is to assumethat the disk block size is large enough to mask the difference betweenrandom and sequential I/O.

Page 4: Scalable Sweeping-BasedSpatial Joinhuangyan/6330/Papers/arge98scalable.pdf · nal PBSM. On the other hand, the improved version of PBSM actually performed about better than SSSJ on

of rectangles, where denotesthe set . For convenience, we define

. We use to denote the number of pairsof intersecting rectangles reported by the algorithm. Weuse lower-case notation to denote the size of the corre-sponding upper-case quantities when measured in termsof the number of disk blocks: , ,

. We define . Theefficiency of our algorithms is measured in terms of thenumber of I/O operations that are performed. All boundsreported in this section are provable worst-case bounds.

We solve the two-dimensional join problem in twosteps. We first solve the simpler one-dimensional joinproblem of reporting all intersections between two setsof intervals. We then use this as a building block in thetwo-dimensional join, which as based on the distributionsweeping technique.

3.1 One-dimensional Case: Interval Join

In the one-dimensional join problem, each intervalor is defined by a lower boundary and anupper boundary . The problem is to report all in-tersections between an interval in and an interval in

. We assume that at the beginning of the algorithm,and have already been sorted into one list of

intervals by their lower boundaries, which can be donein I/O operations using, say, the optimalsorting algorithm from [AV88]. We will show that thefollowing algorithm then completes the interval join in

I/O operations.

Algorithm Interval Join:

(1) Scan the list in order of increasing lower bound-aries, maintaining two initially empty lists and

of “active” intervals from and . More pre-cisely, for every interval in , do the following:

(a) If , then add to , and scan throughthe entire list . If an interval in inter-sects with , output the intersection and keep

in ; otherwise delete from .

(b) If , then add to , and scan throughthe entire list . If an interval in inter-sects with , output the intersection and keep

in ; otherwise delete from .

In order to see that this algorithm correctly outputs allintersections exactly once, we observe that pairsand that intersect can be classified into twocases: (i) begins before and (ii) begins before

. (Coincident intervals are easily handled using a tie-breaking strategy.) Step (1)(a) reports all intersectionsof an interval from with currently “active” intervalsfrom , thus handling case (ii), while step (1)(b) simi-larly handles case (i).

In order to establish the bound of I/O opera-tions for the algorithm, we need to show that the lists

and can be efficiently maintained in external mem-ory. As it turns out, it suffices if we keep only a singleblock of each list in main memory. To add an interval toa list, we add it to this block, and write the block out todisk whenever it becomes full. To scan the list for inter-sections, we just read the entire list and write out againall intervals that are not deleted. We will refer to thissimple implementation of a list as an I/O-list.

To see that this scheme satisfies the claimed bound,note that each interval in is added to an I/O-list onlyonce, and that in each subsequent scan of an I/O-list, theinterval is either permanently removed, or it produces anintersection with the interval that initiated the scan. Sinceall output is done in complete blocks, this results in atmost reads and writes to maintain the lists

and , plus another writes to output the result, fora total of I/O operations in the algorithm. Thus,we have the following Lemma.

Lemma 3.1 Given a list of intervals from andsorted by their lower boundaries, Algorithm Inter-val Join computes all intersections between intervalsfrom and using I/O transfers.

3.2 The Two-dimensional Case: Rectangle Join

Recall that in the two-dimensional join problem, eachrectangle or is defined by a lower bound-ary and upper boundary in the -axis, and bya lower boundary and upper boundary in the

-axis. The problem is to report all intersections betweena rectangle in and a rectangle in . We will show thatthe following algorithm performs I/Ooperations, and thus asymptotically matches the lowerbound implied by the sorting lower bound of [AV88] (seealso [AM]). It can be shown that the algorithm is also op-timal in terms of CPU time. We again assume that at thebeginning of the algorithm, and have already beensorted into one list of rectangles by their lower bound-aries in the -axis, which can be done inI/O operations.

Algorithm Rectangle Join:

(1) Partition the two-dimensional space into verticalstrips (not necessarily of equal width) such that atmost rectangles start or end in any strip, forsome to be chosen later (see Figure 1).

(2) A rectangle is called small if it is contained in a sin-gle strip, and large otherwise. Now partition eachlarge rectangle into exactly three pieces, two endpieces in the first and last strip that the rectangle in-tersects with, and one center piece in between theend pieces. We then solve the problem in the fol-lowing two steps:

(a) First compute all intersections between a cen-ter piece from and a center piece from ,and all intersections between a center piecefrom and a small rectangle from , or a cen-ter piece from and a small rectangle from .

Page 5: Scalable Sweeping-BasedSpatial Joinhuangyan/6330/Papers/arge98scalable.pdf · nal PBSM. On the other hand, the improved version of PBSM actually performed about better than SSSJ on

(b) In each strip, recursively compute all intersec-tions between an end piece or small rectanglefrom and an end piece or small rectanglefrom .

goes into L1 4,P

goes into L1 2,P

goes into L2 2,P

indicates pieces thatwill go into recursivesubproblems

x

y

Strip 1 Strip 4

Figure 1: An example of the partitioning used by thetwo-dimensional join algorithm. Here, , and eachstrip has no more than three rectangles that will be han-dled further down in the recursion. (Such rectangles areshown as shaded boxes.) The I/O-lists that rectangleswill get added into are shown for some of the rectangles.

We can compute the boundaries of the strips inStep (1) by sorting the -coordinates of the end points,and then scanning the sorted list. In this case, we haveto be careful to split the sorted list into several smallersorted lists as we recurse, since we cannot afford to sortagain in each level of the recursion; the same is also thecase for the list . (In practice, the most efficient wayto find the strip boundaries will be based on sampling.)The recursion in Step (2)(b) terminates when the entiresubproblem fits into memory, at which point we can usean internal plane-sweeping algorithm to solve the prob-lem. Note that the total number of input rectangles ateach level of the recursion is at most , since everyinterval that is partitioned can result in at most two endpieces.

What remains is to describe the implementation ofStep (2)(a). The problem of computing the intersec-tions involving center pieces in (2)(a) is quite similarto the interval join problem in the previous section. Inparticular, any center piece can only begin and end atthe strip boundaries. This means that a small rectangle

contained in strip intersects a center piece goingthrough strip if and only if the intervalsand intersect. Thus, we could compute thedesired intersections by running interval joins along the

-axis, one for each strip.However, this direct solution does not guarantee the

claimed bound, since a center piece spanning a largenumber of strips would have to participate in each of thecorresponding interval joins. We solve this problem byperforming all these interval joins in a single scan of therectangle list . The key idea is that instead of using twoI/O-lists and , we maintain a total ofI/O-lists and with (referagain to Figure 1). The algorithm for Step (2)(a) thenproceeds as follows:

Algorithm Rectangle Join (continued):

(2a) Scan list in order of increasing value of . Forevery interval in do the following:

(i) If and is small and contained in strip, then insert into . Perform a scan of

every list with , computing in-tersections and deleting every element of thelist that does not intersect with . Also, writeout lazily to disk for use in the recursive sub-problem in strip .

(ii) If and is large and its center piececonsists of strips , , . . . , , then insert

into . Perform a scan of every list

with and , computing in-tersections and deleting every element of a listthat does not intersect with . Also, write outthe end pieces of lazily to disk for use in theappropriate recursive subproblems.

(iii) If and is small, do (i) with the roles ofand reversed.

(iv) If and is large, do (ii) with the rolesof and reversed.

Lemma 3.2 Algorithm Rectangle Join outputs all inter-sections between rectangles in and rectangles incorrectly and only once.

Proof: (sketch) We first claim that if a rectangle islarge, then Step (2)(a) reports all the intersections be-tween the large rectangle’s center piece and all other rect-angles in the other set. To see this, we classify all the

-pairs that intersect, and where is large, into twocases: ( ) and ( ) . (Equal-ities are easily handled using a tie-breaking strategy.)Steps (2)(a)(i) and (ii) clearly handle all intersectionsfrom case ( ) because the currently “active” intervals arestored in the various I/O-lists, and Steps (2)(a)(i) and (ii)intersect with all of the lists that intersect with its centerpiece. Steps (2)(a)(iii) and (iv) similarly handle case ( ).The case where the interval from is large follows fromsymmetry.

In order to avoid reporting an intersection multipletimes at different levels of the recursion, we keep trackof intervals whose endpoints extend beyond the “currentboundaries” of the recursion and store them in separatedistinguished I/O-lists. (There are at most such lists.)By never comparing elements from distinguished lists of

and , we avoid reporting duplicates.

To show a bound on the number of I/Os used by thealgorithm, we observe that as in the interval join algo-rithm from the previous section, each small rectangle andeach center piece is inserted in a list exactly once. Arectangle in a list produces an intersection every time itis scanned, except for the last time, when it is deleted.This analysis requires that each of the roughly I/O-lists has exclusive use of at least one block of main mem-ory, so the partitioning factor of the distribution sweep

Page 6: Scalable Sweeping-BasedSpatial Joinhuangyan/6330/Papers/arge98scalable.pdf · nal PBSM. On the other hand, the improved version of PBSM actually performed about better than SSSJ on

should be chosen to be at most Thus, the cost ofStep (2)(a) is linear in the input size and the numberof intersections produced in this step, and the total costover the levels of recursion is

.Note that the sublists of created by Step (2)(a) for

use in the recursive computations inside the strips are insorted order. Putting everything together, we have thefollowing theorem.

Theorem 3.3 Given a list of rectangles from andsorted by their lower boundaries in one axis, AlgorithmRectangle Join reports all intersections between rectan-gles from and using I/O transfers.

4 Plane Sweeping and the Square-RootRule

As discussed in the introduction, the overall efficiencyof many spatial join algorithms is greatly influenced bythe plane-sweeping algorithm that is employed as a sub-routine. Several efficient plane-sweeping algorithms forrectangle intersection have been proposed in the com-putational geometry literature. Of course, these algo-rithms were designed under the assumption that the datafits completely in main memory. In this section we willshow that under certain realistic assumptions about theinput data, many of the internal-memory algorithms canin fact be applied to input sets that are much larger thanthe available memory.

4.1 Plane Sweeping

The rectangle intersection problem can be solved in mainmemory by applying a technique called Plane Sweep-ing. Plane sweeping is one of the most basic algorithmicparadigms in computational geometry (see, e.g., [PS85]).Simply speaking, a plane-sweeping (or sweepline) algo-rithm attempts to solve a geometric problem by mov-ing a vertical or horizontal sweepline across the scene,processing objects as they are reached by the sweepline.Clearly, for any pair of intersecting rectangles there isa horizontal line that passes through both rectangles.Hence, a plane-sweeping algorithm for rectangle inter-section only has to find all intersections between rect-angles located on the same sweepline, thus reducing theproblem to a (dynamic) one-dimensional interval inter-section problem.

A typical plane-sweeping algorithm for rectangle in-tersection uses a dynamic data structure that allows inser-tion, deletion, and intersection queries on intervals. Rect-angles are inserted into the structure as they are reachedby the sweepline, at which time a query for intersectionswith the new rectangle is performed, and are removedafter the sweepline has passed over them. Many opti-mal and suboptimal dynamic data structures for intervalshave been proposed; important examples are the intervaltree [Ede83], the priority search tree [McC85], and thesegment tree [Ben77].

4.2 The Square-Root Rule

In most implementations of plane-sweeping algorithms,the maximum amount of memory ever needed is deter-mined by the maximum number of rectangles that areintersected by a single horizontal line. For most “reallife” input data this number, which we will refer to as themaximum overlap of a data set, is actually significantlysmaller than the total number of rectangles. This ob-servation has previously been made by other researchers(see [GS87] and the references therein), and is known asthe “square-root rule” in the VLSI literature. That is, fora data set of size , the number of rectangles intersectedby any horizontal or vertical line is typically ,with a moderate multiplicative constant determined bythe data set.

We consider the standard benchmark data for spa-tial join, namely the US Bureau of the CensusTIGER/Line [Tig92] data. These data files consist ofpolygonal line entities representing physical featuressuch as rivers, roads, railroad lines, etc. We used the datafor the states of New Jersey, Rhode Island, Connecticutand New York. Table 1 shows the maximum number ofrectangles that are intersected by any horizontal line, forthe road and hydrographic features of the four data sets.The number of spatial objects in each data set is givenalong with the maximum number of the correspondingMBR rectangles that overlap with any horizontal line.Note that the maximum overlap for the TIGER data ap-pears to be consistent with the square-root rule.

Thus, if a data set satisfies the square-root rule, thenwe can use plane sweeping to solve the spatial join prob-lem on inputs that are much larger than the availableamount of memory, as long as the data structure usedby the plane sweep never grows beyond the size of thememory.

Road Max. Hydro Max.Overlap Overlap

Road HydroRhode Island 68277 878 7012 54Connecticut 188642 932 28775 145New Jersey 414442 1600 50853 156New York 870412 2649 156567 362

Table 1: Characteristics of road and hydrographic datafrom TIGER/Line data.

5 Scalable Sweeping-Based Spatial Join

We now describe the SSSJ algorithm, which is obtainedby combining the theoretically optimal Rectangle Joinalgorithm presented in Section 3 with an efficient plane-sweeping technique. The resulting SSSJ performs an ini-tial sort, and then directly attempts to use plane-sweepingto solve the join problem. The vertical partitioning stepin Rectangle Join is only performed if the sweepingstructure used by the plane-sweep grows beyond the sizeof the main memory.

Alternatively, we could use random sampling to es-

Page 7: Scalable Sweeping-BasedSpatial Joinhuangyan/6330/Papers/arge98scalable.pdf · nal PBSM. On the other hand, the improved version of PBSM actually performed about better than SSSJ on

Scalable Sweeping-Based Spatial Join:

Sort sets and on their lower -coordinatesInitiate an internal-memory plane-sweepif the plane-sweep runs out of main memory

perform one level of partitioning usingRectangle Join and recursively call SSSJ oneach subproblem

Figure 2: SSSJ algorithm for computing the join betweentwo sets and of rectangles.

timate the overlap of a data set with guaranteed confi-dence bounds, and then use this information to decidewhether the input needs to be partitioned; the details areomitted due to space constraints. In our implementationwe followed the slightly less efficient approach describesabove. After the initial sort we simply start the internalplane-sweeping algorithm, assuming that the sweeplinestructure will fit in memory. During the sweep we moni-tor the size of the structure, and if it reaches a predefinedthreshold we abort the sweep and call Rectangle Join.

Given the large main memory sizes of current work-stations, we expect that in most cases, we can processdata sets on the order of several hundred billion rectan-gles with the internal plane-sweeping algorithm. How-ever, if the data is extremely large, or is highly skewed,then SSSJ will invoke the vertical partitioning at a mod-erate increase in running time.

In most cases, SSSJ will skip the vertical partition-ing, and our spatial join algorithm is reduced to an ini-tial external sorting step, followed by a scan over thesorted data (during which the internal plane-sweeping al-gorithm is run). We believe that this observation is con-ceptually important for two reasons. First, it provides avery simple and insightful view of the structure and I/Obehavior of our spatial join algorithm. Second, it allowsfor a simple and fast implementation, by leveraging theperformance of the highly tuned sorting routines offeredby many database systems. There has been considerablework on optimized database sorts in recent years (see,e.g,, [Aga96, DDC 97, NBC 94]), and it appears wiseto try to draw on these results.

6 Fast Plane-Sweeping Methods

As mentioned already, the overall efficiency of many spa-tial join algorithm is greatly influenced by the internal-memory join algorithm used as a subroutine. In this sec-tion we describe and compare several internal-memoryplane-sweeping algorithms. We present the algorithmsin Subsection 6.1 and report the results of experimentswith the TIGER/line data set in Subsection 6.2.

We estimate that the TIGER/Line data for the entire United Stateswill be no more than million rectangles.

6.1 Algorithms

Recall from Section 4.1 that the most common plane-sweeping algorithm for the rectangle intersection prob-lem is based on an abstract data structure for storingand querying intervals. We implemented several ver-sions of this data structure. In the following we first givea generic description of the plane-sweeping algorithm,and then describe the different data structure implemen-tations. We also describe the plane-sweeping algorithmused in PBSM, which we also implemented for compari-son. This algorithm was proposed in [BKS93], and doesnot use an interval data structure.

As before, assume that we have two setsand of rectangles,

sorted in ascending order by their lower boundary in the-axis. We want to find intersections between and

by sweeping the plane with a horizontal sweepline.For a rectangle , we call

the interval of , the starting time of, and the expiration time of . Let be an in-

stance of the generic data structure that supports thefollowing operations on rectangles and their associatedintervals:

(1) Insert inserts a rectangle into .

(2) Delete removes from all rectangles withexpiration time .

(3) Query reports all rectangles in whose in-terval overlaps with that of .

Figure 3 shows the pseudocode for the resulting algo-rithm Sweep Join Generic.

Algorithm Sweep Join Generic:Head and Head denote the current first el-

ements in the sorted lists and of rectangles, andand are two initially empty data structures.

repeat until and are empty

Let Head and Headif

InsertDeleteQueryRemove from

elseInsertDeleteQueryRemove from

Figure 3: Generic plane-sweeping algorithm for comput-ing the join between two sets and of rectangles.

We implemented three different versions of and ob-tained three different versions of the generic algorithm:

Page 8: Scalable Sweeping-BasedSpatial Joinhuangyan/6330/Papers/arge98scalable.pdf · nal PBSM. On the other hand, the improved version of PBSM actually performed about better than SSSJ on

Tree Sweep where is implemented as an interval treedata structure, List Sweep where is implemented us-ing a single linked list, and Striped Sweep where isimplemented by partitioning the plane into vertical stripsand using a separate linked list for each strip. Finally, werefer to the algorithm used in PBSM as Forward Sweep.In the following we discuss each of these algorithms.

Algorithm Tree Sweep uses a data structure that isessentially a combination of an interval tree [Ede83] anda skip list [Pug90]. More precisely, we used a simplifieddynamic version of the interval tree similar to that de-scribed in Section 15.3 and Exercise 15.3–4 of [CLR90],but implemented the structure using a randomized skiplist instead of a balanced tree structure. (Another, thoughsomewhat different, structure combining interval treesand skip lists has been described in [Han91].) Our rea-son for using a skip list is that it allows for a fairly simplebut efficient implementation while matching (in a prob-abilistic sense) the good worst-case behavior of a bal-anced tree. With this data structure, the expected time foran insertion can be shown to be , while queryand deletion operations take time , whereis the number of rectangles reported or deleted duringthe operation. The worst-case running time of this algo-rithm is , which is at most afactor away from the optimal bound.However, on real-life data, the intersection query timeis usually , as most intersecting rectan-gles are typically close to each other in the tree. Thus,we would not expect significant improvements frommore complicated, but asymptotically optimal data struc-tures [McC85, Ede83].

Algorithm List Sweep uses a simple linked-list datastructure. To decrease allocation and other overheads andimprove locality, each element of the linked list can holdup to rectangles. Insertion is done in constant time,while query and deletion both take time linear in thenumber of elements in the list in the worst case. Thus,the worst case running time of the algorithm is( if we assume that the square root rule ap-plies).

Algorithm Striped Sweep uses a data structurebased on a simple partitioning heuristic. The basic ideais to divide the domain into a number of vertical strips ofequal width and use one instance of the linked list struc-ture from List Sweep in each strip. Intervals are storedin each strip that they intersect. The key parameter inthis data structure is the number of strips used. By using

strips, we hope to achieve an improvement of up to afactor of in query and deletion time. However, if thereare too many strips, many intervals will intersect morethan one strip, thus increasing the size of the data struc-ture and slowing down the operations. We experimentedwith a number of values for the number of strips to de-termine the optimum. Insertions can be done in constanttime, while searching and deletion both can be linear inthe worst case. However, we expect this algorithm toperform very well for real-life data sets that have many

small rectangles and that are not extremely clustered inone area.

Algorithm Forward Sweep is the plane-sweepingalgorithm employed by Patel and DeWitt in PBSM, andwas first proposed in [BKS93]. This algorithm is some-what similar in structure to List Sweep, except that itdoes not use a linked list data structure to store inter-vals encountered in the recent past, but scans forward inthe sorted lists for intervals that will intersect the currentinterval in the future; see Figure 4 for the structure of thealgorithm. The worst-case running time of this algorithmis ( if we assume that the square rootrule applies).

Algorithm Forward Sweep:Head and Head denote the current first ele-

ments in the sorted lists and of rectangles.

repeat until and are empty

Let Head and Headif

Remove fromScan from the current position andreport all rectangles that intersect .Stop when the current rectanglesatisfies .

elseRemove fromScan from the current position andreport all rectangles that intersects .Stop when the current rectanglesatisfies .

Figure 4: Algorithm Forward Sweep used by PBSM forcomputing the join between and .

6.2 Experimental Results

In order to compare the four algorithms we conductedexperiments joining the road and hydrographic line fea-tures from the states of Rhode Island, Connecticut andNew Jersey. The input data was already located in mainmemory at the start of each run. The experiments wereconducted on a Sun SparcStation 20 with megabytesof main memory (we were thus unable to fit the NewYork data into internal memory). To decrease the costof deletion operations, we introduced a lazy factor andonly actually performed the deletion operation every thtime it was called. We chose in Tree Sweep andList Sweep and in Striped Sweep. We varied thenumber of strips in Striped Sweep from to .

Table 2 compares the running times of the four plane-sweeping algorithms (excluding the sorting times). Wecan clearly see that Striped Sweep outperforms the otheralgorithms by a factor of to . The algorithm achievesthe best performance for to strips; beyond this

Page 9: Scalable Sweeping-BasedSpatial Joinhuangyan/6330/Papers/arge98scalable.pdf · nal PBSM. On the other hand, the improved version of PBSM actually performed about better than SSSJ on

Algorithm RI CT NJTree Sweep 2,28 8.57 16.65Forward Sweep 1.18 12.0 15.8List Sweep 1.30 10.19 14.83Striped Sweep(4) 0.72 4.02 6.91Striped Sweep(8) 0.56 2.72 4.91Striped Sweep(16) 0.48 1.97 3.73Striped Sweep(32) 0.43 1.57 3.18Striped Sweep(64) 0.42 1.39 3.06Striped Sweep(128) 0.45 1.36 2.91Striped Sweep(256) 0.54 1.61 3.25

Table 2: Performance comparison of the four plane-sweeping algorithms in main memory (times in seconds).

point, the performance slowly degrades due to increasedreplication. (For , we get a replication rate ofmore than on the smallest data set, and more than

on the other two sets.) Algorithms Forward Sweepand List Sweep are similar in performance, which is tobe expected given their similar structure. Maybe a bitsurprising is that List Sweep actually outperforms For-ward Sweep slightly on the larger data sets, even thoughit has additional overheads associated with maintaininga data structure. Finally, Tree Sweep is slower than For-ward Sweep and List Sweep on small (RI) and thin (NJ)sets, and faster on wide (CT) and large (NY) sets.

We point out that Tree Sweep is the only algo-rithm that has a good worst-case behavior. WhileStriped Sweep is the fastest algorithm on the tested datasets, Tree Sweep is useful because it offers reasonablygood performance even on very skewed data. As dis-cussed in the next section we therefore decided to useboth of them in our practically efficient, yet skew resis-tant, SSSJ algorithm.

7 Experimental Results

In this section, we compare the performance of the SSSJand PBSM algorithms. We implemented the SSSJ algo-rithm along with two versions of the PBSM algorithm,one that follows exactly the description of Patel and De-Witt [PD96] and one that replaces their internal-memoryplane-sweeping procedure with Striped Sweep. We re-fer to the original and improved PBSM as QPBSM andMPBSM, respectively.

We begin by giving a sketch of the PBSM algorithm.We then describe the details of our implementations,and compare the performance of SSSJ against that ofQPBSM and MPBSM on TIGER/Line data sets. Finally,we compare the performance of the three algorithms onsome artificial worst-case data sets that illustrate the ro-bustness of SSSJ.

The claim for the New York data set was verified with additionalruns on a different machine with larger main memory.

7.1 Sketch of the PBSM Algorithm

The filter step of PBSM consists of a decomposition stepfollowed by a plane-sweeping step. In the first step theinput is divided into partitions such that each parti-tion fits in memory. In the second step, each partition isloaded into memory and intersections are reported usingan internal-memory plane-sweeping algorithm.

To form the partitions in the first step, a spatial par-titioning function is used. More precisely, the input isdivided into tiles of some fixed size. The number of tilesis somewhat larger than the number of partitions . Toform a partition, a tile-to-partition mapping scheme isused that combines several tiles into one partition. An in-put rectangle is placed in each partition it intersects with.

The tile-to-partition mapping scheme is obtained byordering the tiles in row-major order, and using eitherround robin or hashing on the tile number. Figure 5 il-lustrates the round-robin scheme, which was used in ourimplementations of PBSM. Note that an input rectanglecan appear in more than one partition, which makes itvery difficult to compute a priori the number of partitionssuch that each partition fits in internal memory. Instead,an estimation is used that does not take duplication intoaccount.

Tile 0/Part 0 Tile 1/Part 1 Tile 2/Part 2 Tile 3/Part 0

Tile 7/Part 1

Tile 11/Part 2Tile 10/Part 1Tile 9/Part 0

Tile 5/Part 2 Tile 6/Part 0

Tile 8/Part 2

Tile 4/Part 1

Figure 5: Partitioning with partitions and tiles us-ing the round-robin scheme. The rectangle drawn with asolid line will appear in all three partitions.

7.2 Implementation Details

We implemented the three algorithms using theTransparent Parallel I/O Programming Environment(TPIE) system [Ven94, Ven95, VV96] (see alsohttp://www.cs.duke.edu/TPIE/). TPIE is a collection oftemplated functions and classes to support high-levelyet efficient implementations of external-memory algo-rithms. The basic data structure in TPIE is a stream, rep-resenting a list of objects of an arbitrary type. The sys-tem contains I/O-efficient implementations of algorithmsfor scanning, merging, distributing, and sorting streams,which are building blocks for our algorithms. This madethe implementation relatively easy and facilitated mod-ular design. The input data consists of two streams ofrectangles, each rectangle being a structure containingthe coordinates of the lower left corner and of the upperright corner, and an ID, for a total of 40 bytes. The out-put consists of a stream containing the IDs of each pair

Page 10: Scalable Sweeping-BasedSpatial Joinhuangyan/6330/Papers/arge98scalable.pdf · nal PBSM. On the other hand, the improved version of PBSM actually performed about better than SSSJ on

0

50

100

150

200

250

300

350

SS

SJ

MP

BS

M

QP

BS

M

SS

SJ

MP

BS

M

QP

BS

M

SS

SJ

MP

BS

M

QP

BS

M

SS

SJ

MP

BS

M

QP

BS

M

SS

SJ

MP

BS

M

QP

BS

M

Algorithm

Tim

e (i

n s

ecs) Phase 2

Phase 1

RI

CT

NJ

NY

ALL

Figure 6: Running times for TIGER data. For MPBSMand QPBSM, Phase 1 consists of the partitioning step,while for SSSJ, it consists of the initial sorting step. Theremaining steps are contained in Phase 2.

of intersecting rectangles.As described in Section 6.1, SSSJ first performs a sort.

We used TPIE’s external-memory merge sorting routinein our implementation. Then SSSJ tries to perform theinternal-memory plane-sweeping directly, and reverts toRectangle Join to partition the data only if the sweepruns out of main memory. We used Striped Sweep as thedefault internal memory algorithm for the first sweep at-tempt, and switched to Tree Sweep once partitioning hasoccurred. In Rectangle Join, we used random samplingto determine the strip boundaries, thus avoiding an ad-ditional sort of the data along the dimension. The I/Olists used by Rectangle Join were implemented as TPIEstreams.

The PBSM implementations, QPBSM and MPBSM,consist of three basic steps: A partitioning step, whichuses 1024 tiles, and then for each generated partitionan internal-memory sorting step and finally a plane-sweeping step. The two PBSM programs differ in theway they perform the plane-sweep: QPBSM uses theForward Sweep of Patel and DeWitt, while MPBSMuses our faster Striped Sweep.

7.3 Experiments with TIGER Data

We performed our experiments on a Sun SparcStation 20running Solaris 2.5, with 32 megabytes of internal mem-ory. In order to avoid network activity, we used a localdisk for the input files as well as for scratch files. We putno restrictions on the amount of internal memory thatQPBSM and MPBSM could use, and thus the virtual-memory system was invoked when needed. However,the amount of internal memory used by SSSJ was limitedto 12 megabytes, which was the amount of free memoryon the machine when the program was running. For allexperiments, the logical block transfer size used by theTPIE streams was set to times the physical disk block

This is the value used by Patel and DeWitt in their experiments.They also found that increasing the number of tiles had little effect onthe overall execution time.

"QPBSM"

"SSSJ""MPBSM"

N

N

0

0

2000

3000

4000

5000

6000

7000

0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1e+06

Number of rectangles

Tim

e (s

econ

ds)

1000

Figure 7: Running times for tall rect

size of kilobytes, in order to achieve a high transferrate.

Figure 6 shows the running times of the three pro-grams on the TIGER data sets. It can be seen thatboth SSSJ and MPBSM clearly outperform QPBSM:SSSJ performs at least better than QPBSM, whileMPBSM performs approximately better than SSSJ.The gain in performance over QPBSM is due to the useof Striped Sweep in SSSJ and MPBSM. As the max-imum overlap of the five data sets is relatively small,SSSJ only runs the TPIE external-memory sort and theinternal-memory plane-sweep.

From an I/O perspective, the behavior of MPBSM andSSSJ is as follows. SSSJ first performs one scan overthe data to produce sorted runs and then another scanto merge the sorted runs (in the TPIE merge sort), be-fore scanning the data again to perform the plane-sweep.MPBSM, on the other hand, first distributes the data tothe partitions using one scan, and then performs a sec-ond scan over the data that sorts each partition in inter-nal memory and performs the plane-sweep on it. Thus,MPBSM has a slight advantage in our implementationbecause it makes one less scan of the data on disk.

A more efficient implementation of SSSJ would feedthe output of the merge step of the TPIE sort directlyinto the scan used for the plane-sweep, thus eliminat-ing one write and one read of the entire data. We be-lieve that such an implementation would slightly outper-form MPBSM. While conceptually this is a very simplechange, it is somewhat more difficult in our setup as itwould require us to open up and modify the TPIE mergesort. In general, such a change might make it more dif-ficult to utilize existing, highly optimized external sortprocedures.

7.4 Experiments with Synthetic Data

In order to illustrate the effect of skewed data distribu-tions on performance, we compared the behavior of thethree algorithms on synthetically generated data sets. Wegenerated two data sets of skewed rectangles, followinga procedure used in [Chi95]. Each data set contains two

Page 11: Scalable Sweeping-BasedSpatial Joinhuangyan/6330/Papers/arge98scalable.pdf · nal PBSM. On the other hand, the improved version of PBSM actually performed about better than SSSJ on

"MPBSM""QPBSM"

"SSSJ"

N0

N

0

200

400

600

800

1000

0 200000 400000 600000 800000 1e+06

Tim

e (s

econ

ds)

Number of rectangles

Figure 8: Running times for wide rect

sets of rectangles each, placed in thesquare. In order to guarantee that the reporting cost doesnot dominate the searching cost, the rectangles are cho-sen such that the total number of intersections betweenrectangles from the two sets is . The first data set,called tall rect, consists of long and skinny verti-cal rectangles which result in a large maximum over-lap. To construct the data, we used a fixed width foreach rectangle ( in the experiments), and chosethe height uniformly in . We also chose theand coordinates of the lower left corner uniformly in

and , respectively. The second dataset, called wide rect, can be obtained by rotating thetall rect data set by 90 degrees. It has the same num-ber of intersections, but a small expected maximum over-lap.

Figure 7 shows the running times of the three algo-rithms on the tall rect data set. The performanceof MPBSM degrades quickly, due to the replicationof each input rectangle in several partitions. QPBSMperforms even worse because of its worst-caseplane-sweep: The rectangles are tall, and as we are scan-ning in the direction, the time for each query becomes

. This problem can be somewhat alleviated if weincrease the number of partitions. However, this wouldfurther increase replication.

Figure 8 shows the running times on the wide rectdata set. In this data set the maximum overlap is small(constant), which is advantageous for all three programs.However, the PBSM implementations still suffer fromexcessive replication.

8 Conclusions and Open Problems

In this paper, we have proposed a new algorithm forthe spatial join problem called Scalable Sweeping-BasedSpatial Join (SSSJ), which combines efficiency on re-alistic data with robustness against highly skewed andworst-case data. We have also studied the performance ofseveral internal-memory plane-sweeping algorithms andtheir implications for the overall performance of spatialjoins.

In our future work, we plan to compare the perfor-mance of SSSJ against tree-based methods [BKS93]. Wealso plan to study the problem of higher-dimensionaljoins. In particular, three-dimensional joins would be in-teresting since they arise quite naturally in GIS. Finally,a question left open by our experiments with internal-memory plane-sweeping algorithms is whether there ex-ists a simple algorithm that matches the performance ofStriped Sweep but that is less vulnerable to skew.

AcknowledgementsWe would like to thank Jignesh Patel for his many clari-fications on the implementation details of PBSM.

References[Aga96] Ramesh C. Agarwal. A super scalar sort algorithm

for RISC processors. In Proc. SIGMOD Intl. Conf.on Management of Data, pages 240–246, 1996.

[AM] L. Arge and P. B. Miltersen. On showing lowerbounds for external-memory computational geom-etry. Manuscript, 1998.

[APR 98] L. Arge, O. Procopiuc, S. Ramaswamy, T. Suel,and J. S. Vitter. Theory and practice of I/O-efficient algorithms for multidimensional batchedsearching problems. In Proc. ACM-SIAM Symp.on Discrete Algorithms, pages 685–694, 1998.

[ARC93] ARC/INFO. Understanding GIS—the ARC/INFOmethod. ARC/INFO, 1993. Rev. 6 for worksta-tions.

[Arg95] L. Arge. The buffer tree: A new technique foroptimal I/O-algorithms. In Proc. Workshop on Al-gorithms and Data Structures, LNCS 955, pages334–345, 1995. A complete version appears asBRICS Technical Report RS-96-28, University ofAarhus.

[Arg97] L. Arge. External-memory algorithms with ap-plications in geographic information systems. InM. van Kreveld, J. Nievergelt, T. Roos, and P. Wid-mayer, editors, Algorithmic Foundations of GIS.Springer-Verlag, Lecture Notes in Computer Sci-ence 1340, 1997.

[AV88] A. Aggarwal and J. S. Vitter. The Input/Outputcomplexity of sorting and related problems. Com-munications of the ACM, 31(9):1116–1127, 1988.

[AVV98] L. Arge, D. E. Vengroff, and J. S. Vitter. External-memory algorithms for processing line segmentsin geographic information systems. Algorithmica(to appear in special issues on Geographical In-formation Systems), 1998. Extended abstract ap-pears in Proc. of Third European Symposium onAlgorithms, ESA’95.

[Ben77] J. L. Bentley. Algorithms for Klee’s rectangleproblems. Dept. of Computer Science, CarnegieMellon Univ., unpublished notes, 1977.

[BG92] L. Becker and R. H. Guting. Rule-based opti-mization and query processing in an extensible ge-ometric database system. ACM Transactions onDatabase Systems, 17(2):247–303, 1992.

[BHF93] L. Becker, K. Hinrichs, and U. Finke. A new algo-rithm for computing joins with grid files. In Inter-national Conference on Data Engineering, pages190–198, 1993. IEEE Computer Society Press.

[BKS93] T. Brinkhoff, H.-P. Kriegel, and B. Seeger. Ef-ficient processing of spatial joins using R-trees.In Proc. SIGMOD Intl. Conf. on Management ofData, 1993.

Page 12: Scalable Sweeping-BasedSpatial Joinhuangyan/6330/Papers/arge98scalable.pdf · nal PBSM. On the other hand, the improved version of PBSM actually performed about better than SSSJ on

[BKSS90] N. Beckmann, H.-P. Kriegel, R. Schneider, andB. Seeger. The R*-tree: An efficient and robustaccess method for points and rectangles. In Proc.SIGMOD Intl. Conf. on Management of Data,1990.

[Chi95] Y.-J. Chiang. Experiments on the practical I/Oefficiency of geometric algorithms: Distributionsweep vs. plane sweep. In Proc. Workshop on Al-gorithms and Data Structures, LNCS 955, pages346–357, 1995.

[CLR90] T. H. Cormen, C. E. Leiserson, and R. L. Rivest.Introduction to Algorithms. The MIT Press, Cam-bridge, Mass., 1990.

[DDC 97] A. C. Arpaci Dusseau, R. H. Arpaci Dusseau, D. E.Culler, J. M. Hellerstein, and D. A. Patterson.High-performance sorting on networks of work-stations. In Proc. SIGMOD Intl. Conf. on Man-agement of Data, pages 243–254, 1997.

[Ede83] H. Edelsbrunner. A new approach to rectangle in-tersections, part I. Int. J. Computer Mathematics,13:209–219, 1983.

[GS87] R. H. Guting and W. Schilling. A practical divide-and-conquer algorithm for the rectangle intersec-tion problem. Information Sciences, 42:95–112,1987.

[GTVV93] M. T. Goodrich, J.-J. Tsay, D. E. Vengroff, andJ. S. Vitter. External-memory computational ge-ometry. In Proc. IEEE Symp. on Foundations ofComp. Sci., pages 714–723, 1993.

[Gun93] O. Gunther. Efficient computation of spatial joins.In International Conference on Data Engineering,pages 50–60, 1993. IEEE Computer Society Press.

[Gut85] A. Guttman. R-trees: A dynamic index struc-ture for spatial searching. In Proc. ACM-SIGMODConf. on Management of Data, pages 47–57, 1985.

[Han91] E. N. Hanson. The interval skip list: A data struc-ture for finding all intervals that overlap a point.In Proceedings of Algorithms and Data Structures(WADS ’91), volume 519 of LNCS, pages 153–164. Springer, 1991.

[HJR97] Y.-W. Huang, N. Jing, and E. A. Rundensteiner.Spatial joins using R-trees: Breadth-first traversalwith global optimizations. In Proc. IEEE Interna-tional Conf. on Very Large Databases, pages 396–405, 1997.

[HS92] E. G. Hoel and H. Samet. A qualitative comparisonstudy of data structures for large linear segmentdatabases. In Proc. ACM SIGMOD Conf., page205, 1992.

[Int97] Intergraph Corp. MGE 7.0, “http://www.inter-graph.com/iss/products/mge/mge-7.0.htm”, 1997.

[KHT89] M. Kitsuregawa, L. Harada, and M. Takagi. Joinstrategies on kd-tree indexed relations. In Inter-national Conference on Data Engineering, pages85–93, 1989. IEEE Computer Society Press.

[KS97] N. Koudas and K. C. Sevcik. Size separation spa-tial join. In Proc. SIGMOD Intl. Conf. on Manage-ment of Data, pages 324–335, 1997.

[LR94] M.-L. Lo and C. V. Ravishankar. Spatial joins us-ing seeded trees. In Proc. SIGMOD Intl. Conf. onManagement of Data, pages 209–220, 1994.

[LR95] M.-L. Lo and C. V. Ravishankar. Generatingseeded trees from data sets. In Proc. InternationalSymp. on Large Spatial Databases, 1995.

[LR96] M.-L. Lo and C. V. Ravishankar. Spatial hash-joins. In Proc. SIGMOD Intl. Conf. on Manage-ment of Data, pages 247–258, 1996.

[McC85] E.M. McCreight. Priority search trees. SIAM Jour-nal of Computing, 14(2):257–276, 1985.

[NBC 94] C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray,and D. Lomet. AlphaSort: A RISC machine sort.In Proc. SIGMOD Intl. Conf. on Management ofData, pages 233–242, 1994.

[NHS84] J. Nievergelt, H. Hinterberger, and K.C. Sevcik.The grid file: An adaptable, symmetric multikeyfile structure. ACM Transactions on Database Sys-tems, 9(1):257–276, 1984.

[OM88] J. A. Orenstein and F. A. Manola. PROBE spatialdata modeling and query processing in an imagedatabase application. IEEE Transactions on Soft-ware Engineering, 14(5):611–629, 1988.

[Ore86] J. A. Orenstein. Spatial query processing in anobject-oriented database system. In Carlo Zaniolo,editor, Proceedings of the 1986 ACM SIGMOD In-ternational Conference on Management of Data,pages 326–336, 1986.

[Ore89] J. A. Orenstein. Redundancy in spatial databases.SIGMOD Record (ACM Special Interest Group onManagement of Data), 18(2):294–305, June 1989.

[Ore90] J. A. Orenstein. A comparison of spatial queryprocessing techniques for native and parameterspaces. SIGMOD Record (ACM Special InterestGroup on Management of Data), 19(2):343–352,June 1990.

[PD96] J. M. Patel and D. J. DeWitt. Partition basedspatial-merge join. In Proc. SIGMOD Intl. Conf.on Management of Data, pages 259–270, 1996.

[PS85] F. P. Preparata and M. I. Shamos. ComputationalGeometry: An Introduction. Springer-Verlag,1985.

[Pug90] W. Pugh. Skip lists: A probabilistic alternativeto balanced trees. Communications of the ACM,33(6):668–676, June 1990.

[Rot91] D. Rotem. Spatial join indices. In InternationalConference on Data Engineering, pages 500–509,1991. IEEE Computer Society Press.

[Sam89] H. Samet. The Design and Analyses of SpatialData Structures. Addison Wesley, MA, 1989.

[SRF87] T. Sellis, N. Roussopoulos, and C. Faloutsos. TheR -tree: A dynamic index for multi-dimensionalobjects. In Proc. IEEE International Conf. on VeryLarge Databases, 1987.

[Tig92] Tiger/line files (tm), 1992 technical documenta-tion. Technical report, U. S. Bureau of the Census.

[Ube94] M. Ubell. The montage extensible datablade ar-chitecture. In Proc. SIGMOD Intl. Conf. on Man-agement of Data, 1994.

[Val87] Patrick Valduriez. Join indices. ACM Transactionson Database Systems, 12(2):218–246, June 1987.

[Ven94] D. E. Vengroff. A transparent parallel I/O environ-ment. In Proc. 1994 DAGS Symposium on ParallelComputation, 1994.

[Ven95] D. E. Vengroff. TPIE User Manual and Ref-erence. Duke University, 1995 with sub-sequent revisions. Available via WWW athttp://www.cs.duke.edu/TPIE.

[VV96] D. E. Vengroff and J. S. Vitter. I/O-efficient sci-entific computation using TPIE. In Proceedings ofthe Goddard Conference on Mass Storage Systemsand Technologies, NASA Conference Publication3340, Volume II, pages 553–570, 1996.