Worst Case Efficient Multidimensional Indexing

Embed Size (px)

Citation preview

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    1/32

    Worst Case Efficient Multidimensional Indexing

    Massimiliano Tomassoli

    June 3, 2008

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    2/32

    Abstract

    Let us consider the following problem.Let U1, U2, . . . , Ud be totally ordered sets and let U = U1 U2

    Ud, where d 2. Given a set S = {s1, s2, . . . , sn} U and two points a =(a1, a2, . . . , ad) U and b = (b1, b2, . . . , bd) U, find {(x1, x2, . . . , xd) S | i {1, 2, . . . , d} ai xi bi}.

    This paper presents an algorithm that can solve this problem in spaceO(n logd1 n) and time O(logd1 n) with a precomputation step (independentof a and b) taking time O(n logd1 n). Moreover, whenever the restrictions areimposed only on k 2 coordinates, the search takes time O(logk1 n + d k).

    If the space is two-dimensional, the problem consists in finding all the pointsthat are both in S and in the rectangle represented by a and b. In this case thealgorithm solves the problem in space O(n log n) and time O(log n).

    If S is dynamic the problem is more complex. Some ideas aimed at solvingthe dynamic version of the problem are presented as well.

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    3/32

    About this document

    This document was written in a hurry, therefore the bibliography is missing and,above all, the algorithms and methods described in this document were neverimplemented nor tested in any way. The proofs of the theorems presented are

    only sketched and no one has ever read them besides me.

    I do not take any responsibility for any harm which may result by using theinformation contained in this document.

    Copyright 2008 by Massimiliano Tomassoli. This document may be freelydistributed and duplicated as long as this copyright notice remains intact. Forany comments: [email protected].

    1

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    4/32

    Contents

    1 Introduction 3

    1.1 The one-dimensional case . . . . . . . . . . . . . . . . . . . . . . 3

    1.2 The multidimensional case . . . . . . . . . . . . . . . . . . . . . . 51.2.1 Some common approaches . . . . . . . . . . . . . . . . . . 51.2.2 UB Trees and space-filling curves . . . . . . . . . . . . . . 6

    2 Static Indexing 9

    2.1 Logarithmic Decomposition . . . . . . . . . . . . . . . . . . . . . 92.2 First method: O(log2 n) . . . . . . . . . . . . . . . . . . . . . . . 132.3 Second method: O(log n) . . . . . . . . . . . . . . . . . . . . . . 172.4 The Multidimensional Case . . . . . . . . . . . . . . . . . . . . . 23

    3 Dynamic Indexing 26

    3.1 Splitting and Fusion . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 Point Insertion and Deletion . . . . . . . . . . . . . . . . . . . . . 27

    2

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    5/32

    Chapter 1

    Introduction

    1.1 The one-dimensional case

    Almost every program needs to memorize some data. The problem is how toquickly retrieve that information when it is needed. We all know the principaldata structures: stack, queue, list, hash table and tree. This paper is mainlyconcerned with trees. A tree is a simple but powerful data structure where nobjects are organized in such a way that we can access a single object (it doesnot matter which one) in time O(log n) in the worst case. It is a remarkableresult, and the idea behind it is so simple!

    Let us take a sorted list of numbers:

    1 5 8 9 20 23 28 30 (1.1)We notice that each single number partitions the list into two parts in a

    natural way. For instance, the number 20 partitions list (1.1) in the two lists

    1 5 8 9 (1.2)

    and

    20 23 28 30 (1.3)

    The way we associate a partition to a number is, of course, completely arbitrary,but some ways appear to be more natural than others. If we were to search forthe number 23 which (sub)list should we use? It is evident that we should

    prefer list (1.3). The idea is clear: if we split a list into two parts, then byinspecting a single number (20, in our example) we can immediately determinewhich sublist cannot possibly contain the number we are searching for. Nowno one prohibits us from reapplying this method over and over until we havesingle-object sublists. If we split each sublist into parts of roughly the samesize, we can find every object in time O(log n), as you can see by looking at thescheme in figure 1.1 or, if you prefer, at the tree in figure 1.2 or, why not, atthe skiplist in figure 1.3. Now let us think of a simple geometric problem wecould successfully tackle using our binary trees (or some variation of them).

    Problem 1.1.1. Given a set S = {x1, x2, . . . , xn} Z and a, b Z, findS [a, b].

    3

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    6/32

    1 5 8 9 20 23 28 30

    Figure 1.1: Simple scheme resembling a binary search tree.

    20

    8

    1

    5

    9

    28

    23 30

    Figure 1.2: Simple balanced binary search tree.

    1 5 8 9 20 23 28 30

    Figure 1.3: Simple skip list.

    4

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    7/32

    Problem 1.1.1 asks us to find all the integers x in S such that a x b.

    If we insert the elements of S into a tree or, even better, we sort them in timeO(n log n) and build a balanced tree in time O(n) or put them in an array andperform a sequence of binary searches on them, we can find the smallest integerin S greater than or equal to a in time O(log n). The other integers can befound in time O(1).

    1.2 The multidimensional case

    The multidimensional case is just a natural generalization of the one-dimensionalcase. Let us consider the two-dimensional case:

    Problem 1.2.1. Given a set S = {x1, x2, . . . , xn} Z2 and a, b Z2, find

    {(x, y) S | ax x bx ay y by}.

    Problem 1.2.1 seems much harder than Problem 1.1.1. How can trees begeneralized to solve it efficiently?

    1.2.1 Some common approaches

    All the methods I have seen share similarities and many classifications are pos-sible, but there are some important differences. Almost every algorithm triesto group points in overlapping or non-overlapping regions, but this regions areorganized in different kinds of hierarchies:

    Grids The space is divided into non-overlapping hypercubes whose position isindependent of the points. They may be seen as a kind of multidimensionalhash table.

    Multilevel Grids The space is divided into non-overlapping hypercubes, whichcan be further divided into other hypercubes. They are usually adaptivein the sense that single hypercubes are further subdivided only if theycontain enough points. Note that while the selection of the hypercubes tobe divided is adaptive, the way a single hypercube is divided is not.

    Splitting Planes The space is split by planes which can be axis aligned orjust arbitrary. Axis aligned planes simplify the splitting process but theymight not adapt very well to the distribution of the points. The classicBinary Space Partition (BSP) Trees use arbitrary planes, while KD-Trees,for instance, use axis aligned planes.

    Bounding Objects The points are enclosed in (usually simple) objects. Theidea is to quickly discard empty spaces and maximize the density of theenclosed portion of space. The Axis Aligned Bounding Boxes (AABB) areprobably the simplest and most used Bounding Objects (BO). They arevery easy to handle but they might not adapt very well to the distributionof the points. The BO are usually recursively organized, i.e. each BO butthe first is (often entirely) contained in some other BO.

    Regions Induced by Ordering An ordering is imposed on the space so thatfor each pair of points (contained in that space) the operator

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    8/32

    A B C D E F G

    D

    B

    A C

    F

    E G

    Figure 1.4: This figure shows a space on which a lexicographic ordering isimposed. Note that the induced partition is quite naive, indeed point locality isbarely exploited. For instance, the shaded rectangle intersects every region butdoes not contain any points.

    1.2.2 UB Trees and space-filling curves

    When a total ordering is imposed on the space, a region becomes just an interval.The performance of a particular algorithm of this family depends crucially on the

    distribution of the points and on the kind of ordering imposed on the space. Forinstance we could just impose the well-known lexicographic ordering, but thatwould not be a great choice, indeed the regions induced by it would probably bevery thin and we would not take advantage of spatial locality (see figure 1.4). Asyou should have already guessed, we do translate our two-dimensional probleminto a one-dimensional problem but the time required to solve it in the worstcase is O(n). For instance, if we are asked to find all the points in the shadedrectangle in figure 1.4, we will have to visit every single region!

    The UB Trees employ a more complex ordering based on Grids, but theO(n) still stands. First a hypercube H0, big enough to contain all the points,is chosen and then divided into 2d identical hypercubes, where d is the di-mension of the space. Let H = {h1, h2, . . . , h2d} be the set of these hyper-

    cubes. Each hypercube hi can be further divided into other 2

    d

    hypercubes.Let Hi = {hi1, hi2, . . . , hi2d} be the set of these hypercubes. In general, wedefine Ha1a2...an as the set {ha1a2...an1, ha1a2...an2, . . . , ha1a2...an2d} of the 2

    d hy-percubes into which ha1a2...an is divided. One point p is smaller than anotherpoint q if and only if there exist two words a = a1a2 . . . ai and b = b1b2 . . . bjsuch that p ha, q hb and a

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    9/32

    h3 h4

    h2

    h1,1 h1,2

    h1,3

    h1,4,1 h1,4,2

    h1,4,3 h1,4,4

    Figure 1.5: This figure shows how, in UB-Trees, we can tell which one of twopoints is the smaller. In this case, we clearly see that the point in h1,4,1 issmaller than the point in h1,4,4.

    Figure 1.6: This figure shows some regions induced by the total ordering usedin the UB-Trees (based on the Z-curve).

    by this kind of subdivision scheme (see figure 1.6) cannot be as thin as the onesinduced by the naive subdivision scheme described above (see figure 1.4).

    An ordering can also be induced by a space-filling curve. In Mathematics

    a space-filling curve is a surjective continuous curve f : [0, 1] S, where Sis a topological space. Usually, S is Rd. These curves are called space-fillingbecause they actually fill the entire space. Given p, q in S, we could say thatp < q min(f1(p)) < min(f1(q))) i.e. p < q if and only if p intersectsthe curve f before q does. Some of these curves are recursive in nature. Iff is a recursive space-filling curve, there is an infinite sequence of functionsf0, f1, f2, . . . where f0 is given and, for each i > 0, fi+1 is obtained by applyinga construction step to fi, and f = limn fn. Usually we can tell whetherp < q just by looking at a single function of the infinite sequence above. Letus look at an example. The order imposed on the space by the UB Trees isinduced by the so-called Z-curve. Figure 1.7 shows the first approximations

    7

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    10/32

    Figure 1.7: This figure shows the first four approximations of the Z-curve.The recursive nature of the curve is quite evident.

    of the Z-curve. It should be evident that, for instance, points in h1 intersect thecurve before points in h2 do.

    8

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    11/32

    Chapter 2

    Static Indexing

    This chapter deals with the static version of the problem, i.e. where all pointsin our space are known in advance and no points can be inserted or deleted.

    2.1 Logarithmic Decomposition

    Definition 2.1.1. Let S = (a1, a2, . . . , an) be a finite sequence. A sequence(ai, ai+1, . . . , aj ), where 1 i j n or j < i, is called a (finite) connectedsubsequence of S.

    Definition 2.1.2. Let S = (a1, a2, . . . , an) be a finite sequence and let ji be

    the connected subsequence (ai, ai+1, . . . , aj ). The partial decomposition of level

    i of S, where 0 < i n, is the sequence

    iS =

    (i1,

    2ii+1, . . . ,

    nni+1) if i divides n

    (i1, 2ii+1, . . . ,

    n/iin/iii+1,

    nn/ii+1) otherwise

    Example 2.1.3. If S = (5, 1, 7, 2, 8, 2, 9, 6), then

    1S = ((5), (1), (7), (2), (8), (2), (9), (6))

    2S = ((5, 1), (7, 2), (8, 2), (9, 6))

    3S = ((5, 1, 7), (2, 8, 2), (9, 6))

    4S = ((5, 1, 7, 2), (8, 2, 9, 6))

    5S = ((5, 1, 7, 2, 8), (2, 9, 6))

    10S = ((5, 1, 7, 2, 8, 2, 9, 6))

    Definition 2.1.4. If S is a finite sequence, its logarithmic decomposition is the

    set S =log

    2n

    i=0 {2i

    S }.

    Example 2.1.5. If S = (5, 1, 7, 2, 8, 2, 9, 6, 3, 4), then S = {1S , 2S,

    4S ,

    8S ,

    16S }

    Definition 2.1.6. Let S = (s1, s2, . . . , sm) and P = (p1, p2, . . . , pn) be twosequences. The concatenation S P, or simply SP if no confusion arises, is thesequence (s1, s2, . . . , sm, p1, p2, . . . , pn).

    9

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    12/32

    s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16

    s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16

    s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16

    s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16

    s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16

    1S

    2

    S

    4S

    8S

    16S

    s6 s7 s8 s9 s10 s11 s12 s13 s14 s15P

    Figure 2.1: In this figure we see a subsequence P of a sequence S and a 2-selection of the logarithmic decomposition of S such that the concatenation ofthat selection contains the same points as P.

    Definition 2.1.7. If S = (s1, s2, . . . , sn) is a sequence of sequences, then itsconcatenation is the sequence s1s2 sn.

    Definition 2.1.8. Let S = (s1, s2, . . . , sn) be a sequence of sequences. Foreach i {1, 2, . . . , n} let ti be a subsequence of length at most l of si. Let(k1, k2, . . . , kh) be a subsequence of (1, 2, . . . , n) such that the sequence T =(tk1 , tk2 , . . . , tkm) contains every non-empty ti. The concatenation of T is calledan l-selection of S.

    Example 2.1.9. Let S = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) and P =(3, 4, 5, 6, 7, 8, 9, 10, 11). To make a 2-selection of S we must choose not morethan two sequences from each iS, for i = 1, 2, 4, 8, 16. Let the chosen 2-selectionbe T = ((11), (3, 4), (9, 10), (5, 6, 7, 8)). U = ((3, 4), (5, 6, 7, 8), (9, 10), (11)) is apermutation of T and its concatenation is equal to P.

    To simplify our discussion, let us give a few more definitions.

    Definition 2.1.10. If s is a connected subsequence of a sequence t, we writes t. If s is a proper connected subsequence of t, we also write s t.

    Definition 2.1.11. Let k be a positive integer such that kS and k/2S are two

    sequences in S . If s is a sequence in kS, there are two sequences t1 and t2 in

    k/2

    S such that t1t2 = s. Let L(s) be t1 and R(s) be t2.Theorem 2.1.12. Let S be a sequence and P a connected subsequence of S.There exists a 2-selection T of S of length at most log2 n, and a permutation such that the concatenation of (T) is equal to P.

    Proof. (sketch) See figure 2.1. Let c = 2log2 |S|. Algorithm 2.1.12.1 can findthe shorter (T) in time O(log n). The algorithm is very simple. If P is emptythen T = (). Let assume P is not empty. Let d be, initially, the concatenationof cS , i.e. the sequence in

    cS . It is clear that P d, then we can have four

    cases:

    1. P = d

    10

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    13/32

    if P = ()return ()

    (T) := () // initially emptyd := concatenation of cSwhile true

    if P = dreturn d

    if P L(d)d := L(d)

    else if P R(d)d := R(d)

    else

    d1 := L(d)P1 := P \ R(d)while P1 = ()

    if P1 R(d1)d1 := R(d1)

    else (T) := (R(d1)) (T)P1 := P1 \ R(d1)d1 := L(d1)

    d2 := R(d)P2 := P \ L(d)while P2 = ()

    if P2 L(d2)d2 := L(d2)

    else

    (T) := (T) (L(d2))P2 := P2 \ L(d2)d2 := R(d2)

    return (T)

    Algorithm 2.1.12.1: This algorithm is used in the proof of theorem 2.1.12.

    11

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    14/32

    s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16

    s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16

    s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16

    s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16

    s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16

    1S

    2

    S

    4S

    8S

    16S

    s6 s7 s8 s9 s10 s11 s12 s13 s14 s15P

    Figure 2.2: This figure shows the same situation as figure 2.1, but here a binarytree connecting the sequences of the logarithmic decomposition is shown. Notethat the arrows of the tree let us reach all the shaded sequences in a total timeof O(log n).

    2. P L(d)

    3. P R(d)

    4. none of the above

    In the first case, d is selected. If we are in the second or the third case we simplyredefine d as L(d) or R(d), respectively. The fourth case requires that we split

    P into two subsequences P1 and P2 such that P1P2 = P and that P1 L(d) andP2 R(d). We now can handle P1 and P2 separately. The important thing tonote is that the last element of P1 is equal to the last element of L(d), and thefirst element of P2 is equal to the first element of R(d). This means that, everytime we select a sequence to add to our 2-selection, we handle the left (right)part of the current P1 (P2). Moreover, the remaining part is strictly smallerthan the handled part, therefore the algorithm must terminate in O(log n) stepsat most. Since we never choose more than one sequence per step, we selectO(log n) sequences at most.

    Theorem 2.1.13. Let S be a sequence and P a connected subsequence of S.The sequence S can be built in time O(n log n) and a 2-selection satisfyingtheorem 2.1.12 can be found in time O(log n).

    Proof. (sketch) Let S = (s1, s2, . . . , sn) and c = 2log2 |S|. It is clear that every

    sequence iS can be built from S in time O(n). Because |S | = O(log n), Scan be built in time O(n log n). We also build a search binary tree B such that

    for i = 1, 2, . . . , c, each sequence d 2i

    S (or its first element) points to thesequences (or their first elements) in L(d) and R(d). Have a look at figure 2.2.We can now solve the problem by using the algorithm described in the proofof theorem 2.1.12. B makes the implementation of the algorithm completelystraightforward.

    12

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    15/32

    2.2 First method: O(log2 n)

    Let us generalize problem 1.2.1 a little.

    Definition 2.2.1. If p A1 A2 An and p = (p1, p2, . . . , pn), thenpai = aip = pi.

    Example 2.2.2. If p U V and p = (x, y), then pu = up = x and pv =vp = y.

    Definition 2.2.3. If S = {s1, s2, . . .} A1 A2 An, then aiS ={ais1, ais2, . . .}. For sequences replace { and } with ( and ), respectively.

    Problem 2.2.4. LetU and V be two totally ordered sets. Given a set S ={s1, s2, . . . , sn} UV and a, b UV, find {(u, v) S | au u bu av v bv}.

    This generalization makes sure that our method does not take advantage ofanything but the fact that the sets are totally ordered.

    Definition 2.2.5. Given two sequences a = (a1, a2, . . .) and b = (b1, b2, . . .), ais lexicographically smaller than b, written a

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    16/32

    Q

    V

    U

    Figure 2.3: This figure shows a set Sof points in the space UV and the orderedprojection Q of S onto V. Q is ordered because it is a sequence (q1, q2, . . . , qn)such that qi < qj i < j for each valid i, j.

    14

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    17/32

    Q

    V

    U

    G

    Q

    Figure 2.4: This figure shows the same points as figure 2.3, but here a represen-tation of the logarithmic decomposition Q and a representation of the grouplogarithmic decomposition G are also shown.

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    2

    2

    2

    2

    2

    2

    4

    4

    4

    8

    8

    16

    Figure 2.5: This figure shows the same points as figure 2.4, but represents thegroup logarithmic decomposition in a more practical way.

    15

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    18/32

    Definition 2.2.11. If a A1 A2 An and a = (a1, a2, . . . , an), let

    a = (an, an1, . . . , a1) A

    n A

    n1 A1.

    Example 2.2.12. Let a, b UV. We know that a < b au < bu (au =bu av < bv). Analogously, a < b av < bv (av = bv au < bu)

    Theorem 2.2.13. LetU andV be two totally ordered sets. Given a set S ={s1, s2, . . . , sn} U V, the ordered group decomposition HS can be built intime O(n log n).

    Proof. (sketch) First we determine T = (t1, t2, . . . , tn) such that it contains thesame elements as S, and ti < tj i < j. Let Hi be the i-th element ofHS . Because all the elements in T with the same coordinate v are adjacentin T, then T is the concatenation of H1. Therefore, by sorting S and doing abit of bookkeeping, we can obtain H1 in time O(n log n). Let us say hi,j is the

    j-th element of Hi. We can obtain Hi+1 by merging |Hi|2 pairs of adjacent

    sequences in Hi in time O(n). Let us see why. If a, b are two sorted sequences,then we can obtain the sorted sequence c that contains the same elements as abby performing |ab| 1 comparisons at most (see the merge sort if you are not

    familiar with it). It follows that the merging of all the |Hi|2 pairs of sequencesin Hi requires

    |Hi|

    21

    j=0

    |hi,2j+1| + |hi,2j+2| 1 b because we are under the lexicographic order. That is why c and dpoint to a rather than to b.

    Theorem 2.3.5. LetU andV be two totally ordered sets, and let S be a subsetofUV. The proximity graph AS of S takes space O(n log n) and can be builtin time O(n log n).

    Proof. (sketch) Look again at figure 2.9. AS contains |S| points. Each point inAS can point at most to one brother per level, therefore, because the levels areO(log n), it can point to O(log n) brothers at most. This means that, for eachpoint p in AS , the set of the arrows that start from p has cardinality O(log n),therefore we need at most O(log n) pointers per point.

    HS takes space O(n log n) and, by theorem 2.2.13, can be built in timeO(n log n). We first build HS and then transform it into AS in the followingway. Let hi be the sequence of level i in HS , and hi,j the j-th sequence inhi. For each valid level l, all the pointers representing the arrows of level lin AS can be set by considering every pair (hl,i, hl,i+1) such that there existsh2l,k that has the same elements as hl,ihl,i+1. Let hl,i = (x1, x2, . . . , xs) andhl,i+1 = (y1, y2, . . . , yt). We use algorithm 2.3.5.1. Algorithm 2.3.5.1 neverspends more than O(1) time on each pair (xi, yj ), and, because the indices iand j are only incremented, never consider more than s + t pairs.

    Since AS has O(n log n) arrows, AS can be built in time O(n log n). Notethat, in a real implementation, we can, and should, build AS from scratch.

    21

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    24/32

    Problem 2.3.6. LetU and V be two totally ordered sets, and S a subset ofU

    V

    . Let a, b U

    V

    , and P = {(u, v) S | av v bv}. Given a and b,find a 2-selection T = (t1, t2, . . . , tm) of HS of length at most O(log2 n), suchthat the concatenation of T contains the same points as P. For each ti, findalso, if there is such a point, the biggest point p in ti such that pu bu.

    Theorem 2.3.7. There exists an algorithm that solves problem 2.3.6 in timeO(log n) and space O(n log n) with a precomputation step, independent of a andb, taking time O(n log n).

    Proof. (sketch) According to theorem 2.3.5 we can build the proximity graphAS in time O(n log n). To simplify this discussion, let us assume that AS andHS are joined in forming a single data structure. We do not really need HS ,but it simplifies somewhat the explanation. But we do need the sequence in thelast sequence in HS (i.e. the lexicographically ordered sequence which containsall the points in S). Let us call it Z.

    Let hi,j be the j-th sequence in the i-th sequence in HS. For each valid i andj, let Mi,j be the biggest point in hi,j such that Mi,j u bu. If i > 1 then thereis a k such that Mi,j = Mi1,k. Let q be the integer such that hi,j contains thesame points as hi1,khi1,q. Since Mi1,k and Mi1,q are brothers of level i 1and Mi1,k = Mi,j > Mi1,q, it is clear that, by definition of AS , Mi1,k pointsto Mi1,q.

    If s is a sequence, let Ms be the biggest point in s such that Msu bu.The proof of theorem 2.2.15 (which refers to the proof of theorem 2.1.13) showshow T can be found in time O(log n) by starting from Z and, for each sequences, considering the subsequences L(s) and R(s). Since we can find MZ with asingle O(log n)-time search in Z, and since, for each sequence s, given Ms we

    can find ML(s) and MR(s) in time O(1), we can indeed solve problem 2.3.6 intime O(log n).

    Theorem 2.3.8. Problem 2.2.4 can be solved in space O(n log n) and timeO(log n) with a precomputation step (independent of a and b) which takes timeO(n log n).

    Proof. (sketch) Let a and b be the two points in problem 2.2.4. According totheorem 2.3.7, we can solve problem 2.3.6 for the same two points a and b intime O(log n) and space O(n log n) with a precomputation step, independentof a and b, taking time O(n log n). Let P = {p1, p2, . . . , pk} be the set of theO(log n) points we are asked to find in problem 2.3.6. All the points in therectangle identified by the points a and b can be found by searching for the

    biggest points on the left of each one of the points in P. We can do this bysimply following the arrows in AS as we did to find the points pi in the firstinstance.

    There is only a little obstacle: points with the same v-coordinate do notpoint to each other in AS , but we can promptly solve the problem by addingthe required O(n) arrows to AS during its construction.

    Note that since S is static we can actually let the client find the other pointsby itself and say that the problem can indeed be solved in time O(log n).

    Remark 2.3.9. In a real implementation, we should not solve problem 2.2.4as described in the proof of theorem 2.3.8. Let a and b be the two points inproblem 2.2.4 and let R be the set {(u, v) U V | au u bu}. We should

    22

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    25/32

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    2

    2

    2

    2

    2

    2

    2

    2

    4

    4

    4

    4

    8

    8

    16

    p

    p2

    p3p4

    p5

    p6p7

    p8

    level 1

    level 2

    level 4

    level 8

    Figure 2.10: This figure shows a partial proximity graph. Let us assume thatwe want to find all the points in the search-rectangle above. The biggest pointin the search-rectangle is p therefore our search starts exactly from it. The

    important thing to note is that as soon as we reach p2, p6 and p8 we know thatp and p5 are the only two points in the search-rectangle. There is no need toproceed any further.

    check whether the points Mi,j , defined as in the proof of theorem 2.3.7, areincluded in R as we find them. If Mi,j is not in R, there is no need to follow thearrows that start from it: the pointed points will be clearly out of R as well.

    Let us consider the case depicted in figure 2.10. After an O(log n)-timesearch we find the point p and then start taking advantage of AS to determinethe other points. In a real implementation, as soon as we see that p2 is not inR, we stop following that branch and go back to p immediately and follow somelower-level arrow. Similarly, as soon as we reach p6 and see that it is not in R,

    we go back to p5.

    2.4 The Multidimensional Case

    The methods described in the previous sections can be easily generalized tohandle the multidimensional case. First let us look at the three-dimensionalcase.

    Problem 2.4.1. LetU, V andW be three totally ordered sets. Given a setS = {s1, s2, . . . , sn} U V W and a, b U V W, find {(u,v,w) S |au u bu av v bv}.

    23

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    26/32

    Single2.5D Problem

    a1

    a2

    a3

    a4

    a5

    a6

    a7

    a8

    b1

    b2

    b3

    b4

    c1

    c2

    d1

    Figure 2.11: This figure shows how a logarithmic decomposition along the thirdaxis can be used to partition the space in such a way that problem 2.4.2 maybe solved by solving O(log n) instances of problem 2.4.1.

    Problem 2.4.2. LetU, V andW be three totally ordered sets. Given a setS = {s1, s2, . . . , sn} U V W and a, b U V W, find {(u,v,w) S |au u bu av v bv aw w bw}.

    Theorem 2.4.3. Problem 2.4.2 can be solved in space O(n log2 n) and timeO(log2 n) with a precomputation step (independent of a and b) taking timeO(n log2 n).

    Proof. (sketch) First of all, note that problem 2.4.1 can be solved by slightvariations of the two methods described in the previous sections. The impor-tant thing is that the lexicographic ordering be extended to include the thirdcoordinate: that way, given two points, one is always smaller than the other.

    This suggests that if we partition the three-dimensional space along thethird axis by using the logarithmic decomposition, we can solve problem 2.4.2,by solving at most O(log n) instances of problem 2.4.1.

    Now note also that the addition of a third dimension does not substantiallyalter the structures and the algorithms used to handle the group decompositionsand the proximity graphs. Once again, the additional dimension is taken careof by the lexicographic ordering extended to include it. A group decompositionpartition the space in O(log n) different ways (in groups of length 1, 2, 4, etc. . . ).Each one of these O(log n) partitioned spaces have to be further partitionedalong the second axis as we do in the two-dimensional case, therefore we needspace and time O(n log2 n).

    An example should help. Look at figure 2.11. First we partition the box (con-taining all the points in S) into 8 boxes of height 1: a1, a2, . . . , a8. Now, we de-compose each one of them along the second axis as we did in the two-dimensionalcase. This operation takes time O(|a1| log n)+O(|a2| log n)+. . .+O(|a8| log n) =O(n log n), where |ai| is the number of points within ai. We then partition the

    24

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    27/32

    box into 4 boxes of height 2: b1, b2, b3, b4. We decompose each one of them in

    time O(|b1| log n) + O(|b2| log n) + O(|b3| log n) + O(|b4| log n) = O(n log n). Werepeat the same procedure for c1, c2 and d1. The total time required to build athree-dimensional version of HS or AS is therefore O(n log

    2 n). Note that if wemerge all these proximity graphs into a single graph, we end up with a graphwhose nodes each have O(log2 n) arrows.

    Let us say we want to find all the points p such that au pu bu av pv bv aw pw bw. Let Z be the sequence contained in the last sequencein HS . We know that Z contains the same points as S. If s is a sequence,let Ms be the biggest point in s such that Msu bu. Let Hw be the groupdecomposition of S along the third-axis. We know that there is a 2-selectionTw = (t1, t2, . . . , tk) of Hw such that the concatenation of Tw contains the samepoints as the set {(u,v,w) S | aw w bw}. We first find the point MZ ,then use the third-axis proximity graph to find the points mt1 , mt2 , . . . , mtk .

    For each mti we can now use the second-axis proximity graphs to solve the k(i.e. O(log n)) 2.5D sub-problems. We define them 2.5D because they are 2Dproblems where the points are immersed in a three-dimensional space.

    Note that since S is static we can actually find the biggest points and letthe client find the other points by itself and say that the problem can indeed besolved in time O(log2 n).

    Remark 2.4.4. What we said in remark 2.3.9 applies to the multidimensionalcase as well. For instance, if we are walking through the third-axis proximitygraph and see that the current biggest point p has the u coordinate smallerthan that of a, we can abandon p immediately without even examining thesecond-axis proximity graphs it leads/is related to.

    Problem 2.4.5. LetU1, U2, . . . , Ud be totally ordered sets and let U = U1 U2 Ud, where d 2. Given a set S = {s1, s2, . . . , sn} U and a, b U,find {(x1, x2, . . . , xd) S | i {1, 2, . . . , d} aui xi bui}.

    Theorem 2.4.6. Problem 2.4.5 can be solved in space O(n logd1 n) and timeO(logd1 n) with a precomputation step (independent of a and b) taking timeO(n logd1 n). Moreover, if the restrictions are imposed only on k 2 coordi-nates, the search takes time O(logk1 n + d k).

    Proof. (sketch) Problem 2.4.5 is an almost straightforward generalization ofproblem 2.4.2. We can prove it by induction on d by generalizing the reasoningof the proof of theorem 2.4.3. To be precise, we should also account for the factthat a lexicographic comparison takes time O(d) in dimension d. We can ruleout the d-factor by noting that the asymptotic estimations are dominated bythe cost of the O(logd2 n) instances of the two-dimensional problem: instead ofkeeping the points distinct by extending the lexicographic ordering as proposedin the proof of theorem 2.4.3, we consider only the first two coordinates (exactlyas if we were to solve problem 2.2.4) and whenever two points coincide, we putthem in the same multinode. That is not different from what one have to do tohandle repeated keys with structures that does not support them natively.

    If the restrictions are imposed only on k 2 coordinates, we will needto find only k 2-selections because in the other d k cases we will chooseimmediately the group that contains all the points (in the subspace we are inat that moment).

    25

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    28/32

    Chapter 3

    Dynamic Indexing

    Making a proximity graph dynamic is not a trivial task. We want to be ableto insert new points in it and delete old points from it in an efficient way and,above all, without adversely affecting the time needed to perform a search.

    For the sake of clarity, we will analyze proximity graphs built upon groupdecompositions. That will make the analysis and the pictures much clearer.(Remember that a proximity graph can be built upon a group decompositionand the two structures can coexist.)

    The first thing we are going to do is relax the proximity graph.

    Definition 3.0.7. Let G = (g20 , g21 , g22 , . . . , g2n) be a group decomposition. Agroup in gi is called a group of level i. If p is a group of level i and q is both agroup of level i/2 and a subsequence of p, then q is a subgroup of p.

    Definition 3.0.8. If two groups are subgroups of level l of the same group,they are cogroups of level l.

    Definition 3.0.9. A relaxed group decomposition is a group decompositionwhere groups of level greater than 1 may have any positive number of subgroups.

    Definition 3.0.10. A relaxed proximity graph P is a proximity graph suchthat from each point in P may start two or more arrows of the same level.Specifically, for each valid level l, for each point p in P, and for each cogroup cof level l of p, p points to its biggest brother of level l in c.

    Definition 3.0.11. Let P be a proximity graph and a an arrow connecting twopoints p and q in P. The length of a is equal to the level of the two cogroups

    which contain p and q.

    3.1 Splitting and Fusion

    Whenever a group g of level i in a proximity graph contains four subgroups,we split g into two groups g1 and g2 of level i, each one of which takes two ofthe four subgroups of g. On the contrary, when two adjacent cogroups contain,together, less than four subgroups, we should join them together by performinga fusion.

    Can we perform splittings and fusions efficiently? I have not had much timeto think about it, but here are some ideas. The obvious solution is a non-solution

    26

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    29/32

    (because we change the problem): if our space is the Cartesian product of metric

    spaces (not necessarily bounded), then we can build the proximity graph byconnecting the points with arrows whose length is directly proportional to thedistance (along a single axis) of the points to be connected. Because the spacesmay be unbounded, we should determine a unitary length l based on the firsttwo points we receive and then let the other arrows be of length 2 il, for somei Z.

    Another possible solution is that of relaxing the proximity graph as muchas possible and perform some rebalancing step each time we add or delete apoint. There is a little problem here: a group of level 1 can have many points,then many steps may be necessary to handle it. We can overcome this obsta-cle by (conceptually) perturbing the points so that there are no points on thesame axis-aligned line. Note that that does not alter our asymptotic estima-tions. That way the work to do to rebalance the proximity graph should be

    proportional to the number of points added and deleted.You might have noticed that something is missing here: how do we insert

    and delete points? If we cannot perform these operations efficiently, then thereis no point in talking of splitting and joining groups.

    Definition 3.1.1. A relaxed proximity graph is balanced if no splittings orfusions can be performed on it.

    3.2 Point Insertion and Deletion

    Definition 3.2.1. A proximity graph is compact if, for each point p and foreach level l, the biggest brother less than p of level l can be found in time O(1).

    Theorem 3.2.2. If P is a balanced relaxed proximity graph, a point p can beinserted in P in time O(log2 n).

    Proof. (sketch) For each valid level l, the number of arrows of length l that pointto p is O(n). We can handle the arrows one level at a time. Let us see how wecan handle the arrows of length l. After p is (at least conceptually) inserted ina group g of level l, in the most general case, p is preceded by a point q1 andfollowed by a point q2, such that q1 < p < q2. All the arrows of level l that arrivein p start from points contained in some cogroup of g. Because q1 < p < q2,there could be some arrows that still point to q1 but should now point to p.Look at figure 3.1. Let R = (r1, r2, . . . , rn) be the sequence of the points thatpoint to q1 or to q2 with arrows of level l, where ri < rj i < j. Note

    that only the arrows that point to q1 may need to be redirected to p. Let R bethe longest connected subsequence of R such that for each r R we have thatp < r < q2. It is clear that all (and only) the points in R

    must stop pointingto q1 and start pointing to p. For each point t, let tol(t) be the sorted sequenceof all the points that point to t with arrows of length l. All we have to do issplit tol(q1) into two appropriate subsequences s1 and s2 and let tol(q1) = s1and tol(p) = s2. Let us note that, consequently, the deletion of a point can becarried out by joining two sorted subsequences.

    Let us see how all this can be implemented. For each point p and each validlevel l, we associate to p a tree Tp,l which contains a reference to all the pointsthat point to p with arrows of length l. Let us go back to our q1 and p. As

    27

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    30/32

    r1 r2 r3 r4 r5 r6 r7 r8

    q1 p q2

    Figure 3.1: This figure shows a small portion of a proximity graph into whicha point p has just been inserted. Let us assume that q1 < p < q2 and that allthe arrows in the figure are of the same level. It is clear that, in this case, r4,r5 and r6 must stop pointing to q1 and start pointing to p.

    you can see by looking at figure 3.2, the points r1, r2, . . . , rn point both to q1as to the nodes in Tq1,l that represent them (or directly to other elements ricontained in the tree: it depends on the type of tree used). Figure 3.3 showswhat happens when Tq1,l is split. Note that, after the splitting, the points inTp,l still point to q1. Because no direct pointers to q1 need to be modified, thesplitting and the rebalancing can be performed in time O(log n).

    We have two ways of determining where a point points to:

    (i) we can follow its direct link in time O(1) or

    (ii) we can reach the root of the tree it belongs to in time O(log n) and thensee to which point the tree is associated in time O(1).

    The last method always gives the correct answer, but is slower, so the formermethod is always tried first. Let us say we want to determine which point r3points to. We first follow the direct link of r3 which, in our example, lead us toq1. We then check if r3 is in Tq1,l in time O(1) (note that we just have to lookat the the last element of the tree). Because r3 is not in Tq1,l we start movingtoward the root of Tp,l and for each node s we visit along the path, we checkwhether s points directly to the right node p. If this is the case, we are done,otherwise we move on to the next parent. After we have determined the nodep one way or the other, we make all the nodes we have visited point directly top. This is called path compression and greatly speeds up successive searches.

    Because we must redirect arrows of O(log n) different lengths, we need timeO(log2 n).

    We note that P is compact if and only if all the direct links are correct. Since

    P is balanced, the arrows that start from p are never more than O(log n) and wecan decide to which elements they must point by first searching for the biggestpoint less than p in P and then determining the other points by consulting P asusual. If the proximity graph is compact we can find all those elements in timeO(log n), otherwise we need time O(log2 n). Let b1, b2, . . . , bk be the O(log n)elements to which p must point. For each i {1, 2, . . . , k}, we must insert p inone of the trees associated to bi, therefore the total time needed is O(log

    2 n).

    Remark 3.2.3. Of course, in a real implementation we will associate a tree Tp,lof level l to a point p only ifp is pointed by a big enough number of brothers oflevel l.

    28

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    31/32

    r1 r2 r3 r4r5 r6r7

    Tq1,l

    q1

    Figure 3.2: This figure shows the tree Tq1,l of level l associated to the pointq1. This tree is used to handle all the arrows of level l the points to q1. In the

    figure, the points r1, r2, . . . , r7 are the points from which the arrows of level lstart.

    29

  • 8/14/2019 Worst Case Efficient Multidimensional Indexing

    32/32

    r1 r2r5 r7

    Tq1,l

    q1

    r3 r4r6

    Tp,l

    p

    Figure 3.3: This figure shows the trees of level l associated to the points q1 andp. Before the point p was inserted in the proximity graph, the situation wasthat shown in figure 3.2. Here the original tree Tq1,l has been split into the new

    tree Tq1,l and the tree Tp,l so that now the points r3, r6 and r4 points to p (seethe proof of theorem 3.2.2).

    30