A Survey of Recent Multidimensional Access Methods

8/2/2019 A Survey of Recent Multidimensional Access Methods

1/117

A Survey of Recent Multidimensional Access Methods

Jayendra Venkateswaran, University of Missouri-Rolla

Abstract

Indexing spatial data has been a major area of research for the past two decades. A direct

mapping of the objects from the multi-dimensional space to one-dimensional space does

not exist. This resulted in researchers developing many Multidimensional AccessMethods for efficient processing of spatial information in large databases. This paper

examines various spatial indexes proposed in literature and taxonomy of these structuresis presented. Each structure is reviewed by brief summary, comparison with similar

structures, characteristics and algorithms for various operations. Finally a comparativeanalysis of all of these structures is presented.

1. Introduction

Spatial data is the data that has connection with coordinates is single or even multi-dimensional spaces. Spatial databases systems are gaining importance various industriesand researches over the past decade. Spatial data is used in many applications such as

Cartography, Computer Vision and Robotics and Scientific and Temporal databases.Spatial databases are collections of spatial objects like points, lines and high-dimensional

objects. The Spatial Database Systems need to integrate the data obtained from varioussources and by different ways and whenever required has to support the analysis andprocessing of data stored. Due to the large volume of spatial data and their complex

structures and relationships, spatial operations have become more expensive compared toconventional operators likejoin and select. The efficiency of the operations on a structure


2/117

d d i i d h f h l d i d f i l

structures which are efficient for data spaces of fewer dimensions, the performancedegrades when extended to higher dimensions. This is known as Curse of Dimensionality

In High dimensional data spaces, [Bohn01] classifies the effects as follows:

Geometric Effects: As the dimension increases, the volume of the (hyper) - cubes andspheres increases. The volume of a cube in a d-dimensional space with edge length of e is

given by V = ed.

Effects of Partition: As the dimension increases the space partitioning becomes coarser.

Database Environment: The query distribution is affected as the dimensionality of the

data space increases.


3/117

2. Classification

Multi-dimensional Access Methods can be classified into Point Access Methods (PAM)and Spatial Access Method (SAM) ([Gaed97]). PAM was designed to perform operations

on database of spatial objects which do not have spatial extension. But SAM performsoperations on spatial objects like lines, polygon and higher-dimensional objects.

2.1Point Access Method

Several classifications of PAM under different categories can be found in [Same90] and[Gaed97]. In [Gaed97] it is classified into two categories: Multi-dimensional Hashing

Methods and Hierarchical Access Methods. Multi-dimensional Hashing methods such asgrid file ([Hinr83], [Sevc94]) use one-dimensional hashing to represent the multi-dimensional objects using different heuristics to preserve the spatial proximity of the

objects. Hierarchical Methods like Quadtree ([Bent74], [Garg82]), K-D-Tree [Bent75],and K-D-B-Tree [Robi81] use hierarchical data structures to store the point data. Space-

filling curves ([Bial69], [Saga94], [Falo89], [Same90]) were often used to preserve the

spatial proximity during the linear ordering of the spatial objects. UB-Tree ([Baye97])usesz-ordering for mapping objects into one-dimensional sequence.

2.2Spatial Access Methods

These can be considered as extension of PAM for processing objects with spatial extent.Based on the classification techniques proposed in [Lome92] and [Seeg88] the spatial

access methods can be classified under the following categories:

Transformation Methods: One approach is to map spatial objects to points in high-dimensional spaces. Then the points are stored using existing PAM. But this approachdoes not preserve the spatial proximity as the dimensionality increases Another approach


4/117

The index structures can also be classified into two categories - Data OrganizingStructures such as R-tree [Gutt84] and Space Organizing Structures such as such as

Quadtree [Bent74] and Grid-Files [Sevc94]. Several surveys such as [Ahn01], [Bohn01][Gaed97] and [Proc97] provide background and analysis of these methods. Several other

surveys analyze specific classifications such as Tree-Based Index structures ([Gunt91],[Brow98]) and structures for Spatial Information Processing ([Same95], [Kuba01],[Gunt90], [Guti94], [Widm97], [Kolo90]). An analysis, performance evaluation and

comparison of some of these structures can be found in [Jain95], [Gree89], [Roge98] and[Webe98].

3. Taxonomy

The basic issues to be addressed when designing index structures for spatial informationprocessing are storage-utilization and fast information retrieval. Other issues that are to

be minimized while designing spatial index structures are: Area of regions of a node,Overlap between two regions, number of objects duplicated to avoid overlaps, directorysize and height of the tree. These factors ensure the efficiency for many applications. No

straight-forward solution is available which fulfills all of these issues. Other factors suchas buffer size and design strategies, space allocation and concurrency control methods

can also affect the performance of spatial information processing. Based on its basicstructure the index structures can be classified into the following categories: Tree-basedIndex Structures, Hashing Methods, Methods based on Space-filling curves, Methods

based on Distance based Indexing and Signature Methods. The evolution of the indexstructures discussed in this paper is given in Figure B.

Tree-based structures partition the space into a manageable number of smaller subspaces,which are partitioned further and so on. Early structures such as Binary-Tree [Knut73],

K-D-Tree [Bent75], BTree [Baye72] and B+-Tree [Come79] were designed for databased on primary keys. Database applications involve searching on one or more of the


5/117

Figure B: Index Structures Taxonomy


6/117

applications where the dimensions can be ordered by significance and there exists featurevectors that allow shift in dimensions. It is applicable to real data that are subject to the

Karhunen-Loeve-Transform. SS-Tree [Jain96] is an index structure designed forsimilarity indexing for multi-dimensional data. It is an improvement of R*-Tree, but used

bounding spheres instead of bounding rectangles and modified forced re-insertion. UnlikeR*Tree, SS-Tree re-inserts entries when the entries in a node are not reinserted. S R-Tree[Kata-97] can be regarded a combination of SS-Tree and SR-Tree. It uses the intersection

between the bounding sphere and the bounding rectangle. Hence outperforms both SS-Tree and R*-Tree. The size of the directory entry is increased significantly by this

approach.

G-Tree [Kuma94] combines the properties of B-Tree [Baye72] and Grid file [Sevc94]. It

is a balanced index structure and divides the data space into non-overlapping regions.Here the position of each node identifies its corresponding region directly. So even

though the splitting procedure is more restrictive than the K-D-B-Tree [Robi81], G-Treehas the advantage of higher storage utilization. MB+-Tree [Yang95] is multidimensionalB+-Tree [Come79]. It partitions the data space in disjoint rectangular regions, like the G-

Tree. The regions are ordered and the tree is balanced. The number of levels in the tree isreduced thus reducing the search time. PK-Tree [Wang98] consists of the combined

properties of the PR-quadtree [Same90] and K-D-Tree [Bent75] with the removal ofunnecessary nodes. It has better performance results compared to methods such as R-tree,SR-Tree and X-Tree.

Grid file ([Hinr83], [Sevc94], [Hinr85]) is a hashing based access method which is a

variation of the grid method. Its goal is to retrieve objects by at most two disk accesses.

This is done by using grid directory consisting of grid blocks. All records in one gridblock are stored in the same bucket. ([Beck92], [Regnier85]) provide theoretical analysis

of grid file and its variants. Buddy Tree [Seeg90] is a tree-structured dynamic hashingmethod. The leaves point to the data pages. It uses k-d-tries ([Oren82]) to partition the


7/117

distance function used depends on the application. Its goal is to reduce the search timealong the number of distance computations.

Vector-Approximation File or VA-File [Blot98] is based on the concept of Signature

Methods ([Falo85], [Falo87]). Here the vector space is partitioned into cells and thesecells are used to generate bit-encoded approximation of each vector. VA-File is the flatarray of these approximations. It overcomes the dimensionality curse problem of spatial

objects in high-dimensional spaces. A-Tree [Saku00] is based on the concepts of VA-Fileand SR-Tree. It uses Virtual Bounding Rectangles (VBR), which contain MBRs by

approximating the data objects. It achieves better performance than VA-File and SR-

Tree.

4. Multi-dimensional Access Methods

4.1Tree-based Index Structures

Binary Search Tree is a basic data structure for representing data objects whose index

values are in linear ordering. The concept of partitioning the data space recursively hasbeen adopted and generalized in many sophisticated index structures. Early structures

such as Binary-Tree [Knut73], K-D-Tree [Bent75], BTree [Baye72] and B+-Tree[Come79] were designed based on the primary key of the data objects. K-D-Tree, BTree

and B+-Tree were the basic structures based on which all tree-based index structureswere developed. K-D-B-Tree [Robi81] is based on K-D-Tree and B-Tree, combines theadvantages of these two structures and uses only a single attribute value as a boundary.

In this section, we will examine indexes based on these structures.

4.1.1 R-Tree

Indexing methods such as ISAM B-Tree and its variants can index only one-dimensional


8/117

Fig 1a: R-Tree


9/117

4.1.1.2 Salient Features

R-Tree uses Minimum Bounding Rectangle (MBR) as Minimum Bounding Box. MBR ofan object is the smallest rectangle containing it. Entries in the leaf nodes are of the form

[MBR, Record_Pointer] and entries in non-leaf nodes are of the form [MBR,Child_Pointer]. Let M be the maximum number of entries possible in a node, let m =M/2 represent the minimum number of entries in a node and d be the number of

dimensions. This lower bound m prevents the degeneration of trees and ensures theefficient storage utilization. If the number of entries in a node falls below m, the node is

deleted and the rest of its entries are distributed among the sibling nodes. Based on the

size of disk page the upper-bound M can be determined. By storing the bounding boxesof geometric objects such as points, polygons or more complex objects, R-Trees can be

used to determine which objects intersect a given query region.

R-Tree has the following properties:

Except root node, all intermediate and leaf nodes have between m and M indexentries.

For each entry in the leaf node, the MBR is the smallest rectangle that spatiallycontains the d-dimensional object pointed by the Child_Pointer.

For each entry in a non-leaf node, the MBR is the smallest rectangle that spatiallycontains the objects in the sub tree pointed by the child node.

Unless it is a leaf node, the Root node has at least two children. All leaves appear in the same level.As each node has at least m entries, the height of an R-Tree of N objects can at most be(logm N) - 1. The Maximum number of nodes is [(N/m) + (N/m

2) + +1] and the worst-

case space utilization for all nodes, except root, is [m/M]. Nodes will tend to have morethan m entries which will decrease the height of the tree and improve space utilization


10/117

Step 1: If N is a leaf node, for each entry E whose E.MBR intersects with the queryregion Q, the object pointed by E.Object_Pointer is retrieved.

Step 2: If N is a Non-Leaf Node, for each entry E whose E.MBR intersects with Q,

Search(E.Child_Pointer, Q) is invoked.

Insertion: Inserting an object in R-Tree involves inserting its MBR to the R-Tree along

with a reference to the object. Only one path of the tree is traversed and the new entry isinserted at the leaf node. The insertion algorithm descends down the tree by selecting an

entry which requires the least enlargement to include the new object at each intermediate

node. Then the object is inserted to the leaf node. Two heuristics have to be defined tohandle the insert operation: the choice of a suitable region to insert and to manage

overflows. An overflow occurs when the number of entries in a node becomes greater thanM after inserting the new entry. Overflows are generally handled by splitting the node.

Area-Enlargement(E) = Area(E (including the object to be inserted)) Area(E (withoutthe new object)).

If there is no overlap, the insertion algorithm finds a leaf node in O(logm N) time.

Insert(N,O)

Input: Let N be the root of the R-Tree and O be the object to be inserted.Output: New R-Tree with O inserted.

Step 1: If N is a Non-Leaf Node, an entry E whose MBR needs the least enlargement toinclude O, is selected. Ties are resolved by selecting one with the smallest area and then

Insert(E.Child_Pointer,O) is invoked.


11/117

Deletion: Deletion in an R-Tree requires an exact match query for the object. Due to thepossible overlaps, deletion is not a local operation. The Deletion algorithm searches the

R-Tree to find the leaf node which has the object to be deleted. The entry is removedfrom the leaf node. If there is an underflow due to the deletion, the entries of the node are

stored in an array. An underflow occurs when the number of entries in a node becomesless than m, after deleting an entry. This process is repeated recursively on the parents ofthe node and continued till the root is reached. Then the entries stored in the array are

reinserted into the tree. The orphaned entries are merged with the sibling entries andintervals are adjusted. The principle of reinsertion is that an entry must be reinserted into

the same level as it was deleted from.

Delete(N,O)

Input: R-Tree rooted at N and object to be deleted O.

Output: New R-Tree with O removed.

Step 1: Search(N,O.MBR) is invoked to get leaf node L which contains O. If L is not

found, the process is terminated.

Step 2: O is removed from L.

Step 3: If L is root, then the entries in L are added to Q and step 8 is executed.

Step 4: If L is a non-root node, let P be the parent of L and E be its entry in P.

Step 5: If L has less than m entries, remove E from P and add entries of L to Q.

Step 6: If L has more than m entries, E.MBR is adjusted in P.


12/117

Quadratic-Split Algorithm: This scheme picks two of the (M+1) entries to be the first

element of the two new groups, by choosing the pair that would waste most area if bothwere added to a group. Each of the remaining entries is added to the group, which

requires least area enlargement. Ties are resolved by selecting group with smaller areaand then fewer entries.

Algorithm:

Input: Node N to be split

Output: Nodes N and N

Step 1: For each pair of entries E1 and E2, if E is the enclosing rectangle of these twoentries, d = Area(E) Area(E1) Area(E2) is calculated.

Step 2: The pair with largest value for d is selected and assigned as the first element oftwo nodes N and N.

Step 3: If all the entries have been assigned, stop the process.

Step 4: If a node has few entries such that it requires all of the remaining entries, theremaining entries are assigned to the node and the process is terminated.Step 5: For each unassigned entry E, d1 and d2, the area increase required by N and N

respectively to include E is calculated.

Step 6: The entry E which has the largest difference between d1 and d2, is selected to bethe next entry to be assigned.

Step 7: E is added to the node which requires least enlargement to include it. Ties areresolved by using the criteria smaller area and fewer entries.


13/117

Step 2: The separations are normalized by dividing the width of the entire set along thecorresponding dimension.

Step 3: The pair with the greatest normalized separation is along any dimension is

selected.

R-Tree is completely dynamic: insertions and deletions can be intermixed with queries

and no periodic global re-organization is required. It is based on the heuristicoptimization of the area of the enclosing rectangle in inner nodes. Since the structure

allows the bounding rectangles of different entries to overlap one another, the search

algorithm must traverse more than one path to search the desired data. Minimizingoverlap between sibling nodes is an important issue concerning the searching

performance in an R-Tree. It was proposed to index data objects of non-zero size in highdimensional spaces but its index structure can be simply adapted to indexing multi-

dimensional points with some small modifications to its insertion and search algorithms.

4.1.2 R+-Tree

Efficient R-Tree [Gutt84] search requires minimal coverage and overlap. Since it is very

hard to control overlap during dynamic split of R-Trees, efficient search strategies maydegrade from logarithmic to linear. R+-Tree [Sell87] is an overlap-free variant of R-Tree.It is a compromise between R-Tree and the K-D-B-Tree [Robi81]. R+-Tree was proposed

to overcome of overlapping covering rectangles of internal nodes of the R-Tree. Itfollows the concept that if partitions are allowed to split rectangles, then zero overlap

among intermediate nodes can be achieved. Avoiding overlap increases the height of the

tree, but has the benefit of multiple shorter paths.

4.1.2.1 R+-Tree Vs R-Tree and K-D-B-Tree


14/117

Figure 2a: R+-Tree


15/117

It has the following properties:

For each entry of the form (Child_Pointer, MBR) in the intermediate node, the sub-tree rooted at the node pointed by Child_Pointer contains a rectangle R if and only ifR is covered by MBR.

Overlap between any two entries in an intermediate node is zero. Root has at least two children unless it is leaf. All leaves are at the same level.4.1.2.3 Operations

Search: Search algorithm is similar to R-Tree. The main difference with that of R-Treeare the non-overlapping regions. The search space is decomposed into disjoint sub-regions and for each of those the tree is traversed until the data objects that are in the

query region, are found in the leaves.

Search(N,Q)

Input: R+-Tree rooted at N and query region Q.Output: Objects in the query region Q.

Step 1: If N is Non-Leaf node, for each entry E whose MBR overlap with Q,Search(E.Child_Pointer,Q) is invoked.

Step 2: If N is a Leaf node, each entry E which intersects with Q retrieve the objectpointed by E.Object_Pointer.

Insertion: During insertion three cases are to be considered [Gunt88, Ooib88]. First caseis when the covering rectangles of all entries do not intersect with the object to be

inserted and the second case is when the object intersects with the rectangles of all entries


16/117

Insert(N,O)

Input: Let N be the root of the R-Tree and O be the object to be inserted.Output: New R+-Tree with O inserted.

Step 1: If N is a Non-Leaf node, for each entry E whose E.MBR overlaps with O.MBR,Insert(E.Child_Pointer,O) is invoked.

Step 2: If N is a Non-Leaf node and O does not overlap with any entry, Select an entry E

whose MBR needs least enlargement to include O, Resolve ties be selecting one with the

smallest area and Insert(E.Child_Pointer,O) is invoked.

Step 3: If N is a Leaf node, insert O into N. If N has more than M entries, Split(N) iscalled, which re-organizes the tree.

Step 4: If there is no overflow, the MBRs of entries along the path are adjusted.

Deletion: First the objects that must be deleted are located and then are removed from the

leaf nodes. During deletion more than one entry may have to be removed from the leafnodes as the insertion routine may have added entries in more than one leaf node. Whennodes become underutilized due to lot of deletions, periodic re-organization is required.The entries in the under-utilized node are re-inserted at the top of the tree. During

deletion the tree is traversed till the leaf node is reached. At each intermediate node, theentries whose MBRs overlap with that of the object are selected to be traversed. At the

leaf node, the entry is removed and the parent rectangle that encloses the remainingchildren rectangles, is adjusted

Delete(N,O)


17/117

Criteria used during partition are:

Nearest neighbor Minimal total x- and y- displacement Minimal total space coverage accrued by the two sub-regions and Minimal number of rectangle splits.First three reduce search by reducing area of dead space and the fourth controls theexpansion of height. All the four criteria cannot possibly be satisfied at the same time. In

the algorithm the rectangles are sorted in all dimensions and so the complexity is of orderN log N. Sweep routine is used to scan rectangles and identify points where space

partitioning is possible. The fill-factor determines how much populated is the tree. Themore the packed the faster is the search. So if database is static it is desirable to pack thetree to capacity.

Split(N):

Input: Node to be split N of the R+-Tree and fill- factor f.Output: New re-organized R+-Tree after splitting N.

Step 1: Let Lx and Ly be the lowest x- and y- coordinates of the entries in N.

Step 2: Along each axis, the cost and location of cut are determined from Steps 3 and 4.

Step 3: Starting from the lowest value along the dimension, first f rectangles from therectangles sorted along the axis are selected.

Step 4: The cost of organizing rectangles along the axis is computed based on minimalsplits, minimal coverage and other properties.


18/117

R+-Tree is a variant of R-Tree without any overlap among the intermediate nodes. Itrequires more space than R-Tree as it adds entries to more than one leaf node. The pack

algorithm attempts to setup R+-tree with good search performance. The performance ofR+-tree is immune to changes in the distribution of segment sizes. When the number of

large segments approaches the total number of segments R+-tree suffers due to lot ofsplits to sub-regions. The main advantage is the improved search performance in case ofpoint queries. It behaves like a K-D-B-tree when the data is points.

4.1.3 R*-Tree

Minimizing both coverage and overlap is difficult in optimizing the performance of R-Tree. R*-Tree [Krei90] is one of the most successful variants of R-tree. It is based on

careful study of R-Tree algorithms under various data distributions and has the samestructure as the R-Tree. R*-Tree introduces margin of the covering rectangles as an

additional optimization criteria. This criteria is based on the fact that clustering rectangleswith little variance of lengths of the edges tend to reduce the area of the clusterscovering rectangle. It is a dynamic structure as insertions and deletions can be performed

with no periodic reorganization. As in R-Tree, multiple paths may have to be traversedfor search operations. R*-Tree is efficient structure for both point and spatial objects. It

introduces the concept of Forced Re-Insertion which forces part of the entries to be re-inserted during insertion. This helps in achieving dynamic reorganization of the R*-Treestructure.

4.1.3.1 R*-Tree Vs R-Tree

R*-Tree, in addition to the area criterion of R-Tree, uses margin and overlap of eachenclosing rectangle. Minimizing area reduces the dead space and improves search

performance. Storage utilization is improved by minimizing the height of the tree. It usesthe concept of Forced Re-insert when there is an overflow.


19/117

Structures of R-Tree and R+-Tree depends on the order of insertion of the data objects.R*-Tree uses the concept of Forced Re-Insertion which helps the object in finding a node

where it can be inserted with improved performance in storage and search. The numberof overlaps is minimized as compared to R-tree. Also it avoids identifier duplication as in

R+-Tree [Sell87].

Figure 3a: R*-Tree


20/117

maximum number of entries possible in a node, let m = M/2 and d be the number ofdimensions. The design of R*-Tree introduces a policy called forced-reinsert. If a node

overflows, it is not split immediately. Instead the first p entries from the node arereinserted into the tree, with the parameter p varying.

R*-Tree has the following properties:

The leaf has at least two children, unless it is a leaf. Every non- leaf node has between m and M children unless it is the root. Every leaf node contains between m and M entries unless it is the root. For each entry in a non-leaf node, the MBR is the smallest rectangle that spatially

contains the rectangles in the sub tree pointed by the child node.

All leaves appear in the same level.The criteria used for optimization are:

Minimizing the area covered: Minimizing the dead space improves the performancesince decisions on which paths have to be traversed can be taken on higher levels.

Minimizing MBR Overlap: This decreases the number of paths traversed.

Minimizing Margin of MBR: Margin is the perimeter of the rectangle. By minimizingthe margin, the rectangles will be shaped more quadratic. Queries with large quadraticquery rectangles will profit from this optimization.

Optimizing Storage Utilization: Higher storage utilization reduces the query cost asthe height of the tree will be kept low.

4.1.3.3 Operations

Search: Given a query region, searching involves retrieving all data objects that overlapwith the query region. At each level of the search the nodes whose MBR overlap with the

i i l t d d t d d h l f d h d th t i


21/117

entry requiring least are enlargement is selected. For other intermediate nodes, the entryrequiring least area enlargement to include the new object is selected. The main idea in

the case of non- leaf nodes is to select an entry, which needs least area to include the newdata and in the case of leaf nodes the entry, which needs least overlap enlargement is

selected. In case of overflows first some entries are reinserted, this can often avoidsplitting.

If E1Ep are entries in the node N, for an entry Ek the overlap value is given by,p

Overlap(Ek) = ? i=1 Area(Ek.MBR n Ei.MBR), 1 = k = p and i ? k

Insert(N,O)

Input: R*-Tree rooted at N and O be the object to be inserted.

Output: New R*-Tree with O inserted.

Step 1: If N is a parent of leaf nodes, an entry E whose rectangle needs least overlap

enlargement to include O is selected. Ties are resolved by choosing entry which needs

least area enlargement and then one having least area. Insert(E.Child_Pointer, O) isinvoked.

Step 2: If N is any of other Non-Leaf nodes, an entry E which needs least area

enlargement to include O is selected. Ties are resolved by choosing the entry with leastarea. Insert(E.Child_Pointer, O) is invoked.

Step 3: If N is a Leaf node, O is accommodated in N.

Step 4: If N overflows, if N is not the root and for the given object executes Step 4 forthe first time then ReInsert(N) is invoked else Split(N) is invoked.


22/117

value is the sum of perimeters of the MBRs of both groups and Overlap-value is the areaintersected by the two groups.

For any two groups G1 and G2,

Margin-Value = Margin[MBR(G1)] + Margin[MBR(G2)]Area-Value = Area[MBR(G1)] + Area[MBR(G2)]

Overlap-Value = Area[MBR(G1) n MBR(G2)]

The entries are sorted along each axis. For each sort, (M-2m+2) distributions with the

M+1 entries are possible, wherein the kth distribution, the first group contains the first (m-1+ k) entries and the second group contains the remaining entries The axis which has the

least sum of margin-values is chosen as the split axis the best distribution along the axisis selected. For each axis the entries have to be sorted, this requires O(M log M) time. For

each axis the margin of 4*(M-2m+2) rectangles and the overlap of 2*(M-2m+2)distributions have to be calculated.

Split(N)

Input: Node to be split N of the R*-Tree.Output: New re-organized R*-Tree.

Step 1: The entries are sorted first by their lower values and then by the upper value oftheir rectangles, along each axis.

Step 2: For each sort, all of the (M-2m+2) distributions are determined.

Step 3: Along each distribution the sum of all margin values are determined.


23/117

As a side effect storage utilization is improved. Due to more restructuring less split occurs. Since outer rectangles of a node are reinserted, the shape of the MBRs will be morequadratic, which is a desirable property.The CPU cost will increase as insertion routine is called more often but there will be less

splits. Average disc access increases slightly if Forced-Reinsert is applied to R*-Tree, butit improves the structure.

Reinsert(N)

Input: Node N which overflows during insertion and number of entries p to be reinserted.Output: Re-organized R*-Tree without split of N.

Step 1: For all the M+1 entries in N, the distance between the center of their MBR andthe center of the MBR of all the M+1 entries, is computed.

Step 2: The entries are sorted in decreasing order of their computed distances.

Step 3: First p entries are removed from N and its MBR is adjusted.

Step 4: For each entry E to be reinserted, Insert(Root,E) is invoked where Root is the root

of the R*-Tree.

Deletion: Deletion routine of R*-Tree is same as that of the R-Tree. During deletion,using the search algorithm, the leaf node having the entry to be deleted is determined.

Then the entry is removed from the leaf node. The MBRs of the entries along the path areupdated. If deletion causes underflows in the leaf nodes, the entries are removed, thenode is deleted, the tree gets updated and all the entries are inserted again into the tree

using the insertion routine.


24/117

Step 6: Let N = P and process is repeated from Step 3.

Step 7: For each entry E in Q, Insert(N, E) is invoked.

R*-trees are based on reduction of area, margin and overlap of directory rectangles andeffectively supports point and spatial data at the same time. Its implementation cost is

only slightly high than that of other R-Trees. It is robust against ugly data distribution.The average insertion cost is lower than the R-Tree. It differs from the R-Tree mainly in

the insertion and node split algorithms. Algorithms for Deletion and Search remain the

same. In future fan out can be reduced and R*-trees can generalized to handle polygonsefficiently.

4.1.4 SS-Tree

Similarity Indexing is required in many applications to facilitate efficient similarityqueries of a dataset of typically high dimensional feature vectors. For example a query on

content based image database could find pictures with similar color or texture. Similarityindexing has three main components: Objects represented by high dimensional feature

vectors, querying feature vectors based on one or more (dis)similarity and different typesof fundamental queries. Usually the knowledge of a domain expert is required forrepresenting data objects as feature vectors. Query performance is more important than

update performance in image and multimedia databases. But dynamic updating of thedatabase should be supported.

SS-Tree [Jain96] is an improvement of R*-Tree [Krei90] and has similar configuration asthat of R-Tree [Gutt84]. To avoid exhaustive searches and save space, all the elements in

the feature vector should be used for indexing.


25/117

In high dimensional data space spheres is expected to generate better data groupings,which contribute to the data retrieval performance.

SS-Tree reinserts entries unless reinsert has been made at the same node or leaf,whereas in R*-Tree reinsertion takes place unless reinsertion has been made in hesame level. This promotes the dynamic reorganization of the structure.


Leaf nodes of SS-Tree contain entries of the form (Feature_Vector, Data), where dataholds the data for the leaf.

Non-leaf nodes contain entries of the form (Centroid, Radius, Child_Pointer). Centroid is the mean value of the feature vectors in the child node and the radius is

the distance from center to the outermost feature vector.

For d dimensional feature space requires d+1 memory units to store an entry. Divides points into isotropic neighborhoods into bounding spheres of shorter diameter

regions.

For each entry in a non-leaf node, the bounding sphere contains the entries in the sub-tree pointed by the child node.

Each node, except the root node has a minimum of m and maximum of M entries.Figure 4a shows the 2-Dimensional representation of the high-dimensional feature space.The objects which consist of both point and spatial data are bounded by Minimum

Bounding Spheres, circles in case of 2-dimensional. MBSs S1, S2 and S3 covet the entiredata space. The MBSs S4 to S10 in the intermediate nodes cover the data objects in thedata space. It can be observed from the figure that multiple paths have to be traversed

during search operations. For example the exact match query for the object f hasS2? S7? fand S1 as the search paths.


26/117

Figure 4b: SS-Tree Structure

4.1.4.3 Operations

The algorithms for data search, insertion and deletion from R-Tree and R*-Tree can beused in SS-Tree with some modifications in routines used for insertion and node-

splitting.

Search: The search algorithms search regions in order of minimum distance from thequery point until the query results are guaranteed correct to required accuracy. Twopriority queues are used: a search queue and result queue. For similarity sampling, the

same algorithm proposed for R-Tree can be used for SS-Tree and it provides fastsampling using only the internal nodes. The search algorithm involves traversing from

the root to the leaf nodes. At each level, the entries whose bounding sphere overlaps withh i l d f l A h l f d h bj hi h l


27/117

level of the tree, the entry whose centroid is closest to the feature vector of the object tobe inserted is selected. Every entry traversed for insertion is updated by adjusting the

values of its centroid and radius. Once the leaf node is reached, the object is added to it.If there is an overflow and nodes children have not been already reinserted, they are

reinserted. If reinsertion cannot be applied during overflow of a node, it is split.

For reinsertion, the entries are sorted in descending order based on their distances from

the centroid of the Bounding sphere of the node. Then first p entries of the list areselected for reinsertion. These p entries are removed from the node and the bounding box

of the node is adjusted. Then the entries are inserted one by one into the tree.

Insert(N,O)

Input: SS-Tree rooted at N and the object to be inserted O.

Output: New SS-Tree with O inserted

Step 1: If the tree has only the root node, create a new node array and insert the new

entry.

Step 2: If N is a Non-Leaf Node, the entry E, whose E.Centroid is closest toO.Feature_Vector, is selected. E.Centroid and E.Radius are updated and Insert(E,O) isinvoked.

Step 3: If N is a Leaf-node, O is added to N.

Step 4: If N overflows and its entries have not been already reinserted, for each entry E inN Reinsert(E) is invoked.

Step 5: If N overflows and Reinsert cannot be applied, Split(N) is invoked.


28/117

Node-Split : During insertion, when there is an overflow and reinsertion could not beapplied, i.e. all the reinserted nodes are inserted in the same node; the node has to be

split. The split algorithm initially determined the variance of the entries in eachdimension and selects the dimension with highest variance as the split dimension. The

dimension with highest variance is selected as the split dimension. Along the dimension asplit location to minimize the sum of variances along both sides of the split is selected.The entries on both sides of split location are assigned to two new nodes. If the root gets

split, a new root array is allocated and the two parents are entered in it, else among thetwo new nodes, the one which is closest to the parent is retained in the parent node and

the other node is reinserted.

Split(N)

Input: Node N whose entries are to be reinserted.

Output: New Re-organized SS-Tree.

Step 1: For the entries in N, the dimension with highest Variance is selected as the split

dimension.

Step 2: The Split location is selected so as to minimize the sum of variances of each sideof the split.

Step 3: Two new parent nodes, E and E, are created and the split elements are assignedto them.

Step 4: If E is a root node, a new root array is allocated and E and E are written to it.

Step 5: If E is not a root node and P is the parent node of E, among the two new nodes Eand E, the one which is closest to P is retained in P and Reinsert() is invoked for the


29/117

Step 3: If there is any underflow in N, all of the remaining entries are removed, the tree is

updated and Reinsert(N) is invoked.

SS-Tree is an indexing structure created for the main purpose of similarity indexing. InSS-Tree less information is needed to store the bounding spheres, which results in largerfan-out. The diameters of the bounding spheres are insensitive to dimensionality and

improve the query performance. Bounding spheres occupy more volume than boundingrectangles. As the dimensionality increases, more overlap between the bounding spheres

affects the query performance. It is suited for approximate queries than R*-Tree. For

higher dimensional data, SS-tree provides faster query performances than R*-tree. Itrequires significantly less CPU time to insert elements because it uses linear algorithm as

compares to others. Its storage utilization is greater than that of R*-tree.

4.1.5 SR-Tree

A feature vector is extracted from image characteristics like hue, saturation, intensity and

texture and stored in database along with the images. A set of images to a particular

image can be retrieved by searching feature vectors close to that of the given image. SS-Tree was proposed for similarity queries based on Feature Vectors close to that of thegiven image. It performs better than the structures R*-Tree and K-D-B-Tree.Sphere/Rectangle-Tree or SR-Tree [Kata97] can be regarded as a combination of the R*-

Tree and SS-Tree and outperforms them. It uses the intersection solid between a rectangleand sphere as the bounding region.

Both bounding rectangles and spheres have their own merits and demerits.

Bounding Rectangles divide points into small volume regions, but have largerdiameters.


30/117


Employs both bounding spheres and bounding rectangle. Specifies a region by the intersected by the bounding sphere and bounding rectangle. Leaf nodes contain entries of the form (Feature_Vector, Data) where data holds the

data for the leaf .

Nodes contain between m and M entries where m >= M/2. Non-Leaf nodes contain entries of the form (Centroid,Radius,MBR, n, Child_Pointer)

, where MBR is the minimum bounding rectangle, and n is the total number of dataentries stored in the sub-tree pointed by the child_pointer.

Figure 5a: SR-Tree


31/117

SR-Tree structure for the running example is shown in Figure 5. Its structure is based onthat of R-Tree [Gutt84]. In the figure, the boxes indicate the bounding rectangles of the

objects enclosed by it and the circles represent the bounding sphere. The intersection ofthese two regions forms the bounding region for SR-Tree. Multiple paths need to be

traversed during search operations. For example exact match query for object finvolvessearching the paths S2? S7? fand S1.

4.1.5.3 Operations :

The algorithms for data search, insertion deletion and node-split are derived from the

corresponding algorithms used by the SS-Tree, R-Tree and R*- Tree. The modificationsare mainly for the updates of the bounding spheres and the bounding rectangles during

data insertions and deletions.

Search: The search algorithms search regions in order of minimum distance from thequery point until the query results are guaranteed correct to required accuracy. Twopriority queues are used: a search queue and result queue. The search algorithm traverses

from the root to the leaf nodes. At each intermediate node the entries which overlap with

the query region are traversed. At the leaf node, the entries whose Feature_Vectoroverlap with the query region are retrieved.

Search(N,Q)

Input: SR-Tree rooted at N and Q is the query region given by the user.

Output: All Objects in the query region Q.

Step 1: If N is a Non-Leaf node, for each entry E whose region overlaps with Q,

Search(E,Q) is invoked.


32/117

The center of the bounding sphere,n

xi = ? k=1 Ek.xi * Ek.w

n? k=1 Ek.w

Let ds be the maximum distance from the center of a parent node to the bounding spheresof its children and dr the maximum center of a parent node to the bound ing rectangles of

its children.

ds = max [ || x - Ek.x || + Ek.r ]

1 = k = ndr = max [MAXDIST(x,Ek.MBR )]

1 = k = n

Where MAXDIST() computes the maximum distance from a point to a Minimum

bounding rectangle. The radius of the bounding sphere,

r = minimum (ds, dr)

Insert(N,O)

Input: SR-Tree rooted at N and the object to be inserted O.

Output: New SR-Tree with O inserted

Step 1: If N is a Non-Leaf node, the entry E whose E.Centroid is closest to

O.Feature_Vector is selected. Insert(E,O) is invoked.


33/117

Step 2: The first p entries are selected.

Step 3: The p entries are removed from N and the tree is updated.

Step 4: For each entry E Insert(N,E) is invoked.

Node-Split : Split algorithm is similar to that of R*-Tree. The split algorithm calculates

the variances of the entries in the node along each dimension, selects the dimension withhighest variance and then the location of split. Then two new nodes are created and the

entries are assigned to them. If the node is not a root node the node which is close to the

parent is retained and the other node is reinserted into the tree. The Bounding Boxes areadjusted along the path in the tree.

Split(N)

Input: Node N whose entries are to be split.Output: New Re-organized SR-Tree

Step 1: The Variances of coordinates along each dimension are calculated from the

centroids of its children.

Step 2: The dimension with the highest variance is selected as the dimension for splitting.

Step 3: The Split location is selected so as to minimize the sum of variances of each sideof the split.


34/117

Delete(N,O)

Input: Root node N and O the object to be deleted.

Output: New SR-Tree with O deleted.

Step 1: If N is a Non-Leaf node, entry E whose E.Centroid is closest to

O.Feature_Vector, is selected. Delete(E,O) is invoked.

Step 2: If N is a Leaf node, O is removed from N and the tree is updated.

Step 3: If there is any underflow in N, all of the remaining entries are removed, the tree is

updated and Reinsert(N) is invoked.

SR-Tree is one of the latest structures that feature the combination of both minimumbounding rectangles and minimum bounding spheres. By combining both, it inherits theadvantages of SS-Tree and R*-Tree. By introducing bounding rectangles neighborhood

could be partitioned into smaller regions and improves disjointness among regions. It

reduces the volume and diameter of regions as compared to SS-Tree and R*-Tree. Thecreation cost of SR-Tree is higher than that of SS-Tree. The fanout is small. The size ofthe node entry is three times larger than that of SS-Tree and one-and-half of that of R*-Tree. As SR-tree saves leaf- level reads more than increase of node-level reads, its total

disk reads is less than that of SS-tree. SR-tree is effective from lower dimensionality tothe higher dimensionality and improves the performance as compared to SS-tree and also

is more effective for less uniform data sets, which can be practical in actual image/videosimilarity. Although its creative cost is more, SR-tree outperforms SS-tree for

applications requiring index structures that are efficient for high-dimensional nearest-neighbor queries.


35/117

4.1.6.1 TV-Tree Vs R-Tree and its variants

For R-Tree and its variants, although conceptually they can be extended to higherdimensions, they usually require time and space that grows exponentially withdimension and reduce to a sequential scanning.

Insertion cost is cheaper in TV-Tree due to the fact that TV-Tree is shallower than thecorresponding R*-Tree.

The number of disk accesses for a search in TV-Tree is lower than that of R*-Tree. The savings in total disk accesses during search increases with increase in size of the

database, which indicates that it scales well.

As object size increases the leaf fanout decreases making TV-Tree grow faster, butthis does not affect the search performances much.

TV-Tree requires fewer number of nodes and hence less storage space. The space savings in TV-Tree are from the internal nodes, which mean that the non-

leaf levels will require a small buffer, which can be significant when buffer size islimited.

TV-Tree can be used for high dimensional feature spaces without this dimensionalityproblem.

The feature vectors contract and expand dynamically in the TV-Tree. Compared to trees that use fixed number of features, TV-Tree provides higher fanout

at top levels, using only few features.


Each MBB is represented by its centre and radius. The MBBs can overlap and all the entries of a node's sub-tree are contained in the

node.

More than one level can have the same number of active dimensions.


36/117

.

Figure 6a: TV-Tree

Figure 6b: TV-Tree Structure


37/117

Step 2: If N is not a leaf node, for each entry in S, whose MBRs overlap with the query

region, do the search steps.

Insertion: When a new object has to be inserted, the tree is traversed, at each stepselecting an entry which is suitable to contain the new object. After insertion, if there isan overflow in the leaf node, either some entries are re-inserted or the node is split. The

MBRs are updated for the nodes along the path. During insertion contraction can occurresulting in a MBR with lower dimensionality. The split algorithm divides the entries into

two groups with at least ff percentage of space utilization, where ff is the parameter used

for performance. Splitting can be done either by clustering or by ordering the entries in anode.

To select an entry in an intermediate node, the following criteria are used in the given

order of priority:

1. Minimum increase in overlapping regions within the node. This involves selecting anentry E such that after updating E.MBR, the number of overlaps among the entrieswithin the node is minimized.

2. Minimum decrease in dimensionality. In this the entry E is selected so that it canaccept the new object by contracting its center as little as possible.

3. Minimum increase in radius.4. Minimum distance from the center of the MBR to the object.Insert(N,O)

Input: TV-Tree rooted at N and object to be inserted O.Output: Reorganized TV-Tree.

Step 1: If N is a Non leaf node an entry E which needs least overlap enlargement to


38/117

Step 5: MBRs that have been changed and Split an intermediate node if there is anoverflow.

Split: Splitting is done to redistribute the set of MBRs into two groups so as to facilitate

operations and provide high space utilization. Splitting can be done by two ways-Ordering and Clustering.

Splitting by ordering : In this, the vectors are ordered and a best partition along theordering is found. The following criteria are used to minimize area and to minimize the

overlap.

Minimum sum of radius of the two MBBs formed. Minimum of (Sum of radius of MBBs - Distance between centers).Ordering can be done in different ways. Sorting vectors lexicographically, Space-fillingcurves like Hilbert-Curves and others can be used.

Given node N to be split and the performance parameter ff which is the minimum

percentage of space utilization of a node, two new nodes N1 and N2 have to be created.

Order_Split(N,ff)

Input: Node to be split, N and storage utilization ff.Output: Two nodes N and N.

Step 1: The MBRs of the entries in N are sorted by ascending row-major order of their

centers.

Step 2: Two groups of entries in nodes N and N, each having at least ff storage


39/117

Step 1: Two most un-similar MBRs are selected from the entries in N. This can be doneby choosing selecting the two having the smallest common prefix in their centers. In case

of ties, the pair with the largest distance between centers is selected.

Step 2: The selected entries head the two new groups created in nodes N and N.

Step 3: Each of the remaining entries is added to a group based on the criteria: minimum

increase in overlap, minimum decrease in dimensionality, minimum increase in radiusand minimum distance from the center in the given order of priority.

Deletion: Deletion involves searching the entry using exact match query and removingthe entry from the node. Then the bounding boxes are updated. When underflow occurs,

the entries in the node are removed and are re-inserted. When entries inside a node areredistributed either by reinsertion or by split, new MBRs have to be calculated and

extending may be required by introducing new active dimensions, those on which allobjects agree.Delete(N,O)

Input: TV-Tree rooted at N and objected to be deleted O.Output: New TV-Tree with O removed.

Step 1: Using the exact match search query, Search(N,O) the leaf node L which has the O

is obtained.

Step 2: O is removed from L and L.MBR is updated.

Step 3: If L underflows, all the entries are removed from the node, its MBR is updated.For each entry E removed from L, Insert(N,E) is invoked.


40/117

X-Tree [Krei96] is an extension of R*-Tree [Krei90] based on the problems arising out ofhigh dimensional data spaces. It can be seen as a hybrid of linear array-like structure and

hierarchical R-Tree like structure. For higher dimensions, in case of high overlaps, mostof the entries in the directory will be searched. Linear organization is more efficient in

these cases as it needs less space and may read faster.

X-Tree extends R*-Tree by two concepts:

Overlap-free split according to a split history. Supernodes with enlarged page capacity.The Objective is to avoid overlap of bounding boxes by using an organization of the tree,which is optimized for high dimensional space. It avoids splits, which would result in ahigh degree of overlap in the directory. Instead of allowing splits that introduces overlap,nodes are extended over the usual block size, called as supernodes. The basic idea is to

have tree as hierarchical as possible and to avoid splits that would result in high overlap.Recording the history of data page splits in R-Tree results in a binary tree having split

dimensions as nodes and current data pages as leaf nodes. Whenever split results inunbalanced tree, with underutilized nodes, which affect the storage utilization, X-Tree

doesnt split but creates an enlarged directory node instead.

4.1.7.1 X-Tree Vs R*-Tree

X-Tree outperforms both R*-Tree and TV-Tree [Jaga94]. It uses split history to select the split dimension. If split dimension selected using the split history results in high overlap, it extends the

current to larger size called as supernode. Whenever there is high overlap due to split, X-Tree uses nodes of larger size called

supernodes.

Takes less time and has fewer page accesses for search queries


41/117

Figure 7a: X-Tree


42/117

Due to the increase in number of supernodes, the height of X-Tree decreases withincrease in dimension.

When none of the nodes is a supernode, X-Tree is completely hierarchical and issimilar to R-Tree.

When root is the only supernode, the performance corresponds to a linear directoryscan and the size of the directory linearly depends on the dimension.

4.1.7.3 Operations

The algorithms used in X-Tree are designed to automatically organize the nodes in

hierarchy, such that the portions of data, which would produce high overlap, areorganized linearly and those, which can be organized hierarchically without muchoverlap, are organized in hierarchical form

Search: The search algorithm is similar to that of R*-Tree, since only minor changes arerequired in accessing the super nodes. Search algorithm searches the tree to retrieve all

the entries which overlap with the query region.

Search(N,Q)

Input: X-Tree rooted at N and the query region Q.

Output: All objects in the query region.

Step 1: If N is a non-leaf node, for each entry E whose E.MBR overlaps with Q,Search(E,Q) is invoked.

Step 2: If N is a leaf node, each entry E whose E.MBR overlaps with Q are returned.

Insertion: The most important algorithm is the insertion algorithm as it determines the


43/117

Step 5: First topological or overlap-minimal split is tried and if it is successful, a newnode is added into the tree and the tree gets updated.

Step 6: If there is no good split, super node is created. The entry of the super node is

updated in its parent node.

Split: For topological splits X-Tree uses the R*-Tree or other split algorithms. When a

node overflows the splitting can be done in the following ways:

Using the topological properties of MBRs like MBR-Extension and dead-space, thenode is tried to split.

If the above results in high overlap, the split history stored in the nodes is used. X-Tree selects the dimension with which the root of the split-history tree has been split.This is done to select a dimension over which all the data in the tree are split, which

guarantees overlap-free regions.

For lower dimensions, there may be more than one overlap-free split dimension, butthe probability that a second split dimension exists which is a part of the split historyof MBRs of all entries decreases with increase in dimension.

The overlap-free or overlap-minimal split requires information about the split history hasto be stored in the intermediate nodes and can result in an unbalanced tree. In this case it

would be advantageous not to split the node instead of splitting to create one underfilledand another almost overfilled node and also resulting in less storage utilization. In these

cases X-Tree creates an enlarged directory node called as supernode. The higher thedimensionality the more supernodes will be created and larger the supernode become. If asupernode is created or extended and if there is not enough contiguous space on disk to

sequentially store the supernode, a local reorganization has to be performed by the diskmanager and this does not occur frequently.


44/117

Step 5: If the overlap-minimum split results in an underfilled node the node is not split. Inthis case the current node is extended to become a supernode of twice the standard block

size.

Deletion: Delete operation is also a simple modification of the corresponding R*-Treealgorithm. The only difference occurs in the case of underflow of a supernode. If thesupernode consists of two blocks, it is converted to a normal node. Otherwise, if the

supernode consists of more than two blocks, its size is reduced. Delete performs thesearch for the leaf node having the entry and removes the entry. The update operation can

be seen as a combination of delete and an insert operation.

Delete(N,O):

Input: X-Tree rooted at N and Object to be deleted O.

Output: New Reorganized X-Tree.

Step 1: An exact match query, Search(N,O) is performed to get the leaf-node L which

has the object to be deleted O.

Step 2: O is removed from L and L.MBR is updated.

Step 3: If there is no underflow of L, the tree is updated.

Step 4: If L underflows and it is not a supernode, its entries are removed, the tree is

updated and for each entry E, Insert(N,E) is invoked.

Step 5: If L is supernode which overflows, if L has more than two blocks it is reduced byone block, else it is converted into an ordinary leaf node.


45/117

minimum bounding boxes, which may overlap to divide the space recursively. Pyramidstructure is a widely used technique in image processing. The root of the pyramid

corresponds to the entire image. Then the space corresponding to the root is divided intoquadrants down to pixel level.

It has the following advantages:

Physical location of a node can be calculated, since the number of nodes in a level isknown. It provides a fast direct access.

A node contains the summarized information of the area to which it corresponds,which can speedup the search time.

Pyramid K-instantiable Tree or PK-Tree combines the aspect of PR-quad-tree and KD-Tree [Bent75] but where the unnecessary nodes are eliminated. It achieves better bound

on the height of the tree for skewed data distributions. It instantiates the non-leaf nodes,which have with at least a certain number of non-empty children. This restriction

eliminates the problem of the height of the tree growing large due to skew in the spatialdistribution of points. Updates are inexpensive and it is independent of the order of data

insertions and deletions. PK-Tree is unique for any data set.

PK-Tree is created as follows:

Initially the data points are in a rectangular area or cell. The Rectangle is divided recursively into smaller sub-cells. At each level, the division could be different for each dimension. The higher the level is, the smaller the cell size is. It uses simple rule to eliminate nodes with fewer children.


46/117

Depending on application this tree can lead to inefficient storage or searchperformances due to the unbalanced tree structure.

KD-Tree removes some unnecessary nodes from PR-quad-tree, but its height can bevery large in case of large datasets.

4.1.8.1.2PK-Tree Vs SR-Tree and X-Tree

SR-Tree has a larger generation time as it requires computation of bounding sphereand bounding rectangle.

The generation time increases as the number of dimensions increases for X-Tree asnumber of super nodes becomes larger.

PK-Tree has shorter generation time and the time increases slowly with increase indimension.

For search queries PK-Tree outperforms both SR-Tree and X-Tree as there are nooverlapping siblings in PK-Tree.

The response time for X-Tree increases at a slower rate that of the SR-Tree while PK-Tree has the slowest rate of increase with the dimensionality.

As it eliminates the non-K-instantiated nodes, PK-Tree removes the performanceimpact of skewed data distributions and performs well than X-Tree and SR-Tree.


47/117

Figure 8b: PK-Tree Structure

PK-Tree structure for the running example is shown in Figure 8a. Here the data space is

two-dimensional. The dividing ratios fro all levels are r1 = 2 and r2 = 2, the value of Kwas set to 3. The data objects are represented by points and labeled using alphabets.Among the cells at level 2, 8 and 13 are 3-Instantiable and others like 1, 2 and 4 are not

3-Istantiable as the objects contained in them cannot be covered by one or two subcells.At level one all the cells A, B, C and D are 3-Instantiable. For example subcells 1,2, and

6 cover the objects in cell A.


Given dataset D of N nodes, the set of dividing ratios R with r as the maximum fan-out

factor from the set, and a value for K:

Any two cells should have one of following relationship: disjoint, subset and superset. The cells at the same level are disjoint and at different levels are either disjoint or one


48/117

retrieved. If h is the height of the tree, the computational complexity of small rangequeries is O(h) and for larger range queries, the data objects returned are linearly

proportional to N and the complexity becomes O(N) where N is the cardinality of thedata set upon which the PK-Tree is built. Exact match query is a special case of range

query with range set to zero.

Search(N,d,Q):

Input: Root node N, the location dand query range Q.

Output: All data objects within the range Q from d.

Step 1: Let Result = NULL.

Step 2: If N is a Non-Leaf node, if Q encloses an entry E in N, Step 4 is performed.

Step 3: If N is a Non-Leaf node, if Q overlaps with an entry E in N, then Let N = E andprocedure is repeated from Step 2.

Step 4: If N is a Non-Leaf node, for each entry E in N, let N = E and the process isrepeated from Step 4.

Step 5: If N is a Leaf node, Result = Result U N.Data.

Insertion: Insertion of an object into the PK-Tree with rank K is achieved by inserting

the corresponding point cell into the Tree.

PK-Tree is generated from an empty tree and the data points are inserted one by one. Adata point is inserted into corresponding leaf node in the following two phases:


49/117

Step 2: If N is a Non-Leaf node and O is not contained in any entry, then O is added to N

and Update(N) is invoked.

The Updation algorithm checks whether a node is instantiable and also updates all thenodes along the path to the node.

Update(N):

Input: Node N which has to be updated.

Output: New Reorganized PK-Tree.

Step 1: Let C be the set of all children of the node N.

Step 2: If N is not he root node and C has less than K entries, Node N is de- instantiatedand all of its children are added to its parent node P and Update(P) is invoked.

Step 3: If there exists a K-instantiable entry E in the node N, then Steps 4 to 7 are

followed.

Step 4: The sub-cell E is instantiated.

Step 5: E is made child of N.

Step 6: All entries in C which are contained in E are made its children.

Step 7: Update(N) is invoked.

Deletion: Deletion of a data point is removal of its point cell from the tree. Similar to


50/117

Output: Reorganized PK-Tree with O removed.

Step 1: If there exists an entry E such that O is contained in E then Step 2 is followed.

Step 2: If E is a Leaf node, E is de- instantiated and Update(N) is invoked.

Step 3: If E is a Non-Leaf node Delete(E,O) is invoked.

PK-Tree is a variation of PR-quad-trees. It differs from existing index structures by

employing unique set of constraints to eliminate unnecessary nodes that can result from

skewed data distribution. The total number of nodes in PK-Tree is O(N). The averageheight of the tree is O(Log N). It improves the creation time and query time compared to

existing spatial index structures. It performs well for uniformly distributed and for mostskewed data distributions. It has the properties like: non-overlapping of sibling nodes,

uniqueness of PK-Tree for a given data set independent of the order of insertion ordeletion and bounded number of children. PK-Tree outperforms the existing spatial indexbased methods like SR-Tree and X-Tree which are based on R-Tree.

4.1.9 G-Tree

Grid-Tree or G-Tree [Kuma94] is a multi-dimensional index structure for Point AccessMethod, which combines the features of Grid files [Sevc94] and B-Trees. It divides the

multidimensional space into a grid of variable size partitions and the partitions areorganized into a B-Tree. It orders and numbers objects in such a way that partitions that

are spatially close together are also close in terms of their partition numbers. It adaptswell for structures with high frequency of insertions and deletions and to non-uniform

distributions of data. The structure proposed is similar to BD-Tree [Osha83]. Thepartitions correspond to disk pages and points are assigned to a partition until it is full. Afull partition is split into two equal sub-partitions and the points in the original partition


51/117

4.1.9.1.2G-Tree Vs KDB-Tree

Fan-out of G-Tree is about 2.5 times as large that as the fan-out for a KDB-Tree fortwo-dimensional case and the advantage increases for higher dimensions.

In G-Tree when a partition becomes full, it is split into two equal sub-partitions,whereas in KDB-Tree the partitions may be unequal. KDB-Tree results in forced

splitting of the children nodes when the parent nodes are split.

The splitting method in KDB-Tree also affects the algorithm for handling deletion,which requires reorganizations and often deteriorates the storage utilization.

In G-Tree regardless of the order of insertions and deletions, the partitioning becomesidentical.


Partitions are numbered as binary string of 0s and 1s. Initially the entire data space is

divided along the first dimension into two equal sub-partitions and numbered as 0 and 1.When a partition 0 becomes full, it is subdivided to create to create two partitions, 00 and

01, of equal size. For d dimensions the splitting dimension recycles with a periodicity ofd such that each dimension appears once in a cycle. A leaf node points to a page thatcontains all points in a partition, while higher level nodes point to nodes at next lower

level. The most significant bits of partitions are used to compare two partitions, to knowthe relationship between two partitions.

The data space is divided into non-overlapping regions of variable size. Each dimension has values within a specified range. Each region or partition has a maximum of M entries. The partition numbers are of variable length and each one is as long as is necessary. Each partition is assigned a unique partition number.


52/117

Figure 9a: G-Tree


53/117

further. The Search Algorithm first transforms the left-most and right most points of thequery region into b-bit long partition numbers Pl and Pr respectively. Next the G-Tree is

searched and all partitions in the range Pl-Pr are checked. If the partition is fullycontained in the query region all the points in the partition are returned. If the Partition

overlaps with the query region, all the points in the partition are examined one by one andthose points which lie within the query region are returned. If the partition does notoverlap or is not contained then next partition is checked and this is done recursively till

all the points in the query region are recovered.

Search(N,Q):

Input: N the root node of G-Tree and the query region Q containing leftmost-point ql and

rightmost point qr.Output: All objects in the query region.

Step 1: Let Pl = Transform(ql) and Pr = Transform(qr).

Step 2: Let P be the smallest partition in the G-Tree.

Step 3: Until P > Pl, P = Pnext is performed to search for Pl in the G-Tree.

Step 4: For all partitions in the range Pl to Pr the Steps 5 to 8 are performed.

Step 5: Let O = overlap(P) to check whether the partition is contained in or intersects the

query region.

Step 6: If P is contained in the query region, all entries in P are retrieved.

Step 7: If P overlaps the query region, each entry in P which intersects Q is retrieved.


54/117

Step 3: For j ranging from 1 to n Steps 4 to 5 are performed.

Step 4: Let p = n * (i-1) + j.

Step 5: If p = b and bit in position p is 0, then ql.xj = (qr.xj + ql.xj)/2, else qr.xj = (qr.xj +ql.xj)/2.

Step 6: Let j = 1. While (j = n and ql.xj = hj and qr.xj = lj), j = j+1 is performed.

Step 7: If j == n+1, then Val = 1 as partition P is contained in the region and Val isreturned, else Step 8 is performed.

Step 8: Let j = 1. While (j = n and ql.xj = hj and qr.xj = lj), j = j+1 is performed.

Step 9: If j == n+1, then Val = 2 as partition P is contained in the region and Val isreturned

Insertion: Insertion first assigns a partition number, and then searches the tree to locatethe partition where the point has to be inserted. If such a partition exists it inserts thepoint and in case of overflows splits the partition into two equal partitions. This partitionis done recursively until no one partition has all the points. If no partition exists, a new

partition is created and added to the tree.

The search algorithm initially computes the partition number p for the given object withthe number of bits equal to that of the smallest partition created so far. Then it searches

for a partition which contains the partition number p. If such a partition exists, then theobject is inserted into the partition. If there is no partition available, a new partition iscreated with itself or its largest ancestor which does not overlap with the existing


55/117

Step 4: If Pa overflows, two new partitions P0 and P1 are created by appending 0 and 1respectively with Pa.

Step 5: Pa is deleted from the G-Tree.

Step 6: Objects in Pa are reallocated to P0 and P1.

Step 7: Insert(N,P0) and Insert(N,P1) are invoked.

Deletion: Like the Insertion algorithm, the deletion algorithm first computes a partition

number p with number of bits equal to the dimension of the smallest partition the G-Tree.Then the G-Tree is searched to find a partition which contains the partition number. If no

partition is found then the given point does not exist. If a partition was found, the givenpoint is searched and deleted from the partition. If the partition could be merged with its

complement partition, then the entries of the two partitions are removed, all the entriesare entered into a node and the new node parent node is inserted.

Delete(N,O):

Input: Root node N and the object to be deleted O.Output: Reorganized G-Tree with O deleted.

Step 1: Let Pd = Transform(O).

Step 2: Search(N,O) is performed to get the Partition P which contains Pd.

Step 3: O is removed from P and let Pa = P.

Step 4: While the total number of points in Pa and Complement partition of Pa is less


56/117

4.1.10 MB+-Tree

Multidimensional B+-Tree or MB+-Tree [Yang95] can be considered as an extension ofB+-Tree to multi-dimensions. B+-Tree stores the data values in leaves and is copied into

internal nodes in case of necessity. MB+-Tree takes into account all the characteristics ofimage and video database management systems employing content-based retrieval thanits previous methods. To retrieve an image based on its content, it is necessary to extract

the features which are characteristic of an image and index the image on these features.Feature vector can be represented as a multidimensional vector with each component

denoting different feature measures. MB+-Tree has similarities and differences with R-

Tree [Gutt84] and its variants.

4.1.10.1.1 MB+-Tree Vs B+-Tree:

Insertion and Deletion in MB+-Tree are extensions of the corresponding algorithmsof B+-Tree in higher dimensions.

The search methods in MB+-Tree are different as it tackles similarity queries.4.1.10.1.2 MB+-Tree Vs R-Tree and its variants

Unlike other multidimensional index structures MB+-Tree uses a linear ordering forindexing the multidimensional space.

For R-Tree and its variants, searching an intermediate node requires examining allentries in the node and examining an entry requires comparing the boundary values inall dimensions.

For MB+-Tree, some entries in the node may not need to be examined and can returnat an intermediate step. Also examining an entry may not require comparing all

dimensions.


57/117

Figure 10a: MB+-Tree


58/117

is divided independently along the second dimension. The region can be splitcontinuously until its width is above a threshold, and then the split is along the next

dimension as new objects are inserted into MB+-Tree. A linear order can be defined onthe set of all regions, by comparing the boundary values in the same order as the

dimensions and a B+-Tree can be built using the order.

In MB+-Tree, the partition can occur anywhere along a dimension. If there are a large

number of data points clustered in a region, MB+-Tree can split the objects evenly. Thisdecreases the height and hence the search time of the tree. The values used for splitting

should be stored at each level.

For example for two-dimensional the structure of MB+-Tree is as follows:

(x,y) represents an attribute in the 2-dimensional space. The space is partitioned into M vertical strips by M-1 vertical partitions. Then Each Vertical strip S is partitioned into Ns regions by (Ns -1) horizontal lines. The value of Ns can be different for different strips. Thus the 2-dimensional space is partitioned into a set of disjoint regions, each of

which is a rectangle. The horizontal dimension is called as the first dimension and the vertical dimension is

called as the second dimension.

The M vertical strips are from left to right and the Ns horizontal regions for a verticalstrip S is from bottom to top.

This yields a linear order on the set of all regions. Using this order a B+-Tree is built on the set of all regions.The properties of the MB+-Tree are:

Each internal node has M entries and M+1 pointers


59/117

The linear ordering has the following advantages:

Space required for each entry at the leaf level is reduced by nearly half. This willreduce the number of leaf nodes and hence the size of the MB+-Tree resulting in a

better search performance.

Insertion and deletion algorithms are similar to those for a B+-Tree and are simplerthan for those for other trees like R-Tree and its variants. The entries in theintermediate levels correspond to an element in the set as do entries in the leaf level.

MB+-Tree maintains locality to some extent which should give better performance insearching.

4.1.10.3 Operations

Initially the MB+-Tree has only one leaf node which is also the root node with only oneentry. Each object inserted is simply added to the list until it is full. Then splittingoperation is required for the nest insertion and the entire space will be divided into two

along the first dimension. The process continues until space is divided into smallerregions. Then it is done along the second dimension and so on.

Search: Given a query region, all the objects which belong to the query region have to beretrieved. Each leaf node is scanned for overlapping, and for those leaf nodes the entries

in the list pointed by its pointer are scanned to locate the required objects. The searchalgorithm first finds the leaf nodes and then goes through all the entries of the leaf node.

The entries in the internal nodes are scanned to find all the sub-trees that contain at leastone leaf entry overlapping with the query region, when more than one sub-tree overlapswith the query region all such entries are searched recursively. As the rectangle in an

entry is not the enclosing rectangle, the algorithm has two loops as the condition to findthe first sub-tree is different from identifying the following sub-tree.


60/117

Search(N,Q):

Input: Root node of MB+-Tree N and the query region Q.Output: Objects in the query region Q.

Step 1: If N is a leaf node N returned.

Step 2: Among the M entries, the first entry E that overlaps with Q is determined and isadded to a list NODE.

Step 3: From the following entries, the nodes that have at least one entry overlappingwith the query region and add its entries to NODE.

Step 4: SET = NULL and for each entry E in NODE, S = S n Search(E,Q) is invoked.

Step 5: Let RESULT = NULL. For each entry N in S, the Steps 6 and & are followed.

Step 6: Starting from the right-most entry, for each entry the upper-right corner is

determined.

Step 7: If the entry overlaps with the query region, it is added to the resultant set.

Step 8: The Resultant set is returned.

For R-Tree and its variants, searching a node requires searching all of its entries. ForMB+-Tree searching could stop at intermediate step and some entries in the node need

not be examined and examining an entry may not require comparing on all dimensions.

Insertion: Using the linear order, the insertion is a standard B+-Tree operation. If there is


61/117

Step 4: During split, the list will be divided into two lists of about the same size.

Step 5: If the region to be split is a vertical strip that has not been divided by a horizontalline, it is divided by another vertical line by choosing a value for the minimum length of

the horizontal side of the vertical strip.

Step 6: If the vertical dividing results in a thin vertical strip, with horizontal side smaller

than minimal value, horizontal dividing is done.

Step 7: After a region has been split into two smaller regions, a new entry will be inserted

into the tree.

Deletion: Deletion is similar to the standard B+-Tree deletion. Deletion may be requiredwhen a list becomes too small.

Delete(N,O):

Input: MB+-Tree rooted at N and object to be deleted O.

Output: Reorganized MB+-Tree.

Step 1: Using the exact match query, Search(N,O) the leaf node N which has the entry tobe deleted is determined.

Step 2: O is removed from N.

Step 3: If the list in N becomes too small it is merged with another list.

Step 4: Two neighboring region with split along the same dimensio n can be merged forexample a vertical strip not divided by a horizontal line can be merged with another such


62/117

Easy incremental reorganization as the file grows. Simple algorithms and

Ability to handle different kinds of queries.

Many index structures exhibit these properties only some times. K-D-B-Tree [Robi81]was proposed as a Multidimensional Access Method. It has the following desirable

properties:

A balanced tree structure. All leaf nodes area the same level. The data is stored in the leaf nodes and the intermediate node contains indexes whichdirect the search. It adapts to the distribution of attribute values.Holey-Brick Tree or hB-Tree [Lome90] is derived from K-D-B-Tree but has additionaldesired properties. It uses K-d-Trees [Betn75] to organize the space represented by theinterior nodes for very efficient searching.

The advantages of using K-d-trees are:

Compared to boundary representation, as the regions in the k-d-tree share boundaries,they have high intra-node search space.

K-d-tree requires less number of comparisons during searching than boundaryrepresentation.

It uses less space than the boundary list representation.The nodes in hB-Tree represent bricks from which smaller bricks have been removed. Inorder to minimize redundancy k-d-tree corresponding to an interior node can have several

leaves pointing to a same node It grows from leaves and has all the leaves at the same


63/117

1. The organization and splitting of data and index nodes.2. Posting the index terms to the next higher level.3. The guaranteed storage utilization.

hB-Tree requires more space to store index terms than B+-Tree and so has less fan-out than B+-Tree.

hB-Tree differs from the B+-Tree only in the organization of index terms into k-d-trees and in splitting of data between nodes.

Figure 11a: hB-Tree


64/117

In hB-Tree, the partition is done in more than one dimension. Figure 11a illustrates thisproperty of hB-Tree over a two-dimensional running example. The root node has entries

of two child regions R and S partitioned by X1 along X-axis and Y1 along Y-Axis.Region R has two parts one containing the subregions 1 and 2 and second one containing

the subregions 5 and 6. The entry EXT in region R represents the portion removed fromR. Region S contains the subregions 3 and 4.


hB-Tree is a derivation of K-D-B-Tree, where k-d-trees are used to represent theintermediate nodes.

Node-splitting is done based on more than one attributes. Nodes represent regions from which smaller regions may have been removed. The holey region is called as enclosing region and the region removed is called as

extracted region.

As several leaves of the k-d-tree can refer to the same node at the lower level, it is nottruly a tree, but a directed acyclic graph.

During node splitting, iffis the fraction of data going into the new node and (1-f) intothe original node then the storage utilization U is given by:

U = [f* Log( 1/f) ] + [ ( 1-f) * Log(1/(1-f)) ]

Ifs is the size of a node and i is the size of an index term,Fan-out = (U* s) /i.

4.1.11.3 Operations

The k-d-tree leaves in an hB-Tree index node refer to lower level hB-Tree nodes. In eachinternal node of the k-d-tree, there is an indicator of which attribute is compared, whatthe comparison value is, whether the equality is on the left branch or the right branch.


65/117

Step 5: If the comparison value is in the middle of the search range, then both

Search(N.L,Q) and Search(N.R,Q) are invoked.

Insertion: The exact match search finds the node, where the data has to be inserted usingk-d-trees in hB-Tree index nodes. Within this node the location where the new object hasto be inserted is determined. After insertion if the node overflows it is plit.

Insert(N,O):

Input: Root node of hB-Tree N and object to be inserted O.Output: Reorganized hB-Tree with O inserted.

Step 1: Exact match query Search(N,O) is performed to get node N where O has to be

inserted.

Step 2: O is inserted into N, if N has sufficient space.

Step 3: If N overflows, Steps 4 to 6 are followed for splitting.

Step 4: A new node is created and the data of the original node are split between theoriginal and the new nodes.

Step 5: An index term of the tree which identifies the new node

Documents

A Survey of Recent Multidimensional Access Methods