24
Solving Approximate Similarity Queries DANG Tran Khanh School of Computing Science Middlesex University The Burroughs, Hendon NW4 4BT London, United Kingdom [email protected] Abstract: Supporting similarity search capabilities in data repositories helps satisfy user information needs rather than only user data needs like conventional DBMSs. This is desired for many modern database applications. However, as the data repository contains high-dimensional data, solutions to similarity search problem become cost-inefficient due to the so-called dimensionality curse. This phenomenon has been observed and shown that in high-dimensional data spaces the probability of overlaps between a query and data regions in a multidimensional access method (MAM) is very high. Hence, the execution of a similarity query may require accessing a vast number of the data regions and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order to lighten complexities of the problem. However, most research work done so far focuses mainly on approximate nearest neighbor (NN) and range queries in a single-feature data space. In practice, multiple-condition queries appear more frequently and get more complicated to deal with in whatever sense. In this article, we present effcient approaches to three types of approximate similarity queries: approximate multi-feature NN, approximate single-feature NN, and approximate range queries. Specially, we will use the Vague Query System, one among flexible query answering systems for conventional DBMSs, as a case study to illustrate and establish the practical value of our proposed solutions. Experimental results with both synthetic and real data sets will confirm the efficiency of these solutions. Keywords: approximate nearest neighbor query, approximate range query, multidimensional access method, SH-tree, Vague Query System (VQS), intelligent information retrieval 1 A Preliminary Declaration of Problems As we know, both nearest neighbor and range searching problems are among the most important and fundamental problems in computational geometry because of its numerous important application areas [1, 2]. Specially, in many modern database applications, high-dimensional searching problems arise when complex objects are represented by vectors of d numeric features. As the dimension d increases high enough, it is difficult to design an algorithm for the searching problem that achieves both nearly linear space and poly-logarithmic query time in the worst case. The reason is that processing these problems exactly may result in very cost-ineffective data accesses (IO-cost) and/or computations (CPU-cost), especially with high-dimensional data sets. In other words, this means that as the dimension d increases, the space and/or time complexities also increase dramatically. For this reason, in many cases, one is willing to allow approximately solving the high-dimensional searching problem in order to alleviate the costs while can still keep the acceptably high accuracy of the returned results. Formal definitions of the approximate problems with respect to the nearest neighbor searching problem and bounding-sphere 1 range searching problem in d-dimensional spaces are given below: Definition 1 (approximate nearest neighbor searching problem): Given a set S of n data objects in a d-dimensional space, R d . A data object PS is called a (1+ε)-approximate nearest neighbor of a given query object Q with any real ε>0 if for all other data objects P’S: dist(P, Q) (1+ε).dist(P’, Q) where dist(X, Y) represents the distance between objects X and Y. Definition 2 (approximate bounding-sphere range searching problem): Given a set S of n data objects in a d-dimensional space, R d . For any real ε>0, a given object Q and a real value r0, a data object 1 Note that there are many other possible bounding ranges such as hyper-rectangles, half-spaces, etc.

Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

Solving Approximate Similarity Queries

DANG Tran Khanh School of Computing Science

Middlesex University The Burroughs, Hendon NW4 4BT

London, United Kingdom [email protected]

Abstract: Supporting similarity search capabilities in data repositories helps satisfy user information needs rather than only user data needs like conventional DBMSs. This is desired for many modern database applications. However, as the data repository contains high-dimensional data, solutions to similarity search problem become cost-inefficient due to the so-called dimensionality curse. This phenomenon has been observed and shown that in high-dimensional data spaces the probability of overlaps between a query and data regions in a multidimensional access method (MAM) is very high. Hence, the execution of a similarity query may require accessing a vast number of the data regions and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order to lighten complexities of the problem. However, most research work done so far focuses mainly on approximate nearest neighbor (NN) and range queries in a single-feature data space. In practice, multiple-condition queries appear more frequently and get more complicated to deal with in whatever sense. In this article, we present effcient approaches to three types of approximate similarity queries: approximate multi-feature NN, approximate single-feature NN, and approximate range queries. Specially, we will use the Vague Query System, one among flexible query answering systems for conventional DBMSs, as a case study to illustrate and establish the practical value of our proposed solutions. Experimental results with both synthetic and real data sets will confirm the efficiency of these solutions.

Keywords: approximate nearest neighbor query, approximate range query, multidimensional access method, SH-tree, Vague Query System (VQS), intelligent information retrieval

1 A Preliminary Declaration of Problems

As we know, both nearest neighbor and range searching problems are among the most important and fundamental problems in computational geometry because of its numerous important application areas [1, 2]. Specially, in many modern database applications, high-dimensional searching problems arise when complex objects are represented by vectors of d numeric features. As the dimension d increases high enough, it is difficult to design an algorithm for the searching problem that achieves both nearly linear space and poly-logarithmic query time in the worst case. The reason is that processing these problems exactly may result in very cost-ineffective data accesses (IO-cost) and/or computations (CPU-cost), especially with high-dimensional data sets. In other words, this means that as the dimension d increases, the space and/or time complexities also increase dramatically. For this reason, in many cases, one is willing to allow approximately solving the high-dimensional searching problem in order to alleviate the costs while can still keep the acceptably high accuracy of the returned results. Formal definitions of the approximate problems with respect to the nearest neighbor searching problem and bounding-sphere1 range searching problem in d-dimensional spaces are given below:

Definition 1 (approximate nearest neighbor searching problem): Given a set S of n data objects in a d-dimensional space, Rd. A data object P∈S is called a (1+ε)-approximate nearest neighbor of a given query object Q with any real ε>0 if for all other data objects P’∈S:

dist(P, Q) ≤ (1+ε).dist(P’, Q) where dist(X, Y) represents the distance between objects X and Y.

Definition 2 (approximate bounding-sphere range searching problem): Given a set S of n data objects in a d-dimensional space, Rd. For any real ε>0, a given object Q and a real value r≥0, a data object 1 Note that there are many other possible bounding ranges such as hyper-rectangles, half-spaces, etc.

Page 2: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

P∈S is considered belonging to the result set of an approximate range query (Q, r, ε) (center Q, radius r, and a tolerant error ε) if the following inequality holds:

dist(P, Q) ≤ (1+ε).r where dist(X, Y) represents the distance between objects X and Y.

Although the problem of answering approximate nearest neighbor and range queries has also extensively been studied as published in a rich literature such as [1, 2, 3, 4, etc.], it is mostly satisfied for only low- or medium-dimensional data sets. In high-dimensional spaces, the key obstacle is so-called the curse of dimensionality (cf. section 3.1). In this paper, we shall introduce adaptations to the SH-tree [5] for answering such approximate queries in high-dimensional feature spaces.

On the other hand, the searching problem that we mentioned above is confined only to one single data set. A more frequently appeared query type in practice is multi-feature nearest neighbor queries, which are sometimes referred to as complex vague queries [6]. Answering such queries requires the system to search on some feature spaces individually and then combine the searching results to find the final answers returned to users. The feature spaces are commonly multidimensional spaces and may consist of a vast amount of data. Therefore searching costs, including IO-cost and CPU-cost, are prohibitively expensive for complex vague queries. In the case of a single-feature space, to alleviate the costs, the problem of answering nearest neighbor and approximate nearest neighbor queries has been proposed and quite well-addressed in the literature. In this paper, we shall introduce an approach, called ε-ISA, for finding (1+ε)-approximate nearest neighbor(s) of complex vague queries, which must deal with the problem over multiple feature spaces. This approach is based on a novel, efficient and general algorithm called ISA-Incremental hyper-Sphere Approach [6], which has been introduced for solving the nearest neighbor problem in the Vague Query System (VQS) [7]. To the best of our knowledge, the ε-ISA is one of a few vanguard solutions for dealing with the problem of answering approximate multi-feature nearest neighbor queries. Our experimental results with both uniformly distributed and real data sets will prove the efficiency of the proposed approach.

Furthermore, besides the more frequent occurrence, complex vague queries are also more challenging and practical in modern information/data searching systems like semantic-based similarity search systems or flexible query ansering systems (cf. chapter 3 of [5]). Hence, although in this paper we introduce solutions to dealing with all three typical kinds of approximate queries, including approximate single-feature nearest neighbor queries, approximate bounding-sphere range queries, and approximate complex vague queries, we shall mainly focus on addressing approximate complex vague queries (i.e., approximate multi-feature nearest neighbor queries). The two other types of approximate similarity queries are very promising effects resulting from the SH-tree (see section 2).

The rest of this paper is organized as follows: Section 2 briefly gives an introduction to the SH-tree, which will be used to manage multidimensional data sets for our later experiments. Section 3 is dedicated to intensively introducing the ε-ISA, a new approach to approximate complex vague queries. Sections 4 and 5 approach approximate nearest neighbor queries and approximate bounding-sphere range queries, respectively, for high-dimensional data sets. Finally, section 6 gives conclusions and presents future work.

2 The SH-Tree: A Flexible Super Hybrid Index Structure for Multidimensional Data

The SH-tree has been proven one of the most flexible and efficient multidimensional access methods (MAMs). In this section, we consicely introduce the SH-tree structure. More detailed discussions about the tree can be found in [5].

Because the SH-tree is planned to apply not only for point data objects, but also for extended data objects so it has a no overlap-free space partitioning strategy for directory nodes. The idea of this approach is to easily control objects that cross a selected split position and to solve the storage utilization problem. There are three node kinds in the SH-tree: internal, balanced and leaf nodes. Each internal node i has the structure like <d, lo, up, other_info>, where d is the split dimension, lo represents the left (lower) boundary of the right (higher) partition, up represents the right (higher) boundary of the left (lower) partition, and other_info consists of additional information as node type information, pointers to its left, right child and its parent node as well as meta-data like the data object number of its left, right child. While up≤lo means no overlap between partitions of its two child nodes,

Page 3: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

up>lo indicates that the partitions are overlapping. This structure is similar to ones introduced in the SKD-tree [8] and the Hybrid tree [9]. The supplemental information like the meta-data as mentioned above also gives hints to develop a cost model for nearest neighbor queries in high-dimensional space, or to estimate the selectivity of range queries, etc. Moreover, let BRi denote bounding rectangle of internal node i. The bounding rectangle of its left child is defined as BRi∩(d≤ up). Note that ∩ denotes geometric intersection. Similarly, the bounding rectangle of its right child is defined as BRi∩(d≥ lo). This allows us to apply algorithms used in the DP (data partitioning)/R-tree based index techniques to the SH-tree.

Figure 1. A possible partition of an example data space and the corresponding mapping to the SH-tree

Balanced nodes are just above leaf nodes and they are not hierarchical (see Figure 1). Each of them has a similar structure to that of an internal node of the SR-tree [10]. This is a specific characteristic of the SH-trees. It conserves the data cluster, in part, and makes the height of the SH-tree smaller as well as employing the SR-tree’s superior aspects during the querying phase. Moreover, it also shows that the SH-trees are not simple in binary shape as in typical KD-tree based index techniques. They are also multi-way trees as R-tree based index techniques:

BN: <B1, B2, …Bn> (minBN_E ≤ n ≤ maxBN_E) Bi: <BS, MBR, num, child_pointer> In general, a balanced node consists of n entries B1, B2, …, Bn with minBN_E ≤ n ≤ maxBN_E,

where minBN_E and maxBN_E are the minimum and maximum number of entries in the node, respectively. Each entry Bi keeps information of a leaf node including four components: a bounding sphere BS, a minimum bounding rectangle MBR, the object number of leaf node num and a pointer to it pointer.

Each leaf node of the SH-tree has the same structure as that of the SS-tree [11] (because the SR-tree is just designed for points but the SH-tree is planned for both points and extended objects):

LN: <L1, L2, …Lm> (minO_E ≤ m ≤ maxO_E) Li: <obj, info> As we see above, each leaf node of the SH-tree consists of m entries L1, L2, … Lm, with minO_E ≤

m ≤ maxO_E, where minO_E and maxO_E are minimum and maximum number of entries in a leaf, respectively. Each entry Li consists of a data object obj and information in the structure info as coordinate values of the data object’s feature vector, a radius that bounds the data object’s extent in the feature space, the data object’s MBR, etc. If data objects in the database are complex, obj is only an identifier instead of a real data object. In addition, in case the SH-tree is only applied to point data

10 11

9

8

7 6 5

4

d=1 lo=6 up=6

d=2 lo=5 up=6

12 13 14 15 16

d=1 lo=3 up=3

7 8 9 4 5 6

d=2 lo=8 up=8

10 11

1 2 3

d=2 lo=3 up=4

1

2 3

Balanced node

Leaf node

Internal node MBR BS Overlapping space

8 1042 6 0

15

16

13

14

11

4

5 9

7 8

1

2 3

6

10

10

8

6

4

2

0

12

(a) (b)

Page 4: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

objects, each Li is similar to that of the SR-tree: Li: <obj, feature_info>. In this case, the other information of the objects is no longer needed. For example, the parameter radius is always equal to zero and MBR is the point itself.

Figures 1a, 1b show a possible partition of an example feature space and its corresponding mapping to the SH-tree, individually. Assume we have a two-dimensional feature space D with the size of (0, 0, 10, 10). With the split information (d, lo, up) = (1, 6, 6), the BRs of left and right children of the internal node 1 are BR2 = D∩(d≤ 6) = (0, 0, 6, 10) and BR3 = D∩(d≥ 6) = (6, 0, 10, 10), individually. For the internal node 2, assume that (d, lo, up) = (2, 3, 4), we have BR4 = BR2∩(d≤ 4) = (0, 0, 6, 4), BR5 = BR2∩(d≥ 3) = (0, 3, 6, 10). Similarly, for the internal node 3, with (d, lo, up) = (2, 5, 6) we obtain BR6 = BR3∩(d≤ 6) = (6, 0, 10, 6), BR7 = BR3∩(d≥ 5) = (6, 5, 10, 10), and so on. Note that, the BRs coordinate information is not stored in the SH-tree explicitly2, but it is dynamically computed when necessary.

Furthermore, the storage utilization constraints of the SH-tree must ensure that each balanced node is always filled with at least minBN_E entries and each data page must contains at least minO_E data objects. Therefore, each subspace according to a balanced node will hold DON data objects (DON denotes the data object number of each balanced node) and DON satisfies the following condition: minO_E x minBN_E ≤ DON ≤ maxO_E x maxBN_E. We will see that the SH-tree is also very efficient for approximate nearest neighbor (NN) and range queries in high-dimensional spaces.

3 Solving Approximate Complex Multi-Feature Nearest Neighbor Queries

As discussed in [6], addressing complex multi-feature nearest neighbor (M-FNN) queries can be accomplished by the Incremental hyper-Sphere Approach (ISA). Although the ISA is more efficient than a previous proposed approach, the Incremental hyper-Cube approach (ICA) [12], in all terms of IO- and CPU-cost, in this section we shall present important modifications to further improve the ISA so that it can also work well in high-dimensional feature spaces and when the query condition number increases.

3.1 Introduction and Related Research

Similarity search has emerged and become a fundamental paradigm for a variety of application areas as image/multimedia processing [13, 14], data mining [15], CAD database systems [16], time-series databases [17, 18], information retrieval (IR) systems [19], geographical information systems (GISs) and tourist information systems (TISs) [20], digital libraries [21], etc. In these contexts, an important problem is how to find the data object(s) that is most similar to a given query object. The standard approach is to perform search in feature vector repositories of the data objects and use specific computational functions, e.g., Euclidean metric, to measure the distance between feature vectors. Similarity search therefore becomes a nearest neighbor query over the feature vector spaces.

Notably, there are some proposals to extend and facilitate the conventional relational database management systems (RDBMSs) with similarity search capabilities as [7, 22, 23, 24, 25] (see [5] for detailed discussions). Specially, in [7], Kueng and Palkoska proposed an approach to solving the similarity search problem in the conventional RDBMS context but it has aspects similar to approaches to the NN problem in feature vectors repositories of modern database applications just mentioned above. More concretely, the VQS, which is introduced in [7], is an extension to the conventional RDBMSs and can operate on the top of them to retrieve tuples semantically close to user’s queries. The VQS key feature is the concept of Numeric-Coordinate-Representation-Tables (NCR-Tables) that store semantic meta-information of attributes. In fact, attributes of arbitrary types in a query relation/view are mapped to Euclidean spaces and kept by NCR-Tables. When the RDBMS fails to find an exact match for a given query Q, the VQS will search on some NCR-Tables corresponding to the query conditions of Q and return the best match for Q. Intuitively, NCR-Tables in the VQS are equivalent to the feature vector spaces mentioned above.

2 Except for the root node where information about coordinates of the whole data space, in the running example it is D = (0, 0, 10, 10), is kept track of and incrementally updated during the tree creation process. Using this information and information about the splits in internal nodes we can calculate BRs of the subspaces during traversing the SH-tree

Page 5: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

Solving NN queries has been researched for a long time and has attracted attention of lots of researches, e.g. [26, 27, 28, 29, 30]. The de facto standard solution is to employ MAMs [31, 32] to index feature values, which are often multidimensional vectors. Searching is later done with the support of MAMs to speed-up the performance and to decrease the costs, including IO-cost and CPU-cost. Nevertheless, when the number of dimensions of the feature spaces increases, MAMs typically suffer from the so-called dimensionality curse. This phenomenon has been observed and shown that in high-dimensional feature spaces, e.g., greater than 16, the probability of overlaps between a query and data regions is very high. The execution of a similarity query may therefore require accessing many of the data regions, and thus the performance of MAMs significantly decreases. Specially, in such cases, a linear scan over the whole data set would perform better than MAMs [33, 34].

In practice, the mapping of attributes of objects to coordinates of vectors is heuristic in many application domains [35]. Therefore, an approximate similarity search is in many cases as good as an exact search or a “true” similarity search. As presented in Definition 1, given a data set S and a query object Q, a data object P∈S is called a (1+ε)-approximate nearest neighbor of Q with ε>0 if for all other data objects P’∈S:

dist(Q, P) ≤ (1+ε).dist(Q, P’) (1)

There is a great number of previous researches that have concentrated on addressing the approximate NN queries over a single feature space such as [1, 3, 35, 36, 37, 38]. Here we call such queries approximate single-feature NN (S-FNN) queries. In section 4 we shall elaborate more on this query type.

Although there are also numerous researches in M-FNN query answering as [25, 39, 40, 41, 42, 43], we cannot apply any solution among them to the approximate M-FNN queries in the VQS (see [12] for more details about M-FNN queries in the VQS). Of late, we have introduced an approach named Incremental hyper-Sphere Approach (ISA) [6] that has been shown to be efficient and general for supporting similarity search capabilities in semantic based similarity search systems3. The ISA is an improvement and a generalization of an approach proposed for the VQS in [12]. Specially, the ISA can be applied to searching on multiple feature spaces that do not satisfy three conditions so that Fagin’s algorithm (the A0 algorithm) [39] and all its improvements like [41, 42] become applicable. Three conditions of the feature spaces in order to employ the Fagin’s algorithm are as follows [44]:

(1) There is at least a key for each feature space to be searched. (2) There is a mapping between the keys. (3) We must ensure that the mapping is one-to-one. Intuitively, condition (1) is always satisfied in the VQS, however, condition (3) does not hold. In

the VQS, each Fuzzy Field is also the key for the corresponding NCR-Table but there is no mapping one-to-one between Fuzzy Fields of the NCR-Tables. Figure 2 depicts an illustrative example to make this clearer.

Figure 2. Random access is impossible in the VQS

In Figure 2 we have a query relation with two attributes Attr_1 and Attr_2 whose attribute values are being mapped to the NCR-Tables NCR_1 and NCR_2, respectively. Moreover, let’s assume that

3 See [5] for our definition and discussions about semantic based similarity search systems.

……

y1x2

y2x1

y1x1

Attr_2 Attr_1 Query relation

… …

… y2

… y1 [Values] Domain_2 NCR_2

… …

… x2

… x1 [Values] Domain_1 NCR_1

Page 6: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

attribute Attr_1 (resp. Attr_2) is mapped to attribute Domain_1 (resp. Domain_2) in the table NCR_1 (resp., NCR_2) and it becomes the primary key of NCR_1 (resp. NCR_2). To make the NCR-Tables NCR_1 and NCR_2 satisfy the three conditions as mentioned above we must ensure that for each value of the key attribute Domain_1 in the table NCR_1 there is only one value of the key attribute Domain_2 in the table NCR_2 (and vice versa) and both of them belong to a certain tuple in the query relation. If this is satisfied then the random accesses to the NCR-Tables are possible. Nevertheless, this does not hold in the context of our problem: Figure 2 shows an example situation in which value x1 in NCR_1 can be combined with at least two other values y1, y2 in NCR_2 to form two different tuples (x1, y1) and (x1, y2) in the running query relation (another example showing in Figure 2 is value y1 in table NCR_2 and values x1, x2 in table NCR_1 with respect to two tuples (x1, y1) and (x2, y1) in the query relation).

Recently, there are two algorithms proposed for dealing with the approximate NN queries over multiple feature spaces published in [45, 46]. We call these queries approximate M-FNN queries. Nevertheless, the algorithm for dealing with approximate M-FNN queries introduced in [46] has solved the problem in a different context from that the VQS is facing. In details, the authors presented an algorithm to find multimedia objects that approximately similar to a given object and then proved that this approximate algorithm is instance optimal4. This algorithm, however, makes an assumption that random access [39] is possible. This assumption just holds as three other conditions as presented above are satisfied.

Moreover, in [45] the authors introduced an algorithm that is a generalization of the A0 algorithm [39] to deal with M-FNN queries called the J* algorithm which looks like the ISA very much5. Then, they presented a modified version of the J* algorithm to address approximate M-FNN queries. However, here we must note that although the ISA looks very similar to the J* algorithm, they are intrinsically different:

(1) The J* algorithm assumes the availability of ranked values on each subsystem involving the query conditions. However, the ISA does not assume this and it, instead, points out how to make this availability using an algorithm, named the incremental algorithm adapted for range queries, which is modified from one of the state-of-the-art algorithms for k-nearest neighbor problem [47] (see [5, 6] for the details of the adapted algorithm).

(2) The J* algorithm firstly reduces the processed states by inheriting all properties of an A* algorithm [48]; the database access cost is alleviated by an iterative deepening technique. With the ISA, the database access cost and the processed states are reduced by taking into account the incremental hyper-sphere range queries over each index and by computing the maximum searching radii dynamically (see the optimal ISA version in chapter 6 of [5]).

Also, the ISA was inspired by the Incremental hyper-Cube Approach introduced by Kueng et al. [12], which had been introduced earlier and had the same overall goals as that of the J* algorithm. In this paper we will present an approach based on the ISA to efficiently and generally solving approximate M-FNN queries, which are also called approximate complex vague queries (CVQs).

The rest of this section is organized as follows: Section 3.2 concisely summarizes and discusses the ISA, which is the base of our new approach presented in the next section. Section 3.3 is dedicated to introducing ε-ISA: an incremental lower bound approach for efficiently finding approximate nearest neighbors of CVQs. Section 3.4 presents experimental results that prove the efficiency of our proposed approach, the ε-ISA.

3.2 The ISA: An Efficient Approach for Solving Complex Vague Queries

As we have already known, query processing in the conventional RDBMSs is not flexible enough to support similarity search capabilities directly. That means when the available data in a relational database does not match a user’s query precisely, the system will only return an empty result set to the 4 Let A be a class of algorithms, let D be a class of databases, and let cost(Ax, Dx) be the middleware cost incurred by running algorithm Ax over database Dx. We say that an algorithm Bx is instance optimal over A and D if Bx∈A and if for every Ax∈A and every Dx∈D we have: cost(Bx, Dx) = O(cost(Ax, Dx)). This equation means that there are constants c and c’ such that cost(Bx, Dx) ≤ c.cost(Ax, Dx) + c’. Here c is referred to as the optimality ratio. More details are directed to [46]. 5 The M-FNN queries in the context of both the ISA and the J* algorithm are similar and different from those of the A0 algorithm. As with the ISA and the J* algorithm, no assumption about random accesses is made to the system.

Page 7: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

user. This limits their applicability to domains where only crisp answers are meaningful. In many other application domains, however, the users expect not only the crisp results returned but also some other results close to the query in a certain sense. Systems can solve this problem are called flexible query answering systems (FQASs) [5]. Among FQASs proposed, the VQS [7] is very notable because it can be easily built, extensible and can work on top of the existing RDBMSs. Specially, the VQS employs NCR-Tables to store semantic background information of attributes in a query relation/view and the similarity search is then carried out with their support. This aspect is similar to approaches to the modern information retrieval applications [19]. In [12], to enhance the VQS, the authors presented an Incremental hyper-Cube Approach (ICA) for finding the best match for M-FNN queries. Unfortunately, the ICA is not general and has weaknesses lead to degenerate the search performance of the VQS. Recently, in [6] we introduced an Incremental hyper-Sphere Approach (ISA) to improve and generalize the ICA for solving M-FNN queries. One of the ISA key features is to use hyper-sphere range queries instead of hyper-cube ones as the ICA to reduce I/O- and CPU-cost during the search process. Figure 3 illustrates the scenario for this idea.

Figure 3. Advantages of the ISA over the ICA

Assume that the searched feature space as illustrated in Figure 3 is involving a query condition qi of a running CVQ6 and the first appropriate tuple of the query relation is found after finishing hyper-cube range query hi0 (see [12] for details of the ICA). In the next step, to verify whether or not this tuple is the best match for the query, the ICA must extend the searching radius using formula 3 (see Figure 4) and create a new hyper-cube query hi1. Intuitively, disk pages intersected with and objects located in the gray area s=hi1-ci1 are unnecessary to be accessed and verified but the ICA always does this while executing query hi1. For the ISA, only objects within hyper-sphere ci1 and disk pages intersected with this hyper-sphere are accessed and further examined.

Moreover, for each NCR-Table (i.e., feature space) involving the query, the ISA employs an incremental algorithm for range queries, which is modified from one of the state-of-the-art algorithms for k-NN queries originally presented in [27]. Briefly, the main purpose of this algorithm is to overcome the unnecessary disk access problem for hyper-sphere range queries in which the results of a previous range query during the search on each feature space can be reused for the extended range query after that. This modified algorithm is also proven to be optimal in terms of disk access number as the original one (see [5] for details). Figure 4 briefly summarizes high-level abstract steps of the ISA final version.

6 Note that both ICA and ISA have been designed to solve M-FNN queries but in Figure 3, for the sake of an illustration, we depict only one possible example feature space.

Input: - A query relation/view S. - A complex vague query Q with n query conditions qi (i=1, 2… n). - Assume each feature space (or NCR-Table) related to Q is managed by a multidimensional index

structure Fi. Output:

- Best match record/tuple Tmin for Q, Tmin∈S. Ties are arbitrarily broken. Step 1: Search on each Fi for the corresponding qi using the adapted incremental algorithm for hyper-sphere

range queries. Step 2: Combine the searching results from all qi to find at least an appropriate record in S, which contains

ci1

hi0 qi

hi1

Page 8: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

Figure 4. The Incremental hyper-Sphere Approach (ISA) for answering Complex M-FNN Queries

With reference to Figure 4, the input information of the ISA can be viewed under a different form as shown in Figure 5 in which finding the best match for user’s query needs to perform search on several NCR-Tables, which keep semantic background meta-information of attributes of the query relation/view and are being managed by MAMs.

Query relation ID Attribute 1 … Attribute n

Id_1 Value_11 … Value_n1

… … … …

Id_k Value_1k … Value_nk

Figure 5. Overview of problem solved by the ISA

After carrying out step 1 there are several returned NCR-Values for each query condition qi. As noted above, here the incremental algorithm adapted for range queries is used to avoid repetitive disk accesses when processing the range queries. This is illustrated in Figure 6: During the execution of range queries qi1, qi2, the algorithm does not have to access again disk pages accessed by range queries qi0, qi1, respectively. This is one of the advantages of the ISA over the ICA: In Figure 3, for instance,

User’s query processing module

NCR-Table 1 NCR-Table n… Index 1 … Index n

the NCR-Values returned with respect to each query condition. If there is no appropriate record found then go back to step 1.

Step 3: Compute total distances/scores for the found records using formula 2 below and find a record Tmin with the minimum total distance TDcur. Ties are broken arbitrarily.

TD = sum

i

iii w/)D/d*w( ⎟⎠

⎞⎜⎝

⎛∑ (2)

where: TD total distance/score of a record Di diameter of the ith feature space wi weight of query condition qi wsum sum of all weights wi di distance from qi to returned NCR-value in the corresponding ith feature space Step 4: Compute the maximum searching radius for each qi with respect to TDcur using formula 3 below and

continue doing the search as steps 1, 2 and 3 until one of two following conditions holds: (a) the current searching radius of each qi is greater than or equal to its maximum searching radius; (b) found a new appropriate record Tnew with the total distance TDnew<TDcur.

rinew = Di * TDcur * wsum / wi (3)where:

rinew new searching radius of qi Di, wi, wsum defined as in formula 2 Step 5: If condition (a) holds then return Tmin as the best match for Q. Otherwise, i.e. condition (b) holds,

replace Tmin with Tnew, i.e. TDcur is also replaced with a smaller value TDnew, and go back to step 4.

Page 9: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

to process range query hi1, the ICA must access again all disk pages accessed during the execution of range query hi0.

Figure 6. Incremental algorithm adapted for range queries

Step 2 of the algorithm tries to find an appropriate record/tuple, which belongs to the query relation and contains the returned NCR-Values in step 1 for each query condition (cf. Figure 5). Step 3 is intuitive: among appropriate records found in step 2, the ISA chooses one, i.e. Tmin, having minimum total distance TDcur, which is calculated by formula 2. Depending on TDcur the ISA computes the maximum search radius for each query condition qi using formula 3. The search is continued and to reduce the searched space for each qi, these maximum searched radii are dynamically updated whenever a new record Tnew is found and its total distance TDnew<TDcur, i.e. condition (b) in step 4 holds. In this case, Tnew replaces Tmin and the search is then kept on until condition (a) in step 4 satisfies. At this time the current Tmin is returned in step 5. Note that when condition (a) in step 4 does not hold for a certain qi then the search is still continued for that qi. This also means the ISA is not a uniform algorithm [41]. As proven in [5], after finishing step 5, there is no any other record in the entire query relation/view S having a smaller total distance to Q than that of the returned record Tmin. Therefore, the ISA correctly returns the best match for the query Q and totally overcomes all shortcomings of the ICA. Furthermore, we also showed in [5] that the ISA increasingly improves the ICA when the dimension d of data sets (NCR-Tables) involved in the query increases.

Although experimental results in [6] have shown that the ISA is significantly outperforms the ICA originally introduced in [12] by orders of magnitude in both terms of I/O- and CPU-cost, its performance is still decreased when the dimension number of the feature spaces or the query condition number increases. For high dimensional feature spaces, the problem is caused by the dimensionality curse, by which probability of overlaps between a query and data regions is very high, hence query processing requires more I/O and CPU costs. For the second reason, i.e. the increase of the query condition number, Figure 7 shows our experimental results of 100 randomly selected three-condition queries in only two-dimensional spaces for both ICA and ISA (see section 3.4 for more information on common test conditions). We can observe that the affected object number in the ISA, which is directly proportional to CPU-cost, is not much less than that of the ICA and thus the ISA performance is decreased. To face these problems we propose a variant of the ISA called ε-ISA to approximately answer CVQs (or M-FNN queries). Section 3.3 will detail the ε-ISA.

(a) (b) Figure 7. The ISA performance is still decreased when the query condition number increases

qi2 qi1

qi qi0

70

75

80

85

90

95

100

Affe

cted

obj

ect n

um (x

1000

)

ICA ISA

2-d12-d22-d3

0

200

400

600

800

1000

1200

1400

Dis

k ac

cess

es n

um

ICA ISA

2-d12-d22-d3

Page 10: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

3.3 The ε-ISA: Towards Efficiently Solving Approximate M-FNN Problem

Inequality as shown in equation 1 has been applied to addressing approximate S-FNN queries as in [1]. Here we define a similar inequality that will be used for dealing with approximate M-FNN queries:

Definition 3 (the approximate nearest neighbor of a complex vague query): Given a M-FNN query Q of n query conditions, a query relation/view S and feature spaces (e.g., the NCR-Tables) corresponding to the mapped attributes of S. Assume the total distance (TD) from each record of S to Q is calculated by formula 2. A record Tapp∈S is defined as a (1+ε)-approximate nearest neighbor of Q with any real ε>0 if for all records T∈S, the total distances satisfy:

TD(Q, Tapp) ≤ ( 1+ε).TD(Q, T) (4)

Inequality as shown in equation 4 looks like one as shown in equation 1. However, differences are inside them as a CVQ may consist of several S-FNN queries. Each query condition of a CVQ is separately addressed and the distance metrics can be different among each subsystem dealing with these query conditions. Nevertheless, the returned distances of objects corresponding to each query condition are then normalized into the interval [0, 1], i.e. the factor (di/Di) in formula 2. The total distance TD of each record to the query Q, which is calculated using formula 2, is the weighted sum of the normalized element distances just mentioned above7. In accordance with the description of the ISA as shown in Figure 4, an appropriate record Tmin having a minimum TD among the found records must be verified if it is the nearest neighbor of Q. As with the approximate problem, however, if its total distance TDcur is satisfied inequality as shown in equation 4, it will be returned immediately as a (1+ε)-approximate nearest neighbor of Q without any more searches. Section 3.3.1 presents a variant of the ISA using lower bound total distance (LBTD) of the “true” nearest neighbor of Q to solve this problem.

3.3.1 Finding Approximate Nearest Neighbor

Processing steps of the algorithm ε-ISA described in Figure 8 below are quite similar to those of the ISA described in Figure 4 except for some differences as follows. First, the input information includes a real ε>0 that is served as a tolerant error to judge whether or not a found appropriate record is a (1+ε)-approximate NN of the query. In practice, this value is usually supplied by the user. Second, the output of the ε-ISA will be an approximate NN instead of the “true” NN of the query. As the ISA, from step 1 to step 3 the ε-ISA tries to find an appropriate record in the query relation/view S. The record Tapp chosen in step 3 has the minimum TD to the query among the found appropriate records. At this point we still do not know if Tapp is a (1+ε)-approximate NN of Q even though it is a candidate for the best match of the query. Then, however, in step 4 the ε-ISA computes a LBTD for the “true” NN of Q as follows:

LBTD = min {TDcur, di}, i=1, 2… n (5)

where the value of n represents query condition number of the complex vague query being processed, i.e., the query Q. 7 Note that, other scoring aggregation functions that are monotone, e.g. min, max, can also be employed. More discussions are directed to [49].

Input: - A query relation/view S. - A complex vague query Q with n query conditions qi (i=1, 2… n). - Assume each feature space (or NCR-Table) related to Q is managed by a multidimensional index

structure Fi. - A real ε>0 used as a tolerant error

Output: - (1+ε)-approximate NN record/tuple Tapp for Q, Tapp∈S. Ties are arbitrarily broken.

Page 11: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

Figure 8. The ε-ISA: Finding (1+ε)-approximate nearest neighbor of a CVQ

In formula 5, each di is the weighted normalized distance from the query condition qi to the farthest object (i.e., NCR-Value in the VQS) returned so far in the corresponding feature space being managed by a multidimensional index structure Fi. Because the incremental algorithm adapted for range queries employs a priority queue [6] thus di is also the weighted normalized distance from qi to the last returned object so far in the corresponding feature space. Particularly, each di will be calculated as follows:

di = wi * ti / (Di * wsum) , i= n,1 where:

wi weight of the corresponding query condition qi wsum sum of weights over all query conditions Di diameter of the corresponding ith feature space ti raw (not normalized) distance from qi to the last returned object so far in the ith feature space

(6)

Basically, the main idea here is to find a value of LBTD that is ensured to be always less than or equal to the distance from the true NN to the query Q. For clearer, Figure 9 depicts an example to illustrate this idea with a two-condition M-FNN query.

(a) Query Relation Attr_1 Attr_2 …

A B … C q2 …

q1 D …

… … …

(b) Figure 9. Lower Bound Total Distance of a M-FNN query

q2

T2 T1

q12

q11

q1 q10

q22

q21

q20

A B D C

Tapp

LBTD = min (TDTapp, TDT1, TDT2)

Step 1: Search on each Fi for the corresponding qi using the adapted incremental algorithm for hyper-sphere range queries.

Step 2: Combine the searching results from all qi to find at least an appropriate record in S, which contains the NCR-Values returned with respect to each query condition. If there is no appropriate record found then go back to step 1.

Step 3: Compute total distances for the found records using formula 2 and find a record Tapp with the minimum total distance TDcur. Ties are broken arbitrarily.

Step 4: Let di be the weighted normalized distance from query condition qi to the last NCR-Value returned in the corresponding feature space, which is being managed by Fi. Compute LBTD=min{TDcur, di}, i=1,2…n.

Step 5: If TDcur≤(1+ε).LBTD, return Tapp as a (1+ε)-approximate NN record for Q. Otherwise, go to step 6. Step 6: Compute the maximum searching radius for each qi with respect to TDcur using formula 3 and continue

doing the search as steps from 1 to 5 until the algorithm is stopped at step 5. If the current searching radius of a certain qi is greater than or equal to its maximum searching radius then searching on Fi is stopped.

Page 12: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

In Figure 9a, assume that the query Q of two conditions q1 and q2 is executing and the sequential range queries for q1 and q2 are q10, q11, q12 and q20, q21, q22, individually. These range queries are carried out in step 1 of the algorithm. Figure 9b shows a part of the possible example query relation/view with respect to Figure 9a. Assume that an appropriate record Tapp(A, B, …) with the total distance TDcur is found after step 3 of the ε-ISA as sketched by a solid line segment connecting two objects: A of q11 and B of q20. Moreover, we also suppose that the ε-ISA is now doing range queries q12 and q22.

Intuitively, if Tapp is not the “true” NN of Q then the total distance TD of this “true” NN cannot be less than the weighted normalized searching radii d1 and d2 of range queries q12 and q22, which are computed as follows: d1=rq12*w1/(D1*(w1+ w2)) and d2=rq22*w2/(D2*(w1+ w2)), where rq12, rq22 are radii of q12 and q22, respectively; w1, w2, D1, D2 are defined as described in formula 6. This total distance TD is equal to the weighted normalized radius of q12, i.e. equals d1, for example, when the appropriate record found is an exact composition of q2 and a returned NCR-Value C for the range query q12 (see the first dashed line T1 as illustrated in Figure 9a that is being represented for an appropriate record T1(C, q2, …) as illustrated in Figure 9b). A similar explanation is for dashed line T2. Therefore, in accordance with Figure 9, we have: LBTD = min {TDcur, weighted normalized radius of q12, weighted normalized radius of q22} = min {TDTapp, d1, d2}. Note that here we do not exclude a case in which Tapp is itself the “true” NN of Q and, practically, we can also observe that LBTD = min {TDTapp, TDT1, TDT2} where TDT1, TDT2 are total distances of records T1, T2 to the query, individually. Therefore, we call di terms as shown in formula 5 partial total distances.

Depending on the LBTD computed by step 4, step 5 of the algorithm decides whether record Tapp is a (1+ε)-approximate NN of Q or not. Step 6 is quite similar to step 4 of the ISA to ensure that the ε-ISA will not access unnecessary objects and unneeded disk pages during the search process. In the case if the current searching radii of every qi are greater than or equal to their corresponding maximum searching radii then searching on every Fi is stopped and, actually, the returned record is the “true” NN of the query Q (because at this time the ε–ISA simply becomes the ISA).

The correctness of the ε–ISA is the evident truth that can easily be perceived from the above discussions as well as the correctness of the ISA as mathematically proven in [5].

3.3.2 A Generalized Algorithm for Finding Approximate k-Nearest Neighbors

The ε-ISA version described in section 3.3.1 just returns a (1+ε)-approximate NN of a CVQ. In practice, almost users need some approximate top-k NNs to the CVQ. We define a record/tuple is a (1+ε)-approximate kth NN to a complex M-FNN query as follows:

Definition 4 (the approximate kth nearest neighbor of a complex M-FNN query): Given a M-FNN query Q of n query conditions, a query relation/view S and feature spaces (e.g. the NCR-Tables) corresponding to the mapped attributes of S. Assume the total distance (TD) from each record of S to Q is calculated by formula 2. A record Tapp∈S is defined as a (1+ε)-approximate kth NN of Q with ε>0 if the ratio of the TD between Q and Tapp and the TD between Q and its true kth NN is at most (1+ε).

In fact, in Definition 4, a (1+ε)-approximate kth NN of Q can be closer to Q than the true kth NN (cf. [1]). On the other hand, note that the incremental algorithm adapted for range queries [5], which is being used for the ε-ISA as described in Figure 8, is originated from an algorithm for finding k-nearest neighbors of a S-FNN query in which k may be unknown in advance [27, 47]. Hence, the problem of dealing with approximate k-nearest neighbors for M-FNN queries and k is not necessary to be known beforehand is easily solved by making some modifications to the ε-ISA as presented below:

(1) A priority queue PQ (or a sorted buffer) is needed to keep all appropriate records to Q (as well as their other necessary information like the total distance to the query) that have been found in step 3 of the ε-ISA as introduced in Figure 8.

(2) The next approximate NN of the query Q is verified similarly to that of the ε-ISA version, but here the record at the top of PQ takes part as Tapp in step 3 of the algorithm (i.e. the ε-ISA).

(3) When a record at the top of PQ is ensured to be the next approximate NN to the query Q, it will be added to the result set and be removed from PQ.

Page 13: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

(4) In the case PQ is empty but the user still needs more results, the search is continued normally on all the involved feature spaces. Due to the aspects of the incremental algorithm adapted for range queries, the search will continue from where it left when computing the previous approximate NN.

The correctness of this ε-ISA version is easily inferred from that of the ε-ISA presented in the previous section. Besides, to address the approximate k-NN problem for complex M-FNN queries in the case the value of k is known in advance, we do not need to maintain all appropriate records to Q in a PQ as discussed above, but only k closest appropriate records to Q. The other modifications in this case are still the same as mentioned above. In section 3.4 below we show experimental results to prove the efficiency of the ε-ISA.

3.4 Experimental Results for Approximate M-FNN Queries

In order to establish the practical value of our algorithm, the ε-ISA, we implemented it based on the ISA and carried out a number of experiments to compare the ε-ISA with the ISA. Our experiments used the following data sets:

(1) Uniformly distributed (also called synthetic) data sets of 2, 4 and 8 dimensions. Each synthetic data set has 100 000 objects of the form <id, values list> and takes role as a feature space (i.e., a NCR-Table in the VQS).

(2) Real data sets of 9 and 16 dimensions: Each of them has more than 64 000 feature vectors of images of the form <id, values list> and they are based on color moments and color histogram of the images, respectively (cf. http://kdd.ics.uci.edu/).

The experiments were carried out for CVQs of 2 and 3 query conditions. For each feature space of all experiments, 100 query points were randomly selected from the corresponding data set. Furthermore, we used the SH-tree to manage the storage and the retrieval of multidimensional data objects in each feature space. We selected the SH-tree because it is one of the most flexible and efficient multidimensional index structures [5]. All the tests were tackled on an Ultra-SPARC-II 360MHz with 512 Mbytes of main memory and some Gigabytes of hard disk capacity. All programs were implemented in C++. The page size was 8Kb for the data sets so as to meet with the disk block size of the operating system.

For all experiments, we simulated the database access performance to the query relation by comparing field id of the affected objects. For example, if a 4-dimensional object <id4, values list> and an 8-dimensional object <id8, values list> were retrieved during the search and id4=id8, then they were considered belonging to a record in the query relation. This record is called an appropriate record in this paper. Intuitively, for a particular query optimization processor in the database system, the affected object number in each feature space is directly proportional to the cost of database access. Therefore, we reported the disk access number of the searches on the SH-trees and the affected object number for all experiments. The affected objects here are objects located in the last hyper-sphere range query for each feature space.

Our first experiment was carried out with the data sets of 4 and 8 dimensions to find (1+ε)-approximate NN for 2-condition M-FNN queries. The ε values selected for the tests were 0.2, 0.4, 0.6, 0.8 and 1. Figures 10a, 10b and 10c show the experimental results: Figure 10a shows the percentage of disk access savings for each feature space against ε values; Figure 10b shows the percentage of affected object savings against ε values; and Figure 10c depicts the accuracy of the approximate results, compared to the true NNs, with respect to the ε values.

We observe in the first experiment that the accuracy is very high while all costs are significantly saved. For example, for ε=0.6 (that means the tolerant error is allowed up to 60%), the ε-ISA saved over 55% and nearly 7% disk accesses, nearly 80% and over 35% affected object number with respect to 4- and 8-dimensional feature space, individually, while it still preserved the accuracy of the returned results at 90%. The experimental results were also promising for the small ε values, e.g. for ε=0.2 as depicted in Figure 10c, the accuracy was up to 98% and the cost savings were still quite high, e.g. over 27% of disk accesses and over 44% of affected objects for the 4-dimensional feature space were saved during the search.

Page 14: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

(a) Disk access savings (b) Affected object savings

(c) Accuracy against the ε values Figure 10. Two-condition (4-d and 8-d) NN queries, different ε values

The second experiment was done with two data sets of 4 dimensions. The ε value was set to 0.2. Here we did the experiment to show the efficiency of the ε-ISA in dealing with approximate k-nearest neighbor problem. The values of k were set to 5, 10, 15, 20, 25 and 30. Figures 11a, 11b and 11c show the experimental results with notations similar to those of the previous experiment. The results obtained were quite good for small values of k and better for larger values of k. Specially, for k=25 or 30, disk access savings were over 10% and affected object savings were over 23% for each 4-dimensional feature space, while the accuracy was maintained at nearly 90% against the true k-nearest neighbors of the selected queries.

(a) Disk access savings (b) Affected object savings

(c) Accuracy against the true k-NNs Figure 11. Two-condition (4-d) k-NN queries, ε = 0.2

010

20304050

607080

90100

0 0.2 0.4 0.6 0.8 1Epsilon

Dis

k ac

cess

savi

ngs (

%)

4-d8-d

010

20304050

607080

90100

0 0.2 0.4 0.6 0.8 1Epsilon

Affe

cted

obj

ect s

avin

gs (%

) 4-d8-d

0102030405060708090

100

0 0.2 0.4 0.6 0.8 1Epsilon

Acc

urac

y (%

)

0

10

20

30

40

50

5 10 15 20 25 30k-nearest neighbors

Dis

k ac

cess

savi

ngs (

%) 4- d (1)

4- d (2)

0

10

20

30

40

50

5 10 15 20 25 30k-nearest neighbors

Affe

cted

obj

ect s

avin

gs (%

)

4- d (1)4- d (2)

0

1020

30405060

7080

90100

5 10 15 20 25 30k-nearest neighbors

Acc

urac

y (%

)

Page 15: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

Next experiment was performed on the data sets of 2 dimensions. For this experiment the ε-ISA was carried out to find (1+ε)-approximate NN for 3-condition queries. The main purpose of this test was to show effect of the query condition number on the efficiency of the ε-ISA. The ε values chosen were 0.25, 0.5, 0.75 and 1. The results are shown in Figures 12a and 12b: Figure 12a depicts all costs savings and Figure 12b shows the accuracy, compared to the true NN of the selected queries, with respect to the ε values. The experimental results are particularly considerable. For ε=0.5 (that means the tolerant error is permitted up to 50%) the ε-ISA saved nearly 18% of disk accesses and over 17%, 22% and 24% of the affected objects for each 2-dimensional feature space, individually, while still preserving the accuracy of the approximate results at 91%.

(a) Savings of Costs (3-condition queries) (b) Accuracy against the ε values (3-condition queries)

Figure 12. Three-condition (2-d) NN queries, different ε values

The last experiment was carried out with the two real data sets of image feature vectors. Here the ε value was set to 1, i.e. the tolerant error was permitted up to 100%. 100 two-condition M-FNN queries were performed and the results were quite good: the ε-ISA saved about 4.5 % and 1% of the affected object number and disk access number, individually, for 16-d data set while it remained the accuracy at 71% with respect to the “true” NNs of the queries. One notable fact here, in this last experiment, is that the average relative error or the effective epsilon calculated similarly to that introduced in [1]8 is quite low, only 0.23. This is a very promising result because it (and the results of previous experiments) is showing that the effective epsilon committed is typically at least an order of magnitude smaller the tolerant error ε allowed. Note that we did not give the effective epsilon values for the previous experiments because the accuracy of those experimental results was very high enough (see [1] for the definition and explanations of the effective epsilon).

4 Solving Approximate Single-Feature Nearest Neighbor Queries

Although dimension-reduction techniques have been introduced to lighten complexities of the similarity search problem in high-dimensional feature spaces, e.g. Karhunen-Loeve transform, [18], etc. and even there are also a various number of approaches to this problem, the research results obtained so far are still unsatisfactory, and thus much more research is necessary. Of late, there has been an increasing interest in avoiding or alleviating the curse of dimensionality by resorting to approximate nearest neighbor searching. As declared in [36], and with reference also to Definition 1, only a small value of ε could suffice for most practical purposes. Below we present some typical results resulting from previous researches for the problem of (1+ε)-approximate S-FNN query answering and related work.

8 Concisely, the computation of the effective epsilon in our algorithm is conducted as follows: (1) Computing the ratio between the distance to the appropriate record reported by the ε-ISA and the true nearest neighbor minus 1; (2) Averaging results in step (1) over all queries to get the resulting quantity that is called the effective epsilon. Note that here the computations are for records, but not multidimensional points as shown in [1]. As with our problem, a record satisfying the query may have multi-features and each of these features contains multidimensional data.

0

5

10

15

20

25

30

35

40

45

50

0 0.25 0.5 0.75 1Epsilon

Cos

ts sa

ving

s (%

)

Disk accessesAffected obj (2-d1)Affected obj (2-d2)Affected obj (2-d3)

0

10

20

30

40

50

60

70

80

90

100

0 0.25 0.5 0.75 1

Epsilon

Acc

urac

y (%

)

Page 16: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

There are some notable surveys of data structures for nearest neighbors as [31, 32, 50]. In such literature, one can find valuable information about variants of k-d-trees, R-trees, hashing tables, and structures based on space-filling curves. The problem is that the proposed structures almost only perform well in low- and medium-dimensional spaces. In high-dimensional spaces, they all exhibit poor behavior in the worst case as well as typical cases [1, 51].

For approximate nearest neighbor, the situation is slightly better. The problem was firstly studied by Bern [52]. He proposed a data structure based on quad trees [53], which uses linear space and provides logarithmic query time. However, the approximation error factor for his algorithm is a fixed function of the dimension. In [4] Arya et al. introduced an algorithm with query time O(1/ε)dO(log n) and using a nearly linear data structure. The dependence on ε was reduced by Clarkson afterwards [54]. In [1] Arya et al. strengthened results in their previous work and obtained optimal O(n) space pre-processing requirement cost9, but with query time growing as O(dd). Recently, in [3] the author gave an algorithm with O(n log d)2d pre-processing cost and query time polynomial in d, ε, and log n. In the same paper, the author also gave another algorithm with pre-processing polynomial in d, ε, and n but with query time O(n + d log3n). This second algorithm improves the O(dn) time bound of the brute-force algorithm. Later, in [35, 36], the authors independently proposed new algorithms to further improve the space and time complexities. Concretely, their algorithms can eliminate all exponential dependencies in dimension, yielding a query time O(d logO(1)(dn)) and space (dn)O(1). Here, the O-notation hides constant factors exponentially depending on tolerant error parameter ε, but not on dimension d.

In the context of semantic based similarity search systems or FQASs like the VQS, sometimes users want answers to queries as “Find employees living near the company” or “Find managers whose salary is approximately at 100 000$ annually”, etc. In such cases (and many other cases) we can observe that the query specification is even itself a “guess”. Therefore, approximate answers to such queries are of course suitable. Furthermore, in many cases the difference between the NN and a “good” approximation is indistinguishable from a practical point of view [37].

Intrinsically, answering approximate queries requires satisfying two conflicting requirements as follows: (1) low processing costs (IO-cost, CPU-cost, memory cost), and (2) high accuracy of the results (i.e., low errors, or more exactly, low effective tolerant errors – cf. section 3.4). In section 3, we discussed and introduced an approximate k-NN algorithm for solving complex multi-feature NN queries. The proposed algorithm has a very promising efficiency satisfying both requirements (1) and (2) above. It has been proven through the theoretical analyses as well as the experiments. In this section, we present an adaptation to the SH-tree for answering approximate S-FNN queries. This adapted algorithm is briefly described as in Figure 13 below:

Approximate k-NN algorithm for S-FNN queries: 1: Approximate_kNN_Query(Obj, PQ, ε, Rset, k) 2: DATAOBJECT Obj //Query object 3: PRIORITYQUEUE PQ //Priority queue 4: FLOAT ε //Epsilon – tolerant error 5: RESULTBUFFER Rset //Sorted result buffer 6: INTEGER k //number of desired approximate NNs 7: While not PQ.IsEmpty() do 8: PRIORITYQUEUE top=PQ.Top() 9: If top.dist ≥ Rset[k].dist/(1+ε) then return 10: Case top.type of //else 11: INTERNAL_NODE: //Compute bounding regions (BR) for left and right children 12: BOUNDINGREC BRleft= top.BR I (d ≤ top.up) 13: BOUNDINGREC BRright= top.BR I (d ≥ top.lo) //MINDIST from Obj to BRleft and BRright 14: FLOAT leftdist= MINDIST(Obj, BRleft)

9 The authors introduced a new tree, the BBD-tree (balanced box-decomposition tree). This tree has O(log n) height, and subdivides space into regions defined by axis-aligned hyper-rectangles that are fat, meaning that the ratio between the longest and shortest sides is bounded. The tree has O(n) nodes and can be built in O(dn log n) time.

Page 17: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

15: FLOAT rightdist= MINDIST(Obj, BRright) //Then, push them together with involved information into the priority queue PQ if needed 16: If leftdist < Rset[k].dist/(1+ε) then 17: PQ.Push(top.left, top.left.type, leftdist, BRleft) 18: If rightdist < Rset[k].dist/(1+ε) then 19: PQ.Push(top.right, top.right.type, rightdist, BRright) 20: BALANCED_NODE: 21: For each entry Leaf in the balanced node do 22: FLOAT dist= Distance(Obj, Leaf) 23: If dist < Rset[k].dist/(1+ε) then 24: PQ.Push(Leaf, LEAF_NODE, dist, NULL) 25: LEAF_NODE: 26: For each object Object in the leaf node do 27: FLOAT dist= ObjectDistance(Obj, Object) 28: If dist < Rset[k].dist then 29: Update Rset with respect to this new seen object Object. 30: EndCase 31: EndWhile

Figure 13. Pseudo-code of the adapted approximate k-NN algorithm

This approximate k-NN algorithm is a variant of the k-NN algorithm, which is itself in turn an adaptation to the incremental NN algorithm [47], for finding k-nearest neighbors with the SH-tree. The algorithm also uses a priority queue PQ to realize the best-first search method. It means that the next node or object will be accessed is always put on top of the queue. The comparison between elements in the queue is done with respect to their distances to the query object Obj.

The algorithm is easy to understand as described in Figure 13. Note that the SH-trees have three kinds of nodes: internal, balance, and leaf nodes (cf. section 2). Depending on which kind of node at the top of PQ the algorithm will process it accordingly. In line 9, the algorithm check if no other data objects that have not been seen so far can have a distance to Obj less than Rset[k].dist/(1+ε), in which Rset[k].dist represents the distance of the current k-th nearest neighbors to Obj. If this is true then current k nearest neighbors (k-NNs) in the sorted buffer Rset are returned as approximate k-NNs of Obj. Here we should note that the distance from Obj to a bounding of any node in the tree is always a lower bound with respect to its distance to any data object (or any child node) contained in this node. If the condition in line 9 does not hold, then the algorithm will check the type of object seen at the top of PQ. In case this is an internal node, the algorithm must determine bounding regions for its left and right child nodes, MINDIST from Obj to them and then put them into PQ if necessary. Conditions in lines 16 and 18 are to make sure that the left and right child nodes, individually, perhaps result in a data object that has a smaller distance to Obj than the k-th current NN in Rset. These conditions prune nodes that should put into the queue. The famous lower bound distance metric MINDIST was introduced [26]. In case the object at the top of PQ is a balanced node, the algorithm computes the distance from Obj to each child of the balanced node (i.e., to a leaf node), Distance(Obj, Leaf), and puts the leaf into PQ if the condition in line 23 holds. Note that, the distance Distance(Obj, Leaf) is defined as the longer one between the minimum distances to the leaf’s MBR (Minimum Bounding Rectangle) and BS (Bounding Sphere) [5]. In case a leaf node has been seen at the top of PQ then the algorithm will process similarly to the above case. It will firstly access objects in the leaf and then computes the distance from Obj to each of them, ObjectDistance(Obj, Object). If this distance satisfies the condition in line 28, the corresponding data object will be put into the result buffer Rset and then this sorted result buffer will be updated.

The correctness of this algorithm10 is obvious. Note that, as any approximate algorithms proposed in the literature, performance of this algorithm depends on the choice of ε values. Intuitively, the larger the ε value is, the faster the algorithm terminates. However, this may negatively effect the quality of the result (more exactly, the effective epsilon may be higher). We would also remark here that our 10 Also note here that, in [5] we introduced two different adaptations of the incremental k-NN algorithm originally proposed in [27] to the SH-tree. Algorithm as described in Figure 13 is a variant of the first one, which leads to higher CPU-cost during the query processing among the two. In table 1, this adapted algorithm is named “Alg2”.

Page 18: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

algorithm as shown in Figure 13 is ε-independent, implying that once the tree has been built, any ε value can also be used. Obviously, setting ε=0 will cause our algorithm to compute the true NN. We have implemented the algorithm and done tests with a real data set of 9-dimensional image feature vectors based on their color moments (cf. section 3.4). We performed 100 approximate 10-NN queries with different ε values. All the tests were carried out on a SUN ULTRA 30, CPU 300 MHz, 384 Mbytes of the main memory and some Gigabytes of the hard disk capacity. Experimental results are presented in Table 1 below.

Table 1. Experimental results of approximate k-NN queries

Alg1 (C-kNN)

Alg2 (C-kNN) ε = 0.0 ε = 0.25 ε = 0.5 ε = 0.75 ε = 1.0 ε = 2.0

I/O-cost 76.22 56.12 56.12 42.7 35.54 30.68 27.4 20.82

CPU-cost (sec) 0.099 0.248 0.078 0.063 0.054 0.048 0.044 0.035

Accuracy 100% 100% 100% 100% 99.7% 98.8% 98.5% 94.1%

In Table 1, the first two columns show experimental results of two adapted k-NN algorithms for the SH-tree as introduced in [5]: algorithm 1 adapted from [26] and algorithm 2 from [27]. Other columns show experimental results of approximate k-NN algorithm as described in Figure 13 with different values of the tolerant error ε. The first row shows IO-cost (i.e. the disk access number averaged over 100 test queries), the second row shows averaged CPU time that needs to complete each query, and the third row presents the accuracy of k-NN results returned. Evidently, for algorithms 1 and 2, i.e. correct k-NN (C-kNN) algorithms, the accuracy of results obtained equals 100%. This is also correct for approximate k-NN algorithm as ε = 0. From results showing in Table 1, we observe that the approximate k-NN algorithm performs very well: It can keep interestingly high accuracy of the returned results while significantly reducing costs of the query processing. For example, with ε = 0.25, this approximate algorithm can save nearly 44% (resp. ~24%) of IO-cost, and nearly 36.4% (resp. ~74.6%) of CPU-cost compared to adapted algorithms 1 and 2, respectively, meanwhile it still keeps the accuracy at 100%. One more interesting result is that the approximate algorithm with ε =0 performs better than both adapted algorithms. It can be explained by the more efficient pruning strategy when we know the k value in advance (see lines 16, 18, and 23 of the algorithm), which leads to a smaller priority queue in size. This also confirms a conclusion presented in [47] that the size of the priority queue affects the performance of queue operations during the algorithm’s execution (and thus affects the performance of the algorithm). More importantly, this result leads us to conclude that in the case of k-NN queries with k is determined beforehand, the algorithm as shown in Figure 13 could be a suitable substitute for all previously adapted algorithms for the SH-tree (note that ε value is set to zero in this case).

To conclude this section, we present here an example to more intuitively show the effect of ε value on the quality of results returned to an approximate query. Figure 14 below sketches the scenario:

Figure 14. Approximate NN results and correct NN results

r

LN1

IN1

r/(1+ε)

q p x

Page 19: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

In Figure 14 we assume that q is the query object and p is the current NN with ObjectDistance(q, p) = r. Assume the bounding sphere (q, r) is now overlapping a leaf node LN1 and an internal node IN1 and they will be visited in this order (according to their distances to q). Note that, leaf nodes in the SH-tree have both minimum bounding rectangle (MBR) and bounding sphere (BS), while internal nodes have only bounding rectangle (BR, even not MBR). With NN algorithms, they have to first access the leaf node LN1 and find out that ObjectDistance(q, x) < MINDIST(q, IN1), and ObjectDistance(q, x) < r (in fact, in this case we have ObjectDistance(q, x) < MINDIST(q, IN1) < r). Therefore, object x will be returned as the “true” NN without accessing the internal node IN1. Nevertheless, with the approximate algorithm as presented in Figure 13, when p has already been determined to be the current approximate NN and when the leaf node LN1 comes to the top of the priority queue, the algorithm will find out that Distance(q, LN1) > r/(1+ε), and thus p is immediately returned as the approximate NN without more accesses like the NN algorithms. But, if ε value is chosen small enough so that inequality Distance(q, LN1) ≥ r/(1+ε) still does not hold, then more accesses to disk pages and objects should be the case.

5 Solving Approximate Range Queries

Range searching is one of the most crucial important problems in computational geometry because it arises in many application areas and a wide variety of geometric problems can be formulated as a range searching problem [55]. There is the sheer volume of literature on this problem over the last decades such as [2, 56, etc.]. There are also some recent surveys that have summarized the development history of the problem as [55, 57]. With reference to [2, 55], there are typical kinds of ranges like hyper-rectangles, spheres, halfspaces, or simplices. The final goal is to preprocess the given data set S into a data structure so that for a range query R, the data objects in R∩S can be reported or counted efficiently. As the nearest neighbor problem, almost researches and approaches in range searching problem have focused on solving the problem exactly. However, lower bounds show that if linear space is assumed, the problem cannot be solved in polylogarithmic time, except for the case of orthogonal ranges [2]11. This suggests that it is worth considering variants of the problem. One of the vanguard approaches to this was originally introduced in [2]. There, the authors proposed an algorithm that rather than approximating the count of points, they considered the range to be a fuzzy range, and assumed that data points close to the boundary of the range (relative to the range’s diameter) may or may not be included in the count.

In the context of semantic- or content-based similarity search systems and other systems with such similar aspects, range query answering is also among the most popular types of queries. Some example queries as “Count all families living within a radius of 3 km from Sai Gon center”, or “How many people that their hair color is similar to red with an acceptable error is 0.1”, etc. In such cases, users of the systems may be quite satisfied with an answer that is just relatively accurate. Towards supporting and satisfying the user information need and integrating the SH-tree into such systems, we develop an approximate algorithm, as described in Figure 15 below, for answering bounding sphere12 range queries in the SH-trees.

Approximate algorithm for bounding sphere range queries: 1: AppBSRangeQuery(bs, ε, Rset, node, br) 2: BOUNDINGSPHERE bs //bounding sphere range query 3: FLOAT ε //Epsilon – tolerant error 4: RESULTBUFFER Rset //result set buffer 5: NODEPOINTER node //pointer to a node of the tree 6: BOUNDINGREC br //bounding rectangle of node 7: Compute outer range bs+ and inner range bs-, which are the dilation and erosion, respectively, of bs by

distance ε*(bs.R) 8: Case (node.Type) of

11 For orthogonal range queries, it is well-known that range trees [58], in all dimensions d ≥ 2, use O(n log(d-1)n) space and can answer the query in O(log(d-1)n + k) time, where k is the number of data objects reported [59]. 12 In fact, our algorithm here can also be applied to other range types with some minor modifications.

Page 20: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

9: INTERNAL_NODE: 10: If br ⊆ bs+ then 11: Descend the tree from this internal node and add all data objects in the leaf nodes to the result set

Rset, and then return 12: If br ∩ bs- = ∅ then 13: return /* Otherwise, compute bounding rectangles for left and right children (note that d is the split dimension),

and then recursively call AppBSRangeQuery for each of them */ 14: BOUNDINGREC left_rec = br I (d ≤ node.up) 15: BOUNDINGREC right_rec = br I (d ≥ node.lo) 16: AppBSRangeQuery(bs, ε, Rset, node.Left, left_rec) 17: AppBSRangeQuery(bs, ε, Rset, node.Right, right_rec) 18: BALANCED_NODE: 19: If br ⊆ bs+ then 20: Descend the tree from this balanced node and add all data objects in the leaf nodes to the result set

Rset, and then return 21: If br ∩ bs- = ∅ then 22: return 23: For each entry Leaf in the balanced node do 24: AppBSRangeQuery(bs, ε, Rset, Leaf, NULL) 25: LEAF_NODE: 26: For each object Object in the leaf node do //compute distance between center of bs and Object 27: FLOAT dist = ObjectDistance(bs.C, Object)

//if Object falls into/overlaps the range query 28: If (dist ≤ bs.R) then 29: Add Object to the result set Rset 30: EndCase

Figure 15. Pseudo-code of the adapted approximate range query algorithm

It is not difficult to follow the algorithm. Among parameters of the algorithm, parameter br (bounding rectangle of node) may be questioned. However, we shall call procedure AppBSRangequery with parameter node as the root node of the SH-tree so br can easily be determined because the root of the SH-tree always keeps and updates this value whenever a new object is inserted to the tree.

Firstly, the algorithm calculates outer range bs+ and inner range bs- (line 7) for parameter bs (cf. Figure 16) then it examines type of node. If this is an internal node, the algorithm checks if br is contained in bs+ (line 10). If this is the case, all data objects contained in node are added to the result set buffer Rset. On the contrary, the algorithm checks if br does not intersect with bs- (line 12). If this is correct then nothing will be done more with this node (line 13). Otherwise, the algorithm computes bounding rectangles for left and right child nodes of the internal node (lines 14, 15) and then makes recursive calls to procedure AppBSRangeQuery for these left and right child nodes (lines 16, 17). In case node is not an internal node, but a balanced node (line 18), processing steps are similar to those of the internal node above except that parameter br in the recursively procedure calls for the leaf nodes (child nodes of the balanced node) is NULL (line 24). The reason is that we do not need such information any more in processing steps with the leaf nodes (lines 25-29). In the case node is a leaf node, the algorithm will access all data objects locating in this leaf and verify that whether or not they satisfy the “true” range query (line 28) in order to be added to Rset.

Although the returned results of algorithm presented in Figure 15 comply with Definition 2, the processing in this algorithm is in fact a little different: the algorithm treats the range as a fuzzy object, meaning that objects lying within distance ε*(bs.R) of the boundary of the bounding sphere range query bs either may or may not be returned. This is the same as algorithm proposed in [2]. Also, note that the algorithm presented in [2] computes the accumulated weight of the points satisfying the approximate query, whereas our algorithm here will retrieve all objects satisfying the approximate

Page 21: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

query. The proof of the correctness of our algorithm remains the same as that presented in [2]. Figure 16 below depicts a graphical view of our problem:

Figure 16. Approximate bounding sphere range queries

From Figure 16 we can easily observe that objects belonging to the area (bs+ - bs-) may be returned and may also not be returned, whereas objects belonging to bs- will always be included in the result set. We also observe that objects returned but belonging to the area (bs+ - bs) are false hits. In [2] the authors introduced a concept called the actual average error committed by the approximate algorithm as follows: Consider a range of radius r and a point at distance r’. If r’<r but the point was not included in the result set (in our case, it may be some objects in the area (bs - bs-)), the associated misclassification error is defined as the relative error, (r-r’)/r. Otherwise, if r’>r but the point was included in the result set, the associated misclassification error is (r’-r)/r. This relative error is measured for every misclassified point and then we averaged them over all points lying in the area (bs+ - bs-) to establish the average error of the query. With a number of experiments, they observed that the average error was much lower than the allowed error factor ε.

6 Conclusions and Future Work

In this paper we introduced three types of approximate queries of great interest and efficient methods to address them: (1) approximate single-feature nearest neighbor (S-FNN) queries, (2) approximate multi-feature nearest neighbor (M-FNN) queries, and (3) approximate range queries. Our discussions were mainly focused on the second query type, i.e., approximate M-FNN queries. The other two are well-known and quite well-addressed with a great number of research publications.

In particular, we introduced an approach called ε-ISA (ISA stands for Incremental hyper-Sphere Approach) to efficiently solving the approximate nearest neighbor problem for M-FNN queries, which are also called complex vague queries (CVQs). To the best of our knowledge, the ε-ISA is one of the vanguard solutions for dealing with the approximate M-FNN queries. Although the ε-ISA is presented in the context of the Vague Query System (VQS) [7], it can also be applied to a variety of application domains such as multimedia databases (e.g., “find images that their shape, texture, and color are approximately similar to cat, cloud, and red, individually”), GISs and TISs (e.g., “find a hotel located near the city centre, and priced at about 100 EUR/night”), digital libraries, and so on. In general the ε-ISA is very useful for application domains that the returned results need not to be exact but similar or approximately similar (with a certain tolerant error) to a given query. Our experimental results with both synthetic and real data sets have proven this. With a suitable ε value, the ε-ISA can save a very high percentage of the costs, including both IO-cost and CPU-cost, while it still preserves the accuracy of the returned results at a particularly very high value. Specially, the ε-ISA is also applicable to general feature spaces including ones that do not satisfy three conditions, which are pointed out in [44], to make Fagin’s algorithm [39] and all its variants such as [41, 42] become utilizable. A fruitful discussion for these approaches and considerable related information can be found in [46, 60, 61].

r

bs

bs+

bs-

εrεr

Must be returned May be returned Must not be returned

Page 22: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

Besides, we also discussed the first and third query types in two separated sections with notable considerations. For each of them, we proposed efficient and effective adaptations to the SH-tree for dealing with the concerned problem. This one more time confirms the powerfulness as well as the flexibility of the SH-tree, a super hybrid index structure for multidimensional data. Moreover, from our experimental results in section 4, we have also realized that in the case of k-NN queries with k is known in advance, our proposed algorithm as shown in Figure 13 is a suitable substitute for all adapted algorithms previously introduced for the SH-tree.

Our future work will concentrate on deploying the ε-ISA as well as the SH-tree for the real world applications such as GISs and TISs (see, e.g., [20]), modern IR systems, etc. Specially, other types of approximate similarity queries as approximate complex vague joins, approximate reverse queries (given a query object Q, find all data objects of the database to which Q is the approximate nearest neighbor), etc. are also of interest and still need more research efforts.

References

1. Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.Y. (1998) An optimal algorithm for approximate nearest neighbor searching in fixed dimensions, Journal of the ACM 45(6), 891-923

2. Arya, S., Mount, D.M. (2000) Approximate range searching, Computational Geometry 17(3-4), 135-152 3. Kleinberg, J.M. (1997) Two algorithms for nearest-neighbor search in high dimensions, Proc. 29th Annual

ACM Symp. on the Theory of Computing, Texas, USA, 599–608 4. Arya, S., Mount, D.M. (1993) Approximate nearest neighbor queries in fixed dimensions, Proc. 4th Annual

ACM-SIAM Symp. on Discrete Algorithms, Austin, Texas, USA, 271-280 5. Dang, T.K. (2003) Semantic based similarity searches in database systems (multidimensional access

methods, similarity search algorithms), PhD Thesis, FAW-Institute, University of Linz, Austria 6. Dang, T.K., Kueng, J., Wagner, R. (2002) ISA-an incremental hyper-sphere approach for efficiently

solving complex vague queries, Proc. 13th Int. Conf. on Database and Expert Systems Applications, Aix-en-Provence, France, 810-819

7. Kueng, J., Palkoska, J. (1997) VQS–a vague query system prototype, Proc. 8th Int. Conf. on Database and Expert Systems Applications, Toulouse, France, 614-618

8. Ooi, B.C., McDonell, K.J., Sacks-Davis, R. (1987) Spatial KD-tree: an indexing mechanism for spatial databases, COMPSAC 1987, Tokyo, Japan, 433- 438

9. Chakrabarti, K., Mehrotra, S. (1999) The hybrid tree: an index structure for high dimensional feature spaces, Proc. 15th Int. Conf. on Data Engineering, Sydney, Australia , 440-447

10. Katayama, N., Satoh, S. (1997) The SR-tree: an index structure for high dimensional nearest neighbor queries, Proc. ACM SIGMOD Int. Conf. on Management of Data, Tucson, Arizona, USA, 369-380

11. White, D.A., Jain, R. (1996) Similarity indexing with the SS-tree, Proc. 12th Int. Conf. on Data Engineering, New Orleans, Louisiana, USA, 516-523

12. Kueng, J., Palkoska, J. (1999) An incremental hypercube approach for finding best matches for vague queries, Proc. 10th Int. Conf. on Database and Expert Systems Applications, Florence, Italy, 238-249

13. Seidl, T., Kriegel, H-P. (1997) Efficient user-adaptable similarity search in large multimedia databases, Proc. 23rd VLDB Conf., Athens, Greece, 506-515

14. Ortega, M., Rui, Y., Chakrabarti, K., Porkaew, K., Mehrotra, S., Huang, T.S. (1998) Supporting ranked boolean similarity queries in MARS, IEEE Trans. on Knowledge and Data Engineering 10(6), 905-925

15. Braunmueller, B., Ester, M., Kriegel, H-P., Sander, J. (2001) Multiple similarity queries: a basic DBMS operation for mining in metric databases, IEEE Trans. on Knowledge and Data Engineering 13(1), 79-95

16. Berchtold, S., Kriegel, H.P. (1997) S3: Similarity search in CAD database systems, Proc. ACM SIGMOD Int. Conf. on Management of Data, Arizona, USA, 564-567

17. Faloutsos, C., Ranganathan, M., Manolopoulos, Y. (1994) Fast subsequence matching in time-series databases, Proc. ACM SIGMOD Int. Conf. on Management of Data, Minnesota, USA, 419-429

18. Chakrabarti, K., Keogh, E.J., Mehrotra, S., Pazzani, M.J. (2002) Locally adaptive dimensionality reduction for indexing large time series databases, ACM Trans. on Database Systems 27(2), 188-228

19. Baeza-Yates, R., Ribeiro-Neto, B. (1999) Modern information retrieval, Addison-Wesley Longman, ISBN 020139829x, ACM Press Books

20. Palkoska, J., Puehretmair, F., Tjoa, A.M., Wagner, R., Woeß, W. (2002) Advanced query mechanisms in tourism information systems, ENTER 2002, Innsbruck, Austria, 438-447

21. Boehm, C., Braunmueller, B., Kriegel, H-P., Schubert, M. (2000) Efficient similarity search in digital libraries, Proc. IEEE Advances in Digital Libraries 2000, Washington, DC, USA, 193-206

22. Ichikawa, T., Hirakawa, M. (1986) ARES: a relational database with the capability of performing flexible interpretation of queries, IEEE Trans. on Software Engineering 12(5), 624-634

Page 23: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

23. Motro, A. (1988) VAGUE: a user interface to relational database that permits vague queries, ACM Trans. on Office Information Systems 6(3), 187-214

24. Pfeifer, U., Fuhr, N. (1995) Efficient processing of vague queries using a data stream approach, Proc. 18th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Seattle, Washington, USA, 189-197

25. Chaudhuri, S., Gravano, L. (1999) Evaluating top-k selection queries, Proc. 25th VLDB Conf., Edinburgh, Scotland, UK, 397-410

26. Roussopoulos, N., Kelley, S., Vincent, F. (1995) Nearest neighbor queries, Proc. ACM SIGMOD Int. Conf. on Management of Data, San Jose, California, USA, 71-79

27. Hjaltason, G.R., Samet, H. (1995) Ranking in spatial databases, Proc. 4th Int. Symp. on Large Spatial Databases, Portland, Maine, USA, 83-95

28. Bozkaya, T., Oezsoyoglu, Z.M. (1999) Indexing large metric spaces for similarity search queries, ACM Trans. on Database Systems 24(3), 361-404

29. Berchtold, S., Keim, D.A., Kriegel, H-P., Seidl, T. (2000) Indexing the solution space: a new technique for nearest neighbor search in high-dimensional space, IEEE Trans. on Knowledge and Data Engineering 12(1), 45 –57

30. Hinneburg, A., Aggarwal, C.C., Keim, D.A. (2000) What is the nearest neighbor in high dimensional spaces?, Proc. 26th VLDB Conf., Cairo, Egypt, 506-515

31. Gaede, V., Guenther, O. (1998) Multidimensional access methods, ACM Comput. Surv. 30(2), 170–231 32. Boehm, C., Berchtold, S., Keim, D.A. (2001) Searching in high-dimensional spaces: index structures for

improving the performance of multimedia databases, ACM Comput. Surv. 33(3), 322-373 33. Weber, R., Schek, H-J., Blott, S. (1998) A quantitative analysis and performance study for similarity-

search methods in high-dimensional spaces, Proc. 24th VLDB Conf., New York, USA, 194-205 34. Beyer, K.S., Goldstein, J., Ramakrishnan, R., Shaft, U. (1999) When is “nearest neighbor” meaningful?,

Proc. 7th Int. Conf. on Database Theory, Jerusalem, Israel, 217-235 35. Kushilevitz, E., Ostrovsky, R., Rabani, Y. (2000) Efficient search for approximate nearest neighbor in high

dimensional spaces, SIAM Journal on Computing 30(2), 457-474 36. Indyk, P., Motwani, R. (1998) Approximate nearest neighbors: towards removing the curse of

dimensionality, Proc. 30th Annual ACM Symp. on the Theory of Computing, Dallas, Texas, USA, 604-613 37. Ciaccia, P., Patella, M. (2000) PAC nearest neighbor queries: approximate and controlled search in high-

dimensional and metric spaces, Proc. 16th Int. Conf. on Data Engineering, California, USA, 244-255 38. Chakrabarti, K., Garofalakis, M.N., Rastogi, R., Shim, K. (2001) Approximate query processing using

wavelets, VLDB Journal 10(2-3), 199-223 39. Fagin, R. (1996) Combining fuzzy information from multiple systems, Proc. 15th ACM SIGACT-SIGMOD-

SIGART Symp. on Principles of Database Systems, Montreal, Canada, 216-226 40. Ortega, M., Rui, Y., Chakrabarti, K., Mehrotra, S., Huang, T.S. (1997) Supporting similarity queries in

MARS, Proc. 5th ACM Int. Conf. on Multimedia, Seattle, USA, 403–413 41. Nepal, S., Ramakrishna, M.V. (1999) Query processing issues in image (multimedia) databases, Proc. 15th

Int. Conf. on Data Engineering, Sydney, Australia, 22-29 42. Guentzer, U., Balke, W-T., Kiessling, W. (2000) Optimizing multi-feature queries for image databases,

Proc. 26th VLDB Conf., Cairo, Egypt, 419-428 43. Boehm, K., Mlivoncic, M., Schek, H-J., Weber, R. (2001) Fast evaluation techniques for complex

similarity queries, Proc. 27th VLDB Conf., Roma, Italy, 211-220 44. Wimmers, E.L., Haas, L.M., Roth, M.T., Braendli, C. (1999) Using Fagin’s algorithm for merging ranked

results in multimedia middleware, Proc. 4th IFCIS Int. Conf. on Cooperative Information Systems, Edinburgh, Scotland, 267-278

45. Natsev, A., Chang, Y-C., Smith, J.R., Li, C-S., Vitter, J.S. (2001) Supporting incremental join queries on ranked inputs, Proc. 27th VLDB Conf., Roma, Italy, 281-290

46. Fagin, R., Lotem, A., Naor, M. (2001) Optimal aggregation algorithms for middleware, Proc. 20th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, California, USA, 102-113

47. Hjaltason, G.R., Samet, H. (1999) Distance browsing in spatial databases, ACM Trans. on Database Systems 24(2), 265-318

48. Russell, S., Norvig, P. (1995) Artificial intelligence: a modern approach, Prentice Hall, Englewood Cliffs, NJ, ISBN 0131038052

49. Fagin, R., Wimmers, E.L. (2000) A formula for incorporating weights into scoring rules, Theoretical Computer Science 239(2), 309-338

50. Dang, T.K., Kueng, J., Wagner, R. (2001) Multidimensional access methods: important factor for current and next decade’s applications in spatial databases, Proc. 9th Interdisciplinary Information Management Talks, Zadov, Czech Republic, 361-374

Page 24: Solving Approximate Similarity Querieskhanh/papers/CSSE2007.pdf · and the performance of MAMs significantly decreases. Approximate similarity search has been introduced in order

51. Arya, S., Mount, D.M., Narayan, O. (1996) Accounting for boundary effects in nearest-neighbor searching, Discrete and Computational Geometry 16(2), 155-176

52. Bern, M.W. (1993) Approximate closest-point queries in high dimensions, Information Processing Letters 45(2), 95-99

53. Finkel, R.A., Bentley, J.L. (1974) Quad trees: a data structure for retrieval on composite keys, Acta Informatica Vol. 4, 1974, 1-9

54. Clarkson, K.L. (1994) An algorithm for approximate closest-point queries, Proc. 10th Annual Symp. on Computational Geometry, Stony Brook, New York, USA, 160-164

55. Agarwal, P.K., Erickson, J. (1999) Geometric range searching and its relatives, Advances in Discrete and Comp. Geometry, Contemporary Math. 223, American Mathematical Society Press, 1-56

56. Hibino, S., Rundensteiner, E.A. (1998) Processing incremental multidimensional range queries in a direct manipulation visual query. Proc. 14th Int. Conf. on Data Engineering, Orlando, Florida, USA, 458-465

57. Matousek, J. (1994) Geometric range searching, ACM Comput. Surv. 26(4), 421-461 58. Bentley, J.L. (1980) Multidimensional divide-and-conquer, Communications of the ACM 23(4), 214-229 59. Preparata, F.P., Shamos, M.I. (1985) Computational geometry: an introduction, Springer-Verlag, New

York, NY, ISBN 0387961313 60. Fagin, R. (2002) Combining fuzzy information: an overview, ACM SIGMOD Record 31(2), 109-118 61. Bruno, N., Gravano, L., Marian, A. (2002) Evaluating top-k queries over web-accessible databases, Proc.

18th Int. Conf. on Data Engineering, San Jose, California, USA, 369-382