26
Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING APPROXIMATE NEAREST NEIGHBOR OF COMPLEX VAGUE QUERIES DANG Tran Khanh, KÜNG Josef, WAGNER Roland Institute for Applied Knowledge Processing (FAW) Johannes Kepler University of Linz Austria

Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Embed Size (px)

Citation preview

Page 1: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1

-ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING APPROXIMATE NEAREST NEIGHBOR OF

COMPLEX VAGUE QUERIES

DANG Tran Khanh, KÜNG Josef, WAGNER Roland

Institute for Applied Knowledge Processing (FAW)

Johannes Kepler University of Linz

Austria

Page 2: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 2

OUTLINE

Complex Vague Queries in the Vague Query System (VQS)

Similarity search problem of the VQS in the conventional DBMSs

Incremental hyper-Sphere Approach (ISA)

Overcome shortcomings of Incremental hyper-Cube Approach (ICA)

-ISA: Finding Approximate Nearest Neighbors of Complex

Vague Queries

The issue of the dimensionality curse

The issue of increasing the query condition number

Experimental Results

Conclusions

Page 3: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 3

COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM

The VQS:

Introduced by Kueng and Palkoska 1997

Support similarity search capabilities in the conventional DBMSs: return

to users records semantically close to a given query

One of the VQS’s basic ideas:

• NCR-Tables (Numeric-Coordinate-Representation-Tables): keep

numeric semantic information of non-numeric attributes

Page 4: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 4

NCR-Tables – an example

Colors Name red green blue

black 0 0 0 blue 0 0 255 light blue 173 216 230

dark blue 0 0 139

... ... ...

Car Nr Typ Col

L-1234 VW blue W-5679 Opel black ... ... ...

fuzzy field NCR-key NCR - columns

NCR-table

COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM

SELECT FROM CarWHERE

Col IS ‘dark blue‘INTO

myResultTable;

Page 5: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 5

Complex Vague Queries in VQS: A simplified view of the problem

NCR-Table 1 NCR-Table n…

Index 1 … Index n

Value_nk…Value_1k...

…………

Value_n1…Value_11...

Attribute n…Attribute 1...Query relation

Vague query processing module

COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM

Page 6: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 6

The issue of the dimensionality curse [Weber et al 1998; Beyer

et al 1999]

NCR-Tables with high-dimensional data:

• The probability of overlaps between a query and data regions is very

high, and thus the performance of multidimensional access methods

(MAMs) is decreased significantly

• A linear scan over the whole data set would perform better than

MAMs

Approximate nearest neighbor problem:

dist(Q, P) (1+)dist(Q, P’) (1)

• Almost for single data sets: single–feature nearest neighbor (S-FNN)

queries [Arya et al 1998, Kleinberg 1997, Amato et al 2000, Ciaccia

and Patella 2000, etc.]

COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM

Page 7: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 7

Solving Complex Vague Queries in VQS: “Random access“ [Fagin 1996] is impossible

……

y1x2

y2x1

y1x1

Attr2Attr1Query

relation

……

…y2

…y1

[Values]Domain1Attr1

……

…x2

…x1

[Values]Domain1Attr1

COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM

Page 8: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 8

Incremental hyper-Cube Approach (ICA) [Kueng and Palkoska 1999]

Issues with the ICA: see [Dang et al 2002a, Dang et al 2002b] for the details

How to determine the initial hyper-cubes ? How to extend the hyper-cubes in necessary case Accessing unnecessary disk pages and objects Repeated disk accesses Only best match record is returned (not top-k records)

COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM

Page 9: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 9

INCREMENTAL HYPER-SPHERE APPROACH (ISA)

Input: A query relation/view S A complex vague query Q with n query conditions qi (i=1, 2… n) Assume each feature space (or NCR-Table) related to Q is managed

by a multidimensional index structure Fi

Output: Best match record/tuple Tmin for Q, TminS. Ties are arbitrarily broken.

Step 1: Search on each Fi for the corresponding qi using the adapted incremental algorithm for hyper-sphere range queries.

Step 2: Combine the searching results from all qi to find at least an appropriate record in S, which contains the returned NCR-Values with respect to each query condition. If there is no appropriate record found then go back to step 1.

Step 3: Compute total distances/scores for the found records using formula 2 below and find a record Tmin with the minimum total distance TDcur. Ties are arbitrarily broken.

Page 10: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 10

INCREMENTAL HYPER-SPHERE APPROACH (ISA)

Page 11: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 11

INCREMENTAL HYPER-SPHERE APPROACH (ISA)

Step 4: Compute the maximum searching radius for each qi with respect to TDcur using formula 3 below and continue doing the search as steps 1, 2 and 3 until one of two following conditions holds: (a) the current searching radius of each qi is greater than or equal to its maximum searching radius; (b) found a new appropriate record Tnew with the total distance TDnew<TDcur

Step 5: If condition (a) holds then return Tmin as the best match for Q. Otherwise, i.e. condition (b) holds, replace Tmin with Tnew, i.e. TDcur is also replaced with a smaller value TDnew, and go back to step 4

Page 12: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 12

INCREMENTAL HYPER-SPHERE APPROACH (ISA)

Modifying ISA to retrieve top-k records: see [Dang et al 2002b]

High-dimensional feature spacesand/or

Query condition number increases

ISA performance is decreased

Page 13: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 13

-ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX

VAGUE QUERIES

CVQ = M-FNN (Multi-Feature Nearest Neighbor) query

Using lower bound total distance (LBTD)

Page 14: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 14

-ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX

VAGUE QUERIES Input:

A query relation/view S A complex vague query Q with n query conditions qi (i=1, 2… n) Assume each feature space (or NCR-Table) related to Q is managed

by a multidimensional index structure Fi

A real >0 used as a tolerant error

Output: (1+)-approximate NN record/tuple Tapp for Q, TappS. Ties are

arbitrarily broken.

Step 1: Search on each Fi for the corresponding qi using the adapted incremental algorithm for hyper-sphere range queries.

Step 2: Combine the searching results from all qi to find at least an appropriate record in S, which contains the returned NCR-Values with respect to each query condition. If there is no appropriate record found then go back to step 1.

Step 3: Compute total distances/scores for the found records using formula 2 and find a record Tapp with the minimum total distance TDcur. Ties are arbitrarily broken.

Page 15: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 15

-ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX

VAGUE QUERIES

Step 4: Let di be distance from query condition qi to the last NCR-Value returned in the corresponding feature space, which is being managed by Fi. Compute LBTD as follows:

LBTD = min {TDcur, di}, i=1,2…n (5)

Step 5: If TDcur <= (1+)LBTD, return Tapp as a (1+)-approximate NN record for Q. Otherwise, go to step 6

Step 6: Compute the maximum searching radius for each qi with respect to TDcur using formula 3 and continue doing the search as steps from 1 to 5 until the algorithm is stopped at step 5. If the current searching radius of a certain qi is greater than or equal to its maximum searching radius then searching on Fi is stopped

See next slice

Page 16: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 16

-ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX

VAGUE QUERIESLower Bound Total Distance - An example

A B

C D

QR Attr1 Attr2

A B

C q2

q1 D

Page 17: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 17

-ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX

VAGUE QUERIES

Approximate k-nearest neighbors

See our paper for more details

Page 18: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 18

EXPERIMENTAL RESULTS

Data sets:

Uniformly distributed: 2, 4, and 8 dimensions (100K objects for each

of them)

Real: 9 and 16 dimensions (more than 64K feature vectors of

images, URL: http://kdd.ics.uci.edu/)

Using the SH-tree [Dang et al 2001a] to manage

multidimensional data

Page size: 8KB

100 query points were randomly selected from each

corresponding data set

...

Page 19: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 19

EXPERIMENTAL RESULTS

2-condition (4-d and 8-d) NN queries, different values

Page 20: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 20

EXPERIMENTAL RESULTS

2-condition (4-d) k-NN queries, = 0.2

Page 21: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 21

EXPERIMENTAL RESULTS3-condition (2-d) NN queries, different values

2-condition NN queries (9-d and 16-d real data sets), =1

=1 means tolerant error is permitted up to 100% -ISA saved about 4.5 % and 1% of the affected object and disk access

number, individually, for 16-d data set while it remained the accuracy at 71%

One notable fact here is that the effective epsilon calculated as introduced in (Arya et al. 1998) is quite low, only 0.23. This is a very promising result.

Page 22: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 22

CONCLUSIONS

-ISA: An Incremental Lower Bound Approach for Efficiently Finding Approximate Nearest Neighbor of Multi-Feature Queries in VQS

-ISA is one of the vanguard solutions to dealing with this problem

-ISA is very useful for application domains that the returned results need not to be exact but similar or approximate similar (with a certain tolerant error) to a given query. The experimental results have proven this. With a suitable value, the -ISA can save a very high percentage of the costs including both IO-cost and CPU-cost while it still preserves the accuracy of the returned results at a particularly very high value

-ISA is applicable to not only numeric domains such as NCR-tables, but also any ranked input

Application areas: TIS (tourist information systems), GIS, digital libraries, multimedia systems, etc.

Page 23: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 23

More information

• URL: http://www.faw.uni-linz.ac.at/• E-mail: {khanh, jkueng, rwagner}@faw.uni-linz.ac.at

Page 24: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 24

Research related to dealing with complex vague queries

The A0 algorithm [Fagin 1996] (There are some improvements of Fagin‘s algorithm, see the paper for more details): Finding top-k matches for a user query involving several

multimedia attributes Problem: this algorithm assumes that random access is

possible in the system. This assumption is correct only three following conditions hold:

1. there is at least a key for each subsystem,2. there is a mapping between the keys,3. and we must ensure that the mapping is one-to-one

In VQS: condition (1) is always satisfied (each fuzzy field are the key for the corresponding NCR-table), but there is no the mapping one-to-one between the fuzzy fields

Cannot be applied to our problem

Page 25: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 25

Other approaches for multimedia databases: [Ortega et al 1997, Chaudhuri et al 1996, Boehm K. et al 2001] (see our paper)

Chaudhuri et al. 1999 introduced a solution to translate a top-k multi-feature query to a range query that the conventional DBMS can process. This approach employs information in the histograms kept by a relational system

Research related to dealing with complex vague queries (cont.)

Page 26: Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING

Hagenberg -Linz -Prague-Vienna

iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 26

ISA and J* algorithm

The ISA The J* algorithmThe input is ranked with support of the incremental algorithm adapted for range queries

Assume that the ranked input is available, do not show how to deal with it

Reduce the database access cost first; this cost and the processed states are reduced by taking into account the hyper-sphere range queries and computing the maximum searching radii

Reduce the processed states first, the database access cost is alleviated by iterative deepening technique (S. Russell and P. Norvig: Artificial Inteligence: A Modern Approach. Prentice Hall, Inc., 1995)

Derived from the ICA that had been introduced earlier and had the same overall goals as the J* alg.

Claimed to be the first alg. that can process “joins” of ranked input and multi-level joins