17
Abstract Introduction Evolutionary GNAT Deleting in the EGNAT Experimental Evaluation Conclusions EGNAT: A Fully Dynamic Metric Access Method for Secondary Memory Roberto Uribe-Paredes 1,2 Gonzalo Navarro 3 1 Depto. de Ingenier´ ıa en Computaci´ on, Universidad de Magallanes, Chile 2 Grupo de Bases de Datos - UART, Universidad Nacional de la Patagonia Austral, R´ ıo Turbio, Argentina 3 Dept. of Computer Science, University of Chile, Chile E-mail: [email protected], [email protected] September 4, 2009 Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary

EGNAT: A Fully Dynamic Metric Access Method for Secondary ...€¦ · Abstract Introduction Evolutionary GNAT Deleting in the EGNAT Experimental Evaluation Conclusions EGNAT: A Fully

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EGNAT: A Fully Dynamic Metric Access Method for Secondary ...€¦ · Abstract Introduction Evolutionary GNAT Deleting in the EGNAT Experimental Evaluation Conclusions EGNAT: A Fully

AbstractIntroduction

Evolutionary GNATDeleting in the EGNATExperimental Evaluation

Conclusions

EGNAT: A Fully Dynamic Metric Access Method

for Secondary Memory

Roberto Uribe-Paredes 1,2 Gonzalo Navarro 3

1Depto. de Ingenierıa en Computacion,

Universidad de Magallanes, Chile

2Grupo de Bases de Datos - UART,

Universidad Nacional de la Patagonia Austral, Rıo Turbio, Argentina

3Dept. of Computer Science,

University of Chile, Chile

E-mail: [email protected], [email protected]

September 4, 2009Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary

Page 2: EGNAT: A Fully Dynamic Metric Access Method for Secondary ...€¦ · Abstract Introduction Evolutionary GNAT Deleting in the EGNAT Experimental Evaluation Conclusions EGNAT: A Fully

AbstractIntroduction

Evolutionary GNATDeleting in the EGNATExperimental Evaluation

Conclusions

Abstract

We introduce a novel metric space search data structurecalled EGNAT, which is fully dynamic and designed forsecondary memory. The EGNAT implements deletions using anovel technique dubbed Ghost Hyperplanes.

Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary

Page 3: EGNAT: A Fully Dynamic Metric Access Method for Secondary ...€¦ · Abstract Introduction Evolutionary GNAT Deleting in the EGNAT Experimental Evaluation Conclusions EGNAT: A Fully

AbstractIntroduction

Evolutionary GNATDeleting in the EGNATExperimental Evaluation

Conclusions

Similarity Search

Range search

The k nearest neighbors

u8

u11

u4

u15

u1

u13

u9

u2

u12u3

u7

u5

u14

u10

u6q

r

Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary

Page 4: EGNAT: A Fully Dynamic Metric Access Method for Secondary ...€¦ · Abstract Introduction Evolutionary GNAT Deleting in the EGNAT Experimental Evaluation Conclusions EGNAT: A Fully

AbstractIntroduction

Evolutionary GNATDeleting in the EGNATExperimental Evaluation

Conclusions

Metric Structures

Construction Methods

Based on Pivots|d(q, pi ) − d(x , pi )| > r => d(q, x) > r

Based on Clustering or compact partitioning

Voronoi Partitioning Criterion (or Hyperplanes)d(q, cj) > d(q, ci ) + 2rCovering Radius Criterion d(q, ci ) − r > rc(ci )

Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary

Page 5: EGNAT: A Fully Dynamic Metric Access Method for Secondary ...€¦ · Abstract Introduction Evolutionary GNAT Deleting in the EGNAT Experimental Evaluation Conclusions EGNAT: A Fully

AbstractIntroduction

Evolutionary GNATDeleting in the EGNATExperimental Evaluation

Conclusions

General InformationConstructionSearch

General Information

This is a new method based on the GNAT [Brin95], that isbased on the GHT (Generalized Hyperplane Tree)[Uhlmann91].

It is based on both clustering and used the Hyperplanes andCovering radius criteria.

It is dynamic and enhanced for Secondary Memory.

Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary

Page 6: EGNAT: A Fully Dynamic Metric Access Method for Secondary ...€¦ · Abstract Introduction Evolutionary GNAT Deleting in the EGNAT Experimental Evaluation Conclusions EGNAT: A Fully

AbstractIntroduction

Evolutionary GNATDeleting in the EGNATExperimental Evaluation

Conclusions

General InformationConstructionSearch

Construction

Each child pi maintains a table of distance ranges toward therest of the sets Dpj

,rangei ,j = (min{d(pi , x), x ∈ Dpj

}, max{d(pi , x), x ∈ Dpj}).

p14

p13

p8

p15

p3 p12

p10

p6p7 p11

p9

p1

p2

p5

p4 p4p3p2p1

p10 p12 p5 p7 p11 p15 p9 p6 p8 p13 p14

Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary

Page 7: EGNAT: A Fully Dynamic Metric Access Method for Secondary ...€¦ · Abstract Introduction Evolutionary GNAT Deleting in the EGNAT Experimental Evaluation Conclusions EGNAT: A Fully

AbstractIntroduction

Evolutionary GNATDeleting in the EGNATExperimental Evaluation

Conclusions

General InformationConstructionSearch

Construction

Two types of nodes:

buckets (leaves)

gnats (internal nodes)

GNAT Node

BUCKET Node BUCKET Node BUCKET Node BUCKET Node BUCKET Node BUCKET Node

(a)

MAX_SIZE_BUCKET

BUCKET Node

Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary

Page 8: EGNAT: A Fully Dynamic Metric Access Method for Secondary ...€¦ · Abstract Introduction Evolutionary GNAT Deleting in the EGNAT Experimental Evaluation Conclusions EGNAT: A Fully

AbstractIntroduction

Evolutionary GNATDeleting in the EGNATExperimental Evaluation

Conclusions

General InformationConstructionSearch

Search

buckets (leaves)If |d(x , p) − d(q, p)| > r orgnats (internal nodes)range(p0, q) ∩ range(p0, Dpx ) = ∅then we know that d(x , q) > r without computing thatdistance

x

qmax_d(p,D )

p

min_d(p,D )q

Dx

q

r

r

Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary

Page 9: EGNAT: A Fully Dynamic Metric Access Method for Secondary ...€¦ · Abstract Introduction Evolutionary GNAT Deleting in the EGNAT Experimental Evaluation Conclusions EGNAT: A Fully

AbstractIntroduction

Evolutionary GNATDeleting in the EGNATExperimental Evaluation

Conclusions

Ghost Hyperplanes

Ghost Hyperplanes

Partition of the space before and after a deletion.

1S

S

S2

S3

S4

S5

S6

S 1

Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary

Page 10: EGNAT: A Fully Dynamic Metric Access Method for Secondary ...€¦ · Abstract Introduction Evolutionary GNAT Deleting in the EGNAT Experimental Evaluation Conclusions EGNAT: A Fully

AbstractIntroduction

Evolutionary GNATDeleting in the EGNATExperimental Evaluation

Conclusions

Ghost Hyperplanes

Choosing the Replacement

The nearest neighbor.

The nearest descendant.

The nearest descendent located in a leaf.

A promising descendant leaf.

An arbitrary descendant leaf.

Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary

Page 11: EGNAT: A Fully Dynamic Metric Access Method for Secondary ...€¦ · Abstract Introduction Evolutionary GNAT Deleting in the EGNAT Experimental Evaluation Conclusions EGNAT: A Fully

AbstractIntroduction

Evolutionary GNATDeleting in the EGNATExperimental Evaluation

Conclusions

Constructions Costs and Deletion CostsComparative MethodsSearch CostsConstruction Costs on Secondary MemorySearch Costs of the Secondary Memory VariantsDeletion and Search Costs on Secondary Memory after Deletions

Constructions Costs and Deletion Costs

0

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

0 20 40 60 80 100

Dis

tanc

e E

valu

atio

ns

Construction Percentage

Total Construction Cost (n=86061 words)

Centers : 04Centers : 08Centers : 16Centers : 20

0

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9 10D

ista

nce

Eva

luat

ions

Percentage deleted from DB

Individual Deletion Cost (n=86061, Spanish dict.)

Centers : 04Centers : 08Centers : 16Centers : 20

0

100

200

300

400

500

600

700

800

900

10 15 20 25 30 35 40

Dis

tanc

e E

valu

atio

ns

Percentage deleted from DB

Individual Deletion Cost (n=86061, Spanish dict.)

Centers : 04Centers : 08Centers : 16Centers : 20

0

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

7e+06

0 10 20 30 40 50 60 70 80 90

Dis

tanc

e E

valu

atio

ns

Construction Percentage

Total Construction Cost (100000, Gauss, dim 10)

Centers : 04Centers : 08Centers : 16Centers : 19

0

20

40

60

80

100

120

140

160

1 2 3 4 5 6 7 8 9 10

Dis

tanc

e E

valu

atio

ns

Percentage deleted from DB

Individual Deletion Cost (n=100000, Gauss, dim 10)

Centers : 04Centers : 08Centers : 16Centers : 19

0

20

40

60

80

100

120

140

160

10 15 20 25 30 35 40

Dis

tanc

e E

valu

atio

ns

Percentage deleted from DB

Individual Deletion Cost (n=100000, Gauss, dim 10)

Centers : 04Centers : 08Centers : 16Centers : 19

Aggregated construction costs (left) and individual deletion costswhen deleting the first 10% (middle) and 40% (right) of the DB.

Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary

Page 12: EGNAT: A Fully Dynamic Metric Access Method for Secondary ...€¦ · Abstract Introduction Evolutionary GNAT Deleting in the EGNAT Experimental Evaluation Conclusions EGNAT: A Fully

AbstractIntroduction

Evolutionary GNATDeleting in the EGNATExperimental Evaluation

Conclusions

Constructions Costs and Deletion CostsComparative MethodsSearch CostsConstruction Costs on Secondary MemorySearch Costs of the Secondary Memory VariantsDeletion and Search Costs on Secondary Memory after Deletions

Comparative Methods

-0.35

-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

1 2 3 4 5 6 7 8 9 10

Dis

tanc

e E

valu

atio

ns (

%)

Percentage deleted from DB

Method 5/3, deleting up to 10% (Spanish dict.)

Centers : 04Centers : 08Centers : 16Centers : 20

-0.5

0

0.5

1

1.5

2

10 15 20 25 30 35 40D

ista

nce

Eva

luat

ions

(%

)Percentage deleted from DB

Method 5/3, deleting up to 40% (Spanish dict.)

Centers : 04Centers : 08Centers : 16Centers : 20

-1

-0.5

0

0.5

1

1.5

1 2 3 4

Dis

tanc

e E

valu

atio

ns (

%)

Range Search

Method 5/3, search after modifying 40% (Spanish dict.)

Centers : 04Centers : 08Centers : 16Centers : 20

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

1 2 3 4 5 6 7 8 9 10

Dis

tanc

e E

valu

atio

ns (

%)

Percentage deleted from DB

Method 5/3, deleting up to 10% (Gauss dim. 10)

Centers : 04Centers : 08Centers : 16Centers : 19

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

10 15 20 25 30 35 40

Dis

tanc

e E

valu

atio

ns (

%)

Percentage deleted from DB

Method 5/3, depeting up to 40% (Gauss dim. 10)

Centers : 04Centers : 08Centers : 16Centers : 19

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.01 0.1 1

Dis

tanc

e E

valu

atio

ns (

%)

Percentage retrieved from the database

Method 5/3, search after modifying 40% (Gauss dim. 10)

Centers : 04Centers : 08Centers : 16Centers : 19

Ratio between methods, in deletion (left, middle) and search(right) costs.

Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary

Page 13: EGNAT: A Fully Dynamic Metric Access Method for Secondary ...€¦ · Abstract Introduction Evolutionary GNAT Deleting in the EGNAT Experimental Evaluation Conclusions EGNAT: A Fully

AbstractIntroduction

Evolutionary GNATDeleting in the EGNATExperimental Evaluation

Conclusions

Constructions Costs and Deletion CostsComparative MethodsSearch CostsConstruction Costs on Secondary MemorySearch Costs of the Secondary Memory VariantsDeletion and Search Costs on Secondary Memory after Deletions

Search Costs

20000

25000

30000

35000

40000

45000

50000

55000

60000

65000

70000

1 2 3 4

Dis

tanc

e E

valu

atio

ns

Range Search

Search Cost after construction (n=86061, Spanish Dic.)

Centers : 04Centers : 08Centers : 16Centers : 20

20000

25000

30000

35000

40000

45000

50000

55000

60000

65000

70000

1 2 3 4D

ista

nce

Eva

luat

ions

Range Search

Search Cost after modifying 10% (n=86061, Spanish Dic.)

Centers : 04Centers : 08Centers : 16Centers : 20

20000

30000

40000

50000

60000

70000

80000

1 2 3 4

Dis

tanc

e E

valu

atio

ns

Range Search

Search Cost after modifying 40% (n=86061, Spanish Dic.)

Centers : 04Centers : 08Centers : 16Centers : 20

30000

35000

40000

45000

50000

55000

60000

65000

70000

75000

80000

0.01 0.1 1

Dis

tanc

e E

valu

atio

ns

Porcentage retrieved from the database

Search Cost after construction (n=100000, Gauss, dim 10)

Centers : 04Centers : 08Centers : 16Centers : 19

35000

40000

45000

50000

55000

60000

65000

70000

0.01 0.1 1

Dis

tanc

e E

valu

atio

ns

Porcentage retrieved from the database

Search Cost after modifying 10% (n=100000, Gauss, dim 10)

Centers : 04Centers : 08Centers : 16Centers : 19

35000

40000

45000

50000

55000

60000

65000

70000

75000

80000

0.01 0.1 1

Dis

tanc

e E

valu

atio

ns

Porcentage retrieved from the database

Search Cost after modifying 40% (n=100000, Gauss, dim 10)

Centros : 04Centros : 08Centros : 16Centros : 19

Search costs after construction and after deleting and reinsertingpart of the database.

Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary

Page 14: EGNAT: A Fully Dynamic Metric Access Method for Secondary ...€¦ · Abstract Introduction Evolutionary GNAT Deleting in the EGNAT Experimental Evaluation Conclusions EGNAT: A Fully

AbstractIntroduction

Evolutionary GNATDeleting in the EGNATExperimental Evaluation

Conclusions

Constructions Costs and Deletion CostsComparative MethodsSearch CostsConstruction Costs on Secondary MemorySearch Costs of the Secondary Memory VariantsDeletion and Search Costs on Secondary Memory after Deletions

Construction Costs on Secondary Memory

0

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

0 10 20 30 40 50 60 70 80 90

Dis

tanc

e E

valu

atio

ns

Construction Percentage

Total Construction Cost (86061, Spanish dict.)

mtree 0.4mtree 0.1

egnat (20)egnat B (20-4)

0

100000

200000

300000

400000

500000

600000

10 20 30 40 50 60 70 80 90R

eads

Construction Percentage

Total Construction Cost (86061, Spanish dict.)

mtree 0.4mtree 0.1

egnat (20)egnat B (20-4)

0

20000

40000

60000

80000

100000

120000

140000

10 20 30 40 50 60 70 80 90

Writ

es

Construction Percentage

Total Construction Cost (86061, Spanish dict.)

mtree 0.4mtree 0.1

egnat (20)egnat B (20-4)

0

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

0 10 20 30 40 50 60 70 80 90

Dis

tanc

e E

valu

atio

ns

Construction Percentage

Total Construction Cost (100000, Gauss, dim 10)

mtree 0.4mtree 0.1

egnat (19)egnat B (19)

0

100000

200000

300000

400000

500000

600000

700000

800000

10 20 30 40 50 60 70 80 90

Rea

ds

Construction Percentage

Total Construction Cost (n=100000, Gauss, dim 10)

mtree 0.4mtree 0.1

egnat (19)egnat B (19-4)

0

50000

100000

150000

200000

10 20 30 40 50 60 70 80 90

Writ

es

Construction Percentage

Total Construction Cost (n=100000, Gauss, dim 10)

mtree 0.4mtree 0.1

egnat (19)egnat B (19-4)

Construction costs of the secondary memory variants: distancecomputations, disk reads and writes.

Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary

Page 15: EGNAT: A Fully Dynamic Metric Access Method for Secondary ...€¦ · Abstract Introduction Evolutionary GNAT Deleting in the EGNAT Experimental Evaluation Conclusions EGNAT: A Fully

AbstractIntroduction

Evolutionary GNATDeleting in the EGNATExperimental Evaluation

Conclusions

Constructions Costs and Deletion CostsComparative MethodsSearch CostsConstruction Costs on Secondary MemorySearch Costs of the Secondary Memory VariantsDeletion and Search Costs on Secondary Memory after Deletions

Search Costs of the Secondary Memory Variants

20000

30000

40000

50000

60000

1 2 3 4

Dis

tanc

e E

valu

atio

ns

Range Search

Search Cost (n=86061, Spanish Dic.)

mtree 0.4mtree 0.1

egnat (20)egnat B (20-4)

20000

30000

40000

50000

60000

70000

0.01 0.1 1

Dis

tanc

e E

valu

atio

ns

Percentage retrieved from the database

Search Cost (n=100000, Gauss, dim 10)

mtree 0.4mtree 0.1

egnat (19)egnat B (19-4)

0

2000

4000

6000

8000

10000

12000

1 2 3 4

Rea

ds/S

eeks

Range Search

Search Cost (n=86061, Spanish Dic.)

mtree 0.4mtree 0.1

egnat (20)egnat B (20-4)

0

2000

4000

6000

8000

10000

12000

14000

0.01 0.1 1

Rea

d/S

eeks

Percentage retrieved from the database

Search Cost (n=100000, Gauss, dim 10)

mtree 0.4mtree 0.1

egnat (19)egnat B (19-4)

Search costs of the secondary memory variants: distancecomputations and disk reads.

Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary

Page 16: EGNAT: A Fully Dynamic Metric Access Method for Secondary ...€¦ · Abstract Introduction Evolutionary GNAT Deleting in the EGNAT Experimental Evaluation Conclusions EGNAT: A Fully

AbstractIntroduction

Evolutionary GNATDeleting in the EGNATExperimental Evaluation

Conclusions

Constructions Costs and Deletion CostsComparative MethodsSearch CostsConstruction Costs on Secondary MemorySearch Costs of the Secondary Memory VariantsDeletion and Search Costs on Secondary Memory after Deletions

Deletion and Search Costs on Secondary Memory

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

1 2 3 4 5 6 7 8 9 10

Dis

k A

cces

ses

Percentage deleted from DB

Individual Deletion Costs (Spanish dict.)

ReadsWrites

0

10

20

30

40

50

10 20 30 40D

isk

Acc

esse

sPercentage deleted from DB

Individual Deletion Costs (Spanish dict.)

ReadsWrites

2500

3000

3500

4000

4500

5000

1 2 3 4

Rea

ds/S

eeks

Range Search

Search Cost (Spanish dict.)

0% deleted10% deleted40% deleted

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

1 2 3 4 5 6 7 8 9 10

Dis

k A

cces

ses

Percentage deleted from DB

Individual Deletion Costs (Gauss vectors)

ReadsWrites

0

2

4

6

8

10

12

14

10 20 30 40

Dis

k A

cces

ses

Percentage deleted from DB

Individual Deletion Costs (Gauss vectors)

ReadsWrites

4000

4500

5000

5500

6000

6500

0.01 0.1 1

Rea

ds/S

eeks

Percentage retrieved from the database

Search Cost (Gauss dim. 10)

0% deleted10% deleted40% deleted

Deletion costs in secondary memory, deleting 10% (left) and 40%(middle). On the right, disk reads for searching after deletions.

Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary

Page 17: EGNAT: A Fully Dynamic Metric Access Method for Secondary ...€¦ · Abstract Introduction Evolutionary GNAT Deleting in the EGNAT Experimental Evaluation Conclusions EGNAT: A Fully

AbstractIntroduction

Evolutionary GNATDeleting in the EGNATExperimental Evaluation

Conclusions

Conclusions

Conclusions

We have presented a dynamic and secondary-memory-bounddata structure based on hyperplane partitioning, EGNAT.

Experimental results show that, as expected, the M-tree

achieves better disk page usage and consequently fewer I/Ooperations at search time, whereas our EGNAT data structurecarries out fewer distance computations.

We have presented a novel mechanism to handle deletionsbased on ghost hyperplanes.

The method of ghost hyperplanes is applicable to other similardata structures.

Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary