Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
AbstractIntroduction
Evolutionary GNATDeleting in the EGNATExperimental Evaluation
Conclusions
EGNAT: A Fully Dynamic Metric Access Method
for Secondary Memory
Roberto Uribe-Paredes 1,2 Gonzalo Navarro 3
1Depto. de Ingenierıa en Computacion,
Universidad de Magallanes, Chile
2Grupo de Bases de Datos - UART,
Universidad Nacional de la Patagonia Austral, Rıo Turbio, Argentina
3Dept. of Computer Science,
University of Chile, Chile
E-mail: [email protected], [email protected]
September 4, 2009Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary
AbstractIntroduction
Evolutionary GNATDeleting in the EGNATExperimental Evaluation
Conclusions
Abstract
We introduce a novel metric space search data structurecalled EGNAT, which is fully dynamic and designed forsecondary memory. The EGNAT implements deletions using anovel technique dubbed Ghost Hyperplanes.
Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary
AbstractIntroduction
Evolutionary GNATDeleting in the EGNATExperimental Evaluation
Conclusions
Similarity Search
Range search
The k nearest neighbors
u8
u11
u4
u15
u1
u13
u9
u2
u12u3
u7
u5
u14
u10
u6q
r
Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary
AbstractIntroduction
Evolutionary GNATDeleting in the EGNATExperimental Evaluation
Conclusions
Metric Structures
Construction Methods
Based on Pivots|d(q, pi ) − d(x , pi )| > r => d(q, x) > r
Based on Clustering or compact partitioning
Voronoi Partitioning Criterion (or Hyperplanes)d(q, cj) > d(q, ci ) + 2rCovering Radius Criterion d(q, ci ) − r > rc(ci )
Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary
AbstractIntroduction
Evolutionary GNATDeleting in the EGNATExperimental Evaluation
Conclusions
General InformationConstructionSearch
General Information
This is a new method based on the GNAT [Brin95], that isbased on the GHT (Generalized Hyperplane Tree)[Uhlmann91].
It is based on both clustering and used the Hyperplanes andCovering radius criteria.
It is dynamic and enhanced for Secondary Memory.
Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary
AbstractIntroduction
Evolutionary GNATDeleting in the EGNATExperimental Evaluation
Conclusions
General InformationConstructionSearch
Construction
Each child pi maintains a table of distance ranges toward therest of the sets Dpj
,rangei ,j = (min{d(pi , x), x ∈ Dpj
}, max{d(pi , x), x ∈ Dpj}).
p14
p13
p8
p15
p3 p12
p10
p6p7 p11
p9
p1
p2
p5
p4 p4p3p2p1
p10 p12 p5 p7 p11 p15 p9 p6 p8 p13 p14
Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary
AbstractIntroduction
Evolutionary GNATDeleting in the EGNATExperimental Evaluation
Conclusions
General InformationConstructionSearch
Construction
Two types of nodes:
buckets (leaves)
gnats (internal nodes)
GNAT Node
BUCKET Node BUCKET Node BUCKET Node BUCKET Node BUCKET Node BUCKET Node
(a)
MAX_SIZE_BUCKET
BUCKET Node
Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary
AbstractIntroduction
Evolutionary GNATDeleting in the EGNATExperimental Evaluation
Conclusions
General InformationConstructionSearch
Search
buckets (leaves)If |d(x , p) − d(q, p)| > r orgnats (internal nodes)range(p0, q) ∩ range(p0, Dpx ) = ∅then we know that d(x , q) > r without computing thatdistance
x
qmax_d(p,D )
p
min_d(p,D )q
Dx
q
r
r
Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary
AbstractIntroduction
Evolutionary GNATDeleting in the EGNATExperimental Evaluation
Conclusions
Ghost Hyperplanes
Ghost Hyperplanes
Partition of the space before and after a deletion.
1S
S
S2
S3
S4
S5
S6
S 1
Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary
AbstractIntroduction
Evolutionary GNATDeleting in the EGNATExperimental Evaluation
Conclusions
Ghost Hyperplanes
Choosing the Replacement
The nearest neighbor.
The nearest descendant.
The nearest descendent located in a leaf.
A promising descendant leaf.
An arbitrary descendant leaf.
Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary
AbstractIntroduction
Evolutionary GNATDeleting in the EGNATExperimental Evaluation
Conclusions
Constructions Costs and Deletion CostsComparative MethodsSearch CostsConstruction Costs on Secondary MemorySearch Costs of the Secondary Memory VariantsDeletion and Search Costs on Secondary Memory after Deletions
Constructions Costs and Deletion Costs
0
1e+06
2e+06
3e+06
4e+06
5e+06
6e+06
0 20 40 60 80 100
Dis
tanc
e E
valu
atio
ns
Construction Percentage
Total Construction Cost (n=86061 words)
Centers : 04Centers : 08Centers : 16Centers : 20
0
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 10D
ista
nce
Eva
luat
ions
Percentage deleted from DB
Individual Deletion Cost (n=86061, Spanish dict.)
Centers : 04Centers : 08Centers : 16Centers : 20
0
100
200
300
400
500
600
700
800
900
10 15 20 25 30 35 40
Dis
tanc
e E
valu
atio
ns
Percentage deleted from DB
Individual Deletion Cost (n=86061, Spanish dict.)
Centers : 04Centers : 08Centers : 16Centers : 20
0
1e+06
2e+06
3e+06
4e+06
5e+06
6e+06
7e+06
0 10 20 30 40 50 60 70 80 90
Dis
tanc
e E
valu
atio
ns
Construction Percentage
Total Construction Cost (100000, Gauss, dim 10)
Centers : 04Centers : 08Centers : 16Centers : 19
0
20
40
60
80
100
120
140
160
1 2 3 4 5 6 7 8 9 10
Dis
tanc
e E
valu
atio
ns
Percentage deleted from DB
Individual Deletion Cost (n=100000, Gauss, dim 10)
Centers : 04Centers : 08Centers : 16Centers : 19
0
20
40
60
80
100
120
140
160
10 15 20 25 30 35 40
Dis
tanc
e E
valu
atio
ns
Percentage deleted from DB
Individual Deletion Cost (n=100000, Gauss, dim 10)
Centers : 04Centers : 08Centers : 16Centers : 19
Aggregated construction costs (left) and individual deletion costswhen deleting the first 10% (middle) and 40% (right) of the DB.
Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary
AbstractIntroduction
Evolutionary GNATDeleting in the EGNATExperimental Evaluation
Conclusions
Constructions Costs and Deletion CostsComparative MethodsSearch CostsConstruction Costs on Secondary MemorySearch Costs of the Secondary Memory VariantsDeletion and Search Costs on Secondary Memory after Deletions
Comparative Methods
-0.35
-0.3
-0.25
-0.2
-0.15
-0.1
-0.05
0
1 2 3 4 5 6 7 8 9 10
Dis
tanc
e E
valu
atio
ns (
%)
Percentage deleted from DB
Method 5/3, deleting up to 10% (Spanish dict.)
Centers : 04Centers : 08Centers : 16Centers : 20
-0.5
0
0.5
1
1.5
2
10 15 20 25 30 35 40D
ista
nce
Eva
luat
ions
(%
)Percentage deleted from DB
Method 5/3, deleting up to 40% (Spanish dict.)
Centers : 04Centers : 08Centers : 16Centers : 20
-1
-0.5
0
0.5
1
1.5
1 2 3 4
Dis
tanc
e E
valu
atio
ns (
%)
Range Search
Method 5/3, search after modifying 40% (Spanish dict.)
Centers : 04Centers : 08Centers : 16Centers : 20
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
1 2 3 4 5 6 7 8 9 10
Dis
tanc
e E
valu
atio
ns (
%)
Percentage deleted from DB
Method 5/3, deleting up to 10% (Gauss dim. 10)
Centers : 04Centers : 08Centers : 16Centers : 19
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
10 15 20 25 30 35 40
Dis
tanc
e E
valu
atio
ns (
%)
Percentage deleted from DB
Method 5/3, depeting up to 40% (Gauss dim. 10)
Centers : 04Centers : 08Centers : 16Centers : 19
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.01 0.1 1
Dis
tanc
e E
valu
atio
ns (
%)
Percentage retrieved from the database
Method 5/3, search after modifying 40% (Gauss dim. 10)
Centers : 04Centers : 08Centers : 16Centers : 19
Ratio between methods, in deletion (left, middle) and search(right) costs.
Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary
AbstractIntroduction
Evolutionary GNATDeleting in the EGNATExperimental Evaluation
Conclusions
Constructions Costs and Deletion CostsComparative MethodsSearch CostsConstruction Costs on Secondary MemorySearch Costs of the Secondary Memory VariantsDeletion and Search Costs on Secondary Memory after Deletions
Search Costs
20000
25000
30000
35000
40000
45000
50000
55000
60000
65000
70000
1 2 3 4
Dis
tanc
e E
valu
atio
ns
Range Search
Search Cost after construction (n=86061, Spanish Dic.)
Centers : 04Centers : 08Centers : 16Centers : 20
20000
25000
30000
35000
40000
45000
50000
55000
60000
65000
70000
1 2 3 4D
ista
nce
Eva
luat
ions
Range Search
Search Cost after modifying 10% (n=86061, Spanish Dic.)
Centers : 04Centers : 08Centers : 16Centers : 20
20000
30000
40000
50000
60000
70000
80000
1 2 3 4
Dis
tanc
e E
valu
atio
ns
Range Search
Search Cost after modifying 40% (n=86061, Spanish Dic.)
Centers : 04Centers : 08Centers : 16Centers : 20
30000
35000
40000
45000
50000
55000
60000
65000
70000
75000
80000
0.01 0.1 1
Dis
tanc
e E
valu
atio
ns
Porcentage retrieved from the database
Search Cost after construction (n=100000, Gauss, dim 10)
Centers : 04Centers : 08Centers : 16Centers : 19
35000
40000
45000
50000
55000
60000
65000
70000
0.01 0.1 1
Dis
tanc
e E
valu
atio
ns
Porcentage retrieved from the database
Search Cost after modifying 10% (n=100000, Gauss, dim 10)
Centers : 04Centers : 08Centers : 16Centers : 19
35000
40000
45000
50000
55000
60000
65000
70000
75000
80000
0.01 0.1 1
Dis
tanc
e E
valu
atio
ns
Porcentage retrieved from the database
Search Cost after modifying 40% (n=100000, Gauss, dim 10)
Centros : 04Centros : 08Centros : 16Centros : 19
Search costs after construction and after deleting and reinsertingpart of the database.
Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary
AbstractIntroduction
Evolutionary GNATDeleting in the EGNATExperimental Evaluation
Conclusions
Constructions Costs and Deletion CostsComparative MethodsSearch CostsConstruction Costs on Secondary MemorySearch Costs of the Secondary Memory VariantsDeletion and Search Costs on Secondary Memory after Deletions
Construction Costs on Secondary Memory
0
1e+06
2e+06
3e+06
4e+06
5e+06
6e+06
0 10 20 30 40 50 60 70 80 90
Dis
tanc
e E
valu
atio
ns
Construction Percentage
Total Construction Cost (86061, Spanish dict.)
mtree 0.4mtree 0.1
egnat (20)egnat B (20-4)
0
100000
200000
300000
400000
500000
600000
10 20 30 40 50 60 70 80 90R
eads
Construction Percentage
Total Construction Cost (86061, Spanish dict.)
mtree 0.4mtree 0.1
egnat (20)egnat B (20-4)
0
20000
40000
60000
80000
100000
120000
140000
10 20 30 40 50 60 70 80 90
Writ
es
Construction Percentage
Total Construction Cost (86061, Spanish dict.)
mtree 0.4mtree 0.1
egnat (20)egnat B (20-4)
0
1e+06
2e+06
3e+06
4e+06
5e+06
6e+06
0 10 20 30 40 50 60 70 80 90
Dis
tanc
e E
valu
atio
ns
Construction Percentage
Total Construction Cost (100000, Gauss, dim 10)
mtree 0.4mtree 0.1
egnat (19)egnat B (19)
0
100000
200000
300000
400000
500000
600000
700000
800000
10 20 30 40 50 60 70 80 90
Rea
ds
Construction Percentage
Total Construction Cost (n=100000, Gauss, dim 10)
mtree 0.4mtree 0.1
egnat (19)egnat B (19-4)
0
50000
100000
150000
200000
10 20 30 40 50 60 70 80 90
Writ
es
Construction Percentage
Total Construction Cost (n=100000, Gauss, dim 10)
mtree 0.4mtree 0.1
egnat (19)egnat B (19-4)
Construction costs of the secondary memory variants: distancecomputations, disk reads and writes.
Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary
AbstractIntroduction
Evolutionary GNATDeleting in the EGNATExperimental Evaluation
Conclusions
Constructions Costs and Deletion CostsComparative MethodsSearch CostsConstruction Costs on Secondary MemorySearch Costs of the Secondary Memory VariantsDeletion and Search Costs on Secondary Memory after Deletions
Search Costs of the Secondary Memory Variants
20000
30000
40000
50000
60000
1 2 3 4
Dis
tanc
e E
valu
atio
ns
Range Search
Search Cost (n=86061, Spanish Dic.)
mtree 0.4mtree 0.1
egnat (20)egnat B (20-4)
20000
30000
40000
50000
60000
70000
0.01 0.1 1
Dis
tanc
e E
valu
atio
ns
Percentage retrieved from the database
Search Cost (n=100000, Gauss, dim 10)
mtree 0.4mtree 0.1
egnat (19)egnat B (19-4)
0
2000
4000
6000
8000
10000
12000
1 2 3 4
Rea
ds/S
eeks
Range Search
Search Cost (n=86061, Spanish Dic.)
mtree 0.4mtree 0.1
egnat (20)egnat B (20-4)
0
2000
4000
6000
8000
10000
12000
14000
0.01 0.1 1
Rea
d/S
eeks
Percentage retrieved from the database
Search Cost (n=100000, Gauss, dim 10)
mtree 0.4mtree 0.1
egnat (19)egnat B (19-4)
Search costs of the secondary memory variants: distancecomputations and disk reads.
Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary
AbstractIntroduction
Evolutionary GNATDeleting in the EGNATExperimental Evaluation
Conclusions
Constructions Costs and Deletion CostsComparative MethodsSearch CostsConstruction Costs on Secondary MemorySearch Costs of the Secondary Memory VariantsDeletion and Search Costs on Secondary Memory after Deletions
Deletion and Search Costs on Secondary Memory
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
1 2 3 4 5 6 7 8 9 10
Dis
k A
cces
ses
Percentage deleted from DB
Individual Deletion Costs (Spanish dict.)
ReadsWrites
0
10
20
30
40
50
10 20 30 40D
isk
Acc
esse
sPercentage deleted from DB
Individual Deletion Costs (Spanish dict.)
ReadsWrites
2500
3000
3500
4000
4500
5000
1 2 3 4
Rea
ds/S
eeks
Range Search
Search Cost (Spanish dict.)
0% deleted10% deleted40% deleted
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
1 2 3 4 5 6 7 8 9 10
Dis
k A
cces
ses
Percentage deleted from DB
Individual Deletion Costs (Gauss vectors)
ReadsWrites
0
2
4
6
8
10
12
14
10 20 30 40
Dis
k A
cces
ses
Percentage deleted from DB
Individual Deletion Costs (Gauss vectors)
ReadsWrites
4000
4500
5000
5500
6000
6500
0.01 0.1 1
Rea
ds/S
eeks
Percentage retrieved from the database
Search Cost (Gauss dim. 10)
0% deleted10% deleted40% deleted
Deletion costs in secondary memory, deleting 10% (left) and 40%(middle). On the right, disk reads for searching after deletions.
Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary
AbstractIntroduction
Evolutionary GNATDeleting in the EGNATExperimental Evaluation
Conclusions
Conclusions
Conclusions
We have presented a dynamic and secondary-memory-bounddata structure based on hyperplane partitioning, EGNAT.
Experimental results show that, as expected, the M-tree
achieves better disk page usage and consequently fewer I/Ooperations at search time, whereas our EGNAT data structurecarries out fewer distance computations.
We have presented a novel mechanism to handle deletionsbased on ghost hyperplanes.
The method of ghost hyperplanes is applicable to other similardata structures.
Roberto Uribe-Paredes , Gonzalo Navarro EGNAT: A Fully Dynamic Metric Access Method for Secondary