A Rule-Based Optimizer for Spatial Join Operations
Miguel FornariJoão Luiz Comba
Cirano Iochpe
Instituto de InformáticaUniversidade Federal do Rio Grande do Sul
Porto Alegre - Brazil
Outline
1.Introduction and motivation
2.Spatial Hash Algorithms
3.The Validation System Architecture
4.The Rules
5.Conclusions
3
Introduction and Motivation
The spatial join operation is fundamental and expensive in GIS
Combines two sets of spatial features based on a spatial predicate
DBMS, traditionally, improves the performance based on heuristic rules and cost expressions
Spatial DBMS can include a specific module to spatial operations
4
Goal
Reduce response time of the spatial join algorithms for the filter step
A set of rules to optimize the performance of some well-known algorithms
Which parameters are relevant?
What is the best value for each important parameter?
5
Sorted files
Indexed files
One indexed file
Space subdivision
Nested loop Nested Loop [5]
Scalable Sweeping-based Spatial Join [7]Iterative Spatial Join[8]
Synchronized Tree Transversal [3]
Query loop
File sorting
File indexing
Space subdivision
Index Nested [5]
Sort Sweep-Based Spatial Join [10]
Priority Queue-Driven Process [9]
Build & Match [4]
Seeded-Tree [5]
Slot Index Spatial Join [11]
Partition Based Spatial Merge Join [2]
Spatial Hash Join [5]
Size Separation Spatial Join [6]
Sequential files
Priority Queue-Driven Process 2 [9]
Hashed files Histogram-based Hash Stripped Join [2]
Hashing ?
SJ Algorithms● According to the file organization
6
SJ algorithms
For each algorithm, two cost expressions are important
I/O cost
CPU cost
Some cost are already known, but not all
All expressions written in a similar notation
7
The System Architecture
The performance analysis, although correct, simplifies many cases
Real cases are more complex
Real data sets obtained in Internet
Interface
STT PBSM ISSJ
LRU Buffer
Join Module
Data import
Data Generator
HHSJ
RTO Buffer
8
Plane-sweep algorithm
All SJ algorithms load objects to memory and perform a sweep-plane algorithm to check if pairs of objects satisfy the spatial predicate.
Traditional performance is O(k + n log n) , where k is the number of object intersections
But, O(c + n log n), where c is the number of performed comparisons, is more exact.
o1
o4
o3
o2
9
Plane-sweep algorithm
Divide the space into strips
Count the number of objects in each strip
The size of strips is the average size of objects
Estimation of c = Sum of all values
Slice 1:3 objects
Slice 2:5 objects
Slice 3:4 objects
Slice 4:5 objects
Slice 5:5 objects
Slice 6:2 objects
Slice 1: 2 objects
Slice 2: 2 objects.
Slice 3: 3 objects
Slice 4: 3 objects
Slice 5: 1 objects
Slice 6: 2 objects
10
Rule 1 – Sweep-plane
The DBMS can estimate c for each axle and choose the one with minor value of c, optimizing the plane-sweep.
The shape of objects alters the response time
P5:1 P4:1 P3:1 P2:1 P1:1 P1:2 P1:3 P1:4 P1:50
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
Response time varying the shape of objects
PBSM
ISSJ
STT
HHSJ
Shape - relation between side size
Res
pons
e tim
e (s
)
11
Synchronized Tree Transversal
Well known algorithm for R-Trees
The performance depends on height of R-Trees and average size of nodes.
The space division reduces the number of object comparisons ( c).
Available memory is not important.
12
Rule 2 - STT
The STT algorithm is optimized defining nodes with a low number of entries.
But, the total number of nodes will be greater, defining a minimum limit for the rule.
Optimal value between 50-75
0 10 25 50 75 100 125
0
0,25
0,5
0,75
1
1,25
1,5
1,75
2
2,25
2,5
2,75
Response time varying node fanout
go_soilusage X go_hydro
la_streets X la_rr
usa_rr X usa_rivers
Fanout
Re
spo
nse
tim
e (
s)
13
Rule 2 - STT
The performance of STT algorithm is constant when the memory buffer size increases.
Except for very values
Set any value greater than 4*heigth of the R-Trees
14
Iterative Stripped SJ
Iterative SJ (Jacox & Samet) + strips
Strips divides the space and reduces c
Transpassant objects can occur
The sorting can be either internal or external
The performance depends on the memory available, the number of strips, and replicas.
15
Rule 3 - ISSJ
The ISSJ algorithm is optimized definining a great number of strips. The number of objects in each strip will be small, but the is limited by the adding of replicas.
1 2 4 8 16 32 64 128 256 512 1024 2048
0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
5
5,5
6
Algorithm ISSJ - Response time x Number of strips
Number of strips
Res
pons
e T
ime
(s)
1 2 4 8 16 32 64 128 256 512 1024 2048
0,00
0,25
0,50
0,75
1,00
1,25
1,50
1,75
2,00
2,25
2,50
Algorithm ISSJ - Replication factor x Number of strips
ge_hypsogr X ge_roads
la_rr X la_streets
Number of strips
Rep
licat
ion
Fac
tor
16
Rule 3 - ISSJ
It´s important allocate enough memory to perform an internal sorting of each set
25 50 75 100 125 150 175 200 250 300 350 400 450 450 500 600 700 800 900 1000
0
0,5
1
1,5
2
2,5
3
3,5
4
4,5
Algorithm ISSJ - Response time x available memory size
gr_rivers X gr_roadsge_hypsogr X ge_roads
la_rr X la_streets
Available memory size (# pages)
Re
spo
nse
tim
e (
s)
17
Partition Based Spatial Method
(PBSM)Calculates the number of partitions
Uses a regular grid to divide the spacePartitions can overflow
The performance depends on the number of replicas and overflowed partitions
The number of object comparisons (c) is reduced by the space subdivision
18
Rule 4 - PBSM
PBSM is improved setting a high value for the number of partitions using a small size of memory or just set a lower bound to the number of partitions.
50 75 100 125 150 175 200 250 300 350 400 450 450 500 600 700 800 900 1000
0
0,25
0,5
0,75
1
1,25
1,5
1,75
2
2,25
2,5
2,75
3
3,25
Algorithm PBSM - Response time x memory size
gr_rivers X gr_roads
ge_hypso X ge_roads
la_rr X la_street
Available memory size (#pages)
Re
spo
nse
tim
e (
s)
19
Rule 4 - PBSM
This rule is limited by the number of replicas, that increase the number of processed objects.
25
50
75
10
0
12
5
15
0
17
5
20
0
25
0
30
0
35
04
00
45
0
50
0
60
0
70
0
80
0
90
0
10
00
12
50
15
00
17
50
20
00
22
50
25
00
27
50
30
00
32
50
35
00
37
50
40
00
42
50
45
00
05
1015202530354045505560657075
Algorithm PBSM - Response time x Memory size
ca_street X ca_riversA100K X B2250K
Available memory sizel (#pages)
Res
posn
e tim
e(s)
20
Histogram Hash Stripped Join
A histogram of object distribution guides the space partitioning to avoid overflow
Replicas are counted into the histogram
The objects are maintained in a hash file and are loaded to memory only once.
The performance is affected by the number of replicas and the space subdivision
21
Rule 5 - HHSJ
HHSJ is improved setting a large value for the number of partitions and for the number of strips.
22
Rule 5 - HHSJ
This rule is limited, also, by the number of replicas, that increase the number of processed objects.
1 2 4 8 16 32 64 128 256 512 1024 2048
0
0,25
0,5
0,75
1
1,25
1,5
1,75
2
2,25
2,5
2,75
Algorithm HHSJ - Response time x # of strips
Number of strips
Res
pons
e tim
e (s
)
1 2 4 8 16 32 64 128 256 512 1024 2048
0,000
0,250
0,500
0,750
1,000
1,250
1,500
1,750
2,000
2,250
2,500
2,750
Algorithm HHSJ - Replication factor x # of strips
ge_hypsogr X ge_roads
la_rr X la_streets
Number of strips
Rep
licat
ion
fact
or
23
Conclusions & Future Work
Our main contribution
The use of rules can reduce the response time of individual algorithms, in some cases, more than 50%.
The rules can be incorporated in real GDBMS
Future work
Use 3D sets to perform the tests
Include other spatial operations
Implement in PostGIS
24
Contact & questions
Miguel Fornari
www.inf.ufrgs.br/~fornari
João Comba
www.inf.ufrgs.br/~comba
Cirano Iochpe
www.inf.ufrgs.br/~ciochpe