24
A Rule-Based Optimizer for Spatial Join Operations Miguel Fornari João Luiz Comba Cirano Iochpe Instituto de Informática Universidade Federal do Rio Grande do Sul Porto Alegre - Brazil

A Rule-Based Optimizer for Spatial Join Operations Miguel Fornari João Luiz Comba Cirano Iochpe Instituto de Informática Universidade Federal do Rio Grande

Embed Size (px)

Citation preview

A Rule-Based Optimizer for Spatial Join Operations

Miguel FornariJoão Luiz Comba

Cirano Iochpe

Instituto de InformáticaUniversidade Federal do Rio Grande do Sul

Porto Alegre - Brazil

Outline

1.Introduction and motivation

2.Spatial Hash Algorithms

3.The Validation System Architecture

4.The Rules

5.Conclusions

3

Introduction and Motivation

The spatial join operation is fundamental and expensive in GIS

Combines two sets of spatial features based on a spatial predicate

DBMS, traditionally, improves the performance based on heuristic rules and cost expressions

Spatial DBMS can include a specific module to spatial operations

4

Goal

Reduce response time of the spatial join algorithms for the filter step

A set of rules to optimize the performance of some well-known algorithms

Which parameters are relevant?

What is the best value for each important parameter?

5

Sorted files

Indexed files

One indexed file

Space subdivision

Nested loop Nested Loop [5]

Scalable Sweeping-based Spatial Join [7]Iterative Spatial Join[8]

Synchronized Tree Transversal [3]

Query loop

File sorting

File indexing

Space subdivision

Index Nested [5]

Sort Sweep-Based Spatial Join [10]

Priority Queue-Driven Process [9]

Build & Match [4]

Seeded-Tree [5]

Slot Index Spatial Join [11]

Partition Based Spatial Merge Join [2]

Spatial Hash Join [5]

Size Separation Spatial Join [6]

Sequential files

Priority Queue-Driven Process 2 [9]

Hashed files Histogram-based Hash Stripped Join [2]

Hashing ?

SJ Algorithms● According to the file organization

6

SJ algorithms

For each algorithm, two cost expressions are important

I/O cost

CPU cost

Some cost are already known, but not all

All expressions written in a similar notation

7

The System Architecture

The performance analysis, although correct, simplifies many cases

Real cases are more complex

Real data sets obtained in Internet

Interface

STT PBSM ISSJ

LRU Buffer

Join Module

Data import

Data Generator

HHSJ

RTO Buffer

8

Plane-sweep algorithm

All SJ algorithms load objects to memory and perform a sweep-plane algorithm to check if pairs of objects satisfy the spatial predicate.

Traditional performance is O(k + n log n) , where k is the number of object intersections

But, O(c + n log n), where c is the number of performed comparisons, is more exact.

o1

o4

o3

o2

9

Plane-sweep algorithm

Divide the space into strips

Count the number of objects in each strip

The size of strips is the average size of objects

Estimation of c = Sum of all values

Slice 1:3 objects

Slice 2:5 objects

Slice 3:4 objects

Slice 4:5 objects

Slice 5:5 objects

Slice 6:2 objects

Slice 1: 2 objects

Slice 2: 2 objects.

Slice 3: 3 objects

Slice 4: 3 objects

Slice 5: 1 objects

Slice 6: 2 objects

10

Rule 1 – Sweep-plane

The DBMS can estimate c for each axle and choose the one with minor value of c, optimizing the plane-sweep.

The shape of objects alters the response time

P5:1 P4:1 P3:1 P2:1 P1:1 P1:2 P1:3 P1:4 P1:50

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

Response time varying the shape of objects

PBSM

ISSJ

STT

HHSJ

Shape - relation between side size

Res

pons

e tim

e (s

)

11

Synchronized Tree Transversal

Well known algorithm for R-Trees

The performance depends on height of R-Trees and average size of nodes.

The space division reduces the number of object comparisons ( c).

Available memory is not important.

12

Rule 2 - STT

The STT algorithm is optimized defining nodes with a low number of entries.

But, the total number of nodes will be greater, defining a minimum limit for the rule.

Optimal value between 50-75

0 10 25 50 75 100 125

0

0,25

0,5

0,75

1

1,25

1,5

1,75

2

2,25

2,5

2,75

Response time varying node fanout

go_soilusage X go_hydro

la_streets X la_rr

usa_rr X usa_rivers

Fanout

Re

spo

nse

tim

e (

s)

13

Rule 2 - STT

The performance of STT algorithm is constant when the memory buffer size increases.

Except for very values

Set any value greater than 4*heigth of the R-Trees

14

Iterative Stripped SJ

Iterative SJ (Jacox & Samet) + strips

Strips divides the space and reduces c

Transpassant objects can occur

The sorting can be either internal or external

The performance depends on the memory available, the number of strips, and replicas.

15

Rule 3 - ISSJ

The ISSJ algorithm is optimized definining a great number of strips. The number of objects in each strip will be small, but the is limited by the adding of replicas.

1 2 4 8 16 32 64 128 256 512 1024 2048

0

0,5

1

1,5

2

2,5

3

3,5

4

4,5

5

5,5

6

Algorithm ISSJ - Response time x Number of strips

Number of strips

Res

pons

e T

ime

(s)

1 2 4 8 16 32 64 128 256 512 1024 2048

0,00

0,25

0,50

0,75

1,00

1,25

1,50

1,75

2,00

2,25

2,50

Algorithm ISSJ - Replication factor x Number of strips

ge_hypsogr X ge_roads

la_rr X la_streets

Number of strips

Rep

licat

ion

Fac

tor

16

Rule 3 - ISSJ

It´s important allocate enough memory to perform an internal sorting of each set

25 50 75 100 125 150 175 200 250 300 350 400 450 450 500 600 700 800 900 1000

0

0,5

1

1,5

2

2,5

3

3,5

4

4,5

Algorithm ISSJ - Response time x available memory size

gr_rivers X gr_roadsge_hypsogr X ge_roads

la_rr X la_streets

Available memory size (# pages)

Re

spo

nse

tim

e (

s)

17

Partition Based Spatial Method

(PBSM)Calculates the number of partitions

Uses a regular grid to divide the spacePartitions can overflow

The performance depends on the number of replicas and overflowed partitions

The number of object comparisons (c) is reduced by the space subdivision

18

Rule 4 - PBSM

PBSM is improved setting a high value for the number of partitions using a small size of memory or just set a lower bound to the number of partitions.

50 75 100 125 150 175 200 250 300 350 400 450 450 500 600 700 800 900 1000

0

0,25

0,5

0,75

1

1,25

1,5

1,75

2

2,25

2,5

2,75

3

3,25

Algorithm PBSM - Response time x memory size

gr_rivers X gr_roads

ge_hypso X ge_roads

la_rr X la_street

Available memory size (#pages)

Re

spo

nse

tim

e (

s)

19

Rule 4 - PBSM

This rule is limited by the number of replicas, that increase the number of processed objects.

25

50

75

10

0

12

5

15

0

17

5

20

0

25

0

30

0

35

04

00

45

0

50

0

60

0

70

0

80

0

90

0

10

00

12

50

15

00

17

50

20

00

22

50

25

00

27

50

30

00

32

50

35

00

37

50

40

00

42

50

45

00

05

1015202530354045505560657075

Algorithm PBSM - Response time x Memory size

ca_street X ca_riversA100K X B2250K

Available memory sizel (#pages)

Res

posn

e tim

e(s)

20

Histogram Hash Stripped Join

A histogram of object distribution guides the space partitioning to avoid overflow

Replicas are counted into the histogram

The objects are maintained in a hash file and are loaded to memory only once.

The performance is affected by the number of replicas and the space subdivision

21

Rule 5 - HHSJ

HHSJ is improved setting a large value for the number of partitions and for the number of strips.

22

Rule 5 - HHSJ

This rule is limited, also, by the number of replicas, that increase the number of processed objects.

1 2 4 8 16 32 64 128 256 512 1024 2048

0

0,25

0,5

0,75

1

1,25

1,5

1,75

2

2,25

2,5

2,75

Algorithm HHSJ - Response time x # of strips

Number of strips

Res

pons

e tim

e (s

)

1 2 4 8 16 32 64 128 256 512 1024 2048

0,000

0,250

0,500

0,750

1,000

1,250

1,500

1,750

2,000

2,250

2,500

2,750

Algorithm HHSJ - Replication factor x # of strips

ge_hypsogr X ge_roads

la_rr X la_streets

Number of strips

Rep

licat

ion

fact

or

23

Conclusions & Future Work

Our main contribution

The use of rules can reduce the response time of individual algorithms, in some cases, more than 50%.

The rules can be incorporated in real GDBMS

Future work

Use 3D sets to perform the tests

Include other spatial operations

Implement in PostGIS

24

Contact & questions

Miguel Fornari

[email protected]

www.inf.ufrgs.br/~fornari

João Comba

[email protected]

www.inf.ufrgs.br/~comba

Cirano Iochpe

[email protected]

www.inf.ufrgs.br/~ciochpe