Download ppt - 1 Should SDBMS support the Join Index?: A Case study from CrimeStat Pradeep Mohan¹, Shashi Shekhar¹, Ned Levine², Ronald E. Wilson³, Betsy George¹, Mete

1

Should SDBMS support the Join Index?: A Case study from CrimeStat

Pradeep Mohan¹, Shashi Shekhar¹, Ned Levine², Ronald E. Wilson³, Betsy George¹, Mete Celik¹

¹University of Minnesota, Twin-Cities, {mohan,shekhar,bgeorge,mcelik}@cs.umn.edu

²Ned Levine and Assoicates, Houston, TX, [email protected]

³National Institute of Justice, Washington D.C, [email protected]

mailto:[email protected]

mailto:[email protected]

2

Outline

IntroductionMotivationProblem StatementRelated Work

Contributions

Conclusion and Future Work

Self-Join Index Experimental Evaluation

3

Motivation

Crime Analysis: Where are the burglary hotspots ?

Epidemiology: Is Cancer Spatially Clustered ?Transportation: Which major highways require traffic calming measures ?

Application Domains

Query: Where are the Burglary hotspots ?

N3

N7N6

N2

N1

N5

N4

Neighbor Relation R1

An Example

N3

N7N6

N2

N1

N5

N4

4

W-Matrix and W-Queries

K-Function

Queries that perform a repeated computation of the W-Matrix : W-Queries.

7654321

00101101

01101012

00000113

01100004

01010115

10110106

01000007

NNNNNNN

N

N

N

N

N

N

N

W-Matrix

W-Queries

Moran’s I

Geary’s C

T

TN

ZZ

ZWZI

G Statistic

T

TN

ZZ

XXWC

)(

T

T

XX

XXWG

7654321

0033.0033.033.001

025.025.0025.0025.02

000005.05.03

05.05.000004

025.0025.0025.025.05

25.0025.025.0025.006

01000007

NNNNNNN

N

N

N

N

N

N

N

WN : Row Normalized W-Matrix

Neighborhood Graph

Hotspots

nsobservatio ofset A :X

mean respect to with Normalized : XZ

5

W-Operations

N3

N1

N2 N6

N5

N7

N4

Notion of neighbors, successors and predecessors.

Operations

Neighbors Successor(s) Predecessor(s) Composite Others

Input Operation Output

get-all-neighbors()

get-all-neighbors(N2)

N3

N1

N2 N6

N5

N7

N4

get-all-successors()

get-all-successors(N2)

get-all-predecessors()

get-a-successor(N2,Node-id)Delete(N2,N1,N3)

get-all-predecessors-of-a-successor()get-a-predecessor-of-a-successor()

get-a-successor()

get-a-predecessor(N2,Node-id)get-all-predecessors(N2)

get-a-predecessor()Delete()

get-all-predecessors-of-a-successor(N2, Node-id)

get-a-predecessor-of-a-successor(N2,Node-id,Node-id)

6

W-Query Processing AlgorithmsAlgorithm CalcRipleyK

get-all-neighbors(N)

Frequency ← Size(get-all-successors(N))

Algorithm Hotspots_JI Stage 1: Hotspot Identification

Identify a Seed.

get-all-neighbors(Seed)

get-all-successors(Seed)

Stage 2: Hotspot Refinement

P ← get-a-predecessor-of-a-successor(Seed,succ-id)

If P is Correlates better with the Successor than with the Seed.

Remove the Successor from successor list.

Stage 3: Update Remaining Nodes

For each, S in Hotspot

Delete(S)

N3

N7N6

N2

N1

N5

N4

Neighbor Relation R1

Input

Output

K

3000 –

2500 –

2000 –

1500 –

1000 –

500 –

0 –

5

–

10 –

15 –

20 –

30 –

40 –

Distance (Miles)

Complete Spatial Randomness

7

Problem StatementGiven:

A spatial (crime) data warehouse. A set of W- Operations.

Find: A suitable spatial index type representation.Objective:

User response time is minimized.Constraints:

Dataset is updated infrequently. Concurrency control and recovery considerations are addressed separately.

Courtsey: Ned Levine and Associates

Input Data Output & W Operations

Courtsey: Ned Levine and Associates

8

Challenges Scalability to Large Datasets

Dataset Size = 14852 Crime Reports

CrimeStat Libraries’ Response Time = 2Hrs 30 Minutes

Query: Where are the Burglary hotspots ?

9

Related Work: Classification

SDBMS Tool Spatial Indices Supported Spatial Self-Join Indices

CrimeStat NO NO

Oracle spatial R Tree , Quad Tree NO

SQL Server 2008 Grid files NO

Post GIS R Tree NO

ESRI ArcSDE Grid Files NO

SDBMS Tools

Current R Tree family index structures perform Repeated on-the-fly W computation.

Computationally Expensive!!

Our Approach: Pre-computed W ! (Self-join)

10

Contributions

Modeled W-Queries Proposed a set of W-Operations W-Query Processing Algorithms Self-join Index RepresentationAlgebraic Cost model: Operations Experimental Evaluation Experimental Setup User Response time analysis

11

Self-Join Index: RepresentationKey Observations

Classical Join Index : Edge List

Which representation can localize neighbor, successor and predecessor information ?

W-Matrix ↔ Self-join Neighborhood GraphSelf-Join Adjacency List Index

[N7:w67,N5:w65,N4:w64,N2:w62]

[N4:w54,N6:w56,N2:w52]

[N5:w45,N6:w46]

Adjacency ListSi

[N6:w76]N7

N6

N5

N4

[N2:w32,N1:w31]N3

[N1:w21,N3:w23,N5:w25,N6:w26]N2

[N2: w12 ,N3:w13,N5:w15]N1

Adjacency ListSi

w16

w35

w32

w31

w14

w15

w13

w12

w26

w25

w23

w21

WijS i

N6

N5N2N1

N4N5N3N1N2N1N6N2N5N2N3

N2 N1S jS i

N2

N1N1N1N3

N3N3

Edge List

Adjacency list LOCALIZES successor, predecessor and row normalized InformationEdge List SCATTERS these.

7654321

0033.0033.033.001

025.025.0025.0025.02

000005.05.03

05.05.000004

025.0025.0025.025.05

25.0025.025.0025.006

01000007

NNNNNNN

N

N

N

N

N

N

N

W-Matrix : Neighborhood Graph or Self-join

15

Experimental Evaluation: Experiment Setup

Self-Join Index Generator

Candidate Algorithms(CalcRipleyK, Hotspots_JI)

Response timeAnalysis

Size of the Police

Precincts

W QueryProcessing Algorithms

Dataset Size

SJALI

Experiment Goals: Compare candidates on response times.

Metric of Comparison: Response time

Workload: Baltimore Auto theft ’96 (Crime Report ID, Location, Date)

Hardware: Intel Xeon 3.2 Ghz, 4 GB RAM

Candidates

CrimeStat Libraries

R-Tree: Tree Matching

Self-Join Index

16

Baltimore Auto-theft Dataset

Crime Report

Baltimore County Auto Thefts from Jan 1996 to Sept 1996: 14852 Crime Reports

Courtsey: Ned Levine and Associates(www.nedlevine.com )

http://www.nedlevine.com/

17

Response Time Analysis: Comparison with R-Tree

Response time comparison for hotspot identification.

Response time comparison for K-Function computation.

Questions:

How does the response time of the Ripley’s K function Query vary with dataset size ?

How does the response time of the Hotspot Identification Query vary with dataset size ?

Fixed ParametersHotspots

Hotspot min-Size Threshold = 10 Crime ReportsK Function

# of max-significance levels = 100

Overall Trend: Self-join Index Vs R-Tree: Response time Reduced by a factor of 2.

18

Response Time Analysis: Comparison with CrimeStat

Response time comparison for hotspot identification.Response time comparison for K-Function computation.

Questions:

How does the response time of the Ripley’s K function Query vary with dataset size ? How does the response time of the Hotspot Identification Query vary with dataset size ?

Fixed ParametersHotspots

Hotspot min-Size Threshold = 10 Crime Reports

K Function # of max-significance levels = 100

Overall Trend: Self-join Index Vs CrimeStat: Response time Reduced by a factor of 40.

19

Conclusions W-Queries important in Spatial Statistics, e.g. Crime analysis, Public health, transportation.

W-Operations of W-Queries.

Self-join adjacency list index more scalable than R-Tree and CrimeStat.

Future work Experimental Quantification

I/O costs of W-Query Processing Algorithms.

I/O Cost Models for W-Query Processing Algorithms. Further I/O Optimization

Extracting optimal page access sequences for processing W-Queries.

Optimizing the number of W-Query operations.

Other W-Queries

Local Moran’s I, Local Getis Ord.

Larger datasets of >=100000, will R-Tree be comparable ?

20

Acknowledgment

Members of the Spatial Database and Data Mining Research Group University of Minnesota, Twin-Cities.

This Work was supported by Grants from NSF, USDOD and NIJ.

Thank You for your Questions, Comments and Patience!