1
Should SDBMS support the Join Index?: A Case study from CrimeStat
Pradeep Mohan¹, Shashi Shekhar¹, Ned Levine², Ronald E. Wilson³, Betsy George¹, Mete Celik¹
¹University of Minnesota, Twin-Cities, {mohan,shekhar,bgeorge,mcelik}@cs.umn.edu
²Ned Levine and Assoicates, Houston, TX, [email protected]
³National Institute of Justice, Washington D.C, [email protected]
2
Outline
IntroductionMotivationProblem StatementRelated Work
Contributions
Conclusion and Future Work
Self-Join Index Experimental Evaluation
3
Motivation
Crime Analysis: Where are the burglary hotspots ?
Epidemiology: Is Cancer Spatially Clustered ?Transportation: Which major highways require traffic calming measures ?
Application Domains
Query: Where are the Burglary hotspots ?
N3
N7N6
N2
N1
N5
N4
Neighbor Relation R1
An Example
N3
N7N6
N2
N1
N5
N4
4
W-Matrix and W-Queries
K-Function
Queries that perform a repeated computation of the W-Matrix : W-Queries.
7654321
00101101
01101012
00000113
01100004
01010115
10110106
01000007
NNNNNNN
N
N
N
N
N
N
N
W-Matrix
W-Queries
Moran’s I
Geary’s C
T
TN
ZZ
ZWZI
G Statistic
T
TN
ZZ
XXWC
)(
T
T
XX
XXWG
7654321
0033.0033.033.001
025.025.0025.0025.02
000005.05.03
05.05.000004
025.0025.0025.025.05
25.0025.025.0025.006
01000007
NNNNNNN
N
N
N
N
N
N
N
WN : Row Normalized W-Matrix
Neighborhood Graph
Hotspots
nsobservatio ofset A :X
mean respect to with Normalized : XZ
5
W-Operations
N3
N1
N2 N6
N5
N7
N4
Notion of neighbors, successors and predecessors.
Operations
Neighbors Successor(s) Predecessor(s) Composite Others
Input Operation Output
get-all-neighbors()
get-all-neighbors(N2)
N3
N1
N2 N6
N5
N7
N4
get-all-successors()
get-all-successors(N2)
get-all-predecessors()
get-a-successor(N2,Node-id)Delete(N2,N1,N3)
get-all-predecessors-of-a-successor()get-a-predecessor-of-a-successor()
get-a-successor()
get-a-predecessor(N2,Node-id)get-all-predecessors(N2)
get-a-predecessor()Delete()
get-all-predecessors-of-a-successor(N2, Node-id)
get-a-predecessor-of-a-successor(N2,Node-id,Node-id)
6
W-Query Processing AlgorithmsAlgorithm CalcRipleyK
get-all-neighbors(N)
Frequency ← Size(get-all-successors(N))
Algorithm Hotspots_JI Stage 1: Hotspot Identification
Identify a Seed.
get-all-neighbors(Seed)
get-all-successors(Seed)
Stage 2: Hotspot Refinement
P ← get-a-predecessor-of-a-successor(Seed,succ-id)
If P is Correlates better with the Successor than with the Seed.
Remove the Successor from successor list.
Stage 3: Update Remaining Nodes
For each, S in Hotspot
Delete(S)
N3
N7N6
N2
N1
N5
N4
Neighbor Relation R1
Input
Output
K
3000 –
2500 –
2000 –
1500 –
1000 –
500 –
0 –
5
–
10 –
15 –
20 –
30 –
40 –
Distance (Miles)
Complete Spatial Randomness
7
Problem StatementGiven:
A spatial (crime) data warehouse. A set of W- Operations.
Find: A suitable spatial index type representation.Objective:
User response time is minimized.Constraints:
Dataset is updated infrequently. Concurrency control and recovery considerations are addressed separately.
Courtsey: Ned Levine and Associates
Input Data Output & W Operations
Courtsey: Ned Levine and Associates
8
Challenges Scalability to Large Datasets
Dataset Size = 14852 Crime Reports
CrimeStat Libraries’ Response Time = 2Hrs 30 Minutes
Query: Where are the Burglary hotspots ?
9
Related Work: Classification
SDBMS Tool Spatial Indices Supported Spatial Self-Join Indices
CrimeStat NO NO
Oracle spatial R Tree , Quad Tree NO
SQL Server 2008 Grid files NO
Post GIS R Tree NO
ESRI ArcSDE Grid Files NO
SDBMS Tools
Current R Tree family index structures perform Repeated on-the-fly W computation.
Computationally Expensive!!
Our Approach: Pre-computed W ! (Self-join)
10
Contributions
Modeled W-Queries Proposed a set of W-Operations W-Query Processing Algorithms Self-join Index RepresentationAlgebraic Cost model: Operations Experimental Evaluation Experimental Setup User Response time analysis
11
Self-Join Index: RepresentationKey Observations
Classical Join Index : Edge List
Which representation can localize neighbor, successor and predecessor information ?
W-Matrix ↔ Self-join Neighborhood GraphSelf-Join Adjacency List Index
[N7:w67,N5:w65,N4:w64,N2:w62]
[N4:w54,N6:w56,N2:w52]
[N5:w45,N6:w46]
Adjacency ListSi
[N6:w76]N7
N6
N5
N4
[N2:w32,N1:w31]N3
[N1:w21,N3:w23,N5:w25,N6:w26]N2
[N2: w12 ,N3:w13,N5:w15]N1
Adjacency ListSi
w16
w35
w32
w31
w14
w15
w13
w12
w26
w25
w23
w21
WijS i
N6
N5N2N1
N4N5N3N1N2N1N6N2N5N2N3
N2 N1S jS i
N2
N1N1N1N3
N3N3
Edge List
Adjacency list LOCALIZES successor, predecessor and row normalized InformationEdge List SCATTERS these.
7654321
0033.0033.033.001
025.025.0025.0025.02
000005.05.03
05.05.000004
025.0025.0025.025.05
25.0025.025.0025.006
01000007
NNNNNNN
N
N
N
N
N
N
N
W-Matrix : Neighborhood Graph or Self-join
15
Experimental Evaluation: Experiment Setup
Self-Join Index Generator
Candidate Algorithms(CalcRipleyK, Hotspots_JI)
Response timeAnalysis
Size of the Police
Precincts
W QueryProcessing Algorithms
Dataset Size
SJALI
Experiment Goals: Compare candidates on response times.
Metric of Comparison: Response time
Workload: Baltimore Auto theft ’96 (Crime Report ID, Location, Date)
Hardware: Intel Xeon 3.2 Ghz, 4 GB RAM
Candidates
CrimeStat Libraries
R-Tree: Tree Matching
Self-Join Index
16
Baltimore Auto-theft Dataset
Crime Report
Baltimore County Auto Thefts from Jan 1996 to Sept 1996: 14852 Crime Reports
Courtsey: Ned Levine and Associates(www.nedlevine.com )
17
Response Time Analysis: Comparison with R-Tree
Response time comparison for hotspot identification.
Response time comparison for K-Function computation.
Questions:
How does the response time of the Ripley’s K function Query vary with dataset size ?
How does the response time of the Hotspot Identification Query vary with dataset size ?
Fixed ParametersHotspots
Hotspot min-Size Threshold = 10 Crime ReportsK Function
# of max-significance levels = 100
Overall Trend: Self-join Index Vs R-Tree: Response time Reduced by a factor of 2.
18
Response Time Analysis: Comparison with CrimeStat
Response time comparison for hotspot identification.Response time comparison for K-Function computation.
Questions:
How does the response time of the Ripley’s K function Query vary with dataset size ? How does the response time of the Hotspot Identification Query vary with dataset size ?
Fixed ParametersHotspots
Hotspot min-Size Threshold = 10 Crime Reports
K Function # of max-significance levels = 100
Overall Trend: Self-join Index Vs CrimeStat: Response time Reduced by a factor of 40.
19
Conclusions W-Queries important in Spatial Statistics, e.g. Crime analysis, Public health, transportation.
W-Operations of W-Queries.
Self-join adjacency list index more scalable than R-Tree and CrimeStat.
Future work Experimental Quantification
I/O costs of W-Query Processing Algorithms.
I/O Cost Models for W-Query Processing Algorithms. Further I/O Optimization
Extracting optimal page access sequences for processing W-Queries.
Optimizing the number of W-Query operations.
Other W-Queries
Local Moran’s I, Local Getis Ord.
Larger datasets of >=100000, will R-Tree be comparable ?
20
Acknowledgment
Members of the Spatial Database and Data Mining Research Group University of Minnesota, Twin-Cities.
This Work was supported by Grants from NSF, USDOD and NIJ.
Thank You for your Questions, Comments and Patience!