View
217
Download
3
Embed Size (px)
Citation preview
A PARALLEL FORMULATION OF THE SPATIAL AUTO-REGRESSION MODEL
FOR MINING LARGE GEO-SPATIAL DATASETS
HPDM 2004 Workshop at SIAM Data Mining Conference
Barış M. Kazar, Shashi Shekhar, David J. Lilja, Daniel Boley
Army High Performance Computing and Research Center (AHPCRC)Minnesota Supercomputing Institute (MSI)
Digital Technology Center (DTC)University of Minnesota
04.24.2004
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets 2
Overview
• Motivation
• Classical and New Data-Mining Techniques
• Problem Definition
• Our Approach
• Experimental Results
• Conclusions and Future Work
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets 3
Motivation
• Widespread use of spatial databases Mining spatial patterns The 1855 Asiatic Cholera on London [Griffith]
• Fair Landing [NYT, R. Nader] Correlation of bank locations with loan
activity in poor neighborhoods• Retail Outlets [NYT, Walmart, McDonald etc.]
Determining locations of stores by relating
neighborhood maps with customer
databases• Crime Hot Spot Analysis [NYT, NIJ CML]
Explaining clusters of sexual assaults by
locating addresses of sex-offenders• Ecology [Uygar]
Explaining location of bird nests based on structural environmental variables
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets 4
Key Concept: Neighborhood Matrix (W)
WEST21 )1,(SOUTH 111 j)1,(iEAST 111 )1,(
NORTH 12 ),1(
),(
qjp, ijiqj, p-iq-jp, i ji
qj p,iji
jineighbors
W allows other neighborhood definitions• distance based• 8 and more neighbors
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Space + 4-neighborhood
0100100000000000101001000000000001010010000000000010000100000000100001001000000001001010010000000010010100100000000100100001000000001000010010000000010010100100000000100101001000000001001000010000000010000100000000000100101000000000001001010000000000010010
6th row
Binary W
021002
1000000000003
1031003
10000000000
03103
10031000000000
002100002
1000000003
1000031003
10000000
041004
1041004
1000000
0041004
1041004
100000
00031003
10000310000
00003100003
10031000
0000041004
1041004
100
00000041004
1041004
10
000000031003
1000031
000000002100002
100
00000000031003
10310
000000000031003
1031
0000000000021002
10
6th row
Row-normalized W
Given:• Spatial framework• Attributes
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets 5
Classical and New Data-Mining Techniques
Classical Linear Regression Low
Spatial Auto-Regression High
Name ModelClassification Accuracy
εx βy
εxβWyy ρ
framework spatialover matrix odneighborho -by- :
parameter n)correlatio-(auto regression-auto spatial the:
nnW
• Solving Spatial Auto-regression Model = 0, = 0 : Least Squares Problem = 0, = 0 : Eigenvalue Problem General case: Computationally expensive
• Maximum Likelihood Estimation• Need parallel implementation to scale up
β
εε
SSEnn
L 2
)ln(
2
)2ln(ln)ln(
2WI
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets 6
Related Work & Our Contributions
• Related work: Li, 1996Limitations: Solved 1-D problem
• Our ContributionsParallel solution for 2-D problems Portable software
Fortran 77 An Application of Hybrid Parallelism
» MPI messaging system » Compiler directives of OpenMP
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets 7
A Serial Solution
B• Golden Section
Search• Calculate ML
Function
ACompute
Eigenvalues
C
Least SquaresEigenvalues of W
of range,,,, nWyx
2ˆ,ˆ,ˆ ̂Wyx ,,,n
• Compute Eigenvalues (Stage A) Produces dense W neighborhood matrix, Forms synthetic data y Makes W symmetric Householder transformation
Convert dense symmetric matrix to tri-diagonal matrix QL Transformation
Compute all eigenvalues of tri-diagonal matrix
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets 8
Serial Response Times (sec)
• Stage A is the bottleneck & Stage B and C contribute very small to response time
0
1000
2000
3000
4000
5000
6000
7000
SGIOrigin
IBM SP IBMRegatta
SGIOrigin
IBM SP IBMRegatta
SGIOrigin
IBM SP IBMRegatta
2500 6400 10000
Problem Sizes on Different Machines
Tim
e (s
ec)
Stage A Stage B Stage C
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets 9
Problem Definition
Given:• A Sequential solution procedure: “Serial Dense Matrix Approach” for one-dimensional geo-spaces
Find:• Parallel Formulation of Serial Dense Matrix Approach for multi-dimensional geo-spaces
Constraints:• N(0,2I) IID• Reasonably efficient parallel implementation• Parallel Platform • Size of W (large vs. small and dense vs. sparse)
Objective:• Portable & scalable software
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets 10
Our Approach – Parallel Spatial Auto-Regression
• Function vs. Data PartitioningFunction partitioning: Each processor works on the
same data with different instructionsData partitioning (applied): Each processor works on
different data with the same instructions
• Implementation Platform: Fortran with MPI & OpenMP API’s
• No machine-specific compiler directivesPortabilityHelp software development and technology transfer
• Other Performance TuningStatic terms computed once
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets 11
021002
1000000000003
1031003
10000000000
03103
10031000000000
002100002
1000000003
1000031003
10000000
041004
1041004
1000000
0041004
1041004
100000
00031003
10000310000
00003100003
10031000
0000041004
1041004
100
00000041004
1041004
10
000000031003
1000031
000000002100002
100
00000000031003
10310
000000000031003
1031
0000000000021002
10
Contiguous16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
P1 P1 P1 P1P2 P2 P2 P2
P3 P3 P3 P3P4 P4 P4 P4
021002
1000000000003
1031003
10000000000
03103
10031000000000
002100002
1000000003
1000031003
10000000
041004
1041004
1000000
0041004
1041004
100000
00031003
10000310000
00003100003
10031000
0000041004
1041004
100
00000041004
1041004
10
000000031003
1000031
000000002100002
100
00000000031003
10310
000000000031003
1031
0000000000021002
10
P1
P2
P3
P4
P1
P2
P3
P4
P1
P2
P3
P4
P1
P2
P3
P4
Round-robin with chunk size 1 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
Data Partitioning in a Smaller Scale
• 4 processors are used and chunk size can be determined by the user• W is 16-by-16 and partitioned across processors
P1- (40 vs. 58) P2- (36 vs. 42) P3- (32 vs. 26)
P4- (28 vs. 10)
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets 12
• A : Contiguous for rectangular loops & round-robin with chunk-size 4
• B : Contiguous• C : Contiguous• The arrows are also synchronization points for parallel solution
A B C• There are synchronization points within the boxes as well
Data Partitioning & Synchronization
B• Golden Section
Search• Calculate ML
Function
ACompute
Eigenvalues
C
Least SquaresEigenvalues of W
of range,,,, nWyx
2ˆ,ˆ,ˆ ̂
Wyx ,,,n
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets 13
Experimental Design
Factor NameLanguage
Problem Size (n)
Neighborhood Structure
Method
Auto-regression Parameter
Contiguous (B=n/p )
Round-robin w/ B ={1,4,8,16}
Combined (Contiguous+Round-robin)
Dynamic w/ B ={n/p ,1,4,8,16}
Guided w/ B ={1,4,8,16}
MLB Affinity w/ B ={n/p ,1,4,8,16}
Number of Processors
Maximum Likelihood for exact SAM[0,1)
SLB
DLB
Parameter Domain f77 w/ OpenMP & MPI2500,6400 and 10000 observation points2-D w/ 4-neighbors
IBM Regatta w/ 47.5 GB Main Memory; 32 1.3 GHz Power4 architecture processors
Hardware Platform
1,4, and 8
Load-Balancing
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets 14
Experimental Results – Effect of Load Balancing
Effect of Load-Balancing Techniques on Speedup for Problem Size 10000
0
1
2
3
4
5
6
7
8
1 4 8Number of Processors
Sp
eed
up
mixed1 Static B=8 Dynamic B=8 Affinity B=1 Guided B=16
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets 15
Experimental Results- Effect of Problem Size
Impact of Problem Size on Speedup Using Affinity Scheduling on 8 Processors
0
1
2
3
4
5
6
7
8
2500 6400 10000Problem Size
Sp
ee
du
p
affinity B=n/p affinity B=1 affinity B=4 affinity B=8 afiinity B=16
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets 16
Experimental Results- Effect of Chunk Size
Effect of Chunk Size on Speedup Using Static Scheduling on 8 Processors
0
1
2
3
4
5
6
7
8
1 4 8 16 n/pChunk Size
Spee
dup
PS=2500 PS=6400 PS=10000
• Critical value of the chunk size for which the speedup reaches the maximum. • This value is higher for dynamic scheduling to compensate for the scheduling overhead.• The workload is more evenly distributed across processors at the critical chunk size value.
Effect of Chunk Size on Speedup Using Dynamic Scheduling on 8 Processors
0
1
2
3
4
5
6
7
8
1 4 8 16 n/pChunk Size
Sp
eed
up
PS=2500 PS=6400 PS=10000
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets 17
Experimental Results- Effect of # of Processors
Effect of Number of Processors on Speedup (PS=10000)
0
1
2
3
4
5
6
7
8
mix
ed1
mix
ed2
stat
ic B
=1st
atic
B=4
stat
ic B
=8st
atic
B=1
6st
atic
B=n
/pdy
nam
ic B
=1dy
nam
ic B
=4dy
nam
ic B
=8dy
nam
ic B
=16
dyna
mic
B=n
/paf
finity
B=1
affin
ity B
=4af
finity
B=8
affin
ity B
=16
affin
ity B
=n/p
guid
ed B
=1gu
ided
B=4
guid
ed B
=8gu
ided
B=1
6
Load-Balancing (Scheduling) Techniques
Sp
eed
up
4 8
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets 18
Summary
• Developed a parallel formulation of spatial auto-regression model
• Estimates maximum likelihood of regular square tessellation 1-D and 2-D planar surface partitionings for location prediction problems
• Used dense eigenvalue computation and hybrid parallel programming
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets 19
Future Work
1. Understand reasons of inefficiencies– Algebraic cost model for speedup
measurements on different architectures
2. Fine tune implemented parallel formulation
– Consider alternate parallel formulations
3. Parallelize other serial solutions using sparse-matrix techniques
− Chebyshev Polynomial approximation− Markov Chain Monte Carlo Estimator
A Parallel Formulation of The Spatial Auto-Regression Model for Mining Large Geo-spatial Datasets 20
Acknowledgments & Final Word
• Army High Performance Computing Research Center-AHPCRC• Minnesota Supercomputing Institute - MSI• Digital Technology Center – DTC• Spatial Database Group Members• ARCTiC Labs Group Members• Dr. Sanjay Chawla• Dr. Kelley Pace • Dr. James LeSage
THANK YOU VERY MUCHQuestions?