39
PhD Prelim Oral Exam Parallelizing Eigen-Value Computation Based Exact Spatial Auto-Regression Model Solution Baris M. Kazar Advisors: Dr. Shashi Shekhar, Dr. David J. Lilja AHPCRC, Dept. of Electrical & Computer Eng’g. University of Minnesota [email protected] http://www.cs.umn.edu/~kazar

PhD Prelim Oral Exam Parallelizing Eigen-Value Computation Based Exact Spatial Auto-Regression Model Solution Baris M. Kazar Advisors: Dr. Shashi Shekhar,

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

PhD Prelim Oral Exam

Parallelizing Eigen-Value Computation Based Exact Spatial Auto-Regression

Model Solution

Baris M. KazarAdvisors:

Dr. Shashi Shekhar, Dr. David J. Lilja

AHPCRC, Dept. of Electrical & Computer Eng’g.

University of Minnesota

[email protected]://www.cs.umn.edu/~kazar

09/20/2004 Parallelizing Exact SAR Model Solution

Biography

• Education– M.S., Electrical and Computer Engineering UMN-TC, 2000– Took WPE, Started PhD Thesis– Ph.D. Candidate, Electrical and Computer Engineering

UMN-TC (expected 2005)

• Research Interests– Data and knowledge engineering, spatial database

management, spatial data mining, parallel processing, geographic information systems, and spatial statistics.

2/33

09/20/2004 Parallelizing Exact SAR Model Solution

Biography

• Publications High Performance Spatial Data-Mining, B. M. Kazar, S.

Shekhar and D. J. Lilja, AHPCRC Tech Report no. 2003-125 (Poster at ACM SIGPLAN Principles and Practice of Parallel Programming (PPoPP) Conference, June 2003.)

A Parallel Formulation of the Spatial Auto-Regression Model for Mining Large Geo-Spatial Databases, B.M. Kazar, S.Shekhar, D.J. Lilja, and D. Boley, AHPCRC Tech Report no. 2004-103, March, 2004. (International Workshop on High Performance and Distributed Mining (HPDM) at SIAM Data Mining Conference, April, 2004.)

Comparing Exact and Approximate Spatial Auto-Regression Model Solutions for Spatial Data Analysis, B.M. Kazar, S. Shekhar, D. J. Lilja, K. Pace, AHPCRC Tech Report no. 2004-126 (GIScience 2004 Conference, October, 2004)

Scalable Parallel Approximate Formulations of Multi-Dimensional Spatial Auto-Regression Models for Spatial Data Mining, S. Shekhar, B. M. Kazar, D. J. Lilja, will appear as summary paper in 24th Army Science Conference

3/33

09/20/2004 Parallelizing Exact SAR Model Solution

Outline

• Spatial Data-Mining (SDM)– Motivation– Spatial Autocorrelation (SA)– SDM Provides Better Model– But, Computation Costs are Much Higher – Computational Challenge

• Problem Definition• Our Approach• Algebraic Analysis• Experimental Results• Summary & Future Work

09/20/2004 Parallelizing Exact SAR Model Solution

Motivation

• Widespread use of spatial databases Mining spatial patterns The 1855 Asiatic Cholera on London [Griffith]

• Fair Landing [NYT, R. Nader] Correlation of bank locations with loan

activity in poor neighborhoods• Retail Outlets [NYT, Walmart, McDonald etc.]

Determining locations of stores by relating

neighborhood maps with customer

databases• Crime Hot Spot Analysis [NYT, NIJ CML]

Explaining clusters of sexual assaults by

locating addresses of sex-offenders• Ecology [Uygar]

Explaining location of bird nests based on structural environmental variables

4/33

09/20/2004 Parallelizing Exact SAR Model Solution

Spatial Auto-correlation (SA)• Random Distributed Data (no SA): Spatial distribution satisfying assumptions of classical data

• Cluster Distributed Data: Spatial distribution NOT satisfying assumptions of classical data

Pixel property with

independent identical

distribution

RandomNest

Locations

Pixel property with

spatial auto-

correlation

ClusterNest

Locations

5/33

09/20/2004 Parallelizing Exact SAR Model Solution

• Linear Regression → SAR• Spatial auto-regression (SAR) model has higher accuracy and removes

IID assumption of linear regression

εxβy εxβWyy

SDM Provides Better Model!

6/33

09/20/2004 Parallelizing Exact SAR Model Solution

But, Computation Costs are Much Higher

• Linear regression takes 2 seconds for 10000 problem size on IBM Regatta• Stage A is the bottleneck & Stage B and C contribute very small to response time

0

1000

2000

3000

4000

5000

6000

7000

SGIOrigin

IBM SP IBMRegatta

SGIOrigin

IBM SP IBMRegatta

SGIOrigin

IBM SP IBMRegatta

2500 6400 10000

Problem Sizes on Different Machines

Tim

e (s

ec)

Stage A Stage B Stage C

7/33

09/20/2004 Parallelizing Exact SAR Model Solution

Computational Challenge

• Maximum-Likelihood Estimation = MINimizing the Function

• Solving SAR Model– = 0 → Least Squares Problem– = 0, = 0 → Eigen-value Problem– General case: → Computationally expensive due to the

log-det term in the ML Function

framework spatialover matrix odneighborho -by- : parameter n)correlatio-(auto regression-auto spatial the:

nnW

))]()([])([)((1

ln2

||lnMIN11

1||

yWIxxxxIxxxxIWIyWI

TTTTTTT

n

n

Log-det term

β ε

8/33

09/20/2004 Parallelizing Exact SAR Model Solution

Outline

• Spatial Data-Mining (SDM)

• Problem Definition– Key Concept: Neighborhood Matrix (W)– Related Work– Our Contributions

• Our Approach

• Algebraic Analysis

• Experimental Results

• Summary & Future Work

09/20/2004 Parallelizing Exact SAR Model Solution

Problem Definition

Given:• A Sequential solution procedure: “Serial Dense Matrix Approach” for one-dimensional geo-spaces [Li,1996]

Find:• Parallel Formulations for multi-dimensional geo-spaces

Objective:• Scalable and efficient software• Maximize speedup (Tserial/Tparallel)

Constraints: • Size of W (large vs. small and dense vs. sparse)• N(0,2I) IID• Reasonably efficient parallel implementation in multi-dimensional geo-spaces• Parallel Platform • Memory limitations

9/33

09/20/2004 Parallelizing Exact SAR Model Solution

Key Concept: Neighborhood Matrix (W)

WEST21 )1,(SOUTH 111 j)1,(iEAST 111 )1,(

NORTH 12 ),1(

),(

qjp, ijiqj, p-iq-jp, i ji

qj p,iji

jineighbors

W allows other neighborhood definitions• distance-based• 8-neighbors

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Space + 4-neighborhood

0100100000000000101001000000000001010010000000000010000100000000100001001000000001001010010000000010010100100000000100100001000000001000010010000000010010100100000000100101001000000001001000010000000010000100000000000100101000000000001001010000000000010010

6th row

Binary W

021002

1000000000003

1031003

10000000000

03103

10031000000000

002100002

1000000003

1000031003

10000000

041004

1041004

1000000

0041004

1041004

100000

00031003

10000310000

00003100003

10031000

0000041004

1041004

100

00000041004

1041004

10

000000031003

1000031

000000002100002

100

00000000031003

10310

000000000031003

1031

0000000000021002

10

6th row

Row-normalized W

Given:• Spatial framework• Attributes

10/33

09/20/2004 Parallelizing Exact SAR Model Solution

Related Work

• Related work: Li,1996– Solved 1-D problem– Used CMSSL linear algebra library on CM-5

supercomputers, which are not available for use anymore

• Limitations: – Not applicable to 2-D,3-D geo-spaces– Not portable

11/33

09/20/2004 Parallelizing Exact SAR Model Solution

Our Contributions

• Parallel solutions for 2-D, 3-D (multi-dimensional) very large problems

• Scalable and efficient software– Fortran 77– An Application of Hybrid Parallelism (MPI & OpenMP)

By the final exam we will also have:• Other alternative solutions for SAR • Determining which solution dominates when • Ranking of solutions with respect to:

– ρ and β scaling which affects accuracy– Computational complexity – Memory requirement

12/33

09/20/2004 Parallelizing Exact SAR Model Solution

Outline

• Spatial Data-Mining (SDM)• Problem Definition• Our Approach

– Dimensions of Design Space & Details• Implementation Platform• Implemented Algorithm & Operation Count• Parallel Formulation• Load-Balancing

• Algebraic Analysis• Experimental Results• Summary & Future Work

09/20/2004 Parallelizing Exact SAR Model Solution

Our Approach

• Parallel Formulations of SAR Model Solutions for multi-dimensional (2-D and 3-D) geo-spaces

• Scalability and efficiency

• Design space dimensions:– Implementation Platform– Implemented (Four) Algorithms– Parallel Formulation– Load-Balancing

13/33

09/20/2004 Parallelizing Exact SAR Model Solution

Implementation Platform

Four options for this dimension:

• C with OpenMP API

• C++ with OpenMP API

• Java with OpenMP API

• Fortran 77/90/95 with OpenMP API– Column-major programming language– Thinking in terms of vectors

14/33

09/20/2004 Parallelizing Exact SAR Model Solution

Implemented Algorithm

• Compute Eigen-values (Stage A) due to Produces dense W neighborhood matrix, Forms synthetic data y (optional) Makes W symmetric Householder transformation

Convert dense symmetric matrix to tri-diagonal matrix QL Transformation

Compute all eigen-values of tri-diagonal matrix

n

i iρλρ

1)ln(1||ln WI

B• Golden Section

Search• Calculate ML

Function

ACompute

Eigenvalues

C

Least SquaresEigen-values of W

of range,,,, nWyx

ˆˆ

bestbest

βbest̂

Wyx ,,,n

15/33

09/20/2004 Parallelizing Exact SAR Model Solution

Operation Counts for Exact SAR Model Soln

• Householder Transformation i.e., reduction to tri-diagonal form is the most complex operation

DGEMMp

n 3

3

2

DGEMVp

n 3

3

2

latpn )25.8(

bandp

np

22log5.2

Computation Cost

CommunicationCost

16/33

09/20/2004 Parallelizing Exact SAR Model Solution

Parallel Formulation

• Function Partitioning: Each processor works on the same data with different instructions

• Data partitioning: Each processor works on different data with the same instructions

Parallel RunSerial Run

Function 1

Function 2

Function 3

Function 1 Function 2 Function 3

DS1 DS2 DS 3

PE #1 PE #2

PE #3DS1 DS2 DS 3

Function 1

Function 1

Function 1

DS1

DS2

DS3

Function 1 Function 1 Function 1

PE #1 PE #2 PE #3

DS1 DS2 DS3

17/33

09/20/2004 Parallelizing Exact SAR Model Solution

Data Parallel Formulation

• Allows finer granularity parallelism i.e. loop-level parallelism• Can be implemented by dividing data into chunks:

– Column-wise (for column-major programming languages)– Row-wise (for row-major programming languages)– Checker-board-wise

• Loop parallelization• Multiple loops in each stage• Each box handles a set of columns• Timing each thread in the parallel region

– to see load imbalance via guide77 tool on IBM Regatta

StartProgram

Computing EigenvaluesStage 1

Golden Section SearchStage 2

Least SquaresStage 3

EndProgram

SynchronizationPoints

SerialRegion

Parallel Region

18/33

09/20/2004 Parallelizing Exact SAR Model Solution

Load-Balancing Techniques

• The chunk size B << n/p such as 4, 8 and 16 in our study• Dynamic: Threads are assigned chunks on a “first-come, first-do” basis • Affinity: Two levels of chunks and threads may execute other threads’

partitioning (chunk stealing)

Chunk Size # of Chunks

Contiguous n/p pRound Robin B n/B

Dynamic w/o Chunk 1 nDynamic w/ Chunk B n/BGuided w/o Chunk

Guided w/ Chunk

Affinity w/o Chunk n/(2p) <2pAffinity w/ Chunk B n/B

Contiguous n/p pRound Robin B n/B

QDLB

MLB

No

No

No

x,y,β,ε

SLB

W

decreasing

from n/p < n/B

Scheme

DLB

19/33

09/20/2004 Parallelizing Exact SAR Model Solution

Load-Balancing: Which Data to Partition?

• Candidates: y, W, x, β, ε• W is partitioned across processors

y

= + +

W x β ε

n-by-1 n-by-n

1-by-1 n-by-k k-by-1 n-by-1

y

n-by-1

20/33

09/20/2004 Parallelizing Exact SAR Model Solution

Load-Balancing:How to Partition? (Small Scale Example)• 4 processors are used and chunk size can be determined by the user• W is 16-by-16 and partitioned across processors

P1- (40 vs. 58) P2- (36 vs. 42) P3- (32 vs. 26)

P4- (28 vs. 10)

P1

P2

P3

P4

P1

P2

P3

P4

P1

P2

P3

P4

P1

P2

P3

P4

Round-robin with chunk size 1 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

021002

1000000000003

1031003

10000000000

03103

10031000000000

002100002

1000000003

1000031003

10000000

041004

1041004

1000000

0041004

1041004

100000

00031003

10000310000

00003100003

10031000

0000041004

1041004

100

00000041004

1041004

10

000000031003

1000031

000000002100002

100

00000000031003

10310

000000000031003

1031

0000000000021002

10

Lower Half is Used

021002

1000000000003

1031003

10000000000

03103

10031000000000

002100002

1000000003

1000031003

10000000

041004

1041004

1000000

0041004

1041004

100000

00031003

10000310000

00003100003

10031000

0000041004

1041004

100

00000041004

1041004

10

000000031003

1000031

000000002100002

100

00000000031003

10310

000000000031003

1031

0000000000021002

10

Contiguous16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

P1 P1 P1 P1P2 P2 P2 P2

P3 P3 P3 P3P4 P4 P4 P4

Lower Half is Used

21/33

09/20/2004 Parallelizing Exact SAR Model Solution

Outline

• Spatial Data-Mining (SDM)

• Problem Definition

• Our Approach

• Algebraic Analysis– Cost Model– Ranking of Load-Balancing Techniques

• Experimental Results

• Summary & Future Work

09/20/2004 Parallelizing Exact SAR Model Solution

Algebraic Cost Model for Exact SAR Model

DGEMMnserialT 33

2

plog16p1

, bandlatB

n

static

serialTstaticprllT

plog16p1, bandlatB

n

dynamic

commTserialTdynamicprllT

bandn

latB

nBcommT

p

2plog5.225.8

DGEMVDGEMM , Time to execute one flop in a rank-n DGEMM and DGEMV of Scalapack, respectively

lat Time to prepare a message for transmission

band Time taken by the message to traverse the network to its destination (1/bandwidth)

ξstatic, ξstatic-rr ξdynamic, ξaffinity

Load imbalance factor for static, static-round-robin, dynamic and affinity scheduling

ξstatic >> ξstatic-rr > ξdynamic > ξaffinity

SerialRegion

SLB

ProcessLocal Work

ProcessLocal Work …

DLBDLB

SLB

SerialRegion

22/33

09/20/2004 Parallelizing Exact SAR Model Solution

Ranking of Load-Balancing Techniques

round-robin

contiguousguided

affinity dynamic

medium poor

Synchr. cost

highlow

good

LoadBalance

• The experimental work for the parallel exact SAR model which took ~ 2 CPU years and the algebraic cost model agree with each other.

best

worst

Load imbalance

Synchronization cost

Round-robinContiguous

Affinity

Dynamic

Guided

Partial Ranking Total Ranking

23/33

09/20/2004 Parallelizing Exact SAR Model Solution

Outline

• Spatial Data-Mining (SDM) • Problem Definition• Our Approach• Algebraic Analysis• Experiment Design & Results

– Goals & Experiment Design Questions– Experimental Design-1 with Synthetic Dataset– Experimental Design-2 with Real Dataset– Summary of Experiment Results

• Summary & Future Work

09/20/2004 Parallelizing Exact SAR Model Solution

Goals

• Hypothesis: The new parallel formulations proposed in this study will outperform the previous parallel implementation in terms of:

– Speedup (S), – Scalability and efficiency,– Problem size (PS) and, – Memory requirement

• Experiment Design answers:1. Which load-balancing method provides best speedup?2. How does problem-size impact speedup?3. How does chunk-size affect speedup?4. How does # of processors affect speedup?5. Which data-partitioning method provides best speedup?

24/33

09/20/2004 Parallelizing Exact SAR Model Solution

Experimental Design

Evaluate Model

Learning Data

Testing Data

ρ=0.203β = 3.75

SyntheticDatasets

with sizes:2500640010K

Analyze/SummarizePerformance Data• Model Quality (accuracy) for

Learning Data only• Performance Measurement

on IBM Regatta w/

47.5 GB Main Memory; 32 1.3 GHz Power4

processors

Parallel (OpenMP+MPI in f77)Eigen-value Based Exact

SAR Model Solution

Load-BalancingTechniques

2-D w/ 4-neighbors

Predicted Model Parameters

ρ = 0.2033 β = 3.7485

Generator/Splitter

Build Model

25/33

09/20/2004 Parallelizing Exact SAR Model Solution

Experimental Results–Effect of Load Balancing

Fixed:• PS to 10000

• Best results for each class of load-balancing technique are presented • Affinity resulted in the best speedup & efficiency• Guided resulted in the worst speedup & efficiency

Effect of Load-Balancing Techniques on Speedup for Problem Size 10000

0

1

2

3

4

5

6

7

8

9

1 4 8

Number of Processors

Sp

eed

up

mixed1 Static B=8 Dynamic B=8 Affinity B=1 Guided B=16 Linear Speedup

26/33

09/20/2004 Parallelizing Exact SAR Model Solution

Experimental Results- Effect of Problem Size

Impact of Problem Size on Speedup Using Affinity Scheduling on 8 Processors

0

1

2

3

4

5

6

7

8

2500 6400 10000Problem Size

Sp

ee

du

p

affinity B=n/p affinity B=1 affinity B=4 affinity B=8 afiinity B=16

Fixed:• 8 processors• Load-balancingtechnique toaffinity

• Interesting trend for affinity with chunk size 1• Memory limitations

27/33

09/20/2004 Parallelizing Exact SAR Model Solution

Experimental Results- Effect of Chunk Size

Effect of Chunk Size on Speedup Using Static Scheduling on 8 Processors

0

1

2

3

4

5

6

7

8

1 4 8 16 n/pChunk Size

Spee

dup

PS=2500 PS=6400 PS=10000

• Fixed: 8 processors• Critical value of the chunk size for which the speedup reaches the maximum. • This value is higher for dynamic scheduling to compensate for the scheduling overhead.• The workload is more evenly distributed at critical chunk size value.

Effect of Chunk Size on Speedup Using Dynamic Scheduling on 8 Processors

0

1

2

3

4

5

6

7

8

1 4 8 16 n/pChunk Size

Sp

eed

up

PS=2500 PS=6400 PS=10000

28/33

09/20/2004 Parallelizing Exact SAR Model Solution

Experimental Results- Effect of # of Processors

Effect of Number of Processors on Speedup (PS=10000)

0

1

2

3

4

5

6

7

8

mix

ed1

mix

ed2

stat

ic B

=1st

atic

B=4

stat

ic B

=8st

atic

B=1

6st

atic

B=n

/pdy

nam

ic B

=1dy

nam

ic B

=4dy

nam

ic B

=8dy

nam

ic B

=16

dyna

mic

B=n

/paf

finity

B=1

affin

ity B

=4af

finity

B=8

affin

ity B

=16

affin

ity B

=n/p

guid

ed B

=1gu

ided

B=4

guid

ed B

=8gu

ided

B=1

6

Load-Balancing (Scheduling) Techniques

Sp

eed

up

4 8

Fixed:• PS to 10000

• Average speedup across all scheduling techniques is 3.43 for the 4-processor case and 5.91 for the 8-processor case • Affinity scheduling shows the best speedup, on average 7 times on 8 processors

29/33

09/20/2004 Parallelizing Exact SAR Model Solution

Problem Size 10000 on 16 PEs

Maximum Likelihood Based Parallel SARProblem Size 10K

0

1

2

3

4

5

6

7

8

9

10

11

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16# of Processors

Sp

ee

du

p

mixed cont rr04 dyn1 dyn4 afc4 gui4 gui8

Serial Times (sec) mixed:2885.4 cont: 2917.1rr04: 3515.1 dyn1: 3401.7 dyn4: 3326.2 afc4: 3246.9 gui4: 3291.1 gui8: 3198.6

30/33

09/20/2004 Parallelizing Exact SAR Model Solution

Summary of Results

• Speed-ups– Best schemes achieve a speed-up of about 10.5 for 16 processors– An order of magnitude improvement over serial solutions

• Efficiency– 0.6 for best load-balancing scheme with 16 PEs– 0.93 for best load-balancing scheme with 8 PEs– We did not use machine specific optimizations– Trade-off between portability and efficiency

• Speed-up Ranking– Affinity with chunk size 1 gives best speedup – Static round-robin with chunk size 4 is next– Contiguous and Guided schemes provide least speedup– Contiguous fails due to non-uniform workload– Guided and dynamic round robin have higher run-time cost

31/33

09/20/2004 Parallelizing Exact SAR Model Solution

Outline

• Spatial Data-Mining (SDM)

• Problem Definition

• Our Approach

• Algebraic Analysis

• Experimental Results

• Summary & Future Work– Future Work– Acknowledgments

09/20/2004 Parallelizing Exact SAR Model Solution

Future Work

• Efficiency– Identify reasons of inefficiency for larger # of PEs and fix them– Sparse eigen-value computation needed

• Hybrid implementations to use more processors• Other alternative solutions for SAR

– Scaling exact SAR model solution by applying direct sparse algorithms such as Sparse LU Decomposition

• Determining the conditions when a solution dominates i.e., ranking of all SAR solutions– ρ and β scaling which affects accuracy– Computational complexity– Memory requirement

• Response of solutions to different inputs i.e., visualization of SAR Model Solutions by varying:– Degree of auto-correlation – Regression coefficients

32/33

09/20/2004 Parallelizing Exact SAR Model Solution

Acknowledgments

• AHPCRC• Minnesota Supercomputing Institute• Spatial Database Group Members• ARCTiC Labs Group Members• Dr. Dan Boley• Dr. Sanjay Chawla• Dr. Vipin Kumar• Dr. James LeSage • Dr. Kelley Pace• Dr. Paul Schrater• Dr. Pen-Chung Yew

THANK YOU VERY MUCHQuestions?

33/33