AA220/CS238 Parallel Methods in Numerical Analysis A ...adl.stanford.edu/cme342/Lecture_Notes_files/lecture21-05.pdf · AA220/CS238 Parallel Methods in Numerical Analysis A Parallel

AA220/CS238

Parallel Methods in Numerical Analysis

A Parallel Sparse Direct Solver

(Symmetric Positive Definite System)

Kincho H. Law

Professor of Civil and Environmental Engineering

Stanford University

Email: [email protected]

May 16, 2003

Parallel Computers

(Machines with multiple processors)

Processor

Cache Local

MemoryI/O

Processor

Cache Local

MemoryI/O

Global

Shared

Memory

Focus of Discussion:

• Processor Assignment

• Sparse Numerical Factorization

(David Mackay, 1992)

1

3

22

21

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

2

4

25

24

23

1 3 22212019181716151413121110987652 4 252423

F

F F

F

FF

FF

F

F

F F

F

F

F

F

FFFFF

F

FF

FFF

F

25

9

181484

2010

21

22

23

24

2

171373

19

15

16

11

12

1 5

6

Global Node

Number

0 2

3

3

3

32

2

2

2

0

1

1

1

1

0

0

0

0

2

0

0

0

0

0

Processor

Number

Tree Structure of Matrix Factor

Processor Assignment (Naïve Strategy)

1

3

22

21

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

2

4

25

24

23

1 3 22212019181716151413121110987652 4 252423

F

F F

F

FF

FF

F

F

F F

F

F

F

F

FFFFF

F

FF

FFF

F

25

9

181484

2010

21

22

23

24

2

171373

19

15

16

11

12

1 5

6

Global Node

Number

1 3

3

3

3

32

2

2

2

0

1

1

1

1

0

0

0

0

2

0

1

2

3

0

Processor

Number

Tree Structure of Matrix Factor

Processor Assignment

11 1*

0 0

*1 1

1

11 11 1*

1

*1 11 1 1 1

1 1

N N

K KK K H

K H

D D KK

K D I H I

= = =

=

11 11D K= 1

1 1 *1 11 1*H H K D K=

1

1

k

T

kk kk ki ii ki

i

D K L D L

=

=

11

1

( )k

jk jk ki ii ji kk

i

L K L D L D=

=

Factoring

column k

Computed

columns

Coefficients

modified

1( )ij ij ik kk kjH H K D K=

where and

Matrix Factorization

Factoring

column k

Computed

columns

Coefficients

not modified

jiL

kiL

1

3

22

21

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

2

4

25

24

23

1 3 22212019181716151413121110987652 4 252423

F

F F

F

FF

FF

F

F

F F

F

F

F

F

FFFFF

F

FF

FFF

F

Numerical Factorization : Phase I (sequential)

For each column block assigned to a processor

1. Perform (profile) factorization on principal

block submatrix

2. Update row segments by Forward Solve

3. Form dot products among row segments

a. Fan out dot products to update

coefficients in same processor

b. Save dot products in buffers to be

fanned in to other processors

TiiiiLDLK

)()()()(=

T

j

iiT

j KLDL *

1)(1)(

* =

1

3

22

21

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

2

4

25

24

23

1 3 22212019181716151413121110987652 4 252423

F

F F

F

FF

FF

F

F

F F

F

F

F

F

FFFFF

F

FF

FFF

F

Numerical Factorization : Phase II (parallel)

1. Fan in dot products and update principal block and

row segments

2. Parallel factorization of column block:

a. Factor current row (in principal block) within

processor

b. Broadcast row to other processors

c. Receive row from other processors

d. Perform dot products and update coefficients (in

principal block) within processor

e. Perform dot products and update row segments

in processor

3. Form dot products among row segments

a. Form dot products between current row and

other row segments in the processor

Among the shared processors,

a. Circulate current row segment to next processor

b. Receive row segment from neigboring processor

c. Form dot products between received row

segment and row segments in processor

(Dot products are fanned out to update coefficients in

same processor but saved in buffer to be fanned in

to other processors)

For each column block shared by processors

1

3

22

21

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

2

4

25

24

23

1 3 22212019181716151413121110987652 4 252423

F

F F

F

FF

FF

F

F

F F

F

F

F

F

FFFFF

F

FF

FFF

F

Parallel Forward Solve (Lz = y) : Phase I (sequential)

For each column block (b) assigned to a

processor

1. Perform a (profile) forward solve with

the principal submatrix in the column

block (i.e. computing )

2. Modify solution obtained in Step 1 with

row segments (within the processor)

a. Update solution coefficients

assigned to processor

b. Store solution coefficeints not

assigned to processor in buffer to

be sent to other processors

( )

*

bZ

( ) ( )

* *

b b

i i iz y L z=

( ) ( )

* *

b b

i iz L z=

1

3

22

21

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

2

4

25

24

23

1 3 22212019181716151413121110987652 4 252423

F

F F

F

FF

FF

F

F

F F

F

F

F

F

FFFFF

F

FF

FFF

F

Parallel Forward Solve (Lz = y) : Phase II (parallel)

For each column block (b) shared by processors

1. Update solution vector in column block

a. Broadcast solution coefficients to other

processors

b. Receive solution coefficients and update

solution vector

2. Parallel forward solve with principal submatrix

a. Broadcast to other processors

b. Receive

c. Update solution coefficients in column

block

3. Multiply solution coefficients with row

segments

a. Update solution coefficients in same

processor

b. Store solution coefficients not assigned

to processor in the buffer to be sent to

other processors

( ) ( )

* *

b b

i i iz y L z=

iZ

iZ

( )b

j j ji iz z L z=

Parallel Backward Solve (DL x = z)

Reverse of Parallel Forward Solve

1. Phase I (parallel) : Compute

solution coefficients in column

blocks shared by processors

2. Phase II (sequential) : Compute

solution coefficients in column

blocks within processor

1

3

22

21

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

2

4

25

24

23

1 3 22212019181716151413121110987652 4 252423

T

User Interface (UI) (Mesh Gen., B.C., etc..)

Coarse Grain Choreographer

Post Processing

Element Library

Matrix, RHS

Formation

and Assembly

Linear Solver

(Direct, Iterative)

Element

Characteristics

Element Library

Matrix, RHS

Formation

and Assembly

Linear Solver

(Direct, Iterative)

Element

CharacteristicsNonlin

ear

and/o

r A

daptive S

olv

er

(domain

decomposition,

data structure)

Coars

e G

rain

Chore

ogra

pher

Fin

e G

rain

Chore

ogra

pher

Fin

e G

rain

Chore

ogra

pher

Node 0

Node p

Communication Routines

(David Mackay 1992, Made Suarjana 1994, Narayana Aluru 1995)

Parallel Finite Element Program - SPMD

0

331

331

1

1

2200

220

1

9

2

20

14

13

25

24

23

22

21

7

8

10

4

3

5

6

17

18

15

16

19

11

12

Global Node

Number

Processor

Number

0

30,1,30,1

2,31,2,30,1,2

1

0,1

2,32,30,2,30,1

20,2,30,3

1

9

2

20

14

13

25

24

23

22

21

7

8

10

4

3

5

6

17

18

15

16

19

11

12

Global Node

Number

Processor

Number

(a). Assignment of elements without duplication (b). Assignment of elements with duplication

Element Assignment and Stiffness Generation

Communication required for assembly Communication not required for assembly

0

30,1,2,30,1,2,3

2,30,1,2,30,1,2,3

1

0,1

2,30,1,2,30,1,2,30,1

20,1,2,30,1,2,3

1

9

2

20

14

13

25

24

23

22

21

7

8

10

4

3

5

6

17

18

15

16

19

11

12

Global NodeNumber

ProcessorNumber

25

9181484

201021

22

23

24

2

171373

19

15

16

11

12

1 5

6

Global NodeNumber

1 3

3

3

3

32

2

2

2

0

1

1

1

1

0

0

0

0

2

0

1

2

3

0

ProcessorNumber

Preliminary Results (MPI Implementation)

(joint work with Ahmed Elgamal, Jinchi Lu, Jun Peng)

Blue Horizon, IBM’s SMP Power3 parallel computer

• San Diego Supercomputer Center

• 8 (375 MHz) processors per node sharing 4

Gbytes memory

Longhorn, IBM Power4 parallel computer

• University of Texas, Austin

• 4 (1.3 GHz) processors per node sharing 8

Gbytes memory

Junior, SUN’s Enterprise E6500 parallel computer

• 8 ultraSPARC (400 MHz) processors sharing

16 GBytes memory

A 25 x 25 x 25 Grid Problem

25x25x25 67600 Blue Horizon

Proc Init. Form matrix Form RHS Num Fact F/B Solve Total

1 28.61 24.46 10.53 623.33 1.66 689.30

2 57.16 13.38 5.09 327.41 1.06 414.10

4 50.59 7.83 2.38 170.32 0.62 231.20

8 39.38 4.84 1.26 84.64 0.37 134.48

16 37.83 3.52 0.61 47.27 0.26 89.83

32 36.45 2.76 0.31 29.05 0.25 69.10

64 36.85 2.45 0.23 22.87 0.67 66.32

Soil-pile Interaction Model (3x3 pile group)

130,020 equations; 29,120 elements; 96,845,738 nonzeros in factor L

Number of Numerical Forward and Solution Total Exe

Processors Factorization Backward Solve Phase Time

2 332.67 1.41 370.42 455.91

4 166.81 0.78 187.72 286.97

8 85.20 0.45 97.71 186.67

16 50.73 0.29 59.39 147.55

32 27.80 0.23 34.61 124.3

64 18.41 0.26 24.40 116.21

Solution Time in seconds for a single step

Speedup relative

to 2 proc.

364,800 equations; 340,514,320 nonzeros in factor L

Number of Numerical Forward and Solution Total Exe

Processors Factorization Backward Solve Phase Time

4 1246.08 2.76 1306.87 1769.00

8 665.66 1.56 702.09 1150.17

16 354.99 0.98 378.35 841.38

32 208.90 0.67 225.93 668.02

64 125.05 0.66 142.33 583.98

Stone Column Centrifuge Test Model

Solution Time in seconds for a single step

Speedup relative

to 4 proc.

(Timing in seconds)

32 procs 64 procs 128 procs

Initialization phase: 435.2 418.6 425.5

Geometry input 273.4 272.8 272.9

Preprocessing 161.4 145.5 152.4

Soil model initialization 0.4 0.3 0.3

Solution phase: 11,835.7 7,643.8 5,381.0

Formation of matrix 373.5 247.2 203.6

Formation of RHS 1,671.6 889.4 458.9

Updating Stresses 175.6 92.5 46.8

Numerical factorization 7,902.3 4,928.5 3,192.2

Forward and back solve 510.7 475.6 594.3

Calculation of element output 0.0 0.0 0.0

Subtotal (of the above six) 10,822.9 6,848.1 4,734.4

Total execution time (Init. phase+Solution phase) 12,298.5 8,080.4 5,819.5

Number of time steps: 35 (at 0.01 seconds interval)

Number of numerical factorizations performed 45 52 52

Number of forward & back solves performed 843 877 876

Time per each numerical factorization 175.6062 94.77923 61.38769

Time per each forward and backward solve 0.60586 0.542315 0.678379

Timing results for the stone column model on Blue Horizon

Number of equations = 364, 800; Total nonzeros = 332,634,544

(Note: Different ordering scheme than previous slide)

0 10 20 30 40 50 60 70 80 90 100−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

Time (sec)

Acc

eler

atio

n (g

)

Simulation of Shallow Foundation Settlement

Shallow foundation in 3D half-space (unit:m)

Base input motion

Solid nodes

Fluid nodes

20-8 noded brick element

Final deformed mesh of the

4480-elements mesh

(scale factor 10,

blue zone - densified area

green zone liquefiable sand)

1,962.85 3,113.42 5,492.74 10,263.57

Total Execution Time (including

miscellaneous)

1,741.85 2,819.68 5,277.48 10,049.81Total for Solution Cost

330.24 281.53 357.76 553.55Forward and Backward Solves (2145)

589.46 1,142.22 2,271.00 4,390.14Numerical Factorization (57)

91.38 168.30 325.05 637.49Update Stresses (1028)

582.32 1,036.14 2,026.54 3,978.64RHS Formation (3201)

148.45 191.49 297.13 489.99LHS Matrix Formation (57)

Solution Phase (1000 time steps, 10

seconds)

50.78 50.79 53.86 57.48Total Initialization Cost

1.62 3.15 7.35 15.56Solver Set-Up : Matrix Storage and Indexing

3.85 2.24 1.20 0.84

Solver Set-Up : Inter-Processor

Communication

1.18Symbolic Factorization

0.15Elimination Tree and Post-Ordering

1.66Multi-Level Nested Dissection (METIS)

17.51Adjacency Structure Set Up

24.08Geometry and Mesh Input

Initialization Phase

32 procs16 procs8 procs4 procsTiming in Seconds

Number of Equations: 67,716; Number of Elements: 4,480 20-8 node brick elements

Timing Results for the Shallow Foundation Settlement Problem

Timing collected on DataStar (IBM SP machine at San Diego Supercomputer Center)

Performance Measurements

0

2000

4000

6000

8000

10000

12000

0 5 10 15 20 25 30 35

Number of processors

To

tal execu

tio

n t

ime (

sec)

0

1

2

3

4

5

6

0 5 10 15 20 25 30 35


Sp

eed

up

facto

r (r

el. t

o 4

pro

cs)

Total execution time

Speedup (total)

0

1

2

3

4

5

6

7

8

0 10 20 30 40


Facto

rizati

on

ph

ase s

peed

up

(re

l.

to 4

pro

cs)

Speedup (Factorization)

0 50 100 150 200−0.25

−0.2

−0.15

−0.1

−0.05

0

Time (sec)

Fou

ndat

ion

vert

ical

dis

plac

emen

t (m

)

75−elements mesh 500−elements mesh 960−elements mesh 4480−elements mesh

1

3

22

21

20

19

18

17

16

15

14

13

12

11

10

9

8

7

6

5

2

4

25

24

23

1 3 22212019181716151413121110987652 4 252423

F

F F

F

FF

FF

F

F

F F

F

F

F

F

FFFFF

F

FF

FFF

F

Parallel Forward Solve (Lz = y) : Phase II (parallel)

For each column block (b) shared by processors

1. Update solution vector in column block

a. Broadcast solution coefficients to other

processors

b. Receive solution coefficients and update

solution vector

2. Parallel forward solve with principal submatrix

a. Broadcast to other processors

b. Receive

c. Update solution coefficients in column

block

3. Multiply solution coefficients with row

segments

a. Update solution coefficients in same

processor

b. Store solution coefficients not assigned

to processor in the buffer to be sent to

other processors

( ) ( )

* *

b b

i i iz y L z=

iZ

iZ

( )b

j j ji iz z L z=

Forward Solve versus Matrix-Vector Multiplication

(Principal Submatrix Block shared among processors)

Lz y=1

z L y=

1 2 nL L L L=

1 1 1 1

2 1nL L L L=

,1

1

1i

i

IL

L

I

= 1

,1

1

1i

i

IL

L

I

=

=• • •

• •

• • •

• • •

• • • • • •

• • • •

• • •

• •

• • •

•

•

• • • •

=

• • •

• •

• •

• • • • •

• • • •

• • • •

•

• • • •

• • • • •

•

•

•

• •

Computing Inverse of Matrix Factor L

( ) 1 1 1 1 1 1 ( 1) 1

1 2 1

i i

i i iL L L L L L L= =

( 1) 1 ( 1) 1

* *

T i i T

i iL D L K=

1

1

j iT

ii ii ij jj ij

j

D D L D L=

=

=

* *i iL L= ( 1)

iiL =

1 ( 1) 1

* *

i

i iL L L=

Definition:

Form row I of L:

Update diagonal:

Set:

Compute row

i of1

L

• No increase in communication time

• Matrix-vector multiplications

1. Choose a shift

2. Factor

3. Choose initial starting vector

4.

5.

6.

7.

8. FOR j = 1, to …. DO

9.

10.

11.

12.

13.

14.

15.

16.

17. ENDFOR

Kx Mx=

K M K=r

p Mr=

1

Tr p=

1 1/q r=

1/p p=

1 1

kr K Mq K p= =

1 1j jr r q=

T

j jq p=

j jr r q=p Mr=

1

T

jr p

+=

1 1/

j jq r

+ +=

1/

jp p

+=

Parallel Lanczos Algorithm

(Other Operations: Re-orthogonalization, Tridiagonal Eigensolves,

Ritz Vector Formation)

Parallel Factorization

Parallel Forward and

Backward Solution

Parallel Matrix

Vector Multiplications

Dot Products:

Form locally (parallel)

Sum globally (broadcast)

Structural Dome Model (16,026 equations) : 40 Lanczos Steps, Time in seconds

Factorization (without partial inverse) Factorization (with partial inverse)

No. of Processors 4 8 16 32 4 8 16 32

Basic Lanczos Steps:

Form Shift 0.79 0.46 0.27 0.18 0.79 0.46 0.27 0.18

Factor Kó 1 4 . 9 1 9 . 1 3 6 . 2 6 5 . 0 6 1 5 . 1 4 9 . 3 4 6 . 3 9 5 . 2 0

D a t a In i t i a l i z a t i o n 0 . 1 5 0 . 0 8 0 . 0 4 0 . 0 2 0 . 1 5 0 . 0 8 0 . 0 4 0 . 0 2

V e c t o r S o l u t i o n 1 6 . 2 2 1 0 . 2 2 8 . 1 0 8 . 2 6 1 5 . 5 5 9 . 2 9 6 . 5 8 5 . 8 3

F o r m á , â a n d r 5 . 9 6 3 . 1 5 1 . 9 3 0 . 9 7 5 . 9 6 3 . 2 4 1 . 6 8 0 . 9 7

M i s c e l l a n e o u s S t e p s :

R e - o r t h o g o n a l i z a t i o n 2 . 4 0 1 . 4 4 0 . 7 8 0 . 4 6 2 . 7 0 1 . 4 4 0 . 7 7 0 . 4 6

T r i d i a g o n a l E i g e n s o l ve s 1 . 2 0 1 . 2 0 1 . 2 0 1 . 2 0 1 . 2 0 1 . 2 0 1 . 2 0 1 . 2 0

F o r m R i t z V e c t o r s 1 . 0 4 0 . 6 2 0 . 4 0 0 . 2 9 1 . 0 4 0 . 6 2 0 . 4 0 0 . 2 9

T o t a l S o l u t i o n T i m e 4 3 . 0 8 2 6 . 3 5 1 8 . 9 9 1 6 . 4 5 4 2 . 6 3 2 5 . 7 2 1 7 . 3 5 1 4 . 6 3

Example Results: iPSC II, PVM Implementation

Finite Element Grid Models :

40 Lanczos Steps, Time in Seconds

No. of Proc. (without inverse) (with inverse)

180x180 mesh (65,514 equations)

8 63.20 60.44

16 35.08 32.33

32 22.38 18.46

64 17.07 11.87

128 15.62 8.84


16 43.58 40.14

32 26.47 22.32

64 20.09 14.16

128 18.07 9.91


16 51.99 48.48

32 31.48 26.88

64 23.18 16.39

128 20.99 11.80

Example Results:

Paragon Machine,

PVM

Implementation

General Remarks

Sparse direct solution method can be efficient in (modest size) parallel

environment

• Need >2,500 equations per processor for good performance (general

experience with Intel’s Hypercube, Paragon, …, IBM Power’s

Machines, Sun Enterprise)

• Memory bound – more the memory per processor, the better

utilization of sparse solver

Other Parallel Sparse Solvers: Parallel SuperLU, MUMPS, etc…

Problems with multiple RHS (Eigensolvers, Modified Newton, etc..)

• Compute inverse of principal block matrix factor

• Increase in local computations, but same communication overhead.

• Transform data dependent parallel forward/backward solves to data

independent matrix vector multiplication

Distributed environment

• Task farming model (De Santiago 1996) -- fault tolerant, recovery

• Internet-based “collaborative” component-based model (Peng 2002)

Documents

AA220/CS238 Parallel Methods in Numerical Analysis A ...adl.stanford.edu/cme342/Lecture_Notes_files/lecture21-05.pdf · AA220/CS238 Parallel Methods in Numerical Analysis A Parallel