Upload
duonghanh
View
219
Download
0
Embed Size (px)
Citation preview
AA220/CS238
Parallel Methods in Numerical Analysis
A Parallel Sparse Direct Solver
(Symmetric Positive Definite System)
Kincho H. Law
Professor of Civil and Environmental Engineering
Stanford University
Email: [email protected]
May 16, 2003
Parallel Computers
(Machines with multiple processors)
Processor
Cache Local
MemoryI/O
Processor
Cache Local
MemoryI/O
Global
Shared
Memory
Focus of Discussion:
• Processor Assignment
• Sparse Numerical Factorization
(David Mackay, 1992)
1
3
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
2
4
25
24
23
1 3 22212019181716151413121110987652 4 252423
F
F F
F
FF
FF
F
F
F F
F
F
F
F
FFFFF
F
FF
FFF
F
25
9
181484
2010
21
22
23
24
2
171373
19
15
16
11
12
1 5
6
Global Node
Number
0 2
3
3
3
32
2
2
2
0
1
1
1
1
0
0
0
0
2
0
0
0
0
0
Processor
Number
Tree Structure of Matrix Factor
Processor Assignment (Naïve Strategy)
1
3
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
2
4
25
24
23
1 3 22212019181716151413121110987652 4 252423
F
F F
F
FF
FF
F
F
F F
F
F
F
F
FFFFF
F
FF
FFF
F
25
9
181484
2010
21
22
23
24
2
171373
19
15
16
11
12
1 5
6
Global Node
Number
1 3
3
3
3
32
2
2
2
0
1
1
1
1
0
0
0
0
2
0
1
2
3
0
Processor
Number
Tree Structure of Matrix Factor
Processor Assignment
11 1*
0 0
*1 1
1
11 11 1*
1
*1 11 1 1 1
1 1
N N
K KK K H
K H
D D KK
K D I H I
= = =
=
11 11D K= 1
1 1 *1 11 1*H H K D K=
1
1
k
T
kk kk ki ii ki
i
D K L D L
=
=
11
1
( )k
jk jk ki ii ji kk
i
L K L D L D=
=
Factoring
column k
Computed
columns
Coefficients
modified
1( )ij ij ik kk kjH H K D K=
where and
Matrix Factorization
Factoring
column k
Computed
columns
Coefficients
not modified
jiL
kiL
1
3
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
2
4
25
24
23
1 3 22212019181716151413121110987652 4 252423
F
F F
F
FF
FF
F
F
F F
F
F
F
F
FFFFF
F
FF
FFF
F
Numerical Factorization : Phase I (sequential)
For each column block assigned to a processor
1. Perform (profile) factorization on principal
block submatrix
2. Update row segments by Forward Solve
3. Form dot products among row segments
a. Fan out dot products to update
coefficients in same processor
b. Save dot products in buffers to be
fanned in to other processors
TiiiiLDLK
)()()()(=
T
j
iiT
j KLDL *
1)(1)(
* =
1
3
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
2
4
25
24
23
1 3 22212019181716151413121110987652 4 252423
F
F F
F
FF
FF
F
F
F F
F
F
F
F
FFFFF
F
FF
FFF
F
Numerical Factorization : Phase II (parallel)
1. Fan in dot products and update principal block and
row segments
2. Parallel factorization of column block:
a. Factor current row (in principal block) within
processor
b. Broadcast row to other processors
c. Receive row from other processors
d. Perform dot products and update coefficients (in
principal block) within processor
e. Perform dot products and update row segments
in processor
3. Form dot products among row segments
a. Form dot products between current row and
other row segments in the processor
Among the shared processors,
a. Circulate current row segment to next processor
b. Receive row segment from neigboring processor
c. Form dot products between received row
segment and row segments in processor
(Dot products are fanned out to update coefficients in
same processor but saved in buffer to be fanned in
to other processors)
For each column block shared by processors
1
3
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
2
4
25
24
23
1 3 22212019181716151413121110987652 4 252423
F
F F
F
FF
FF
F
F
F F
F
F
F
F
FFFFF
F
FF
FFF
F
Parallel Forward Solve (Lz = y) : Phase I (sequential)
For each column block (b) assigned to a
processor
1. Perform a (profile) forward solve with
the principal submatrix in the column
block (i.e. computing )
2. Modify solution obtained in Step 1 with
row segments (within the processor)
a. Update solution coefficients
assigned to processor
b. Store solution coefficeints not
assigned to processor in buffer to
be sent to other processors
( )
*
bZ
( ) ( )
* *
b b
i i iz y L z=
( ) ( )
* *
b b
i iz L z=
1
3
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
2
4
25
24
23
1 3 22212019181716151413121110987652 4 252423
F
F F
F
FF
FF
F
F
F F
F
F
F
F
FFFFF
F
FF
FFF
F
Parallel Forward Solve (Lz = y) : Phase II (parallel)
For each column block (b) shared by processors
1. Update solution vector in column block
a. Broadcast solution coefficients to other
processors
b. Receive solution coefficients and update
solution vector
2. Parallel forward solve with principal submatrix
a. Broadcast to other processors
b. Receive
c. Update solution coefficients in column
block
3. Multiply solution coefficients with row
segments
a. Update solution coefficients in same
processor
b. Store solution coefficients not assigned
to processor in the buffer to be sent to
other processors
( ) ( )
* *
b b
i i iz y L z=
iZ
iZ
( )b
j j ji iz z L z=
Parallel Backward Solve (DL x = z)
Reverse of Parallel Forward Solve
1. Phase I (parallel) : Compute
solution coefficients in column
blocks shared by processors
2. Phase II (sequential) : Compute
solution coefficients in column
blocks within processor
1
3
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
2
4
25
24
23
1 3 22212019181716151413121110987652 4 252423
T
User Interface (UI) (Mesh Gen., B.C., etc..)
Coarse Grain Choreographer
Post Processing
Element Library
Matrix, RHS
Formation
and Assembly
Linear Solver
(Direct, Iterative)
Element
Characteristics
Element Library
Matrix, RHS
Formation
and Assembly
Linear Solver
(Direct, Iterative)
Element
CharacteristicsNonlin
ear
and/o
r A
daptive S
olv
er
(domain
decomposition,
data structure)
Coars
e G
rain
Chore
ogra
pher
Fin
e G
rain
Chore
ogra
pher
Fin
e G
rain
Chore
ogra
pher
Node 0
Node p
Communication Routines
(David Mackay 1992, Made Suarjana 1994, Narayana Aluru 1995)
Parallel Finite Element Program - SPMD
0
331
331
1
1
2200
220
1
9
2
20
14
13
25
24
23
22
21
7
8
10
4
3
5
6
17
18
15
16
19
11
12
Global Node
Number
Processor
Number
0
30,1,30,1
2,31,2,30,1,2
1
0,1
2,32,30,2,30,1
20,2,30,3
1
9
2
20
14
13
25
24
23
22
21
7
8
10
4
3
5
6
17
18
15
16
19
11
12
Global Node
Number
Processor
Number
(a). Assignment of elements without duplication (b). Assignment of elements with duplication
Element Assignment and Stiffness Generation
Communication required for assembly Communication not required for assembly
0
30,1,2,30,1,2,3
2,30,1,2,30,1,2,3
1
0,1
2,30,1,2,30,1,2,30,1
20,1,2,30,1,2,3
1
9
2
20
14
13
25
24
23
22
21
7
8
10
4
3
5
6
17
18
15
16
19
11
12
Global NodeNumber
ProcessorNumber
25
9181484
201021
22
23
24
2
171373
19
15
16
11
12
1 5
6
Global NodeNumber
1 3
3
3
3
32
2
2
2
0
1
1
1
1
0
0
0
0
2
0
1
2
3
0
ProcessorNumber
Preliminary Results (MPI Implementation)
(joint work with Ahmed Elgamal, Jinchi Lu, Jun Peng)
Blue Horizon, IBM’s SMP Power3 parallel computer
• San Diego Supercomputer Center
• 8 (375 MHz) processors per node sharing 4
Gbytes memory
Longhorn, IBM Power4 parallel computer
• University of Texas, Austin
• 4 (1.3 GHz) processors per node sharing 8
Gbytes memory
Junior, SUN’s Enterprise E6500 parallel computer
• 8 ultraSPARC (400 MHz) processors sharing
16 GBytes memory
A 25 x 25 x 25 Grid Problem
25x25x25 67600 Blue Horizon
Proc Init. Form matrix Form RHS Num Fact F/B Solve Total
1 28.61 24.46 10.53 623.33 1.66 689.30
2 57.16 13.38 5.09 327.41 1.06 414.10
4 50.59 7.83 2.38 170.32 0.62 231.20
8 39.38 4.84 1.26 84.64 0.37 134.48
16 37.83 3.52 0.61 47.27 0.26 89.83
32 36.45 2.76 0.31 29.05 0.25 69.10
64 36.85 2.45 0.23 22.87 0.67 66.32
Soil-pile Interaction Model (3x3 pile group)
130,020 equations; 29,120 elements; 96,845,738 nonzeros in factor L
Number of Numerical Forward and Solution Total Exe
Processors Factorization Backward Solve Phase Time
2 332.67 1.41 370.42 455.91
4 166.81 0.78 187.72 286.97
8 85.20 0.45 97.71 186.67
16 50.73 0.29 59.39 147.55
32 27.80 0.23 34.61 124.3
64 18.41 0.26 24.40 116.21
Solution Time in seconds for a single step
Speedup relative
to 2 proc.
364,800 equations; 340,514,320 nonzeros in factor L
Number of Numerical Forward and Solution Total Exe
Processors Factorization Backward Solve Phase Time
4 1246.08 2.76 1306.87 1769.00
8 665.66 1.56 702.09 1150.17
16 354.99 0.98 378.35 841.38
32 208.90 0.67 225.93 668.02
64 125.05 0.66 142.33 583.98
Stone Column Centrifuge Test Model
Solution Time in seconds for a single step
Speedup relative
to 4 proc.
(Timing in seconds)
32 procs 64 procs 128 procs
Initialization phase: 435.2 418.6 425.5
Geometry input 273.4 272.8 272.9
Preprocessing 161.4 145.5 152.4
Soil model initialization 0.4 0.3 0.3
Solution phase: 11,835.7 7,643.8 5,381.0
Formation of matrix 373.5 247.2 203.6
Formation of RHS 1,671.6 889.4 458.9
Updating Stresses 175.6 92.5 46.8
Numerical factorization 7,902.3 4,928.5 3,192.2
Forward and back solve 510.7 475.6 594.3
Calculation of element output 0.0 0.0 0.0
Subtotal (of the above six) 10,822.9 6,848.1 4,734.4
Total execution time (Init. phase+Solution phase) 12,298.5 8,080.4 5,819.5
Number of time steps: 35 (at 0.01 seconds interval)
Number of numerical factorizations performed 45 52 52
Number of forward & back solves performed 843 877 876
Time per each numerical factorization 175.6062 94.77923 61.38769
Time per each forward and backward solve 0.60586 0.542315 0.678379
Timing results for the stone column model on Blue Horizon
Number of equations = 364, 800; Total nonzeros = 332,634,544
(Note: Different ordering scheme than previous slide)
0 10 20 30 40 50 60 70 80 90 100−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
Time (sec)
Acc
eler
atio
n (g
)
Simulation of Shallow Foundation Settlement
Shallow foundation in 3D half-space (unit:m)
Base input motion
Solid nodes
Fluid nodes
20-8 noded brick element
Final deformed mesh of the
4480-elements mesh
(scale factor 10,
blue zone - densified area
green zone liquefiable sand)
1,962.85 3,113.42 5,492.74 10,263.57
Total Execution Time (including
miscellaneous)
1,741.85 2,819.68 5,277.48 10,049.81Total for Solution Cost
330.24 281.53 357.76 553.55Forward and Backward Solves (2145)
589.46 1,142.22 2,271.00 4,390.14Numerical Factorization (57)
91.38 168.30 325.05 637.49Update Stresses (1028)
582.32 1,036.14 2,026.54 3,978.64RHS Formation (3201)
148.45 191.49 297.13 489.99LHS Matrix Formation (57)
Solution Phase (1000 time steps, 10
seconds)
50.78 50.79 53.86 57.48Total Initialization Cost
1.62 3.15 7.35 15.56Solver Set-Up : Matrix Storage and Indexing
3.85 2.24 1.20 0.84
Solver Set-Up : Inter-Processor
Communication
1.18Symbolic Factorization
0.15Elimination Tree and Post-Ordering
1.66Multi-Level Nested Dissection (METIS)
17.51Adjacency Structure Set Up
24.08Geometry and Mesh Input
Initialization Phase
32 procs16 procs8 procs4 procsTiming in Seconds
Number of Equations: 67,716; Number of Elements: 4,480 20-8 node brick elements
Timing Results for the Shallow Foundation Settlement Problem
Timing collected on DataStar (IBM SP machine at San Diego Supercomputer Center)
Performance Measurements
0
2000
4000
6000
8000
10000
12000
0 5 10 15 20 25 30 35
Number of processors
To
tal execu
tio
n t
ime (
sec)
0
1
2
3
4
5
6
0 5 10 15 20 25 30 35
Number of processors
Sp
eed
up
facto
r (r
el. t
o 4
pro
cs)
Total execution time
Speedup (total)
0
1
2
3
4
5
6
7
8
0 10 20 30 40
Number of processors
Facto
rizati
on
ph
ase s
peed
up
(re
l.
to 4
pro
cs)
Speedup (Factorization)
0 50 100 150 200−0.25
−0.2
−0.15
−0.1
−0.05
0
Time (sec)
Fou
ndat
ion
vert
ical
dis
plac
emen
t (m
)
75−elements mesh 500−elements mesh 960−elements mesh 4480−elements mesh
1
3
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
2
4
25
24
23
1 3 22212019181716151413121110987652 4 252423
F
F F
F
FF
FF
F
F
F F
F
F
F
F
FFFFF
F
FF
FFF
F
Parallel Forward Solve (Lz = y) : Phase II (parallel)
For each column block (b) shared by processors
1. Update solution vector in column block
a. Broadcast solution coefficients to other
processors
b. Receive solution coefficients and update
solution vector
2. Parallel forward solve with principal submatrix
a. Broadcast to other processors
b. Receive
c. Update solution coefficients in column
block
3. Multiply solution coefficients with row
segments
a. Update solution coefficients in same
processor
b. Store solution coefficients not assigned
to processor in the buffer to be sent to
other processors
( ) ( )
* *
b b
i i iz y L z=
iZ
iZ
( )b
j j ji iz z L z=
Forward Solve versus Matrix-Vector Multiplication
(Principal Submatrix Block shared among processors)
Lz y=1
z L y=
1 2 nL L L L=
1 1 1 1
2 1nL L L L=
,1
1
1i
i
IL
L
I
= 1
,1
1
1i
i
IL
L
I
=
=• • •
• •
• • •
• • •
• • • • • •
• • • •
• • •
• •
• • •
•
•
• • • •
=
• • •
• •
• •
• • • • •
• • • •
• • • •
•
• • • •
• • • • •
•
•
•
• •
Computing Inverse of Matrix Factor L
( ) 1 1 1 1 1 1 ( 1) 1
1 2 1
i i
i i iL L L L L L L= =
( 1) 1 ( 1) 1
* *
T i i T
i iL D L K=
1
1
j iT
ii ii ij jj ij
j
D D L D L=
=
=
* *i iL L= ( 1)
iiL =
1 ( 1) 1
* *
i
i iL L L=
Definition:
Form row I of L:
Update diagonal:
Set:
Compute row
i of1
L
• No increase in communication time
• Matrix-vector multiplications
1. Choose a shift
2. Factor
3. Choose initial starting vector
4.
5.
6.
7.
8. FOR j = 1, to …. DO
9.
10.
11.
12.
13.
14.
15.
16.
17. ENDFOR
Kx Mx=
K M K=r
p Mr=
1
Tr p=
1 1/q r=
1/p p=
1 1
kr K Mq K p= =
1 1j jr r q=
T
j jq p=
j jr r q=p Mr=
1
T
jr p
+=
1 1/
j jq r
+ +=
1/
jp p
+=
Parallel Lanczos Algorithm
(Other Operations: Re-orthogonalization, Tridiagonal Eigensolves,
Ritz Vector Formation)
Parallel Factorization
Parallel Forward and
Backward Solution
Parallel Matrix
Vector Multiplications
Dot Products:
Form locally (parallel)
Sum globally (broadcast)
Structural Dome Model (16,026 equations) : 40 Lanczos Steps, Time in seconds
Factorization (without partial inverse) Factorization (with partial inverse)
No. of Processors 4 8 16 32 4 8 16 32
Basic Lanczos Steps:
Form Shift 0.79 0.46 0.27 0.18 0.79 0.46 0.27 0.18
Factor Kó 1 4 . 9 1 9 . 1 3 6 . 2 6 5 . 0 6 1 5 . 1 4 9 . 3 4 6 . 3 9 5 . 2 0
D a t a In i t i a l i z a t i o n 0 . 1 5 0 . 0 8 0 . 0 4 0 . 0 2 0 . 1 5 0 . 0 8 0 . 0 4 0 . 0 2
V e c t o r S o l u t i o n 1 6 . 2 2 1 0 . 2 2 8 . 1 0 8 . 2 6 1 5 . 5 5 9 . 2 9 6 . 5 8 5 . 8 3
F o r m á , â a n d r 5 . 9 6 3 . 1 5 1 . 9 3 0 . 9 7 5 . 9 6 3 . 2 4 1 . 6 8 0 . 9 7
M i s c e l l a n e o u s S t e p s :
R e - o r t h o g o n a l i z a t i o n 2 . 4 0 1 . 4 4 0 . 7 8 0 . 4 6 2 . 7 0 1 . 4 4 0 . 7 7 0 . 4 6
T r i d i a g o n a l E i g e n s o l ve s 1 . 2 0 1 . 2 0 1 . 2 0 1 . 2 0 1 . 2 0 1 . 2 0 1 . 2 0 1 . 2 0
F o r m R i t z V e c t o r s 1 . 0 4 0 . 6 2 0 . 4 0 0 . 2 9 1 . 0 4 0 . 6 2 0 . 4 0 0 . 2 9
T o t a l S o l u t i o n T i m e 4 3 . 0 8 2 6 . 3 5 1 8 . 9 9 1 6 . 4 5 4 2 . 6 3 2 5 . 7 2 1 7 . 3 5 1 4 . 6 3
Example Results: iPSC II, PVM Implementation
Finite Element Grid Models :
40 Lanczos Steps, Time in Seconds
No. of Proc. (without inverse) (with inverse)
180x180 mesh (65,514 equations)
8 63.20 60.44
16 35.08 32.33
32 22.38 18.46
64 17.07 11.87
128 15.62 8.84
200x200 mesh (80,794 equations)
16 43.58 40.14
32 26.47 22.32
64 20.09 14.16
128 18.07 9.91
220x220 mesh (97,674 equations)
16 51.99 48.48
32 31.48 26.88
64 23.18 16.39
128 20.99 11.80
Example Results:
Paragon Machine,
PVM
Implementation
General Remarks
Sparse direct solution method can be efficient in (modest size) parallel
environment
• Need >2,500 equations per processor for good performance (general
experience with Intel’s Hypercube, Paragon, …, IBM Power’s
Machines, Sun Enterprise)
• Memory bound – more the memory per processor, the better
utilization of sparse solver
Other Parallel Sparse Solvers: Parallel SuperLU, MUMPS, etc…
Problems with multiple RHS (Eigensolvers, Modified Newton, etc..)
• Compute inverse of principal block matrix factor
• Increase in local computations, but same communication overhead.
• Transform data dependent parallel forward/backward solves to data
independent matrix vector multiplication
Distributed environment
• Task farming model (De Santiago 1996) -- fault tolerant, recovery
• Internet-based “collaborative” component-based model (Peng 2002)