Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
mumps on thousands of cores: feedback on
the use of direct solvers in ddm
Pierre Jolivet, IRIT–CNRS
MUMPS User DaysJune 2
introduction
hpddm
• open-source, https://github.org/hpddm/hpddm• handles (optimized) Schwarz or substructuring methods,• provides adaptive coarse space constructions,• interfaced with MUMPS, PaStiX, PARDISO, Dissection,SuiteSparse, BoomerAMG,
• interfaced with LAPACK and ARPACK,• can be used with FreeFem++, Feel++, or as is in C, C++,Python or Fortran.
1
https://github.org/hpddm/hpddm
ds(e)l for fe methods
Solving Laplace’s equation with Feel++auto mesh = loadMesh(_mesh = new Mesh);auto Vh = Pch(mesh);auto u = Vh->element(), v = Vh->element();auto f = expr("2*x*y+cos(y):x:y");// a(u, v) =
∫Ω∇u · ∇v
auto a = form2(_trial = Vh, _test = Vh);a = integrate(_range = elements(mesh),
_expr = gradt(u) * trans(grad(v)));// l(v) =
∫Ωfv
auto l = form1(_test = Vh);l = integrate(_range = elements(mesh),
_expr = f * id(v));// u = 0 on ∂Ωa += on(_range = boundaryfaces(mesh),
_rhs = l, _element = u,_expr = cst(0.0));
a.solve(_rhs = l, _solution = u);
2
overlapping schwarz methods
Ω
Consider the linear system: Au = f ∈ Kn.
Given a decomposition of J1;nK, (N1,N2), define:• the restriction operator Ri from J1;nK into Ni,• RTi as the extension by 0 from Ni into J1;nK.Then define:
ui = Riu Aij = RiARTj .
3
overlapping schwarz methods
Ω2Ω1
Consider the linear system: Au = f ∈ Kn.Given a decomposition of J1;nK, (N1,N2), define:• the restriction operator Ri from J1;nK into Ni,• RTi as the extension by 0 from Ni into J1;nK.
Then define:
ui = Riu Aij = RiARTj .
3
overlapping schwarz methods
Ω2Ω1
Consider the linear system: Au = f ∈ Kn.Given a decomposition of J1;nK, (N1,N2), define:• the restriction operator Ri from J1;nK into Ni,• RTi as the extension by 0 from Ni into J1;nK.Then define:
ui = Riu Aij = RiARTj .
3
overlapping schwarz methods
Duplicated unknowns coupled via a partition of unity:
I =N∑i=1
RTi DiRi.
Then, um+1 =N∑i=1
RTi Dium+1i . M
−1RAS =
N∑i=1
RTi DiA−1ii Ri
[Cai and Sarkis 1999]
12
1
12 1
4
overlapping schwarz methods
Duplicated unknowns coupled via a partition of unity:
I =N∑i=1
RTi DiRi.
Then, um+1 =N∑i=1
RTi Dium+1i .
M−1RAS =N∑i=1
RTi DiA−1ii Ri
[Cai and Sarkis 1999]
12
1
12 1
4
overlapping schwarz methods
Duplicated unknowns coupled via a partition of unity:
I =N∑i=1
RTi DiRi.
Then, um+1 =N∑i=1
RTi Dium+1i . M
−1RAS =
N∑i=1
RTi DiA−1ii Ri
[Cai and Sarkis 1999]
12
1
12 1
4
substructuring preconditioners
Ω1
Ω2Ω3
[Gosselet and Rey 2006] Subdomain tearing
5
substructuring preconditioners
Ω1
Ω2Ω3
1(1)
2(1)3(1)
4(1)5(1)
1(2)2(2)3(2)
4(2)5(2)
6(2)
1(3)2(3)
3(3)
4(3)
5(3)6(3)
7(3)
[Gosselet and Rey 2006] Local numbering A(k) =[Aii AibAbi Abb
]5
substructuring preconditioners
Ω1
Ω2Ω3
1(1)b 3(1)b
2(1)b
4(1)b
1(2)b3(2)b2
(2)b
4(2)b3(3)b
1(3)b
2(3)b
[Gosselet and Rey 2006] Elimination of interior d.o.f.S(k) = Abb − AbiA−1ii Aib
5
substructuring preconditioners
1(1)b 3(1)b
2(1)b
4(1)b
1(2)b3(2)b2
(2)b
4(2)b3(3)b
1(3)b
2(3)b 1 3
4
5
2
Jump operators: {B(i)}3i=1 Primal constraints[Mandel 1993]
5
substructuring preconditioners
1(1)b 3(1)b
2(1)b
4(1)b
1(2)b3(2)b2
(2)b
4(2)b3(3)b
1(3)b
2(3)b 12
4
53
6
7
Jump operators: {B(i)}3i=1 Dual constraints[Farhat and Roux 1991]
5
limitations of one-level methods
One-level methods don’t require exchange of globalinformation.
This hampers numerical scalability of such preconditioners:
κ(M−1A) ⩽ C 1H2
(1 +
Hδ
)• level of overlap δ,• characteristic size of a subdomain H.
[Le Tallec 1994; Toselli and Widlund 2005]
6
limitations of one-level methods
One-level methods don’t require exchange of globalinformation.
This hampers numerical scalability of such preconditioners:
κ(M−1A) ⩽ C 1H2
(1 +
Hδ
)• level of overlap δ,• characteristic size of a subdomain H.
[Le Tallec 1994; Toselli and Widlund 2005]
6
two-level preconditioners
two-level preconditioners
A common technique in the field of DDM, MG, deflation:
introduce an auxiliary “coarse” problem.
Let Z be a rectangular matrix. Define
E = ZTAZ.
Z has O(N) columns, hence E is much smaller than A.
Enrich the original preconditioner, e.g. additively
P−1 = M−1 + ZE−1ZT,
cf. [Tang et al. 2009].
7
two-level preconditioners
A common technique in the field of DDM, MG, deflation:
introduce an auxiliary “coarse” problem.
Let Z be a rectangular matrix. Define
E = ZTAZ.
Z has O(N) columns, hence E is much smaller than A.Enrich the original preconditioner, e.g. additively
P−1 = M−1 + ZE−1ZT,
cf. [Tang et al. 2009].
7
computing a coarse operator correction
How to apply ZE−1ZT to u ∈ Kn ?
n
ZTu = × = m≪ n
E−1ZTu = \ =
× =
= ZE−1ZTu
operations & MPI_Gather + linear solve + MPI_Scatter & operations
8
computing a coarse operator correction
How to apply
ZE−1
ZT to u ∈ Kn ?
n
ZTu = × = m≪ n
E−1ZTu = \ =
× =
= ZE−1ZTu
operations & MPI_Gather
+ linear solve + MPI_Scatter & operations
8
computing a coarse operator correction
How to apply
Z
E−1ZT to u ∈ Kn ?
n
ZTu = × = m≪ n
E−1ZTu = \ =
× =
= ZE−1ZTu
operations & MPI_Gather + linear solve
+ MPI_Scatter & operations
8
computing a coarse operator correction
How to apply ZE−1ZT to u ∈ Kn ?
n
ZTu = × = m≪ n
E−1ZTu = \ =
× =
= ZE−1ZTu
operations & MPI_Gather + linear solve + MPI_Scatter & operations
8
numerical results
machines used for scaling runs
Curie Thin Nodes
• 5040 compute nodes (2 eight-core Intel Sandy Bridge).• IB QDR full fat-tree.
Turing
• 6 BlueGene/Q racks.
9
strong scaling (stokes’ equation)
1 subdomain/MPI process, --ranks-per-node = 8.
1 0242 048
4 0968 192
20
40
100
200
500
# of processes
Run
time(sec
onds
)
Linear speedup3D 2D
Credits: R. Haferssas
10
block low-rank approximationsMotivation
During setup, lot of time spent in the direct solver.
11
block low-rank approximations
1 subdomain/MPI process, 2 OpenMP threads/MPI process.
256512
1 0242 048
4 0968 192
0
50
100
150
200
# of processes
Run
time(sec
onds
)
2.1M d.o.f.sbdmn in 2D (P4 FE)
256512
1 0242 048
4 0968 192
0
100
200
# of processes
280k d.o.f.sbdmn in 3D (P2 FE)
Factorization Deflation vectors Coarse operator Krylov method12
diffusion equation
PARDISO
MUMPS FR
BLRε =
10 −12
BLRε =
10 −8
BLRε =
10 −4
PaStiX
0
20
40
60
80
100
0.3
0.8
0.8
0.7
0.8
1.2Run
time(sec
onds
)
LLT factorization (430k d.o.f.)
13
stokes’ equation
PARDISO
MUMPS FR
BLRε =
10 −12
BLRε =
10 −8
BLRε =
10 −4
PaStiX
0
50
100
150
200
2502.3
1.2
1.2
1.2
1.2
2.6
Run
time(sec
onds
)
LDLT factorization (503k d.o.f.)
14
solution phase with multiple right-hand sidesMotivation
DD preconditioner for tomographic imaging.
∇× (∇× E)− µ0(ω2ε+ iωσ
)E = 0
15
scalability of the preconditioner
5121 024
2 0484 096
500
200
50
10
(54)
(61)
(73) (94)
# of processes
Run
time(sec
onds
)Setup Solve Linear speedup
Credits: P.-H. Tournier
• 119 million double-precision complex unknowns• degree 2 edge elements
16
stokes’ equation (lu factorization)
100%250%375%500%
MUMPS T1,1: 1.2s
EP,p =p · T1,1P · TP,p
P = 1 P = 2 P = 4 P = 8 P = 16
1 2 4 8 16 32 64 128
100%250%375%500%
PARDISO T1,1: 3.4s
# of right-hand sides (p) 17
maxwell’s equation (ldlt factorization)
100%
250%375% MUMPS T1,1: 1.1s
EP,p =p · T1,1P · TP,p
P = 1 P = 2 P = 4 P = 8 P = 16
1 2 4 8 16 32 64 128
100%
250%375% PARDISO T1,1: 0.9s
# of right-hand sides (p) 18
block methods for maxwell’s equation
alternative p solve # of it. per RHS eff.
GMRES 1 3 078.4 20 068 627 −GCRO-DR 1 1 836.9 10 701 334 1.7BGMRES 32 724.8 158 − 4.2
BGCRO-DR 8 677.6 524 131 4.5BGCRO-DR 32 992.3 127 − 3.1
• (m, k) = (50, 10) for solving 32 RHSs• 2048 subdomains and 2 threads per subdomain
• alternative #1 to #5 =⇒ 158× fewer iterations• working on all 32 RHSs is costly (#4 vs. #5)
19
block methods for maxwell’s equation
alternative p solve # of it. per RHS eff.
GMRES 1 3 078.4 20 068 627 −GCRO-DR 1 1 836.9 10 701 334 1.7BGMRES 32 724.8 158 − 4.2
BGCRO-DR 8 677.6 524 131 4.5BGCRO-DR 32 992.3 127 − 3.1
• (m, k) = (50, 10) for solving 32 RHSs• 2048 subdomains and 2 threads per subdomain
• alternative #1 to #5 =⇒ 158× fewer iterations• working on all 32 RHSs is costly (#4 vs. #5)
19
block methods for maxwell’s equation
alternative p solve # of it. per RHS eff.
GMRES 1 3 078.4 20 068 627 −GCRO-DR 1 1 836.9 10 701 334 1.7BGMRES 32 724.8 158 − 4.2
BGCRO-DR 8 677.6 524 131 4.5BGCRO-DR 32 992.3 127 − 3.1
• (m, k) = (50, 10) for solving 32 RHSs• 2048 subdomains and 2 threads per subdomain
• alternative #1 to #5 =⇒ 158× fewer iterations• working on all 32 RHSs is costly (#4 vs. #5)
19
block methods for maxwell’s equation
alternative p solve # of it. per RHS eff.
GMRES 1 3 078.4 20 068 627 −GCRO-DR 1 1 836.9 10 701 334 1.7BGMRES 32 724.8 158 − 4.2
BGCRO-DR 8 677.6 524 131 4.5BGCRO-DR 32 992.3 127 − 3.1
• (m, k) = (50, 10) for solving 32 RHSs• 2048 subdomains and 2 threads per subdomain
• alternative #1 to #5 =⇒ 158× fewer iterations• working on all 32 RHSs is costly (#4 vs. #5)
19
block methods for maxwell’s equation
alternative p solve # of it. per RHS eff.
GMRES 1 3 078.4 20 068 627 −GCRO-DR 1 1 836.9 10 701 334 1.7BGMRES 32 724.8 158 − 4.2
BGCRO-DR 8 677.6 524 131 4.5BGCRO-DR 32 992.3 127 − 3.1
• (m, k) = (50, 10) for solving 32 RHSs• 2048 subdomains and 2 threads per subdomain
• alternative #1 to #5 =⇒ 158× fewer iterations• working on all 32 RHSs is costly (#4 vs. #5)
19
block methods for maxwell’s equation
alternative p solve # of it. per RHS eff.
GMRES 1 3 078.4 20 068 627 −GCRO-DR 1 1 836.9 10 701 334 1.7BGMRES 32 724.8 158 − 4.2
BGCRO-DR 8 677.6 524 131 4.5BGCRO-DR 32 992.3 127 − 3.1
• (m, k) = (50, 10) for solving 32 RHSs• 2048 subdomains and 2 threads per subdomain• alternative #1 to #5 =⇒ 158× fewer iterations
• working on all 32 RHSs is costly (#4 vs. #5)
19
block methods for maxwell’s equation
alternative p solve # of it. per RHS eff.
GMRES 1 3 078.4 20 068 627 −GCRO-DR 1 1 836.9 10 701 334 1.7BGMRES 32 724.8 158 − 4.2
BGCRO-DR 8 677.6 524 131 4.5BGCRO-DR 32 992.3 127 − 3.1
• (m, k) = (50, 10) for solving 32 RHSs• 2048 subdomains and 2 threads per subdomain• alternative #1 to #5 =⇒ 158× fewer iterations• working on all 32 RHSs is costly (#4 vs. #5)
19
features and whish list
summary of used features
• LU, LDLT
• distributed or centralized solution• BLR• multiple RHS• detection of pivots and size of nullspace• computation of Schur complement• working host/distributed input
20
whish list
• distributed RHS• efficient handling of block matrices
Thank you for MUMPSand for your attention!
21
whish list
• distributed RHS• efficient handling of block matrices
Thank you for MUMPSand for your attention!
21
IntroductionOverlapping Schwarz methodsSubstructuring preconditionersLimitations of one-level methods
Two-level preconditionersMathematical point of view
Numerical resultsLarge-scale scalability studiesBLRMultiple right-hand sides
Features and whish list