mumpsonthousandsofcores: feedbackon …mumps.enseeiht.fr/doc/ud_2017/Jolivet_Talk.pdf · overlappingschwarzmethods Ω Considerthelinearsystem: Au= f2 Kn. GivenadecompositionofJ1;nK,(N1;N2),deﬁne:therestrictionoperatorRifromJ1;nK

mumps on thousands of cores: feedback on

the use of direct solvers in ddm

Pierre Jolivet, IRIT–CNRS

MUMPS User DaysJune 2

introduction

hpddm

• open-source, https://github.org/hpddm/hpddm• handles (optimized) Schwarz or substructuring methods,• provides adaptive coarse space constructions,• interfaced with MUMPS, PaStiX, PARDISO, Dissection,SuiteSparse, BoomerAMG,

• interfaced with LAPACK and ARPACK,• can be used with FreeFem++, Feel++, or as is in C, C++,Python or Fortran.

1

https://github.org/hpddm/hpddm

ds(e)l for fe methods

Solving Laplace’s equation with Feel++auto mesh = loadMesh(_mesh = new Mesh);auto Vh = Pch(mesh);auto u = Vh->element(), v = Vh->element();auto f = expr("2*x*y+cos(y):x:y");// a(u, v) =

∫Ω∇u · ∇v

auto a = form2(_trial = Vh, _test = Vh);a = integrate(_range = elements(mesh),

_expr = gradt(u) * trans(grad(v)));// l(v) =

∫Ωfv

auto l = form1(_test = Vh);l = integrate(_range = elements(mesh),

_expr = f * id(v));// u = 0 on ∂Ωa += on(_range = boundaryfaces(mesh),

_rhs = l, _element = u,_expr = cst(0.0));

a.solve(_rhs = l, _solution = u);

2

overlapping schwarz methods

Ω

Consider the linear system: Au = f ∈ Kn.

Given a decomposition of J1;nK, (N1,N2), define:• the restriction operator Ri from J1;nK into Ni,• RTi as the extension by 0 from Ni into J1;nK.Then define:

ui = Riu Aij = RiARTj .

3


Ω2Ω1

Consider the linear system: Au = f ∈ Kn.Given a decomposition of J1;nK, (N1,N2), define:• the restriction operator Ri from J1;nK into Ni,• RTi as the extension by 0 from Ni into J1;nK.

Then define:


3


Ω2Ω1

Consider the linear system: Au = f ∈ Kn.Given a decomposition of J1;nK, (N1,N2), define:• the restriction operator Ri from J1;nK into Ni,• RTi as the extension by 0 from Ni into J1;nK.Then define:


3


Duplicated unknowns coupled via a partition of unity:

I =N∑i=1

RTi DiRi.

Then, um+1 =N∑i=1

RTi Dium+1i . M

−1RAS =

N∑i=1

RTi DiA−1ii Ri

[Cai and Sarkis 1999]

12

1

12 1

4



I =N∑i=1

RTi DiRi.

Then, um+1 =N∑i=1

RTi Dium+1i .

M−1RAS =N∑i=1

RTi DiA−1ii Ri


12

1

12 1

4



I =N∑i=1

RTi DiRi.

Then, um+1 =N∑i=1

RTi Dium+1i . M

−1RAS =

N∑i=1

RTi DiA−1ii Ri


12

1

12 1

4

substructuring preconditioners

Ω1

Ω2Ω3

[Gosselet and Rey 2006] Subdomain tearing

5


Ω1

Ω2Ω3

1(1)

2(1)3(1)

4(1)5(1)

1(2)2(2)3(2)

4(2)5(2)

6(2)

1(3)2(3)

3(3)

4(3)

5(3)6(3)

7(3)

[Gosselet and Rey 2006] Local numbering A(k) =[Aii AibAbi Abb

]5


Ω1

Ω2Ω3

1(1)b 3(1)b

2(1)b

4(1)b

1(2)b3(2)b2

(2)b

4(2)b3(3)b

1(3)b

2(3)b

[Gosselet and Rey 2006] Elimination of interior d.o.f.S(k) = Abb − AbiA−1ii Aib

5


1(1)b 3(1)b

2(1)b

4(1)b

1(2)b3(2)b2

(2)b

4(2)b3(3)b

1(3)b

2(3)b 1 3

4

5

2

Jump operators: {B(i)}3i=1 Primal constraints[Mandel 1993]

5


1(1)b 3(1)b

2(1)b

4(1)b

1(2)b3(2)b2

(2)b

4(2)b3(3)b

1(3)b

2(3)b 12

4

53

6

7

Jump operators: {B(i)}3i=1 Dual constraints[Farhat and Roux 1991]

5

limitations of one-level methods

One-level methods don’t require exchange of globalinformation.

This hampers numerical scalability of such preconditioners:

κ(M−1A) ⩽ C 1H2

(1 +

Hδ

)• level of overlap δ,• characteristic size of a subdomain H.

[Le Tallec 1994; Toselli and Widlund 2005]

6

two-level preconditioners


A common technique in the field of DDM, MG, deflation:

introduce an auxiliary “coarse” problem.

Let Z be a rectangular matrix. Define

E = ZTAZ.

Z has O(N) columns, hence E is much smaller than A.

Enrich the original preconditioner, e.g. additively

P−1 = M−1 + ZE−1ZT,

cf. [Tang et al. 2009].

7


A common technique in the field of DDM, MG, deflation:

introduce an auxiliary “coarse” problem.

Let Z be a rectangular matrix. Define

E = ZTAZ.

Z has O(N) columns, hence E is much smaller than A.Enrich the original preconditioner, e.g. additively

P−1 = M−1 + ZE−1ZT,

cf. [Tang et al. 2009].

7

computing a coarse operator correction

How to apply ZE−1ZT to u ∈ Kn ?

n

ZTu = × = m≪ n

E−1ZTu = \ =

× =

= ZE−1ZTu

operations & MPI_Gather + linear solve + MPI_Scatter & operations

8


How to apply

ZE−1

ZT to u ∈ Kn ?

n

ZTu = × = m≪ n

E−1ZTu = \ =

× =

= ZE−1ZTu

operations & MPI_Gather

+ linear solve + MPI_Scatter & operations

8


How to apply

Z

E−1ZT to u ∈ Kn ?

n

ZTu = × = m≪ n

E−1ZTu = \ =

× =

= ZE−1ZTu

operations & MPI_Gather + linear solve

+ MPI_Scatter & operations

8


How to apply ZE−1ZT to u ∈ Kn ?

n

ZTu = × = m≪ n

E−1ZTu = \ =

× =

= ZE−1ZTu

operations & MPI_Gather + linear solve + MPI_Scatter & operations

8

numerical results

machines used for scaling runs

Curie Thin Nodes

• 5040 compute nodes (2 eight-core Intel Sandy Bridge).• IB QDR full fat-tree.

Turing

• 6 BlueGene/Q racks.

9

strong scaling (stokes’ equation)

1 subdomain/MPI process, --ranks-per-node = 8.

1 0242 048

4 0968 192

20

40

100

200

500

# of processes

Run

time(sec

onds

)

Linear speedup3D 2D

Credits: R. Haferssas

10

block low-rank approximationsMotivation

During setup, lot of time spent in the direct solver.

11

block low-rank approximations

1 subdomain/MPI process, 2 OpenMP threads/MPI process.

256512

1 0242 048

4 0968 192

0

50

100

150

200

# of processes

Run

time(sec

onds

)

2.1M d.o.f.sbdmn in 2D (P4 FE)

256512

1 0242 048

4 0968 192

0

100

200

# of processes

280k d.o.f.sbdmn in 3D (P2 FE)

Factorization Deflation vectors Coarse operator Krylov method12

diffusion equation

PARDISO

MUMPS FR

BLRε =

10 −12

BLRε =

10 −8

BLRε =

10 −4

PaStiX

0

20

40

60

80

100

0.3

0.8

0.8

0.7

0.8

1.2Run

time(sec

onds

)

LLT factorization (430k d.o.f.)

13

stokes’ equation

PARDISO

MUMPS FR

BLRε =

10 −12

BLRε =

10 −8

BLRε =

10 −4

PaStiX

0

50

100

150

200

2502.3

1.2

1.2

1.2

1.2

2.6

Run

time(sec

onds

)

LDLT factorization (503k d.o.f.)

14

solution phase with multiple right-hand sidesMotivation

DD preconditioner for tomographic imaging.

∇× (∇× E)− µ0(ω2ε+ iωσ

)E = 0

15

scalability of the preconditioner

5121 024

2 0484 096

500

200

50

10

(54)

(61)

(73) (94)

# of processes

Run

time(sec

onds

)Setup Solve Linear speedup

Credits: P.-H. Tournier

• 119 million double-precision complex unknowns• degree 2 edge elements

16

stokes’ equation (lu factorization)

100%250%375%500%

MUMPS T1,1: 1.2s

EP,p =p · T1,1P · TP,p

P = 1 P = 2 P = 4 P = 8 P = 16

1 2 4 8 16 32 64 128

100%250%375%500%

PARDISO T1,1: 3.4s

# of right-hand sides (p) 17

maxwell’s equation (ldlt factorization)

100%

250%375% MUMPS T1,1: 1.1s

EP,p =p · T1,1P · TP,p

P = 1 P = 2 P = 4 P = 8 P = 16

1 2 4 8 16 32 64 128

100%

250%375% PARDISO T1,1: 0.9s

# of right-hand sides (p) 18

block methods for maxwell’s equation

alternative p solve # of it. per RHS eff.

GMRES 1 3 078.4 20 068 627 −GCRO-DR 1 1 836.9 10 701 334 1.7BGMRES 32 724.8 158 − 4.2

BGCRO-DR 8 677.6 524 131 4.5BGCRO-DR 32 992.3 127 − 3.1

• (m, k) = (50, 10) for solving 32 RHSs• 2048 subdomains and 2 threads per subdomain

• alternative #1 to #5 =⇒ 158× fewer iterations• working on all 32 RHSs is costly (#4 vs. #5)

19




BGCRO-DR 8 677.6 524 131 4.5BGCRO-DR 32 992.3 127 − 3.1

• (m, k) = (50, 10) for solving 32 RHSs• 2048 subdomains and 2 threads per subdomain• alternative #1 to #5 =⇒ 158× fewer iterations

• working on all 32 RHSs is costly (#4 vs. #5)

19




BGCRO-DR 8 677.6 524 131 4.5BGCRO-DR 32 992.3 127 − 3.1

• (m, k) = (50, 10) for solving 32 RHSs• 2048 subdomains and 2 threads per subdomain• alternative #1 to #5 =⇒ 158× fewer iterations• working on all 32 RHSs is costly (#4 vs. #5)

19

features and whish list

summary of used features

• LU, LDLT

• distributed or centralized solution• BLR• multiple RHS• detection of pivots and size of nullspace• computation of Schur complement• working host/distributed input

20

whish list

• distributed RHS• efficient handling of block matrices

Thank you for MUMPSand for your attention!

21

whish list

• distributed RHS• efficient handling of block matrices

Thank you for MUMPSand for your attention!

21

IntroductionOverlapping Schwarz methodsSubstructuring preconditionersLimitations of one-level methods

Two-level preconditionersMathematical point of view

Numerical resultsLarge-scale scalability studiesBLRMultiple right-hand sides

Features and whish list

Documents

mumpsonthousandsofcores: feedbackon …mumps.enseeiht.fr/doc/ud_2017/Jolivet_Talk.pdf · overlappingschwarzmethods Ω Considerthelinearsystem: Au= f2 Kn. GivenadecompositionofJ1;nK,(N1;N2),deﬁne:therestrictionoperatorRifromJ1;nK