46
mumps on thousands of cores: feedback on the use of direct solvers in ddm Pierre Jolivet, IRIT–CNRS MUMPS User Days June 2

mumpsonthousandsofcores: feedbackon …mumps.enseeiht.fr/doc/ud_2017/Jolivet_Talk.pdf · overlappingschwarzmethods Ω Considerthelinearsystem: Au= f2 Kn. GivenadecompositionofJ1;nK,(N1;N2),define:therestrictionoperatorRifromJ1;nK

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • mumps on thousands of cores: feedback on

    the use of direct solvers in ddm

    Pierre Jolivet, IRIT–CNRS

    MUMPS User DaysJune 2

  • introduction

  • hpddm

    • open-source, https://github.org/hpddm/hpddm• handles (optimized) Schwarz or substructuring methods,• provides adaptive coarse space constructions,• interfaced with MUMPS, PaStiX, PARDISO, Dissection,SuiteSparse, BoomerAMG,

    • interfaced with LAPACK and ARPACK,• can be used with FreeFem++, Feel++, or as is in C, C++,Python or Fortran.

    1

    https://github.org/hpddm/hpddm

  • ds(e)l for fe methods

    Solving Laplace’s equation with Feel++auto mesh = loadMesh(_mesh = new Mesh);auto Vh = Pch(mesh);auto u = Vh->element(), v = Vh->element();auto f = expr("2*x*y+cos(y):x:y");// a(u, v) =

    ∫Ω∇u · ∇v

    auto a = form2(_trial = Vh, _test = Vh);a = integrate(_range = elements(mesh),

    _expr = gradt(u) * trans(grad(v)));// l(v) =

    ∫Ωfv

    auto l = form1(_test = Vh);l = integrate(_range = elements(mesh),

    _expr = f * id(v));// u = 0 on ∂Ωa += on(_range = boundaryfaces(mesh),

    _rhs = l, _element = u,_expr = cst(0.0));

    a.solve(_rhs = l, _solution = u);

    2

  • overlapping schwarz methods

    Consider the linear system: Au = f ∈ Kn.

    Given a decomposition of J1;nK, (N1,N2), define:• the restriction operator Ri from J1;nK into Ni,• RTi as the extension by 0 from Ni into J1;nK.Then define:

    ui = Riu Aij = RiARTj .

    3

  • overlapping schwarz methods

    Ω2Ω1

    Consider the linear system: Au = f ∈ Kn.Given a decomposition of J1;nK, (N1,N2), define:• the restriction operator Ri from J1;nK into Ni,• RTi as the extension by 0 from Ni into J1;nK.

    Then define:

    ui = Riu Aij = RiARTj .

    3

  • overlapping schwarz methods

    Ω2Ω1

    Consider the linear system: Au = f ∈ Kn.Given a decomposition of J1;nK, (N1,N2), define:• the restriction operator Ri from J1;nK into Ni,• RTi as the extension by 0 from Ni into J1;nK.Then define:

    ui = Riu Aij = RiARTj .

    3

  • overlapping schwarz methods

    Duplicated unknowns coupled via a partition of unity:

    I =N∑i=1

    RTi DiRi.

    Then, um+1 =N∑i=1

    RTi Dium+1i . M

    −1RAS =

    N∑i=1

    RTi DiA−1ii Ri

    [Cai and Sarkis 1999]

    12

    1

    12 1

    4

  • overlapping schwarz methods

    Duplicated unknowns coupled via a partition of unity:

    I =N∑i=1

    RTi DiRi.

    Then, um+1 =N∑i=1

    RTi Dium+1i .

    M−1RAS =N∑i=1

    RTi DiA−1ii Ri

    [Cai and Sarkis 1999]

    12

    1

    12 1

    4

  • overlapping schwarz methods

    Duplicated unknowns coupled via a partition of unity:

    I =N∑i=1

    RTi DiRi.

    Then, um+1 =N∑i=1

    RTi Dium+1i . M

    −1RAS =

    N∑i=1

    RTi DiA−1ii Ri

    [Cai and Sarkis 1999]

    12

    1

    12 1

    4

  • substructuring preconditioners

    Ω1

    Ω2Ω3

    [Gosselet and Rey 2006] Subdomain tearing

    5

  • substructuring preconditioners

    Ω1

    Ω2Ω3

    1(1)

    2(1)3(1)

    4(1)5(1)

    1(2)2(2)3(2)

    4(2)5(2)

    6(2)

    1(3)2(3)

    3(3)

    4(3)

    5(3)6(3)

    7(3)

    [Gosselet and Rey 2006] Local numbering A(k) =[Aii AibAbi Abb

    ]5

  • substructuring preconditioners

    Ω1

    Ω2Ω3

    1(1)b 3(1)b

    2(1)b

    4(1)b

    1(2)b3(2)b2

    (2)b

    4(2)b3(3)b

    1(3)b

    2(3)b

    [Gosselet and Rey 2006] Elimination of interior d.o.f.S(k) = Abb − AbiA−1ii Aib

    5

  • substructuring preconditioners

    1(1)b 3(1)b

    2(1)b

    4(1)b

    1(2)b3(2)b2

    (2)b

    4(2)b3(3)b

    1(3)b

    2(3)b 1 3

    4

    5

    2

    Jump operators: {B(i)}3i=1 Primal constraints[Mandel 1993]

    5

  • substructuring preconditioners

    1(1)b 3(1)b

    2(1)b

    4(1)b

    1(2)b3(2)b2

    (2)b

    4(2)b3(3)b

    1(3)b

    2(3)b 12

    4

    53

    6

    7

    Jump operators: {B(i)}3i=1 Dual constraints[Farhat and Roux 1991]

    5

  • limitations of one-level methods

    One-level methods don’t require exchange of globalinformation.

    This hampers numerical scalability of such preconditioners:

    κ(M−1A) ⩽ C 1H2

    (1 +

    )• level of overlap δ,• characteristic size of a subdomain H.

    [Le Tallec 1994; Toselli and Widlund 2005]

    6

  • limitations of one-level methods

    One-level methods don’t require exchange of globalinformation.

    This hampers numerical scalability of such preconditioners:

    κ(M−1A) ⩽ C 1H2

    (1 +

    )• level of overlap δ,• characteristic size of a subdomain H.

    [Le Tallec 1994; Toselli and Widlund 2005]

    6

  • two-level preconditioners

  • two-level preconditioners

    A common technique in the field of DDM, MG, deflation:

    introduce an auxiliary “coarse” problem.

    Let Z be a rectangular matrix. Define

    E = ZTAZ.

    Z has O(N) columns, hence E is much smaller than A.

    Enrich the original preconditioner, e.g. additively

    P−1 = M−1 + ZE−1ZT,

    cf. [Tang et al. 2009].

    7

  • two-level preconditioners

    A common technique in the field of DDM, MG, deflation:

    introduce an auxiliary “coarse” problem.

    Let Z be a rectangular matrix. Define

    E = ZTAZ.

    Z has O(N) columns, hence E is much smaller than A.Enrich the original preconditioner, e.g. additively

    P−1 = M−1 + ZE−1ZT,

    cf. [Tang et al. 2009].

    7

  • computing a coarse operator correction

    How to apply ZE−1ZT to u ∈ Kn ?

    n

    ZTu = × = m≪ n

    E−1ZTu = \ =

    × =

    = ZE−1ZTu

    operations & MPI_Gather + linear solve + MPI_Scatter & operations

    8

  • computing a coarse operator correction

    How to apply

    ZE−1

    ZT to u ∈ Kn ?

    n

    ZTu = × = m≪ n

    E−1ZTu = \ =

    × =

    = ZE−1ZTu

    operations & MPI_Gather

    + linear solve + MPI_Scatter & operations

    8

  • computing a coarse operator correction

    How to apply

    Z

    E−1ZT to u ∈ Kn ?

    n

    ZTu = × = m≪ n

    E−1ZTu = \ =

    × =

    = ZE−1ZTu

    operations & MPI_Gather + linear solve

    + MPI_Scatter & operations

    8

  • computing a coarse operator correction

    How to apply ZE−1ZT to u ∈ Kn ?

    n

    ZTu = × = m≪ n

    E−1ZTu = \ =

    × =

    = ZE−1ZTu

    operations & MPI_Gather + linear solve + MPI_Scatter & operations

    8

  • numerical results

  • machines used for scaling runs

    Curie Thin Nodes

    • 5040 compute nodes (2 eight-core Intel Sandy Bridge).• IB QDR full fat-tree.

    Turing

    • 6 BlueGene/Q racks.

    9

  • strong scaling (stokes’ equation)

    1 subdomain/MPI process, --ranks-per-node = 8.

    1 0242 048

    4 0968 192

    20

    40

    100

    200

    500

    # of processes

    Run

    time(sec

    onds

    )

    Linear speedup3D 2D

    Credits: R. Haferssas

    10

  • block low-rank approximationsMotivation

    During setup, lot of time spent in the direct solver.

    11

  • block low-rank approximations

    1 subdomain/MPI process, 2 OpenMP threads/MPI process.

    256512

    1 0242 048

    4 0968 192

    0

    50

    100

    150

    200

    # of processes

    Run

    time(sec

    onds

    )

    2.1M d.o.f.sbdmn in 2D (P4 FE)

    256512

    1 0242 048

    4 0968 192

    0

    100

    200

    # of processes

    280k d.o.f.sbdmn in 3D (P2 FE)

    Factorization Deflation vectors Coarse operator Krylov method12

  • diffusion equation

    PARDISO

    MUMPS FR

    BLRε =

    10 −12

    BLRε =

    10 −8

    BLRε =

    10 −4

    PaStiX

    0

    20

    40

    60

    80

    100

    0.3

    0.8

    0.8

    0.7

    0.8

    1.2Run

    time(sec

    onds

    )

    LLT factorization (430k d.o.f.)

    13

  • stokes’ equation

    PARDISO

    MUMPS FR

    BLRε =

    10 −12

    BLRε =

    10 −8

    BLRε =

    10 −4

    PaStiX

    0

    50

    100

    150

    200

    2502.3

    1.2

    1.2

    1.2

    1.2

    2.6

    Run

    time(sec

    onds

    )

    LDLT factorization (503k d.o.f.)

    14

  • solution phase with multiple right-hand sidesMotivation

    DD preconditioner for tomographic imaging.

    ∇× (∇× E)− µ0(ω2ε+ iωσ

    )E = 0

    15

  • scalability of the preconditioner

    5121 024

    2 0484 096

    500

    200

    50

    10

    (54)

    (61)

    (73) (94)

    # of processes

    Run

    time(sec

    onds

    )Setup Solve Linear speedup

    Credits: P.-H. Tournier

    • 119 million double-precision complex unknowns• degree 2 edge elements

    16

  • stokes’ equation (lu factorization)

    100%250%375%500%

    MUMPS T1,1: 1.2s

    EP,p =p · T1,1P · TP,p

    P = 1 P = 2 P = 4 P = 8 P = 16

    1 2 4 8 16 32 64 128

    100%250%375%500%

    PARDISO T1,1: 3.4s

    # of right-hand sides (p) 17

  • maxwell’s equation (ldlt factorization)

    100%

    250%375% MUMPS T1,1: 1.1s

    EP,p =p · T1,1P · TP,p

    P = 1 P = 2 P = 4 P = 8 P = 16

    1 2 4 8 16 32 64 128

    100%

    250%375% PARDISO T1,1: 0.9s

    # of right-hand sides (p) 18

  • block methods for maxwell’s equation

    alternative p solve # of it. per RHS eff.

    GMRES 1 3 078.4 20 068 627 −GCRO-DR 1 1 836.9 10 701 334 1.7BGMRES 32 724.8 158 − 4.2

    BGCRO-DR 8 677.6 524 131 4.5BGCRO-DR 32 992.3 127 − 3.1

    • (m, k) = (50, 10) for solving 32 RHSs• 2048 subdomains and 2 threads per subdomain

    • alternative #1 to #5 =⇒ 158× fewer iterations• working on all 32 RHSs is costly (#4 vs. #5)

    19

  • block methods for maxwell’s equation

    alternative p solve # of it. per RHS eff.

    GMRES 1 3 078.4 20 068 627 −GCRO-DR 1 1 836.9 10 701 334 1.7BGMRES 32 724.8 158 − 4.2

    BGCRO-DR 8 677.6 524 131 4.5BGCRO-DR 32 992.3 127 − 3.1

    • (m, k) = (50, 10) for solving 32 RHSs• 2048 subdomains and 2 threads per subdomain

    • alternative #1 to #5 =⇒ 158× fewer iterations• working on all 32 RHSs is costly (#4 vs. #5)

    19

  • block methods for maxwell’s equation

    alternative p solve # of it. per RHS eff.

    GMRES 1 3 078.4 20 068 627 −GCRO-DR 1 1 836.9 10 701 334 1.7BGMRES 32 724.8 158 − 4.2

    BGCRO-DR 8 677.6 524 131 4.5BGCRO-DR 32 992.3 127 − 3.1

    • (m, k) = (50, 10) for solving 32 RHSs• 2048 subdomains and 2 threads per subdomain

    • alternative #1 to #5 =⇒ 158× fewer iterations• working on all 32 RHSs is costly (#4 vs. #5)

    19

  • block methods for maxwell’s equation

    alternative p solve # of it. per RHS eff.

    GMRES 1 3 078.4 20 068 627 −GCRO-DR 1 1 836.9 10 701 334 1.7BGMRES 32 724.8 158 − 4.2

    BGCRO-DR 8 677.6 524 131 4.5BGCRO-DR 32 992.3 127 − 3.1

    • (m, k) = (50, 10) for solving 32 RHSs• 2048 subdomains and 2 threads per subdomain

    • alternative #1 to #5 =⇒ 158× fewer iterations• working on all 32 RHSs is costly (#4 vs. #5)

    19

  • block methods for maxwell’s equation

    alternative p solve # of it. per RHS eff.

    GMRES 1 3 078.4 20 068 627 −GCRO-DR 1 1 836.9 10 701 334 1.7BGMRES 32 724.8 158 − 4.2

    BGCRO-DR 8 677.6 524 131 4.5BGCRO-DR 32 992.3 127 − 3.1

    • (m, k) = (50, 10) for solving 32 RHSs• 2048 subdomains and 2 threads per subdomain

    • alternative #1 to #5 =⇒ 158× fewer iterations• working on all 32 RHSs is costly (#4 vs. #5)

    19

  • block methods for maxwell’s equation

    alternative p solve # of it. per RHS eff.

    GMRES 1 3 078.4 20 068 627 −GCRO-DR 1 1 836.9 10 701 334 1.7BGMRES 32 724.8 158 − 4.2

    BGCRO-DR 8 677.6 524 131 4.5BGCRO-DR 32 992.3 127 − 3.1

    • (m, k) = (50, 10) for solving 32 RHSs• 2048 subdomains and 2 threads per subdomain• alternative #1 to #5 =⇒ 158× fewer iterations

    • working on all 32 RHSs is costly (#4 vs. #5)

    19

  • block methods for maxwell’s equation

    alternative p solve # of it. per RHS eff.

    GMRES 1 3 078.4 20 068 627 −GCRO-DR 1 1 836.9 10 701 334 1.7BGMRES 32 724.8 158 − 4.2

    BGCRO-DR 8 677.6 524 131 4.5BGCRO-DR 32 992.3 127 − 3.1

    • (m, k) = (50, 10) for solving 32 RHSs• 2048 subdomains and 2 threads per subdomain• alternative #1 to #5 =⇒ 158× fewer iterations• working on all 32 RHSs is costly (#4 vs. #5)

    19

  • features and whish list

  • summary of used features

    • LU, LDLT

    • distributed or centralized solution• BLR• multiple RHS• detection of pivots and size of nullspace• computation of Schur complement• working host/distributed input

    20

  • whish list

    • distributed RHS• efficient handling of block matrices

    Thank you for MUMPSand for your attention!

    21

  • whish list

    • distributed RHS• efficient handling of block matrices

    Thank you for MUMPSand for your attention!

    21

    IntroductionOverlapping Schwarz methodsSubstructuring preconditionersLimitations of one-level methods

    Two-level preconditionersMathematical point of view

    Numerical resultsLarge-scale scalability studiesBLRMultiple right-hand sides

    Features and whish list