49
Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments Luis E. Garcia-Castillo (1) , Daniel Garcia-Do ˜ noro (2) , Adrian Amor Martin (1) (1) Departamento de Teor´ ıa de la Se ˜ nal y Comunicaciones Universidad Carlos III de Madrid, Spain [luise,aamor]@tsc.uc3m.es (2) Xidian University, Xi’an, China. [email protected] GREMA-UC3M Higher Order Finite Element Code . . . v1.2

Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Higher-Order Finite Element Code forElectromagnetic Simulation on HPC Environments

Luis E Garcia-Castillo(1)Daniel Garcia-Donoro(2) Adrian Amor Martin(1)

(1)Departamento de Teorıa de la Senal y ComunicacionesUniversidad Carlos III de Madrid Spain[luiseaamor]tscuc3mes(2)Xidian University Xirsquoan Chinadanielxidianeducn

GREMA-UC3M Higher Order Finite Element Code v12

UC3M a young University established in 1989

3 Campuses in Madrid Regionbull Getafe 11km far from capitalbull Leganeacutes 12km far from capitalbull Colmenarejo 45km far from capital 3 Schools ( Bachelor programs ) bull Social and Legal Sciences (GC)bull Humanities Communication and Library

Sciences (GC)bull Polytechnic School (LC)

1 Center for Advanced Studiesbull For Master programs

Colmenarejo

Leganeacutes Getafe

Outline

1 Antecedents

2 Parallel Higher-Order FEM CodeElectromagnetic Modeling FeaturesComputational Features and ImplementationGUI with HPCaaS Interface

3 MUMPS InterfaceGetting Experience with MUMPSPresent Stage of DevelopmentSome Issues with MemorySpecialized Interfaces

4 Applications amp Performance

5 Work in Progress and Future Work

GREMA-UC3M Higher Order Finite Element Code v12

Antecedents

More than 20 years of experience onnumerical methods for EM (mainlyFEM but also others) Contributionson

I Curl-conforming basis functionsI Non-standard mesh truncation

technique (FE-IIEE) for scatteringand radiation problems

I Adaptivity h and hp strategiesI Hybridization with MoM and high

frequency techniques such asPOPTD and GTDUTD

I

Code writing from scratch mainlyduring PhD thesis of DGarcia-DonoroParallel processing (MPI) and HPC inmind

Inclusion of well-provenresearch techniquesdeveloped within the researchgroupldquoReasonablerdquo friendly to beused by non-developers

GREMA-UC3M Higher Order Finite Element Code v12

FEM FormulationDouble-Curl Vector Wave PDE

Formulation based on double curl vector wave equation (use of E or H)

nablatimes (fminus1r nablatimes V

)minus k2

0 gr V = minusjk0H0P +nablatimes fminus1r Q

Table Formulation magnitudes and parameters

V macrfr macrgr h P L ΓD ΓN

Form E E macrmicror macrεr η J M ΓPEC ΓPMC

Form H H macrεr macrmicror1η M minusJ ΓPMC ΓPEC

GREMA-UC3M Higher Order Finite Element Code v12

FEM FormulationBoundary Conditions and Excitations

The boundary conditions considered are of Dirichlet Neumann andCauchy types

ntimes V = ΨD over ΓD (1)

ntimes(

macrfrminus1nablatimes V

)= ΨN over ΓN (2)

ntimes(

macrfrminus1nablatimes V

)+ γ ntimes ntimes V = ΨC over ΓC (3)

Periodic Boundary Conditions on unit cell (infinite array approach)

Analytic boundary conditions for waveports of common waveguides andalso numerical waveport for arbitrary waveguides by means of 2Deigenvalueeigenmode characterization

Lumped RLC (resistance coils and capacitors) elements an ports

Impressed electric and magnetic currents plane waves

GREMA-UC3M Higher Order Finite Element Code v12

FEM FormulationVariational Formulation

Use of H(curl) spaces

H(curl)0 = W isin H(curl) ntimesW = 0 on ΓD (4)

H(curl) = W isin L2 nablatimesW isin L2 (5)and Galerkin method

Find V isin H(curl) such that c(FV) = l(F) forallF isin H(curl)0

c(FV) =

int

Ω

(nablatimes F) middot(

macrfrminus1nablatimes V

)dΩminus k2

0

int

Ω

(F middot macrgr V

)dΩ +

γ

int

ΓC

(ntimes F) middot (ntimes V) dΓC

l(F) = minusjk0h0

int

Ω

F middot P dΩ minusint

ΓN

F middotΨN dΓN minusint

ΓC

F middotΨC dΓC

minusint

Ω

F middotnablatimes(

macrfrminus1

L)

GREMA-UC3M Higher Order Finite Element Code v12

FEM DiscretizationHigher-Order Curl-Conforming Element

Own family of higher order isoparametric curl-conforming finite elements(tetrahedron prism hexahedron mdashunder testmdash)Rigorous implementation of Nedelecrsquos mixed order elements

4

2

31

13 14

15

16

18

17

19 20

ζ

ξ

η1

2 3

4

7

8

9

10

11

12

6 5

1

2

3

4

5

6

1

2

3

4

56

7

89

10

1112

16

15

17

18

20

19

22

21

2324

25

26

27

28

29

30

31

32

3334

35

36

ζ

ξ

η

14

13

Example 2nd order versions of tetra and prism

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEProblem Set-Up

Open region problems (optionally) by means of FE-IIEE (Finite Element- Iterative Integral Equation Evaluation)rArr Asymptotically exact absorbing boundary condition

ε1 micro1

Medium 1

ε2 micro2

Medium 2

PEC

PMC

PEC

J

nprime

n

M

S prime

Meq

Jeq

S

Exterior Domain(ΩEXT)EincHinc

FEM Domain(ΩFEM)

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEAlgorithm

Local BC for FEM (sparse matrices)

ntimes(

macrfrminus1nablatimes V

)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S

Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime

VFE-IIEE =

intintcopy

Sprime(Leq timesnablaG) middot dSprime minus jk0h0

intintcopy

Sprime

(Oeq

(G +

1k2

0nablanablaG

))middot dSprime

nabla times VFE-IIEE = jk0h0

intintcopy

Sprime(Oeq timesnablaG) dSprime minus

intintcopy

Sprime

(Leq

(k2

0 G + nablanablaG))

middot dSprime

ΨSCAT = ntimes(

macrfrminus1nablatimes VFE-IIEE

)+ γ ntimes ntimes VFE-IIEE

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEFlow Chart

Assembly ofElement Matrices

Imposition of BCNon Related to S

PostprocessSparseSolver

InitialBC on S

Computation ofElement Matrices

Upgrade of BC on

Mesh

FEM code for nonminusopen problems

Ψ(0)(r)

S Ψ(i+1)(r)

J(i)eq (r prime)M(i)

eq (r prime)rArr

V (r isin ΓS)nablatimesV (r isin ΓS)

GREMA-UC3M Higher Order Finite Element Code v12

Computational Features

Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions

GREMA-UC3M Higher Order Finite Element Code v12

Computational FeaturesTowards HPC

Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores

GREMA-UC3M Higher Order Finite Element Code v12

Parallel Flow Chart of the Code

MPI Division

FE-IIEE enabled

Yes

No

p0 p1 p2 pn ReadInput Data

p0 p1 p2 pn Create global mesh object

Numberingof DOFs

p1 p2 pn Deallocateglobal mesh

object

p0

Initializesolver

p0 p1 p2 pn Create local mesh object

p0 p1 p2 pn Extractboundary

mesh

p0 p1 p2 pn

p0 p1 p2 pn Calculate ampfill LHS matrix

p0 p1 p2 pn Matrixfactorization

p0 Calculate ampfill RHS matrix

Solve systemof equations

p0 p1 p2 pn

Yes

Yes

Calculatescatteringfield

p0 Updateglobal RHS

FE-IIEE enabled

p0 p1 p2 pn

More frequencies

Postprocess

No

No

Solve systemof equations

p0 p1 p2 pn

ConvIter achieved

Yes

No

GREMA-UC3M Higher Order Finite Element Code v12

Graphical User InterfaceFeatures

FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces

Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)

GREMA-UC3M Higher Order Finite Element Code v12

PosidoniaIn-house HPCaaS Solution

Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)

A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015

GREMA-UC3M Higher Order Finite Element Code v12

GUI with HPCaaS InterfaceScreenshot

GREMA-UC3M Higher Order Finite Element Code v12

Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to

Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)

RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )

GREMA-UC3M Higher Order Finite Element Code v12

MUMPS InterfacePresent Stage of Development (version 502)

MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors

I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)

Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization

lowast Frequent use of out-of-core (OOC) capabilities of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 2: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

UC3M a young University established in 1989

3 Campuses in Madrid Regionbull Getafe 11km far from capitalbull Leganeacutes 12km far from capitalbull Colmenarejo 45km far from capital 3 Schools ( Bachelor programs ) bull Social and Legal Sciences (GC)bull Humanities Communication and Library

Sciences (GC)bull Polytechnic School (LC)

1 Center for Advanced Studiesbull For Master programs

Colmenarejo

Leganeacutes Getafe

Outline

1 Antecedents

2 Parallel Higher-Order FEM CodeElectromagnetic Modeling FeaturesComputational Features and ImplementationGUI with HPCaaS Interface

3 MUMPS InterfaceGetting Experience with MUMPSPresent Stage of DevelopmentSome Issues with MemorySpecialized Interfaces

4 Applications amp Performance

5 Work in Progress and Future Work

GREMA-UC3M Higher Order Finite Element Code v12

Antecedents

More than 20 years of experience onnumerical methods for EM (mainlyFEM but also others) Contributionson

I Curl-conforming basis functionsI Non-standard mesh truncation

technique (FE-IIEE) for scatteringand radiation problems

I Adaptivity h and hp strategiesI Hybridization with MoM and high

frequency techniques such asPOPTD and GTDUTD

I

Code writing from scratch mainlyduring PhD thesis of DGarcia-DonoroParallel processing (MPI) and HPC inmind

Inclusion of well-provenresearch techniquesdeveloped within the researchgroupldquoReasonablerdquo friendly to beused by non-developers

GREMA-UC3M Higher Order Finite Element Code v12

FEM FormulationDouble-Curl Vector Wave PDE

Formulation based on double curl vector wave equation (use of E or H)

nablatimes (fminus1r nablatimes V

)minus k2

0 gr V = minusjk0H0P +nablatimes fminus1r Q

Table Formulation magnitudes and parameters

V macrfr macrgr h P L ΓD ΓN

Form E E macrmicror macrεr η J M ΓPEC ΓPMC

Form H H macrεr macrmicror1η M minusJ ΓPMC ΓPEC

GREMA-UC3M Higher Order Finite Element Code v12

FEM FormulationBoundary Conditions and Excitations

The boundary conditions considered are of Dirichlet Neumann andCauchy types

ntimes V = ΨD over ΓD (1)

ntimes(

macrfrminus1nablatimes V

)= ΨN over ΓN (2)

ntimes(

macrfrminus1nablatimes V

)+ γ ntimes ntimes V = ΨC over ΓC (3)

Periodic Boundary Conditions on unit cell (infinite array approach)

Analytic boundary conditions for waveports of common waveguides andalso numerical waveport for arbitrary waveguides by means of 2Deigenvalueeigenmode characterization

Lumped RLC (resistance coils and capacitors) elements an ports

Impressed electric and magnetic currents plane waves

GREMA-UC3M Higher Order Finite Element Code v12

FEM FormulationVariational Formulation

Use of H(curl) spaces

H(curl)0 = W isin H(curl) ntimesW = 0 on ΓD (4)

H(curl) = W isin L2 nablatimesW isin L2 (5)and Galerkin method

Find V isin H(curl) such that c(FV) = l(F) forallF isin H(curl)0

c(FV) =

int

Ω

(nablatimes F) middot(

macrfrminus1nablatimes V

)dΩminus k2

0

int

Ω

(F middot macrgr V

)dΩ +

γ

int

ΓC

(ntimes F) middot (ntimes V) dΓC

l(F) = minusjk0h0

int

Ω

F middot P dΩ minusint

ΓN

F middotΨN dΓN minusint

ΓC

F middotΨC dΓC

minusint

Ω

F middotnablatimes(

macrfrminus1

L)

GREMA-UC3M Higher Order Finite Element Code v12

FEM DiscretizationHigher-Order Curl-Conforming Element

Own family of higher order isoparametric curl-conforming finite elements(tetrahedron prism hexahedron mdashunder testmdash)Rigorous implementation of Nedelecrsquos mixed order elements

4

2

31

13 14

15

16

18

17

19 20

ζ

ξ

η1

2 3

4

7

8

9

10

11

12

6 5

1

2

3

4

5

6

1

2

3

4

56

7

89

10

1112

16

15

17

18

20

19

22

21

2324

25

26

27

28

29

30

31

32

3334

35

36

ζ

ξ

η

14

13

Example 2nd order versions of tetra and prism

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEProblem Set-Up

Open region problems (optionally) by means of FE-IIEE (Finite Element- Iterative Integral Equation Evaluation)rArr Asymptotically exact absorbing boundary condition

ε1 micro1

Medium 1

ε2 micro2

Medium 2

PEC

PMC

PEC

J

nprime

n

M

S prime

Meq

Jeq

S

Exterior Domain(ΩEXT)EincHinc

FEM Domain(ΩFEM)

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEAlgorithm

Local BC for FEM (sparse matrices)

ntimes(

macrfrminus1nablatimes V

)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S

Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime

VFE-IIEE =

intintcopy

Sprime(Leq timesnablaG) middot dSprime minus jk0h0

intintcopy

Sprime

(Oeq

(G +

1k2

0nablanablaG

))middot dSprime

nabla times VFE-IIEE = jk0h0

intintcopy

Sprime(Oeq timesnablaG) dSprime minus

intintcopy

Sprime

(Leq

(k2

0 G + nablanablaG))

middot dSprime

ΨSCAT = ntimes(

macrfrminus1nablatimes VFE-IIEE

)+ γ ntimes ntimes VFE-IIEE

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEFlow Chart

Assembly ofElement Matrices

Imposition of BCNon Related to S

PostprocessSparseSolver

InitialBC on S

Computation ofElement Matrices

Upgrade of BC on

Mesh

FEM code for nonminusopen problems

Ψ(0)(r)

S Ψ(i+1)(r)

J(i)eq (r prime)M(i)

eq (r prime)rArr

V (r isin ΓS)nablatimesV (r isin ΓS)

GREMA-UC3M Higher Order Finite Element Code v12

Computational Features

Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions

GREMA-UC3M Higher Order Finite Element Code v12

Computational FeaturesTowards HPC

Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores

GREMA-UC3M Higher Order Finite Element Code v12

Parallel Flow Chart of the Code

MPI Division

FE-IIEE enabled

Yes

No

p0 p1 p2 pn ReadInput Data

p0 p1 p2 pn Create global mesh object

Numberingof DOFs

p1 p2 pn Deallocateglobal mesh

object

p0

Initializesolver

p0 p1 p2 pn Create local mesh object

p0 p1 p2 pn Extractboundary

mesh

p0 p1 p2 pn

p0 p1 p2 pn Calculate ampfill LHS matrix

p0 p1 p2 pn Matrixfactorization

p0 Calculate ampfill RHS matrix

Solve systemof equations

p0 p1 p2 pn

Yes

Yes

Calculatescatteringfield

p0 Updateglobal RHS

FE-IIEE enabled

p0 p1 p2 pn

More frequencies

Postprocess

No

No

Solve systemof equations

p0 p1 p2 pn

ConvIter achieved

Yes

No

GREMA-UC3M Higher Order Finite Element Code v12

Graphical User InterfaceFeatures

FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces

Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)

GREMA-UC3M Higher Order Finite Element Code v12

PosidoniaIn-house HPCaaS Solution

Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)

A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015

GREMA-UC3M Higher Order Finite Element Code v12

GUI with HPCaaS InterfaceScreenshot

GREMA-UC3M Higher Order Finite Element Code v12

Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to

Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)

RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )

GREMA-UC3M Higher Order Finite Element Code v12

MUMPS InterfacePresent Stage of Development (version 502)

MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors

I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)

Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization

lowast Frequent use of out-of-core (OOC) capabilities of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 3: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Outline

1 Antecedents

2 Parallel Higher-Order FEM CodeElectromagnetic Modeling FeaturesComputational Features and ImplementationGUI with HPCaaS Interface

3 MUMPS InterfaceGetting Experience with MUMPSPresent Stage of DevelopmentSome Issues with MemorySpecialized Interfaces

4 Applications amp Performance

5 Work in Progress and Future Work

GREMA-UC3M Higher Order Finite Element Code v12

Antecedents

More than 20 years of experience onnumerical methods for EM (mainlyFEM but also others) Contributionson

I Curl-conforming basis functionsI Non-standard mesh truncation

technique (FE-IIEE) for scatteringand radiation problems

I Adaptivity h and hp strategiesI Hybridization with MoM and high

frequency techniques such asPOPTD and GTDUTD

I

Code writing from scratch mainlyduring PhD thesis of DGarcia-DonoroParallel processing (MPI) and HPC inmind

Inclusion of well-provenresearch techniquesdeveloped within the researchgroupldquoReasonablerdquo friendly to beused by non-developers

GREMA-UC3M Higher Order Finite Element Code v12

FEM FormulationDouble-Curl Vector Wave PDE

Formulation based on double curl vector wave equation (use of E or H)

nablatimes (fminus1r nablatimes V

)minus k2

0 gr V = minusjk0H0P +nablatimes fminus1r Q

Table Formulation magnitudes and parameters

V macrfr macrgr h P L ΓD ΓN

Form E E macrmicror macrεr η J M ΓPEC ΓPMC

Form H H macrεr macrmicror1η M minusJ ΓPMC ΓPEC

GREMA-UC3M Higher Order Finite Element Code v12

FEM FormulationBoundary Conditions and Excitations

The boundary conditions considered are of Dirichlet Neumann andCauchy types

ntimes V = ΨD over ΓD (1)

ntimes(

macrfrminus1nablatimes V

)= ΨN over ΓN (2)

ntimes(

macrfrminus1nablatimes V

)+ γ ntimes ntimes V = ΨC over ΓC (3)

Periodic Boundary Conditions on unit cell (infinite array approach)

Analytic boundary conditions for waveports of common waveguides andalso numerical waveport for arbitrary waveguides by means of 2Deigenvalueeigenmode characterization

Lumped RLC (resistance coils and capacitors) elements an ports

Impressed electric and magnetic currents plane waves

GREMA-UC3M Higher Order Finite Element Code v12

FEM FormulationVariational Formulation

Use of H(curl) spaces

H(curl)0 = W isin H(curl) ntimesW = 0 on ΓD (4)

H(curl) = W isin L2 nablatimesW isin L2 (5)and Galerkin method

Find V isin H(curl) such that c(FV) = l(F) forallF isin H(curl)0

c(FV) =

int

Ω

(nablatimes F) middot(

macrfrminus1nablatimes V

)dΩminus k2

0

int

Ω

(F middot macrgr V

)dΩ +

γ

int

ΓC

(ntimes F) middot (ntimes V) dΓC

l(F) = minusjk0h0

int

Ω

F middot P dΩ minusint

ΓN

F middotΨN dΓN minusint

ΓC

F middotΨC dΓC

minusint

Ω

F middotnablatimes(

macrfrminus1

L)

GREMA-UC3M Higher Order Finite Element Code v12

FEM DiscretizationHigher-Order Curl-Conforming Element

Own family of higher order isoparametric curl-conforming finite elements(tetrahedron prism hexahedron mdashunder testmdash)Rigorous implementation of Nedelecrsquos mixed order elements

4

2

31

13 14

15

16

18

17

19 20

ζ

ξ

η1

2 3

4

7

8

9

10

11

12

6 5

1

2

3

4

5

6

1

2

3

4

56

7

89

10

1112

16

15

17

18

20

19

22

21

2324

25

26

27

28

29

30

31

32

3334

35

36

ζ

ξ

η

14

13

Example 2nd order versions of tetra and prism

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEProblem Set-Up

Open region problems (optionally) by means of FE-IIEE (Finite Element- Iterative Integral Equation Evaluation)rArr Asymptotically exact absorbing boundary condition

ε1 micro1

Medium 1

ε2 micro2

Medium 2

PEC

PMC

PEC

J

nprime

n

M

S prime

Meq

Jeq

S

Exterior Domain(ΩEXT)EincHinc

FEM Domain(ΩFEM)

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEAlgorithm

Local BC for FEM (sparse matrices)

ntimes(

macrfrminus1nablatimes V

)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S

Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime

VFE-IIEE =

intintcopy

Sprime(Leq timesnablaG) middot dSprime minus jk0h0

intintcopy

Sprime

(Oeq

(G +

1k2

0nablanablaG

))middot dSprime

nabla times VFE-IIEE = jk0h0

intintcopy

Sprime(Oeq timesnablaG) dSprime minus

intintcopy

Sprime

(Leq

(k2

0 G + nablanablaG))

middot dSprime

ΨSCAT = ntimes(

macrfrminus1nablatimes VFE-IIEE

)+ γ ntimes ntimes VFE-IIEE

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEFlow Chart

Assembly ofElement Matrices

Imposition of BCNon Related to S

PostprocessSparseSolver

InitialBC on S

Computation ofElement Matrices

Upgrade of BC on

Mesh

FEM code for nonminusopen problems

Ψ(0)(r)

S Ψ(i+1)(r)

J(i)eq (r prime)M(i)

eq (r prime)rArr

V (r isin ΓS)nablatimesV (r isin ΓS)

GREMA-UC3M Higher Order Finite Element Code v12

Computational Features

Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions

GREMA-UC3M Higher Order Finite Element Code v12

Computational FeaturesTowards HPC

Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores

GREMA-UC3M Higher Order Finite Element Code v12

Parallel Flow Chart of the Code

MPI Division

FE-IIEE enabled

Yes

No

p0 p1 p2 pn ReadInput Data

p0 p1 p2 pn Create global mesh object

Numberingof DOFs

p1 p2 pn Deallocateglobal mesh

object

p0

Initializesolver

p0 p1 p2 pn Create local mesh object

p0 p1 p2 pn Extractboundary

mesh

p0 p1 p2 pn

p0 p1 p2 pn Calculate ampfill LHS matrix

p0 p1 p2 pn Matrixfactorization

p0 Calculate ampfill RHS matrix

Solve systemof equations

p0 p1 p2 pn

Yes

Yes

Calculatescatteringfield

p0 Updateglobal RHS

FE-IIEE enabled

p0 p1 p2 pn

More frequencies

Postprocess

No

No

Solve systemof equations

p0 p1 p2 pn

ConvIter achieved

Yes

No

GREMA-UC3M Higher Order Finite Element Code v12

Graphical User InterfaceFeatures

FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces

Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)

GREMA-UC3M Higher Order Finite Element Code v12

PosidoniaIn-house HPCaaS Solution

Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)

A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015

GREMA-UC3M Higher Order Finite Element Code v12

GUI with HPCaaS InterfaceScreenshot

GREMA-UC3M Higher Order Finite Element Code v12

Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to

Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)

RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )

GREMA-UC3M Higher Order Finite Element Code v12

MUMPS InterfacePresent Stage of Development (version 502)

MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors

I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)

Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization

lowast Frequent use of out-of-core (OOC) capabilities of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 4: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Antecedents

More than 20 years of experience onnumerical methods for EM (mainlyFEM but also others) Contributionson

I Curl-conforming basis functionsI Non-standard mesh truncation

technique (FE-IIEE) for scatteringand radiation problems

I Adaptivity h and hp strategiesI Hybridization with MoM and high

frequency techniques such asPOPTD and GTDUTD

I

Code writing from scratch mainlyduring PhD thesis of DGarcia-DonoroParallel processing (MPI) and HPC inmind

Inclusion of well-provenresearch techniquesdeveloped within the researchgroupldquoReasonablerdquo friendly to beused by non-developers

GREMA-UC3M Higher Order Finite Element Code v12

FEM FormulationDouble-Curl Vector Wave PDE

Formulation based on double curl vector wave equation (use of E or H)

nablatimes (fminus1r nablatimes V

)minus k2

0 gr V = minusjk0H0P +nablatimes fminus1r Q

Table Formulation magnitudes and parameters

V macrfr macrgr h P L ΓD ΓN

Form E E macrmicror macrεr η J M ΓPEC ΓPMC

Form H H macrεr macrmicror1η M minusJ ΓPMC ΓPEC

GREMA-UC3M Higher Order Finite Element Code v12

FEM FormulationBoundary Conditions and Excitations

The boundary conditions considered are of Dirichlet Neumann andCauchy types

ntimes V = ΨD over ΓD (1)

ntimes(

macrfrminus1nablatimes V

)= ΨN over ΓN (2)

ntimes(

macrfrminus1nablatimes V

)+ γ ntimes ntimes V = ΨC over ΓC (3)

Periodic Boundary Conditions on unit cell (infinite array approach)

Analytic boundary conditions for waveports of common waveguides andalso numerical waveport for arbitrary waveguides by means of 2Deigenvalueeigenmode characterization

Lumped RLC (resistance coils and capacitors) elements an ports

Impressed electric and magnetic currents plane waves

GREMA-UC3M Higher Order Finite Element Code v12

FEM FormulationVariational Formulation

Use of H(curl) spaces

H(curl)0 = W isin H(curl) ntimesW = 0 on ΓD (4)

H(curl) = W isin L2 nablatimesW isin L2 (5)and Galerkin method

Find V isin H(curl) such that c(FV) = l(F) forallF isin H(curl)0

c(FV) =

int

Ω

(nablatimes F) middot(

macrfrminus1nablatimes V

)dΩminus k2

0

int

Ω

(F middot macrgr V

)dΩ +

γ

int

ΓC

(ntimes F) middot (ntimes V) dΓC

l(F) = minusjk0h0

int

Ω

F middot P dΩ minusint

ΓN

F middotΨN dΓN minusint

ΓC

F middotΨC dΓC

minusint

Ω

F middotnablatimes(

macrfrminus1

L)

GREMA-UC3M Higher Order Finite Element Code v12

FEM DiscretizationHigher-Order Curl-Conforming Element

Own family of higher order isoparametric curl-conforming finite elements(tetrahedron prism hexahedron mdashunder testmdash)Rigorous implementation of Nedelecrsquos mixed order elements

4

2

31

13 14

15

16

18

17

19 20

ζ

ξ

η1

2 3

4

7

8

9

10

11

12

6 5

1

2

3

4

5

6

1

2

3

4

56

7

89

10

1112

16

15

17

18

20

19

22

21

2324

25

26

27

28

29

30

31

32

3334

35

36

ζ

ξ

η

14

13

Example 2nd order versions of tetra and prism

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEProblem Set-Up

Open region problems (optionally) by means of FE-IIEE (Finite Element- Iterative Integral Equation Evaluation)rArr Asymptotically exact absorbing boundary condition

ε1 micro1

Medium 1

ε2 micro2

Medium 2

PEC

PMC

PEC

J

nprime

n

M

S prime

Meq

Jeq

S

Exterior Domain(ΩEXT)EincHinc

FEM Domain(ΩFEM)

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEAlgorithm

Local BC for FEM (sparse matrices)

ntimes(

macrfrminus1nablatimes V

)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S

Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime

VFE-IIEE =

intintcopy

Sprime(Leq timesnablaG) middot dSprime minus jk0h0

intintcopy

Sprime

(Oeq

(G +

1k2

0nablanablaG

))middot dSprime

nabla times VFE-IIEE = jk0h0

intintcopy

Sprime(Oeq timesnablaG) dSprime minus

intintcopy

Sprime

(Leq

(k2

0 G + nablanablaG))

middot dSprime

ΨSCAT = ntimes(

macrfrminus1nablatimes VFE-IIEE

)+ γ ntimes ntimes VFE-IIEE

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEFlow Chart

Assembly ofElement Matrices

Imposition of BCNon Related to S

PostprocessSparseSolver

InitialBC on S

Computation ofElement Matrices

Upgrade of BC on

Mesh

FEM code for nonminusopen problems

Ψ(0)(r)

S Ψ(i+1)(r)

J(i)eq (r prime)M(i)

eq (r prime)rArr

V (r isin ΓS)nablatimesV (r isin ΓS)

GREMA-UC3M Higher Order Finite Element Code v12

Computational Features

Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions

GREMA-UC3M Higher Order Finite Element Code v12

Computational FeaturesTowards HPC

Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores

GREMA-UC3M Higher Order Finite Element Code v12

Parallel Flow Chart of the Code

MPI Division

FE-IIEE enabled

Yes

No

p0 p1 p2 pn ReadInput Data

p0 p1 p2 pn Create global mesh object

Numberingof DOFs

p1 p2 pn Deallocateglobal mesh

object

p0

Initializesolver

p0 p1 p2 pn Create local mesh object

p0 p1 p2 pn Extractboundary

mesh

p0 p1 p2 pn

p0 p1 p2 pn Calculate ampfill LHS matrix

p0 p1 p2 pn Matrixfactorization

p0 Calculate ampfill RHS matrix

Solve systemof equations

p0 p1 p2 pn

Yes

Yes

Calculatescatteringfield

p0 Updateglobal RHS

FE-IIEE enabled

p0 p1 p2 pn

More frequencies

Postprocess

No

No

Solve systemof equations

p0 p1 p2 pn

ConvIter achieved

Yes

No

GREMA-UC3M Higher Order Finite Element Code v12

Graphical User InterfaceFeatures

FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces

Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)

GREMA-UC3M Higher Order Finite Element Code v12

PosidoniaIn-house HPCaaS Solution

Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)

A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015

GREMA-UC3M Higher Order Finite Element Code v12

GUI with HPCaaS InterfaceScreenshot

GREMA-UC3M Higher Order Finite Element Code v12

Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to

Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)

RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )

GREMA-UC3M Higher Order Finite Element Code v12

MUMPS InterfacePresent Stage of Development (version 502)

MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors

I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)

Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization

lowast Frequent use of out-of-core (OOC) capabilities of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 5: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

FEM FormulationDouble-Curl Vector Wave PDE

Formulation based on double curl vector wave equation (use of E or H)

nablatimes (fminus1r nablatimes V

)minus k2

0 gr V = minusjk0H0P +nablatimes fminus1r Q

Table Formulation magnitudes and parameters

V macrfr macrgr h P L ΓD ΓN

Form E E macrmicror macrεr η J M ΓPEC ΓPMC

Form H H macrεr macrmicror1η M minusJ ΓPMC ΓPEC

GREMA-UC3M Higher Order Finite Element Code v12

FEM FormulationBoundary Conditions and Excitations

The boundary conditions considered are of Dirichlet Neumann andCauchy types

ntimes V = ΨD over ΓD (1)

ntimes(

macrfrminus1nablatimes V

)= ΨN over ΓN (2)

ntimes(

macrfrminus1nablatimes V

)+ γ ntimes ntimes V = ΨC over ΓC (3)

Periodic Boundary Conditions on unit cell (infinite array approach)

Analytic boundary conditions for waveports of common waveguides andalso numerical waveport for arbitrary waveguides by means of 2Deigenvalueeigenmode characterization

Lumped RLC (resistance coils and capacitors) elements an ports

Impressed electric and magnetic currents plane waves

GREMA-UC3M Higher Order Finite Element Code v12

FEM FormulationVariational Formulation

Use of H(curl) spaces

H(curl)0 = W isin H(curl) ntimesW = 0 on ΓD (4)

H(curl) = W isin L2 nablatimesW isin L2 (5)and Galerkin method

Find V isin H(curl) such that c(FV) = l(F) forallF isin H(curl)0

c(FV) =

int

Ω

(nablatimes F) middot(

macrfrminus1nablatimes V

)dΩminus k2

0

int

Ω

(F middot macrgr V

)dΩ +

γ

int

ΓC

(ntimes F) middot (ntimes V) dΓC

l(F) = minusjk0h0

int

Ω

F middot P dΩ minusint

ΓN

F middotΨN dΓN minusint

ΓC

F middotΨC dΓC

minusint

Ω

F middotnablatimes(

macrfrminus1

L)

GREMA-UC3M Higher Order Finite Element Code v12

FEM DiscretizationHigher-Order Curl-Conforming Element

Own family of higher order isoparametric curl-conforming finite elements(tetrahedron prism hexahedron mdashunder testmdash)Rigorous implementation of Nedelecrsquos mixed order elements

4

2

31

13 14

15

16

18

17

19 20

ζ

ξ

η1

2 3

4

7

8

9

10

11

12

6 5

1

2

3

4

5

6

1

2

3

4

56

7

89

10

1112

16

15

17

18

20

19

22

21

2324

25

26

27

28

29

30

31

32

3334

35

36

ζ

ξ

η

14

13

Example 2nd order versions of tetra and prism

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEProblem Set-Up

Open region problems (optionally) by means of FE-IIEE (Finite Element- Iterative Integral Equation Evaluation)rArr Asymptotically exact absorbing boundary condition

ε1 micro1

Medium 1

ε2 micro2

Medium 2

PEC

PMC

PEC

J

nprime

n

M

S prime

Meq

Jeq

S

Exterior Domain(ΩEXT)EincHinc

FEM Domain(ΩFEM)

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEAlgorithm

Local BC for FEM (sparse matrices)

ntimes(

macrfrminus1nablatimes V

)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S

Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime

VFE-IIEE =

intintcopy

Sprime(Leq timesnablaG) middot dSprime minus jk0h0

intintcopy

Sprime

(Oeq

(G +

1k2

0nablanablaG

))middot dSprime

nabla times VFE-IIEE = jk0h0

intintcopy

Sprime(Oeq timesnablaG) dSprime minus

intintcopy

Sprime

(Leq

(k2

0 G + nablanablaG))

middot dSprime

ΨSCAT = ntimes(

macrfrminus1nablatimes VFE-IIEE

)+ γ ntimes ntimes VFE-IIEE

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEFlow Chart

Assembly ofElement Matrices

Imposition of BCNon Related to S

PostprocessSparseSolver

InitialBC on S

Computation ofElement Matrices

Upgrade of BC on

Mesh

FEM code for nonminusopen problems

Ψ(0)(r)

S Ψ(i+1)(r)

J(i)eq (r prime)M(i)

eq (r prime)rArr

V (r isin ΓS)nablatimesV (r isin ΓS)

GREMA-UC3M Higher Order Finite Element Code v12

Computational Features

Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions

GREMA-UC3M Higher Order Finite Element Code v12

Computational FeaturesTowards HPC

Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores

GREMA-UC3M Higher Order Finite Element Code v12

Parallel Flow Chart of the Code

MPI Division

FE-IIEE enabled

Yes

No

p0 p1 p2 pn ReadInput Data

p0 p1 p2 pn Create global mesh object

Numberingof DOFs

p1 p2 pn Deallocateglobal mesh

object

p0

Initializesolver

p0 p1 p2 pn Create local mesh object

p0 p1 p2 pn Extractboundary

mesh

p0 p1 p2 pn

p0 p1 p2 pn Calculate ampfill LHS matrix

p0 p1 p2 pn Matrixfactorization

p0 Calculate ampfill RHS matrix

Solve systemof equations

p0 p1 p2 pn

Yes

Yes

Calculatescatteringfield

p0 Updateglobal RHS

FE-IIEE enabled

p0 p1 p2 pn

More frequencies

Postprocess

No

No

Solve systemof equations

p0 p1 p2 pn

ConvIter achieved

Yes

No

GREMA-UC3M Higher Order Finite Element Code v12

Graphical User InterfaceFeatures

FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces

Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)

GREMA-UC3M Higher Order Finite Element Code v12

PosidoniaIn-house HPCaaS Solution

Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)

A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015

GREMA-UC3M Higher Order Finite Element Code v12

GUI with HPCaaS InterfaceScreenshot

GREMA-UC3M Higher Order Finite Element Code v12

Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to

Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)

RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )

GREMA-UC3M Higher Order Finite Element Code v12

MUMPS InterfacePresent Stage of Development (version 502)

MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors

I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)

Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization

lowast Frequent use of out-of-core (OOC) capabilities of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 6: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

FEM FormulationBoundary Conditions and Excitations

The boundary conditions considered are of Dirichlet Neumann andCauchy types

ntimes V = ΨD over ΓD (1)

ntimes(

macrfrminus1nablatimes V

)= ΨN over ΓN (2)

ntimes(

macrfrminus1nablatimes V

)+ γ ntimes ntimes V = ΨC over ΓC (3)

Periodic Boundary Conditions on unit cell (infinite array approach)

Analytic boundary conditions for waveports of common waveguides andalso numerical waveport for arbitrary waveguides by means of 2Deigenvalueeigenmode characterization

Lumped RLC (resistance coils and capacitors) elements an ports

Impressed electric and magnetic currents plane waves

GREMA-UC3M Higher Order Finite Element Code v12

FEM FormulationVariational Formulation

Use of H(curl) spaces

H(curl)0 = W isin H(curl) ntimesW = 0 on ΓD (4)

H(curl) = W isin L2 nablatimesW isin L2 (5)and Galerkin method

Find V isin H(curl) such that c(FV) = l(F) forallF isin H(curl)0

c(FV) =

int

Ω

(nablatimes F) middot(

macrfrminus1nablatimes V

)dΩminus k2

0

int

Ω

(F middot macrgr V

)dΩ +

γ

int

ΓC

(ntimes F) middot (ntimes V) dΓC

l(F) = minusjk0h0

int

Ω

F middot P dΩ minusint

ΓN

F middotΨN dΓN minusint

ΓC

F middotΨC dΓC

minusint

Ω

F middotnablatimes(

macrfrminus1

L)

GREMA-UC3M Higher Order Finite Element Code v12

FEM DiscretizationHigher-Order Curl-Conforming Element

Own family of higher order isoparametric curl-conforming finite elements(tetrahedron prism hexahedron mdashunder testmdash)Rigorous implementation of Nedelecrsquos mixed order elements

4

2

31

13 14

15

16

18

17

19 20

ζ

ξ

η1

2 3

4

7

8

9

10

11

12

6 5

1

2

3

4

5

6

1

2

3

4

56

7

89

10

1112

16

15

17

18

20

19

22

21

2324

25

26

27

28

29

30

31

32

3334

35

36

ζ

ξ

η

14

13

Example 2nd order versions of tetra and prism

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEProblem Set-Up

Open region problems (optionally) by means of FE-IIEE (Finite Element- Iterative Integral Equation Evaluation)rArr Asymptotically exact absorbing boundary condition

ε1 micro1

Medium 1

ε2 micro2

Medium 2

PEC

PMC

PEC

J

nprime

n

M

S prime

Meq

Jeq

S

Exterior Domain(ΩEXT)EincHinc

FEM Domain(ΩFEM)

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEAlgorithm

Local BC for FEM (sparse matrices)

ntimes(

macrfrminus1nablatimes V

)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S

Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime

VFE-IIEE =

intintcopy

Sprime(Leq timesnablaG) middot dSprime minus jk0h0

intintcopy

Sprime

(Oeq

(G +

1k2

0nablanablaG

))middot dSprime

nabla times VFE-IIEE = jk0h0

intintcopy

Sprime(Oeq timesnablaG) dSprime minus

intintcopy

Sprime

(Leq

(k2

0 G + nablanablaG))

middot dSprime

ΨSCAT = ntimes(

macrfrminus1nablatimes VFE-IIEE

)+ γ ntimes ntimes VFE-IIEE

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEFlow Chart

Assembly ofElement Matrices

Imposition of BCNon Related to S

PostprocessSparseSolver

InitialBC on S

Computation ofElement Matrices

Upgrade of BC on

Mesh

FEM code for nonminusopen problems

Ψ(0)(r)

S Ψ(i+1)(r)

J(i)eq (r prime)M(i)

eq (r prime)rArr

V (r isin ΓS)nablatimesV (r isin ΓS)

GREMA-UC3M Higher Order Finite Element Code v12

Computational Features

Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions

GREMA-UC3M Higher Order Finite Element Code v12

Computational FeaturesTowards HPC

Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores

GREMA-UC3M Higher Order Finite Element Code v12

Parallel Flow Chart of the Code

MPI Division

FE-IIEE enabled

Yes

No

p0 p1 p2 pn ReadInput Data

p0 p1 p2 pn Create global mesh object

Numberingof DOFs

p1 p2 pn Deallocateglobal mesh

object

p0

Initializesolver

p0 p1 p2 pn Create local mesh object

p0 p1 p2 pn Extractboundary

mesh

p0 p1 p2 pn

p0 p1 p2 pn Calculate ampfill LHS matrix

p0 p1 p2 pn Matrixfactorization

p0 Calculate ampfill RHS matrix

Solve systemof equations

p0 p1 p2 pn

Yes

Yes

Calculatescatteringfield

p0 Updateglobal RHS

FE-IIEE enabled

p0 p1 p2 pn

More frequencies

Postprocess

No

No

Solve systemof equations

p0 p1 p2 pn

ConvIter achieved

Yes

No

GREMA-UC3M Higher Order Finite Element Code v12

Graphical User InterfaceFeatures

FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces

Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)

GREMA-UC3M Higher Order Finite Element Code v12

PosidoniaIn-house HPCaaS Solution

Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)

A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015

GREMA-UC3M Higher Order Finite Element Code v12

GUI with HPCaaS InterfaceScreenshot

GREMA-UC3M Higher Order Finite Element Code v12

Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to

Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)

RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )

GREMA-UC3M Higher Order Finite Element Code v12

MUMPS InterfacePresent Stage of Development (version 502)

MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors

I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)

Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization

lowast Frequent use of out-of-core (OOC) capabilities of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 7: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

FEM FormulationVariational Formulation

Use of H(curl) spaces

H(curl)0 = W isin H(curl) ntimesW = 0 on ΓD (4)

H(curl) = W isin L2 nablatimesW isin L2 (5)and Galerkin method

Find V isin H(curl) such that c(FV) = l(F) forallF isin H(curl)0

c(FV) =

int

Ω

(nablatimes F) middot(

macrfrminus1nablatimes V

)dΩminus k2

0

int

Ω

(F middot macrgr V

)dΩ +

γ

int

ΓC

(ntimes F) middot (ntimes V) dΓC

l(F) = minusjk0h0

int

Ω

F middot P dΩ minusint

ΓN

F middotΨN dΓN minusint

ΓC

F middotΨC dΓC

minusint

Ω

F middotnablatimes(

macrfrminus1

L)

GREMA-UC3M Higher Order Finite Element Code v12

FEM DiscretizationHigher-Order Curl-Conforming Element

Own family of higher order isoparametric curl-conforming finite elements(tetrahedron prism hexahedron mdashunder testmdash)Rigorous implementation of Nedelecrsquos mixed order elements

4

2

31

13 14

15

16

18

17

19 20

ζ

ξ

η1

2 3

4

7

8

9

10

11

12

6 5

1

2

3

4

5

6

1

2

3

4

56

7

89

10

1112

16

15

17

18

20

19

22

21

2324

25

26

27

28

29

30

31

32

3334

35

36

ζ

ξ

η

14

13

Example 2nd order versions of tetra and prism

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEProblem Set-Up

Open region problems (optionally) by means of FE-IIEE (Finite Element- Iterative Integral Equation Evaluation)rArr Asymptotically exact absorbing boundary condition

ε1 micro1

Medium 1

ε2 micro2

Medium 2

PEC

PMC

PEC

J

nprime

n

M

S prime

Meq

Jeq

S

Exterior Domain(ΩEXT)EincHinc

FEM Domain(ΩFEM)

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEAlgorithm

Local BC for FEM (sparse matrices)

ntimes(

macrfrminus1nablatimes V

)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S

Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime

VFE-IIEE =

intintcopy

Sprime(Leq timesnablaG) middot dSprime minus jk0h0

intintcopy

Sprime

(Oeq

(G +

1k2

0nablanablaG

))middot dSprime

nabla times VFE-IIEE = jk0h0

intintcopy

Sprime(Oeq timesnablaG) dSprime minus

intintcopy

Sprime

(Leq

(k2

0 G + nablanablaG))

middot dSprime

ΨSCAT = ntimes(

macrfrminus1nablatimes VFE-IIEE

)+ γ ntimes ntimes VFE-IIEE

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEFlow Chart

Assembly ofElement Matrices

Imposition of BCNon Related to S

PostprocessSparseSolver

InitialBC on S

Computation ofElement Matrices

Upgrade of BC on

Mesh

FEM code for nonminusopen problems

Ψ(0)(r)

S Ψ(i+1)(r)

J(i)eq (r prime)M(i)

eq (r prime)rArr

V (r isin ΓS)nablatimesV (r isin ΓS)

GREMA-UC3M Higher Order Finite Element Code v12

Computational Features

Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions

GREMA-UC3M Higher Order Finite Element Code v12

Computational FeaturesTowards HPC

Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores

GREMA-UC3M Higher Order Finite Element Code v12

Parallel Flow Chart of the Code

MPI Division

FE-IIEE enabled

Yes

No

p0 p1 p2 pn ReadInput Data

p0 p1 p2 pn Create global mesh object

Numberingof DOFs

p1 p2 pn Deallocateglobal mesh

object

p0

Initializesolver

p0 p1 p2 pn Create local mesh object

p0 p1 p2 pn Extractboundary

mesh

p0 p1 p2 pn

p0 p1 p2 pn Calculate ampfill LHS matrix

p0 p1 p2 pn Matrixfactorization

p0 Calculate ampfill RHS matrix

Solve systemof equations

p0 p1 p2 pn

Yes

Yes

Calculatescatteringfield

p0 Updateglobal RHS

FE-IIEE enabled

p0 p1 p2 pn

More frequencies

Postprocess

No

No

Solve systemof equations

p0 p1 p2 pn

ConvIter achieved

Yes

No

GREMA-UC3M Higher Order Finite Element Code v12

Graphical User InterfaceFeatures

FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces

Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)

GREMA-UC3M Higher Order Finite Element Code v12

PosidoniaIn-house HPCaaS Solution

Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)

A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015

GREMA-UC3M Higher Order Finite Element Code v12

GUI with HPCaaS InterfaceScreenshot

GREMA-UC3M Higher Order Finite Element Code v12

Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to

Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)

RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )

GREMA-UC3M Higher Order Finite Element Code v12

MUMPS InterfacePresent Stage of Development (version 502)

MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors

I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)

Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization

lowast Frequent use of out-of-core (OOC) capabilities of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 8: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

FEM DiscretizationHigher-Order Curl-Conforming Element

Own family of higher order isoparametric curl-conforming finite elements(tetrahedron prism hexahedron mdashunder testmdash)Rigorous implementation of Nedelecrsquos mixed order elements

4

2

31

13 14

15

16

18

17

19 20

ζ

ξ

η1

2 3

4

7

8

9

10

11

12

6 5

1

2

3

4

5

6

1

2

3

4

56

7

89

10

1112

16

15

17

18

20

19

22

21

2324

25

26

27

28

29

30

31

32

3334

35

36

ζ

ξ

η

14

13

Example 2nd order versions of tetra and prism

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEProblem Set-Up

Open region problems (optionally) by means of FE-IIEE (Finite Element- Iterative Integral Equation Evaluation)rArr Asymptotically exact absorbing boundary condition

ε1 micro1

Medium 1

ε2 micro2

Medium 2

PEC

PMC

PEC

J

nprime

n

M

S prime

Meq

Jeq

S

Exterior Domain(ΩEXT)EincHinc

FEM Domain(ΩFEM)

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEAlgorithm

Local BC for FEM (sparse matrices)

ntimes(

macrfrminus1nablatimes V

)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S

Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime

VFE-IIEE =

intintcopy

Sprime(Leq timesnablaG) middot dSprime minus jk0h0

intintcopy

Sprime

(Oeq

(G +

1k2

0nablanablaG

))middot dSprime

nabla times VFE-IIEE = jk0h0

intintcopy

Sprime(Oeq timesnablaG) dSprime minus

intintcopy

Sprime

(Leq

(k2

0 G + nablanablaG))

middot dSprime

ΨSCAT = ntimes(

macrfrminus1nablatimes VFE-IIEE

)+ γ ntimes ntimes VFE-IIEE

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEFlow Chart

Assembly ofElement Matrices

Imposition of BCNon Related to S

PostprocessSparseSolver

InitialBC on S

Computation ofElement Matrices

Upgrade of BC on

Mesh

FEM code for nonminusopen problems

Ψ(0)(r)

S Ψ(i+1)(r)

J(i)eq (r prime)M(i)

eq (r prime)rArr

V (r isin ΓS)nablatimesV (r isin ΓS)

GREMA-UC3M Higher Order Finite Element Code v12

Computational Features

Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions

GREMA-UC3M Higher Order Finite Element Code v12

Computational FeaturesTowards HPC

Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores

GREMA-UC3M Higher Order Finite Element Code v12

Parallel Flow Chart of the Code

MPI Division

FE-IIEE enabled

Yes

No

p0 p1 p2 pn ReadInput Data

p0 p1 p2 pn Create global mesh object

Numberingof DOFs

p1 p2 pn Deallocateglobal mesh

object

p0

Initializesolver

p0 p1 p2 pn Create local mesh object

p0 p1 p2 pn Extractboundary

mesh

p0 p1 p2 pn

p0 p1 p2 pn Calculate ampfill LHS matrix

p0 p1 p2 pn Matrixfactorization

p0 Calculate ampfill RHS matrix

Solve systemof equations

p0 p1 p2 pn

Yes

Yes

Calculatescatteringfield

p0 Updateglobal RHS

FE-IIEE enabled

p0 p1 p2 pn

More frequencies

Postprocess

No

No

Solve systemof equations

p0 p1 p2 pn

ConvIter achieved

Yes

No

GREMA-UC3M Higher Order Finite Element Code v12

Graphical User InterfaceFeatures

FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces

Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)

GREMA-UC3M Higher Order Finite Element Code v12

PosidoniaIn-house HPCaaS Solution

Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)

A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015

GREMA-UC3M Higher Order Finite Element Code v12

GUI with HPCaaS InterfaceScreenshot

GREMA-UC3M Higher Order Finite Element Code v12

Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to

Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)

RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )

GREMA-UC3M Higher Order Finite Element Code v12

MUMPS InterfacePresent Stage of Development (version 502)

MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors

I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)

Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization

lowast Frequent use of out-of-core (OOC) capabilities of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 9: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Mesh Truncation with FE-IIEEProblem Set-Up

Open region problems (optionally) by means of FE-IIEE (Finite Element- Iterative Integral Equation Evaluation)rArr Asymptotically exact absorbing boundary condition

ε1 micro1

Medium 1

ε2 micro2

Medium 2

PEC

PMC

PEC

J

nprime

n

M

S prime

Meq

Jeq

S

Exterior Domain(ΩEXT)EincHinc

FEM Domain(ΩFEM)

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEAlgorithm

Local BC for FEM (sparse matrices)

ntimes(

macrfrminus1nablatimes V

)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S

Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime

VFE-IIEE =

intintcopy

Sprime(Leq timesnablaG) middot dSprime minus jk0h0

intintcopy

Sprime

(Oeq

(G +

1k2

0nablanablaG

))middot dSprime

nabla times VFE-IIEE = jk0h0

intintcopy

Sprime(Oeq timesnablaG) dSprime minus

intintcopy

Sprime

(Leq

(k2

0 G + nablanablaG))

middot dSprime

ΨSCAT = ntimes(

macrfrminus1nablatimes VFE-IIEE

)+ γ ntimes ntimes VFE-IIEE

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEFlow Chart

Assembly ofElement Matrices

Imposition of BCNon Related to S

PostprocessSparseSolver

InitialBC on S

Computation ofElement Matrices

Upgrade of BC on

Mesh

FEM code for nonminusopen problems

Ψ(0)(r)

S Ψ(i+1)(r)

J(i)eq (r prime)M(i)

eq (r prime)rArr

V (r isin ΓS)nablatimesV (r isin ΓS)

GREMA-UC3M Higher Order Finite Element Code v12

Computational Features

Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions

GREMA-UC3M Higher Order Finite Element Code v12

Computational FeaturesTowards HPC

Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores

GREMA-UC3M Higher Order Finite Element Code v12

Parallel Flow Chart of the Code

MPI Division

FE-IIEE enabled

Yes

No

p0 p1 p2 pn ReadInput Data

p0 p1 p2 pn Create global mesh object

Numberingof DOFs

p1 p2 pn Deallocateglobal mesh

object

p0

Initializesolver

p0 p1 p2 pn Create local mesh object

p0 p1 p2 pn Extractboundary

mesh

p0 p1 p2 pn

p0 p1 p2 pn Calculate ampfill LHS matrix

p0 p1 p2 pn Matrixfactorization

p0 Calculate ampfill RHS matrix

Solve systemof equations

p0 p1 p2 pn

Yes

Yes

Calculatescatteringfield

p0 Updateglobal RHS

FE-IIEE enabled

p0 p1 p2 pn

More frequencies

Postprocess

No

No

Solve systemof equations

p0 p1 p2 pn

ConvIter achieved

Yes

No

GREMA-UC3M Higher Order Finite Element Code v12

Graphical User InterfaceFeatures

FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces

Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)

GREMA-UC3M Higher Order Finite Element Code v12

PosidoniaIn-house HPCaaS Solution

Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)

A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015

GREMA-UC3M Higher Order Finite Element Code v12

GUI with HPCaaS InterfaceScreenshot

GREMA-UC3M Higher Order Finite Element Code v12

Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to

Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)

RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )

GREMA-UC3M Higher Order Finite Element Code v12

MUMPS InterfacePresent Stage of Development (version 502)

MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors

I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)

Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization

lowast Frequent use of out-of-core (OOC) capabilities of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 10: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Mesh Truncation with FE-IIEEAlgorithm

Local BC for FEM (sparse matrices)

ntimes(

macrfrminus1nablatimes V

)+ γ ntimes ntimes V = ΨINC + ΨSCAT over S

Iterative estimation of ΨINC by exterior Equivalence Principle on Sprime

VFE-IIEE =

intintcopy

Sprime(Leq timesnablaG) middot dSprime minus jk0h0

intintcopy

Sprime

(Oeq

(G +

1k2

0nablanablaG

))middot dSprime

nabla times VFE-IIEE = jk0h0

intintcopy

Sprime(Oeq timesnablaG) dSprime minus

intintcopy

Sprime

(Leq

(k2

0 G + nablanablaG))

middot dSprime

ΨSCAT = ntimes(

macrfrminus1nablatimes VFE-IIEE

)+ γ ntimes ntimes VFE-IIEE

GREMA-UC3M Higher Order Finite Element Code v12

Mesh Truncation with FE-IIEEFlow Chart

Assembly ofElement Matrices

Imposition of BCNon Related to S

PostprocessSparseSolver

InitialBC on S

Computation ofElement Matrices

Upgrade of BC on

Mesh

FEM code for nonminusopen problems

Ψ(0)(r)

S Ψ(i+1)(r)

J(i)eq (r prime)M(i)

eq (r prime)rArr

V (r isin ΓS)nablatimesV (r isin ΓS)

GREMA-UC3M Higher Order Finite Element Code v12

Computational Features

Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions

GREMA-UC3M Higher Order Finite Element Code v12

Computational FeaturesTowards HPC

Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores

GREMA-UC3M Higher Order Finite Element Code v12

Parallel Flow Chart of the Code

MPI Division

FE-IIEE enabled

Yes

No

p0 p1 p2 pn ReadInput Data

p0 p1 p2 pn Create global mesh object

Numberingof DOFs

p1 p2 pn Deallocateglobal mesh

object

p0

Initializesolver

p0 p1 p2 pn Create local mesh object

p0 p1 p2 pn Extractboundary

mesh

p0 p1 p2 pn

p0 p1 p2 pn Calculate ampfill LHS matrix

p0 p1 p2 pn Matrixfactorization

p0 Calculate ampfill RHS matrix

Solve systemof equations

p0 p1 p2 pn

Yes

Yes

Calculatescatteringfield

p0 Updateglobal RHS

FE-IIEE enabled

p0 p1 p2 pn

More frequencies

Postprocess

No

No

Solve systemof equations

p0 p1 p2 pn

ConvIter achieved

Yes

No

GREMA-UC3M Higher Order Finite Element Code v12

Graphical User InterfaceFeatures

FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces

Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)

GREMA-UC3M Higher Order Finite Element Code v12

PosidoniaIn-house HPCaaS Solution

Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)

A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015

GREMA-UC3M Higher Order Finite Element Code v12

GUI with HPCaaS InterfaceScreenshot

GREMA-UC3M Higher Order Finite Element Code v12

Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to

Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)

RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )

GREMA-UC3M Higher Order Finite Element Code v12

MUMPS InterfacePresent Stage of Development (version 502)

MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors

I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)

Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization

lowast Frequent use of out-of-core (OOC) capabilities of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 11: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Mesh Truncation with FE-IIEEFlow Chart

Assembly ofElement Matrices

Imposition of BCNon Related to S

PostprocessSparseSolver

InitialBC on S

Computation ofElement Matrices

Upgrade of BC on

Mesh

FEM code for nonminusopen problems

Ψ(0)(r)

S Ψ(i+1)(r)

J(i)eq (r prime)M(i)

eq (r prime)rArr

V (r isin ΓS)nablatimesV (r isin ΓS)

GREMA-UC3M Higher Order Finite Element Code v12

Computational Features

Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions

GREMA-UC3M Higher Order Finite Element Code v12

Computational FeaturesTowards HPC

Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores

GREMA-UC3M Higher Order Finite Element Code v12

Parallel Flow Chart of the Code

MPI Division

FE-IIEE enabled

Yes

No

p0 p1 p2 pn ReadInput Data

p0 p1 p2 pn Create global mesh object

Numberingof DOFs

p1 p2 pn Deallocateglobal mesh

object

p0

Initializesolver

p0 p1 p2 pn Create local mesh object

p0 p1 p2 pn Extractboundary

mesh

p0 p1 p2 pn

p0 p1 p2 pn Calculate ampfill LHS matrix

p0 p1 p2 pn Matrixfactorization

p0 Calculate ampfill RHS matrix

Solve systemof equations

p0 p1 p2 pn

Yes

Yes

Calculatescatteringfield

p0 Updateglobal RHS

FE-IIEE enabled

p0 p1 p2 pn

More frequencies

Postprocess

No

No

Solve systemof equations

p0 p1 p2 pn

ConvIter achieved

Yes

No

GREMA-UC3M Higher Order Finite Element Code v12

Graphical User InterfaceFeatures

FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces

Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)

GREMA-UC3M Higher Order Finite Element Code v12

PosidoniaIn-house HPCaaS Solution

Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)

A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015

GREMA-UC3M Higher Order Finite Element Code v12

GUI with HPCaaS InterfaceScreenshot

GREMA-UC3M Higher Order Finite Element Code v12

Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to

Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)

RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )

GREMA-UC3M Higher Order Finite Element Code v12

MUMPS InterfacePresent Stage of Development (version 502)

MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors

I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)

Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization

lowast Frequent use of out-of-core (OOC) capabilities of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 12: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Computational Features

Computational FeaturesCode written using modern Fortran constructions (F2003)Strong emphasis in code maintainability by use of OOP (ObjectOriented Programming) paradigmsNumerical verification by use of the Method of Manufactured SolutionsNumerical validation by tests with EM benchmark problemsHybrid MPI+OpenMP programmingDirect solver interfaces (HSL MUMPS MKL Pardiso )Graphical User Interface (GUI) with HPCaaS InterfaceLinux amp Windows versions

GREMA-UC3M Higher Order Finite Element Code v12

Computational FeaturesTowards HPC

Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores

GREMA-UC3M Higher Order Finite Element Code v12

Parallel Flow Chart of the Code

MPI Division

FE-IIEE enabled

Yes

No

p0 p1 p2 pn ReadInput Data

p0 p1 p2 pn Create global mesh object

Numberingof DOFs

p1 p2 pn Deallocateglobal mesh

object

p0

Initializesolver

p0 p1 p2 pn Create local mesh object

p0 p1 p2 pn Extractboundary

mesh

p0 p1 p2 pn

p0 p1 p2 pn Calculate ampfill LHS matrix

p0 p1 p2 pn Matrixfactorization

p0 Calculate ampfill RHS matrix

Solve systemof equations

p0 p1 p2 pn

Yes

Yes

Calculatescatteringfield

p0 Updateglobal RHS

FE-IIEE enabled

p0 p1 p2 pn

More frequencies

Postprocess

No

No

Solve systemof equations

p0 p1 p2 pn

ConvIter achieved

Yes

No

GREMA-UC3M Higher Order Finite Element Code v12

Graphical User InterfaceFeatures

FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces

Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)

GREMA-UC3M Higher Order Finite Element Code v12

PosidoniaIn-house HPCaaS Solution

Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)

A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015

GREMA-UC3M Higher Order Finite Element Code v12

GUI with HPCaaS InterfaceScreenshot

GREMA-UC3M Higher Order Finite Element Code v12

Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to

Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)

RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )

GREMA-UC3M Higher Order Finite Element Code v12

MUMPS InterfacePresent Stage of Development (version 502)

MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors

I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)

Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization

lowast Frequent use of out-of-core (OOC) capabilities of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 13: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Computational FeaturesTowards HPC

Towards HPCldquoRethinkrdquo some of the OOP constructions (eg arrays of small derivedtypes )Global mesh objectrarr local mesh objects on each processorSpecialized direct solver interfaces Problems of several tens of millions of unknowns on more than onethousand cores

GREMA-UC3M Higher Order Finite Element Code v12

Parallel Flow Chart of the Code

MPI Division

FE-IIEE enabled

Yes

No

p0 p1 p2 pn ReadInput Data

p0 p1 p2 pn Create global mesh object

Numberingof DOFs

p1 p2 pn Deallocateglobal mesh

object

p0

Initializesolver

p0 p1 p2 pn Create local mesh object

p0 p1 p2 pn Extractboundary

mesh

p0 p1 p2 pn

p0 p1 p2 pn Calculate ampfill LHS matrix

p0 p1 p2 pn Matrixfactorization

p0 Calculate ampfill RHS matrix

Solve systemof equations

p0 p1 p2 pn

Yes

Yes

Calculatescatteringfield

p0 Updateglobal RHS

FE-IIEE enabled

p0 p1 p2 pn

More frequencies

Postprocess

No

No

Solve systemof equations

p0 p1 p2 pn

ConvIter achieved

Yes

No

GREMA-UC3M Higher Order Finite Element Code v12

Graphical User InterfaceFeatures

FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces

Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)

GREMA-UC3M Higher Order Finite Element Code v12

PosidoniaIn-house HPCaaS Solution

Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)

A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015

GREMA-UC3M Higher Order Finite Element Code v12

GUI with HPCaaS InterfaceScreenshot

GREMA-UC3M Higher Order Finite Element Code v12

Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to

Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)

RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )

GREMA-UC3M Higher Order Finite Element Code v12

MUMPS InterfacePresent Stage of Development (version 502)

MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors

I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)

Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization

lowast Frequent use of out-of-core (OOC) capabilities of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 14: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Parallel Flow Chart of the Code

MPI Division

FE-IIEE enabled

Yes

No

p0 p1 p2 pn ReadInput Data

p0 p1 p2 pn Create global mesh object

Numberingof DOFs

p1 p2 pn Deallocateglobal mesh

object

p0

Initializesolver

p0 p1 p2 pn Create local mesh object

p0 p1 p2 pn Extractboundary

mesh

p0 p1 p2 pn

p0 p1 p2 pn Calculate ampfill LHS matrix

p0 p1 p2 pn Matrixfactorization

p0 Calculate ampfill RHS matrix

Solve systemof equations

p0 p1 p2 pn

Yes

Yes

Calculatescatteringfield

p0 Updateglobal RHS

FE-IIEE enabled

p0 p1 p2 pn

More frequencies

Postprocess

No

No

Solve systemof equations

p0 p1 p2 pn

ConvIter achieved

Yes

No

GREMA-UC3M Higher Order Finite Element Code v12

Graphical User InterfaceFeatures

FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces

Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)

GREMA-UC3M Higher Order Finite Element Code v12

PosidoniaIn-house HPCaaS Solution

Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)

A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015

GREMA-UC3M Higher Order Finite Element Code v12

GUI with HPCaaS InterfaceScreenshot

GREMA-UC3M Higher Order Finite Element Code v12

Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to

Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)

RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )

GREMA-UC3M Higher Order Finite Element Code v12

MUMPS InterfacePresent Stage of Development (version 502)

MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors

I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)

Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization

lowast Frequent use of out-of-core (OOC) capabilities of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 15: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Graphical User InterfaceFeatures

FeaturesGUI based on a general purpose pre- and post-processor called GiDhttpgidcimneupces

Creation (or importation) of the geometry model of the problemMesh generationAssignation of material properties and boundary conditionsVisualization of resultsIntegration with Posidona (in-house HPCaaS)

GREMA-UC3M Higher Order Finite Element Code v12

PosidoniaIn-house HPCaaS Solution

Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)

A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015

GREMA-UC3M Higher Order Finite Element Code v12

GUI with HPCaaS InterfaceScreenshot

GREMA-UC3M Higher Order Finite Element Code v12

Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to

Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)

RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )

GREMA-UC3M Higher Order Finite Element Code v12

MUMPS InterfacePresent Stage of Development (version 502)

MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors

I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)

Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization

lowast Frequent use of out-of-core (OOC) capabilities of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 16: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

PosidoniaIn-house HPCaaS Solution

Easing the use of HPC platformsRemote job-submission to HPC infraestructuresDesigned with security user-friendliness collaborative-computing andmobility in mindManagement of all the communication with the remote computer system(file transfers )Interaction with its batch system (job scheduler)History repository of simulationsNotification when job submitted is completedTransparent downloading of the results to visualize them locallyPosidonia also available as stand-alone desktopAndroidWeb solution(also for general use with other simulator andor applications)

A Amor-Martin I Martinez-Fernandez L E Garcia-Castillo ldquoPosidonia A Tool for HPC andRemote Scientific Simulationsrdquo IEEE Antennas and Propagation Magazine 6166ndash177 Dec2015

GREMA-UC3M Higher Order Finite Element Code v12

GUI with HPCaaS InterfaceScreenshot

GREMA-UC3M Higher Order Finite Element Code v12

Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to

Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)

RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )

GREMA-UC3M Higher Order Finite Element Code v12

MUMPS InterfacePresent Stage of Development (version 502)

MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors

I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)

Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization

lowast Frequent use of out-of-core (OOC) capabilities of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 17: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

GUI with HPCaaS InterfaceScreenshot

GREMA-UC3M Higher Order Finite Element Code v12

Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to

Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)

RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )

GREMA-UC3M Higher Order Finite Element Code v12

MUMPS InterfacePresent Stage of Development (version 502)

MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors

I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)

Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization

lowast Frequent use of out-of-core (OOC) capabilities of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 18: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Getting Experience with MUMPSFrom ldquoMUMPS do it allrdquo to

Matrix FormatElementalAssembled (centralized on process 0)Assembled (distributed among processes)Asking to MUMPS for Schur complements and ldquoplayingrdquo with them(outside MUMPS)

RHS and solutionDense RHSSparse RHS (large number of RHS vectors)Centralized solutionDistributed solution (waiting for distributed RHS feature )

GREMA-UC3M Higher Order Finite Element Code v12

MUMPS InterfacePresent Stage of Development (version 502)

MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors

I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)

Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization

lowast Frequent use of out-of-core (OOC) capabilities of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 19: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

MUMPS InterfacePresent Stage of Development (version 502)

MUMPS initializationCall to ParMETIS (or PT-Scotch) to partition matrix among processors

I Other alternatives for partitioning have been considered due to memoryproblems (commented in following slides)

Computation of FEM matrix coefficients associated to each localprocessInput of matrix coefficients to MUMPS in distributed assembled formatCall to MUMPS for matrix factorizationComputation of FEM RHS coefficients on process 0 (in blocks of 10-20vectors) in sparse formatCall to MUMPS for system solve(FE-IIEE enabled) Iteratively update of RHS and system solve until errorcriterion is satisfiedMUMPS finalization

lowast Frequent use of out-of-core (OOC) capabilities of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 20: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Some Issues with MemoryMemory Allocation inside MUMPS

Memory IssueA peak memory use during analysis phase has been detected(distributed assembled)Found out to be due to memory allocation inside MUMPS routinesrelated to maximum MAXS among processors

Listing 1 file zana aux parF

1589 SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH1647 MAXS = ordLAST(I)-ordFIRST(I)+11653 ALLOCATE(SIPES(max(1MAXS) NPROCS))1864 END SUBROUTINE ZMUMPS_BUILD_LOC_GRAPH

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 21: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Some Issues with Memory (cont)Memory Allocation inside MUMPS

Memory IssueExample 45000000 dof problem using 192 processes and 4 bytes perintegerMAXS bandwidth is 45000000rArr 3456 GB memory per process

WorkaroundMatrix partition based on rows instead of elements of the mesh

I Slightly worse LU fill-in (size of cofactors) than with partition based onelements

Change ordering of dof as input to MUMPS (to be done)

It may be the case we are doing something completely wrong

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 22: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Some Issues with MemoryLarge Number of RHS amp Solution Vectors

Large Number of RHS vectorsAnalysis of a given problem under a large number of excitations Examples

Monostatic radar cross section (RCS) predictionLarge arrays of antennas

Present MUMPS InterfaceTreatement of RHS solution amp vectors in blocks (typically 10-20 vectors at atime)

Use of sparse format for RHSUse of centralized solution vectors

I The reason behind the treatment of RHS amp solution vectors in blocks is tolimit the memory needed to storage solution vectors

GREMA-UC3M Higher Order Finite Element Code v12

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 23: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Some Issues with Memory (cont)Large Number of RHS amp Solution Vectors

Distributed Solution Planned for Near FutureUpdate of centralized RHS by FE-IIEErArr use of centralized solution isldquonaturalrdquo (easy in terms of code maintenance)Wish list distributed RHSiquestis distributed RHS feature planned for near future versions of MUMPS

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 24: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Specialized MUMPS InterfacesRepetitive Solver

Vivaldi antenna

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 25: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Specialized MUMPS Interfaces (cont)Repetitive Solver

Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 26: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Specialized MUMPS Interfaces (cont)Repetitive Solver

(a) Unit Cell (b) Unit Cell Mesh

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 27: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Specialized MUMPS Interfaces (cont)Repetitive Solver

Figure Virtual Mesh of Antenna Array

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 28: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Specialized MUMPS Interfaces (cont)Repetitive Solver

Algorithm1 Computation of Schur complement of unit-cell2 Assembled of Schur complements of all ldquovirtualrdquo cellsrArr Interface

problem3 Addition of boundary conditions to interface problem4 Solve the interface problem5 Solve interior unit cell problems

I Identical matrices with different right hand sides

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 29: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Specialized MUMPS Interfaces (cont)Repetitive Solver

Features and RemarksAdvantages saving in time and memoryUnder certain circumstances (number of cells equal to power of 2 andno BC) all leaves of a certain level of the tree are identical

I Further saving in timeI Large saving in memory

Boundary conditions (BC) alter this one branch tree behaviorrArr BC may be left up to the root of the treeOr ldquoalgebraic symmetryrdquo can be explored

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 30: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Specialized MUMPS Interfaces (cont)Repetitive Solver

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 31: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Specialized MUMPS InterfacesHybrid Direct amp Iterative Solver

Hybrid Direct amp Iterative SolverMultifrontal algorithm on only a few levelsIterative solution from the last level of multifrontal algorithmIt can be understood as the direct solver acting as preconditioner of theiterative solverNatural approach to some DDM strategies

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 32: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Specialized MUMPS InterfacesDealing with the Lack of Availability of LU Cofactors

Lack of Availability of LU CofactorsCalls to multiple (typicall sequential) MUMPS instances for SchurcomplementsAssembling Schur complementsFinalizing MUMPS instancesSolve interface problemCreate new MUMPS instances to solve the interior problems

GREMA-UC3M Higher Order Finite Element Code v12

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 33: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Specialized MUMPS Interfaces (cont)Dealing with the Lack of Availability of LU Cofactors

MUMPS Instances for Interior Problemsq Idea inspired by work leaded by Prof Paszynski

Reproduction (or restore) of interior matrix equation and interior righthand sideCall to multiple (typically sequential) MUMPS instances tofactorizesolve the interior problemsUse of Dirichlet conditions for interface unknownsPreliminary tests shows that the approach is worthy in memory(expected) but competitive in time

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 34: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Regular MUMPS amp MULTISOLVERTime and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0 0

4 0 0

6 0 0

8 0 0

1 0 0 0Se

cond

s

u n k n o w n s

M U L T I S O L V E R M U M P S

GREMA-UC3M Higher Order Finite Element Code v12

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 35: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Regular MUMPS amp MULTISOLVER (cont)Time and Memory Comparison

1 x 1 0 6 2 x 1 0 6 3 x 1 0 6 4 x 1 0 6 5 x 1 0 6 6 x 1 0 60

2 0

4 0

6 0

8 0

1 0 0

1 2 0

1 4 0Me

mory

in GB

u n k n o w n s

M U L T I S O L V E RM U M P S S Y S T E M L I M I T ( 1 2 8 G B )

GREMA-UC3M Higher Order Finite Element Code v12

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 36: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

HPC EnvironmentCluster of Xidian University (XDHPC)

Cluster of Xidian University140 compute nodes

I Two twelve-core Intel Xeon 2690 V2 22 GHz CPUsI 64 GB of RAMI 18 TB of hard disk

56 Gbps InfiniBand network

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 37: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Waveguide ProblemLow Pass Filter with Higher-Order Mode Suppression

[10 minus 16] GHz

Length 218 mm

3245 K tetrahedrons

22 M unknowns

Wall time 73 min per freq point

I Arregui et al ldquoHigh-power low-pass harmonic filters with higher-order TEn0 and non-TEn0mode suppression design method and multipactor characterizationrdquo IEEE Trans MicrowTheory Techn Dec 2013

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 38: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 39: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Waveguide Problem (cont)Low Pass Filter with Higher-Order Mode Suppression

GREMA-UC3M Higher Order Finite Element Code v12

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 40: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Scattering ProblemBistatic RCS of Car

Bistatic RCS at 15 GHz

Tyres modeles as dielectric (εr = 40)Several incident planes waves fromdifferent directions

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 41: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Scattering Problem (cont)Bistatic RCS of Car

27 M tetrahedrons

173 M unknownsWall time 59 min per freq point(46 compute nodes)

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 42: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from behind

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 43: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 44: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Scattering Problem (cont)Bistatic RCS of Car

q Incident plane wave arriving from the front

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 45: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Parallel ScalabilityFactorization and FE-IIEE Stages

1 2 0 2 4 0 3 6 0 4 8 00 60 81 01 21 41 61 82 02 22 42 62 83 0

4 9 5 2 5 7

7 0

factor

izatio

n + so

lving s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to thefactorization phase

1 2 0 2 4 0 3 6 0 4 8 0

1 0

1 5

2 0

2 5

3 0

3 5

4 0 1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

7 8 8 5

1 0 0

FE-IIE

E spe

edup

n u m b e r o f c o r e sSpeedup graph corresponding to the

mesh truncation phase

Benchmark Bistatic RCS of Impala

GREMA-UC3M Higher Order Finite Element Code v12

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 46: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Parallel ScalabilityWhole Code

1 2 0 2 4 0 3 6 0 4 8 00 81 01 21 41 61 82 02 22 42 62 83 03 23 4

6 2 6 4 7 0

8 0 fac

toriza

tion +

solvin

g + FE

-IIEE s

peed

up

n u m b e r o f c o r e s

1 2 t h r e a d s p e r p r o c e s s 6 t h r e a d s p e r p r o c e s s 4 t h r e a d s p e r p r o c e s s

Speedup graph corresponding to the whole code

GREMA-UC3M Higher Order Finite Element Code v12

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 47: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Work in Progress and Future Work

Work in ProgressHierarchical basis functions of variable order ph-adaptivityrArr support for hp meshes

Future WorkConformal and non-conformal DDMHybrid (direct + iterative) solver

GREMA-UC3M Higher Order Finite Element Code v12

Thanks for your attention

Thanks to the MUMPS team

Page 48: Higher-Order Finite Element Code for Electromagnetic ...mumps.enseeiht.fr/doc/ud_2017/Castillo_Talk.pdf · Higher-Order Finite Element Code for Electromagnetic Simulation on HPC Environments

Thanks for your attention

Thanks to the MUMPS team