Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Trends in Sparse Linear Trends in Sparse Linear Research and Software Research and Software

Developments Developments in Francein France

P. AmestoyP. Amestoy11, , M. DaydéM. Daydé11, I. Duff, I. Duff22, L. Giraud, L. Giraud11, A. Haidar, A. Haidar33, S. , S. LanteriLanteri44, J.-Y. L’Excellent, J.-Y. L’Excellent55, P. Ramet, P. Ramet66

1 IRIT – INPT / ENSEEIHT2 CERFACS & RAL

3 CERFACS4INRIA, nachos project-team

5 INRIA / LIP-ENSL 6 INRIA Futurs / LaBRI

Overview of sparse linear Overview of sparse linear algebra techniquesalgebra techniques

From Iain DuffFrom Iain Duff

CERFACS & RALCERFACS & RAL

IntroductionIntroduction

Solution of Solution of

Ax = b

Where A is large (may be 106 or greater) and sparse

Direct methodsDirect methods Key idea : factorize the matrix into the product of matrices easy to

invert (triangular) with possibly some permutations to preserve sparsity and maintain numerical stability

E.g. Gaussian Elimination : PAQ = LU where

• P and Q are rows / columns permutations, • L : Lower triangular (sparse), • U: Upper triangular (sparse)

Then forward / backward substitution : Ly = PbUQTx = y

Good points• Can solve very large problems even in 3D• Generally are fast ... even on fancy machines• Are robust and well packaged

Bad points• Need to have A either element-wise or assembled : can have very high

storage requirement• Can be costly

Iterative methodsIterative methods Generate a sequence of vectors converging towards the solution of the

linear system

E.g. iterative methods based on Krylov subspaces : Кk(A; r0) = sp{r0, Ar0, · · · Ak−1r0} where r0 = b − Ax0.

The idea is then to choose a suitable xk x0 + Кk (A; r0)

For example, so that it minimizes || b − Axk ||2 (GMRES).

There are many Krylov methods, depending on criteria used for choosing xk.

Good points• May not require to form A• Usually very low storage requirements• Can be efficient on 3D problems

Bad points• May require a lot of iterations for convergence• May require preconditioning

Hybrid methodsHybrid methods

Not just preconditioningNot just preconditioning• ILU Sparse approximate inverse or

Incomplete Cholesky: core techniques of direct methods are used

We focus on • using a direct method/code combined

with an iterative method.

Hybrid methids (con’t)Hybrid methids (con’t)

Generic examples of hybrid methods are:• Domain decomposition ... using direct

method on local subdomains and/or direct preconditioner on interface

• Block iterative methods .... direct solver on subblocks

• Factorization of nearby problem as a preconditioner

Sparse direct methodsSparse direct methods

Sparse Direct SolversSparse Direct Solvers

Usually three stepsUsually three steps• Pre-processing : symbolic factorizationPre-processing : symbolic factorization• Numercial factorizationNumercial factorization• Forward-backward substitutionForward-backward substitution

Simulate a phenomenon Solve a sparse linear system

~100 MB

Factorize the matrix

~10 GBMUMPS

71 GFlops

512 proc T3E

MUMPS and sparse direct methods

MUMPS team

http://graal.ens-lyon.fr/MUMPS andand http://mumps.enseeiht.fr

History Main contributors since 1996 : Patrick

Amestoy, Iain Duff, Abdou Guermouche, Jacko Koster, Jean-Yves L’Excellent, Stéphane Pralet

Current development team :• Patrick Amestoy, ENSEEIHT-IRIT• Abdou Guermouche, LABRI-INRIA• Jean-Yves L’Excellent, INRIA• Stéphane Pralet, now working for SAMTECH

Phd Students• Emmanuel Agullo, ENS-Lyon• Tzvetomila Slavova, CERFACS.

Users Around Around 1000 users, 2 requests per day Academics or industrials Type of applications :

• Structural mechanics, CAD• Fluid dynamics, Magnetohydrodynamic, Physical

Chemistry• Wave propagation and seismic imaging, Ocean

modelling• Acoustics and electromagnetics propagation• Biology• Finite Element Analysis, Numerical Optimization,

Simulation• . . .

MUMPSMUMPS:: A MUltifrontal Massively Parallel A MUltifrontal Massively Parallel SolverSolver

MUMPS solves large systems of linear equations of the form Ax=b by factorizing A into A=LU or LDLT

Symmetric or unsymmetric matrices (partial pivoting)

Parallel factorization and solution phases (uniprocessor version also available)

Iterative refinement and backward error analysis

Various matrix input formats • assembled format• distributed assembled format• sum of elemental matrices

Partial factorization and Schur complement matrix

Version for complex arithmetic

Several orderings interfaced: AMD, AMF, PORD, METIS, SCOTCH

The multifrontal method (Duff, Reid’83)

Memory is divided into two parts (that canoverlap in time) :

• the factors• the active memory

Elimination tree represents tasksdependencies

ImplementationImplementation Distributed

multifrontal solver (MPI / F90 based)

Dynamic distributed scheduling to accomodate both numerical fill-in and multi-user environments

Use of BLAS, ScaLAPACK

A fully asynchronous distributed solver (VAMPIR trace, 8 processors)

MUMPS 3 main steps (plus initialization and termination) :

•JOB=-1 : initialize solver type (LU, LDLT ) and default parameters

•JOB=1 : analyse the matrix, build an ordering, prepare factorization

•JOB=2 : (parallel) numerical factorization A = LU

•JOB=3 : (parallel) solution step forward and backward substitutions (Ly = b,Ux = y)

•JOB=-2 : termination deallocate all MUMPS data structures

Car body148770 unknowns and 5396386 nonzerosMSC.Software

AVAILABILITY MUMPS is available free of charge

It is used on a number of platforms (CRAY, SGI, IBM, Linux, …) and is downloaded once a day on average (applications in chemistry, aeronautics, geophysics, ...)

If you are interested in obtaining MUMPS for your own use,

please refer to the MUMPS home page

Some MUMPS users: Boeing, BRGM, CEA, Dassault, EADS, EDF, MIT, NASA, SAMTECH, ...

Matrix SolverNumber of processors

1 4 8 16 32 64 128

BbmatMUMPS - 32.1 10.8 12.3 10.4 9.1 7.8SuperLU - 132.9 72.5 39.8 23.5 15.6 11.1

Ecl32MUMPS - 23.9 13.4 9.7 6.6 5.6 5.4SuperLu - 48.5 26.6 15.7 9.6 7.6 5.6

COMPETITIVE PERFORMANCE

Comparison with SuperLU extracted from ACM TOMS 2001 and obtained with S. Li

Recent performance results (ops = nb of operations)

Factorization time in seconds of large matrices on the CRAY T3E (1 proc: not enough memory)

MatrixOpsx109

Factorizationtime in seconds

on 1 proc


on 64 proc


on 128 procAUDIKW_1 5682 3262.8 54.6 35.9

BRGM 31010 - 283.9 - CONESHL_mod 1640 1099.0 19.6 12.1

CONV3D64 23880 - 207.5 146.5 ULTRASOUND80 3915 1542.2 37.1 29.5

Functionalities, Features Recent features

• Symmetric indefinite matrices : preprocessing and 2-by-2 pivots

• Hybrid scheduling• 2D cyclic distributed Schur complement• Sparse, multiple right-hand sides• Singular matrices with detection of null pivots• Interfaces to MUMPS : Fortran, C, Matlab (S. Pralet, while at

ENSEEIHT-IRIT) and Scilab (A. Fèvre, INRIA)

Future functionalities• Out-of-core execution• Parallel analysis phase• Rank revealing algorithms• Hybrid direct-iterative solvers (with Luc Giraud)

Ratio of active and total memory peak on different numbers of processors for several large problems

(Ph.D. E. Agullo, ENS Lyon and Ph.D. M. Slavova, CERFACS)

• Use disk storage to solvevery large problems•Parallel out-of-core Factorization•Preprocessing to minimize volume of I/O•Scheduling for out-of-coresolution

On-going research on Out-of-Core solvers

Hybrid Scheduling Both memory and workload information are used to obtain

a better behaviour in terms of estimated memory, memory used and factorization time in the context of parallel factorization algorithms

Estimated memory much closer to effective memory used

Estimated and effective memory (millions of reals) for the factorization on 64 processors

Max: maximum amount of memoryAvg: average memory per processor

MUMPSstandard

MUMPShybrid

Matrix Estim real Estim real

AUDIKW_1 Max 118.7 50.7 73.9 41.9

Avg 76.2 31.4 49.5 32.1

BRGM Max 406.6 - 257.6 175.1

Avg 185.0 - 158.9 123.5

CONESHL_mod Max 59.6 33.1 33.8 22.5

Avg 25.2 16.8 21.6 16.2

CONV3D64 Max 93.7 88.4 86.9 81.0

Avg 68.7 60.5 60.9 60.2

ULTRASOUND80 Max 43.0 38.9 29.3 27.2

Avg 26.2 22.4 23.5 21.8

Memory minimizing schedules Multifrontal methods can use a large amount of

temporary data By decoupling task allocation and task processing, we

can reduce the amount of temporary data: new optimal schedule proposed in this context (Guermouche, L'Excellent, ACM TOMS)

Memory gains:

Active memory ratio (new algorithm vs Liu's ordering)

Remark: Gains relative to Liu's algorithm are equal to 27.1, 17.5 and 19.6 for matrices 8, 9, and 10 (gupta matrices), respectively

Preprocessing for symmetric matrices (S. Pralet, ENSEEIHT-IRIT)

Preprocessing: new scaling available, symmetric weighted matching and automatic tuning of the preprocessing strategies

Pivoting strategy (2-by-2 pivot and static pivoting) Improvement:

• factorization time• robustness in particular on KKT systems arising from

optimization• memory estimation

Matrix n nnz Old NewCONT-300 180095 539396 - 4.2

BOYD2 466316 890091 - 2.6STOKES128 49666 295938 1.5 0.8

Factorization time on a Linux PC (Pentium4 2.80 GHz)

Scotch, PasTIXScotch, PasTIX

PaStiX TeamPaStiX Team

INRIA – Futurs / LaBriINRIA – Futurs / LaBri

PaStiX solverPaStiX solverFunctionnalitiesFunctionnalities

• LLt, LDLt, LU factorization (symmetric pattern) with LLt, LDLt, LU factorization (symmetric pattern) with supernodal implementationsupernodal implementation

• Static pivoting (Max. Weight Matching) + It. Raff. / Static pivoting (Max. Weight Matching) + It. Raff. / CG / GMRESCG / GMRES

• 1D/2D block distribution + Full BLAS31D/2D block distribution + Full BLAS3• Support external ordering library (provided Scotch Support external ordering library (provided Scotch

ordering)ordering)• MPI/Threads implementation (SMP node / Cluster / MPI/Threads implementation (SMP node / Cluster /

Multi-core / NUMA)Multi-core / NUMA)• Simple/Double precision + Float/Complexe Simple/Double precision + Float/Complexe

operationsoperations• Require only C + MPI + Posix ThreadRequire only C + MPI + Posix Thread• Multiple RHS (direct factorization)Multiple RHS (direct factorization)• Incomplete factorization ILU(k) preconditionnerIncomplete factorization ILU(k) preconditionner

PaStiX solver (con’t)PaStiX solver (con’t)

Available on INRIA GforgeAvailable on INRIA Gforge• All-in-One source codeAll-in-One source code• Easy to install on Linux or AIX systemsEasy to install on Linux or AIX systems• Simple API (WSMP like)Simple API (WSMP like)• Thread safe (can be called from multiple threads in Thread safe (can be called from multiple threads in

multiple MPI communicators)multiple MPI communicators)

Current worksCurrent works• Use of parallel ordering (PT-Scotch) and parallel Use of parallel ordering (PT-Scotch) and parallel

symbolic factorization)symbolic factorization)• Dynamic scheduling inside SMP nodes (static mapping)Dynamic scheduling inside SMP nodes (static mapping)• Out-of Core implementationOut-of Core implementation• Generic Finite Element Assembly (domaine Generic Finite Element Assembly (domaine

decomposition associated to matrix distribution)decomposition associated to matrix distribution)

Direct solver chain (in PaStiX)Direct solver chain (in PaStiX)

Scotch(ordering &

amalgamation)

Fax(block symbolic

factorization)

Blend(refinement &

mapping)

Sopalin(factorizing &

solving)

graph partition symbolMatrixDistributed

solverMatrix

Distributedfactorized

solverMatrix

Distributedsolution

Analyze (sequential steps) // fact. and solve


Scotch(ordering &

amalgamation)

Fax(block symbolic

factorization)

Blend(refinement &

mapping)


solving)


solverMatrix


solverMatrix

Distributedsolution

Sparse matrix ordering (minimizes fill-in)•Scotch: an hybrid algorithm

• incomplete Nested Dissection

• the resulting subgraphs being ordered with an Approximate Minimum Degree method under constraints (HAMD)


Scotch(ordering &

amalgamation)

Fax(block symbolic

factorization)

Blend(refinement &

mapping)


solving)


solverMatrix


solverMatrix

Distributedsolution

The symbolic block factorization Q(G,P)Q(G,P)→→Q(G,P)*=Q(GQ(G,P)*=Q(G

*,P)*,P)=> linear in number => linear in number of blocks!of blocks!

Dense block structures Dense block structures →→ only a few extra pointers only a few extra pointers to store the matrixto store the matrix


Scotch(ordering &

amalgamation)

Fax(block symbolic

factorization)

Blend(refinement &

mapping)


solving)


solverMatrix


solverMatrix

Distributedsolution1 2 3 4 5 6 7 8

1 2 3 4 5

5 6 7 8

51 2 3 4 5 6 7 8

4 41 2 2 3 86 7

2321

6 7 7

CPU time prediction

Exact memory ressources

Modern architecture management (SMP nodes) : Modern architecture management (SMP nodes) : hybrid Threads/MPI implementation (all processors in hybrid Threads/MPI implementation (all processors in the same SMP node work directly in share memory)the same SMP node work directly in share memory)

Less MPI communication and lower the parallel Less MPI communication and lower the parallel memory overcostmemory overcost

Scotch(ordering &

amalgamation)

Fax(block symbolic

factorization)

Blend(refinement &

mapping)


solving)


solverMatrix


solverMatrix

Distributedsolution


Incomplete factorization in PaStiXIncomplete factorization in PaStiX Start from the acknowledgement that it is difficult to Start from the acknowledgement that it is difficult to

build a generic and robust pre-conditionerbuild a generic and robust pre-conditioner Large scale 3D problemsLarge scale 3D problems High performance computingHigh performance computing

Derive direct solver techniques to pre-conditionerDerive direct solver techniques to pre-conditioner

What’s new: (dense) What’s new: (dense) blockblock formulation formulation

Incomplete block symbolic factorization:Incomplete block symbolic factorization: Remove blocks with algebraic criteria Remove blocks with algebraic criteria Use amalgamation algorithm to get dense blocksUse amalgamation algorithm to get dense blocks

Provide incomplete factorization LDLProvide incomplete factorization LDLtt, Cholesky, LU , Cholesky, LU (with static pivoting for symmetric pattern)(with static pivoting for symmetric pattern)

Numerical experiments Numerical experiments (TERA1)(TERA1)

Successful approach for a large collection of industrial Successful approach for a large collection of industrial test cases (PARASOL, Boeing Harwell, CEA) on IBM SP3test cases (PARASOL, Boeing Harwell, CEA) on IBM SP3

TERA1 supercomputer of CEA Ile-de-France TERA1 supercomputer of CEA Ile-de-France (ES45 SMP 4 procs)(ES45 SMP 4 procs)

COUPOLE40000 : COUPOLE40000 : 26.5 1026.5 1066 of unknowns of unknowns 1.5 101.5 101010 NNZL and 10.8Tflops NNZL and 10.8Tflops

356 procs: 34s356 procs: 34s 512 procs: 27s512 procs: 27s 768 procs: 20s 768 procs: 20s

(>500Gflop/s about 35% peak perf.)(>500Gflop/s about 35% peak perf.)

Numerical experiments Numerical experiments (TERA10)(TERA10)

Successful approach on 3D mesh problem with about Successful approach on 3D mesh problem with about 30 millions of unkowns on TERA10 supercomputer30 millions of unkowns on TERA10 supercomputer

But memory is the bottleneck !!!But memory is the bottleneck !!!

ODYSSEE code of French ODYSSEE code of French CEA/CESTACEA/CESTA

Electro-Electro-magnetism code magnetism code (Finite Element (Finite Element Meth. + Integral Meth. + Integral Equation)Equation)Complex double Complex double precision, Schur precision, Schur Compl.Compl.

LinksLinks Scotch : Scotch : http://gforge.inria.fr/projects/scotchhttp://gforge.inria.fr/projects/scotch PaStiX : PaStiX : http://gforge.inria.fr/projects/pastixhttp://gforge.inria.fr/projects/pastix MUMPS : MUMPS : http://mumps.enseeiht.fr/http://mumps.enseeiht.fr/

http://graal.ens-lyon.fr/MUMPShttp://graal.ens-lyon.fr/MUMPS ScAlApplix : ScAlApplix : http://www.labri.fr/project/scalapplixhttp://www.labri.fr/project/scalapplix

ANR CIGCANR CIGC NumasisNumasis ANR CIS ANR CIS Solstice & AsterSolstice & Aster

Latest publication : to appear in Parallel Computing : Latest publication : to appear in Parallel Computing : On finding On finding approximate supernodes for an efficient ILU(k) factorizationapproximate supernodes for an efficient ILU(k) factorization

For more publications, see : http://www.labri.fr/~ramet/For more publications, see : http://www.labri.fr/~ramet/

OSSAU code of French OSSAU code of French CEA/CESTACEA/CESTA 2D / 3D structural mechanics code2D / 3D structural mechanics code

ODYSSEE code of French ODYSSEE code of French CEA/CESTACEA/CESTA Electro-magnetism code Electro-magnetism code

(Finite Element Meth. + Integral Equation)(Finite Element Meth. + Integral Equation) Complex double precision, Schur Compl.Complex double precision, Schur Compl.

Fluid mechanicsFluid mechanics LU factorization with static pivoting LU factorization with static pivoting

(SuperLU approach like)(SuperLU approach like)

Industrial applicationsIndustrial applications

Other parallel sparse direct codes

Code Technique Scope Availability (www.)

MA41 Multifrontal UNS cse.clrc.ac.uk/Activity/HSL

MA49 Multifrontal QR RECT cse.clrc.ac.uk/Activity/HSL

PanelLLT Left-looking SPD NgPARDISO Left-right looking UNS Schenk

PSL Left-looking SPD/UNS SGI product

SPOOLES Fan-in SYM/UNS netlib.org/linalg/spooles

SuperLU Left-looking UNS nersc.gov/xiaoye/SuperLU

WSMP Multifrontal SYM/UNS IBM product

Shared-memory codes

Other parallel sparse direct codes Distributed-memory codes

Code Technique Scope Availability (www.)Availability (www.)

CAPSS Multifrontal SPD netlib.org/scalapack

MUMPS Multifrontal SYM/UNS mumps.enseeiht.frgraal.ens-lyon.fr/MUMPS

PaStiX Fan-in SYM/UNS gforge.inria.fr/pastix

PSPASES Multifrontal SPD cs.umn.edu/mjoshi/pspases

SPOOLES Fan-in SYM/UNS netlib.org/linalg/spooles

SuperLU Fan-out UNS nersc.gov/xiaoye/SuperLU

S+ Fan-out UNS cs.ucsb.edu/research/S+

WSMP Multifrontal SYM IBM product

Sparse solver for Ax = b: only a black box ?

Preprocessing and postprocessing :• Symmetric permutations to reduce fill :(Ax = b => PAPtPx = b)• Numerical pivoting, scaling to preserve numerical accuracy• Maximum transversal (set large entries on the diagonal)• Preprocessing for parallelism (influence of task mapping on

parallelism)• Iterative refinement, error analysis

Default (often automatic/adaptive) setting of the options is available.However, a better knowledge of the options can help the

user to further improve : memory usage, time for solution, numerical accuracy.

The GRID-TLSE ProjetThe GRID-TLSE Projet

A web expert site for sparse linear A web expert site for sparse linear algebraalgebra

Overview of GRID-TLSE: Overview of GRID-TLSE: http://gridtlse.orghttp://gridtlse.org

Supported by Supported by

• ANR LEGO ProjectANR LEGO Project

• ANR SOLSTICE ProjectANR SOLSTICE Project

• CNRS / JST Program : REDIMPS ProjectCNRS / JST Program : REDIMPS Project

• ACI GRID-TLSE ProjectACI GRID-TLSE Project Partners:Partners: Supported Supported

byby

Sparse Matrices Expert Site ?Sparse Matrices Expert Site ?

Expert siteExpert site: Help users in choosing the right : Help users in choosing the right solvers and its parameters for a given problemsolvers and its parameters for a given problem

Chosen approachChosen approach: Expert scenarios which answer : Expert scenarios which answer common user requestscommon user requests

Main goalMain goal: Provide a friendly test environment for : Provide a friendly test environment for expert and non-expert users of sparse linear expert and non-expert users of sparse linear algebra software.algebra software.

Sparse Matrices Expert Site ?Sparse Matrices Expert Site ?

Easy access toEasy access to::• Software and tools;Software and tools;• A wide range of computer architectures;A wide range of computer architectures;• Matrix collections;Matrix collections;• Expert Scenarios.Expert Scenarios.

AlsoAlso : Provide a testbed for sparse : Provide a testbed for sparse linear algebra softwarelinear algebra software

Why using the grid ?Why using the grid ? Sparse linear algebra software makes use of Sparse linear algebra software makes use of

sophisticated algorithms for (pre-/post-) sophisticated algorithms for (pre-/post-) processing the matrix.processing the matrix.

Multiple parameters interfere for efficient Multiple parameters interfere for efficient execution of a sparse direct solver:execution of a sparse direct solver:• Ordering;Ordering;• Amount of memory;Amount of memory;• Architecture of computer;Architecture of computer;• Available libraries.Available libraries.

Determining the best combination of parameter Determining the best combination of parameter values is a multi-parametric problem.values is a multi-parametric problem.

Well-suited for execution over a GridWell-suited for execution over a Grid..

ComponentsComponents

How do software X and Y compare in terms of memory How do software X and Y compare in terms of memory and CPU on my favourite matrix A ?and CPU on my favourite matrix A ?

??0

5

10

15

20

25

30

35

40

Memory CPU

Matrix GRE

MUMPS

SuperLU

Software componentsSoftware components

Weaver: high level Weaver: high level administrator for the administrator for the deployment and the deployment and the exploitation of services on the exploitation of services on the gridgrid

Websolve: an Web Websolve: an Web interface to start services on interface to start services on he gridhe grid

Middleware: DIET Middleware: DIET developed within GRID-developed within GRID-ASP (LIP, Loria Resedas, ASP (LIP, Loria Resedas, LIFC-SDRP) and soon ITBL LIFC-SDRP) and soon ITBL (?)(?)

User 1User 1User iUser i

Hybrid SolversHybrid Solvers

Parallel hybrid iterative/direct solver for the solution of large sparse linear systems

arising from 3D elliptic discretization

L. Giraud1 A. Haidar2 S. Watson3

1ENSEEIHT, Parallel Algorithms and Optimization Group2 rue Camichel, 31071 Toulouse, France

CERFACS - Parallel Algorithm Project42 Avenue Coriolis, 31057 Toulouse, France

3Departments of Computer Science and Mathematics, Virginia Departments of Computer Science and Mathematics, Virginia Polytechnic InstitutePolytechnic Institute, USA

Non-overlapping domain Non-overlapping domain decomposition decomposition

Natural approach for PDE’s, extend to general sparse matrices

Partition the problem into subdomains, subgraphs• Use a direct solver on the subdomains (MUMPS package)

• Robust algebraically preconditioned iterative solver on the interface (Algebraic Additive Schwarz preconditioner, possibly with sparsified and mixed arithmetic variants)

Numerical behaviour of the Numerical behaviour of the preconditionerspreconditioners

Convergence history on a 43 millions dof problems on 1000 System X processors - Virginia Tech.

Parallel scaled scalability studyParallel scaled scalability study

Parallel elapsed time with fixed sub-problem size (43 000 dof)when the number of procesors varies from 27 (1.1 106 dof)

up-to 1000 (43 106 dof)

Hybrid iterative/direct strategiesfor solving large sparse linear systems

resulting from the finite element discretization of the time-harmonic

Maxwell equations

L. Giraud1 A. Haidar2 S. Lanteri31ENSEEIHT, Parallel Algorithms and Optimization Group

2 rue Camichel, 31071 Toulouse, FranceCERFACS - Parallel Algorithm Project

42 Avenue Coriolis, 31057 Toulouse, France3INRIA, nachos project-team

2004 Route des Lucioles, BP 93, 06902 Sophia Antipolis Cedex, France

Context and objectives Solution of Solution of time-harmonic electromagnetic wave

propagation Discretization in space

• Discontinuous Galerkin time-harmonic methods• Unstructured tetrahedral meshes• Based on Pp nodal (Lagrange) interpolation• Centered or upwind fluxes for the calculation of jump terms at

cell boundaries Discretization in space results in a large, sparse, with

complex coefficients linear system• Direct (sparse LU) solvers for 2D problems• Parallel solvers are mandatory for 3D problems

Related publications• H. Fol (PhD thesis, 2006)• V. Dolean, H. Fol, S. Lanteri and R. Perrussel(J. Comp. Appl.

Math., to appear, 2007)

Solution algorithm

Parallel hybrid iterative/direct solver• Domain decomposition framework

Schwarz algorithm with characteristic interface conditions

Sparse LU subdomain solver (MUMPS, P.R. Amestoy, I.S. Duff and J.-Y. L’Excellent, Comput. Meth. App. Mech. Engng., Vol 184, 2000)

Interface (Schur complement type) formulation Iterative interface solver (GMRES or BiCGstab) Algebraic block preconditioning of the interface

system; exploit the structure of the system (free to construct and store)

Scattering of a plane wave by a PEC cube

Plane wave frequency: 900 MHz Tetrahedral mesh:

• # vertices = 67,590 • # elements = 373,632

Total number of DOF: 2,241,792

Performance results on various number of processors (IBM JS21)

# procs # procs

PrecondPrecond

88 1616 3232

NoneNone 5050 5454 6363

M1M1 2424 2525 2626

Number of iterations

Scattering of a plane wave by a PEC cube

Scattering of a plane wave by a head

Plane wave frequency: 1800 MHz Tetrahedral mesh:

• # vertices = 188101 • # elements = 1118952

Total number of DOF: 6,713,712

Performance results on various number of processors (Blue Gene B/L)

# procs # procs

PrecondPrecond

4848 6464 128128 256256

NoneNone 150150 161161 198198 240240

M1M1 4040 4242 5151 6262

Number of iterations

Scattering of a plane wave by a head

Documents

Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y