57
Trends in Sparse Trends in Sparse Linear Research Linear Research and Software and Software Developments Developments in France in France P. Amestoy P. Amestoy 1 , , M. Daydé M. Daydé 1 , I. Duff , I. Duff 2 , L. Giraud , L. Giraud 1 , A. , A. Haidar Haidar 3 , S. Lanteri , S. Lanteri 4 , J.-Y. L’Excellent , J.-Y. L’Excellent 5 , P. Ramet , P. Ramet 6 1 IRIT – INPT / ENSEEIHT 2 CERFACS & RAL 3 CERFACS 4 INRIA, nachos project-team 5 INRIA / LIP-ENSL 6 INRIA Futurs / LaBRI

Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Trends in Sparse Linear Trends in Sparse Linear Research and Software Research and Software

Developments Developments in Francein France

P. AmestoyP. Amestoy11, , M. DaydéM. Daydé11, I. Duff, I. Duff22, L. Giraud, L. Giraud11, A. Haidar, A. Haidar33, S. , S. LanteriLanteri44, J.-Y. L’Excellent, J.-Y. L’Excellent55, P. Ramet, P. Ramet66

1 IRIT – INPT / ENSEEIHT2 CERFACS & RAL

3 CERFACS4INRIA, nachos project-team

5 INRIA / LIP-ENSL 6 INRIA Futurs / LaBRI

Page 2: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Overview of sparse linear Overview of sparse linear algebra techniquesalgebra techniques

From Iain DuffFrom Iain Duff

CERFACS & RALCERFACS & RAL

Page 3: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

IntroductionIntroduction

Solution of Solution of

Ax = b

Where A is large (may be 106 or greater) and sparse

Page 4: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Direct methodsDirect methods Key idea : factorize the matrix into the product of matrices easy to

invert (triangular) with possibly some permutations to preserve sparsity and maintain numerical stability

E.g. Gaussian Elimination : PAQ = LU where

• P and Q are rows / columns permutations, • L : Lower triangular (sparse), • U: Upper triangular (sparse)

Then forward / backward substitution : Ly = PbUQTx = y

Good points• Can solve very large problems even in 3D• Generally are fast ... even on fancy machines• Are robust and well packaged

Bad points• Need to have A either element-wise or assembled : can have very high

storage requirement• Can be costly

Page 5: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Iterative methodsIterative methods Generate a sequence of vectors converging towards the solution of the

linear system

E.g. iterative methods based on Krylov subspaces : Кk(A; r0) = sp{r0, Ar0, · · · Ak−1r0} where r0 = b − Ax0.

The idea is then to choose a suitable xk x0 + Кk (A; r0)

For example, so that it minimizes || b − Axk ||2 (GMRES).

There are many Krylov methods, depending on criteria used for choosing xk.

Good points• May not require to form A• Usually very low storage requirements• Can be efficient on 3D problems

Bad points• May require a lot of iterations for convergence• May require preconditioning

Page 6: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Hybrid methodsHybrid methods

Not just preconditioningNot just preconditioning• ILU Sparse approximate inverse or

Incomplete Cholesky: core techniques of direct methods are used

We focus on • using a direct method/code combined

with an iterative method.

Page 7: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Hybrid methids (con’t)Hybrid methids (con’t)

Generic examples of hybrid methods are:• Domain decomposition ... using direct

method on local subdomains and/or direct preconditioner on interface

• Block iterative methods .... direct solver on subblocks

• Factorization of nearby problem as a preconditioner

Page 8: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Sparse direct methodsSparse direct methods

Page 9: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Sparse Direct SolversSparse Direct Solvers

Usually three stepsUsually three steps• Pre-processing : symbolic factorizationPre-processing : symbolic factorization• Numercial factorizationNumercial factorization• Forward-backward substitutionForward-backward substitution

Simulate a phenomenon Solve a sparse linear system

~100 MB

Factorize the matrix

~10 GBMUMPS

71 GFlops

512 proc T3E

Page 10: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

MUMPS and sparse direct methods

MUMPS team

http://graal.ens-lyon.fr/MUMPS andand http://mumps.enseeiht.fr

Page 11: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

History Main contributors since 1996 : Patrick

Amestoy, Iain Duff, Abdou Guermouche, Jacko Koster, Jean-Yves L’Excellent, Stéphane Pralet

Current development team :• Patrick Amestoy, ENSEEIHT-IRIT• Abdou Guermouche, LABRI-INRIA• Jean-Yves L’Excellent, INRIA• Stéphane Pralet, now working for SAMTECH

Phd Students• Emmanuel Agullo, ENS-Lyon• Tzvetomila Slavova, CERFACS.

Page 12: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Users Around Around 1000 users, 2 requests per day Academics or industrials Type of applications :

• Structural mechanics, CAD• Fluid dynamics, Magnetohydrodynamic, Physical

Chemistry• Wave propagation and seismic imaging, Ocean

modelling• Acoustics and electromagnetics propagation• Biology• Finite Element Analysis, Numerical Optimization,

Simulation• . . .

Page 13: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

MUMPSMUMPS:: A MUltifrontal Massively Parallel A MUltifrontal Massively Parallel SolverSolver

MUMPS solves large systems of linear equations of the form Ax=b by factorizing A into A=LU or LDLT

Symmetric or unsymmetric matrices (partial pivoting)

Parallel factorization and solution phases (uniprocessor version also available)

Iterative refinement and backward error analysis

Various matrix input formats • assembled format• distributed assembled format• sum of elemental matrices

Partial factorization and Schur complement matrix

Version for complex arithmetic

Several orderings interfaced: AMD, AMF, PORD, METIS, SCOTCH

Page 14: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

The multifrontal method (Duff, Reid’83)

Memory is divided into two parts (that canoverlap in time) :

• the factors• the active memory

Elimination tree represents tasksdependencies

Page 15: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

ImplementationImplementation Distributed

multifrontal solver (MPI / F90 based)

Dynamic distributed scheduling to accomodate both numerical fill-in and multi-user environments

Use of BLAS, ScaLAPACK

A fully asynchronous distributed solver (VAMPIR trace, 8 processors)

Page 16: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

MUMPS 3 main steps (plus initialization and termination) :

•JOB=-1 : initialize solver type (LU, LDLT ) and default parameters

•JOB=1 : analyse the matrix, build an ordering, prepare factorization

•JOB=2 : (parallel) numerical factorization A = LU

•JOB=3 : (parallel) solution step forward and backward substitutions (Ly = b,Ux = y)

•JOB=-2 : termination deallocate all MUMPS data structures

Page 17: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Car body148770 unknowns and 5396386 nonzerosMSC.Software

AVAILABILITY MUMPS is available free of charge

It is used on a number of platforms (CRAY, SGI, IBM, Linux, …) and is downloaded once a day on average (applications in chemistry, aeronautics, geophysics, ...)

If you are interested in obtaining MUMPS for your own use,

please refer to the MUMPS home page

Some MUMPS users: Boeing, BRGM, CEA, Dassault, EADS, EDF, MIT, NASA, SAMTECH, ...

Page 18: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Matrix SolverNumber of processors

1 4 8 16 32 64 128

BbmatMUMPS - 32.1 10.8 12.3 10.4 9.1 7.8SuperLU - 132.9 72.5 39.8 23.5 15.6 11.1

Ecl32MUMPS - 23.9 13.4 9.7 6.6 5.6 5.4SuperLu - 48.5 26.6 15.7 9.6 7.6 5.6

COMPETITIVE PERFORMANCE

Comparison with SuperLU extracted from ACM TOMS 2001 and obtained with S. Li

Recent performance results (ops = nb of operations)

Factorization time in seconds of large matrices on the CRAY T3E (1 proc: not enough memory)

MatrixOpsx109

Factorizationtime in seconds

on 1 proc

Factorizationtime in seconds

on 64 proc

Factorizationtime in seconds

on 128 procAUDIKW_1 5682 3262.8 54.6 35.9

BRGM 31010 - 283.9 - CONESHL_mod 1640 1099.0 19.6 12.1

CONV3D64 23880 - 207.5 146.5 ULTRASOUND80 3915 1542.2 37.1 29.5

Page 19: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Functionalities, Features Recent features

• Symmetric indefinite matrices : preprocessing and 2-by-2 pivots

• Hybrid scheduling• 2D cyclic distributed Schur complement• Sparse, multiple right-hand sides• Singular matrices with detection of null pivots• Interfaces to MUMPS : Fortran, C, Matlab (S. Pralet, while at

ENSEEIHT-IRIT) and Scilab (A. Fèvre, INRIA)

Future functionalities• Out-of-core execution• Parallel analysis phase• Rank revealing algorithms• Hybrid direct-iterative solvers (with Luc Giraud)

Page 20: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Ratio of active and total memory peak on different numbers of processors for several large problems

(Ph.D. E. Agullo, ENS Lyon and Ph.D. M. Slavova, CERFACS)

• Use disk storage to solvevery large problems•Parallel out-of-core Factorization•Preprocessing to minimize volume of I/O•Scheduling for out-of-coresolution

On-going research on Out-of-Core solvers

Page 21: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Hybrid Scheduling Both memory and workload information are used to obtain

a better behaviour in terms of estimated memory, memory used and factorization time in the context of parallel factorization algorithms

Estimated memory much closer to effective memory used

Estimated and effective memory (millions of reals) for the factorization on 64 processors

Max: maximum amount of memoryAvg: average memory per processor

MUMPSstandard

MUMPShybrid

Matrix Estim real Estim real

AUDIKW_1 Max 118.7 50.7 73.9 41.9

Avg 76.2 31.4 49.5 32.1

BRGM Max 406.6 - 257.6 175.1

Avg 185.0 - 158.9 123.5

CONESHL_mod Max 59.6 33.1 33.8 22.5

Avg 25.2 16.8 21.6 16.2

CONV3D64 Max 93.7 88.4 86.9 81.0

Avg 68.7 60.5 60.9 60.2

ULTRASOUND80 Max 43.0 38.9 29.3 27.2

Avg 26.2 22.4 23.5 21.8

Page 22: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Memory minimizing schedules Multifrontal methods can use a large amount of

temporary data By decoupling task allocation and task processing, we

can reduce the amount of temporary data: new optimal schedule proposed in this context (Guermouche, L'Excellent, ACM TOMS)

Memory gains:

Active memory ratio (new algorithm vs Liu's ordering)

Remark: Gains relative to Liu's algorithm are equal to 27.1, 17.5 and 19.6 for matrices 8, 9, and 10 (gupta matrices), respectively

Page 23: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Preprocessing for symmetric matrices (S. Pralet, ENSEEIHT-IRIT)

Preprocessing: new scaling available, symmetric weighted matching and automatic tuning of the preprocessing strategies

Pivoting strategy (2-by-2 pivot and static pivoting) Improvement:

• factorization time• robustness in particular on KKT systems arising from

optimization• memory estimation

Matrix n nnz Old NewCONT-300 180095 539396 - 4.2

BOYD2 466316 890091 - 2.6STOKES128 49666 295938 1.5 0.8

Factorization time on a Linux PC (Pentium4 2.80 GHz)

Page 24: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Scotch, PasTIXScotch, PasTIX

PaStiX TeamPaStiX Team

INRIA – Futurs / LaBriINRIA – Futurs / LaBri

Page 25: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

PaStiX solverPaStiX solverFunctionnalitiesFunctionnalities

• LLt, LDLt, LU factorization (symmetric pattern) with LLt, LDLt, LU factorization (symmetric pattern) with supernodal implementationsupernodal implementation

• Static pivoting (Max. Weight Matching) + It. Raff. / Static pivoting (Max. Weight Matching) + It. Raff. / CG / GMRESCG / GMRES

• 1D/2D block distribution + Full BLAS31D/2D block distribution + Full BLAS3• Support external ordering library (provided Scotch Support external ordering library (provided Scotch

ordering)ordering)• MPI/Threads implementation (SMP node / Cluster / MPI/Threads implementation (SMP node / Cluster /

Multi-core / NUMA)Multi-core / NUMA)• Simple/Double precision + Float/Complexe Simple/Double precision + Float/Complexe

operationsoperations• Require only C + MPI + Posix ThreadRequire only C + MPI + Posix Thread• Multiple RHS (direct factorization)Multiple RHS (direct factorization)• Incomplete factorization ILU(k) preconditionnerIncomplete factorization ILU(k) preconditionner

Page 26: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

PaStiX solver (con’t)PaStiX solver (con’t)

Available on INRIA GforgeAvailable on INRIA Gforge• All-in-One source codeAll-in-One source code• Easy to install on Linux or AIX systemsEasy to install on Linux or AIX systems• Simple API (WSMP like)Simple API (WSMP like)• Thread safe (can be called from multiple threads in Thread safe (can be called from multiple threads in

multiple MPI communicators)multiple MPI communicators)

Current worksCurrent works• Use of parallel ordering (PT-Scotch) and parallel Use of parallel ordering (PT-Scotch) and parallel

symbolic factorization)symbolic factorization)• Dynamic scheduling inside SMP nodes (static mapping)Dynamic scheduling inside SMP nodes (static mapping)• Out-of Core implementationOut-of Core implementation• Generic Finite Element Assembly (domaine Generic Finite Element Assembly (domaine

decomposition associated to matrix distribution)decomposition associated to matrix distribution)

Page 27: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Direct solver chain (in PaStiX)Direct solver chain (in PaStiX)

Scotch(ordering &

amalgamation)

Fax(block symbolic

factorization)

Blend(refinement &

mapping)

Sopalin(factorizing &

solving)

graph partition symbolMatrixDistributed

solverMatrix

Distributedfactorized

solverMatrix

Distributedsolution

Analyze (sequential steps) // fact. and solve

Page 28: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Direct solver chain (in PaStiX)Direct solver chain (in PaStiX)

Scotch(ordering &

amalgamation)

Fax(block symbolic

factorization)

Blend(refinement &

mapping)

Sopalin(factorizing &

solving)

graph partition symbolMatrixDistributed

solverMatrix

Distributedfactorized

solverMatrix

Distributedsolution

Sparse matrix ordering (minimizes fill-in)•Scotch: an hybrid algorithm

• incomplete Nested Dissection

• the resulting subgraphs being ordered with an Approximate Minimum Degree method under constraints (HAMD)

Page 29: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Direct solver chain (in PaStiX)Direct solver chain (in PaStiX)

Scotch(ordering &

amalgamation)

Fax(block symbolic

factorization)

Blend(refinement &

mapping)

Sopalin(factorizing &

solving)

graph partition symbolMatrixDistributed

solverMatrix

Distributedfactorized

solverMatrix

Distributedsolution

The symbolic block factorization Q(G,P)Q(G,P)→→Q(G,P)*=Q(GQ(G,P)*=Q(G

*,P)*,P)=> linear in number => linear in number of blocks!of blocks!

Dense block structures Dense block structures →→ only a few extra pointers only a few extra pointers to store the matrixto store the matrix

Page 30: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Direct solver chain (in PaStiX)Direct solver chain (in PaStiX)

Scotch(ordering &

amalgamation)

Fax(block symbolic

factorization)

Blend(refinement &

mapping)

Sopalin(factorizing &

solving)

graph partition symbolMatrixDistributed

solverMatrix

Distributedfactorized

solverMatrix

Distributedsolution1 2 3 4 5 6 7 8

1 2 3 4 5

5 6 7 8

51 2 3 4 5 6 7 8

4 41 2 2 3 86 7

2321

6 7 7

CPU time prediction

Exact memory ressources

Page 31: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Modern architecture management (SMP nodes) : Modern architecture management (SMP nodes) : hybrid Threads/MPI implementation (all processors in hybrid Threads/MPI implementation (all processors in the same SMP node work directly in share memory)the same SMP node work directly in share memory)

Less MPI communication and lower the parallel Less MPI communication and lower the parallel memory overcostmemory overcost

Scotch(ordering &

amalgamation)

Fax(block symbolic

factorization)

Blend(refinement &

mapping)

Sopalin(factorizing &

solving)

graph partition symbolMatrixDistributed

solverMatrix

Distributedfactorized

solverMatrix

Distributedsolution

Direct solver chain (in PaStiX)Direct solver chain (in PaStiX)

Page 32: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Incomplete factorization in PaStiXIncomplete factorization in PaStiX Start from the acknowledgement that it is difficult to Start from the acknowledgement that it is difficult to

build a generic and robust pre-conditionerbuild a generic and robust pre-conditioner Large scale 3D problemsLarge scale 3D problems High performance computingHigh performance computing

Derive direct solver techniques to pre-conditionerDerive direct solver techniques to pre-conditioner

What’s new: (dense) What’s new: (dense) blockblock formulation formulation

Incomplete block symbolic factorization:Incomplete block symbolic factorization: Remove blocks with algebraic criteria Remove blocks with algebraic criteria Use amalgamation algorithm to get dense blocksUse amalgamation algorithm to get dense blocks

Provide incomplete factorization LDLProvide incomplete factorization LDLtt, Cholesky, LU , Cholesky, LU (with static pivoting for symmetric pattern)(with static pivoting for symmetric pattern)

Page 33: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Numerical experiments Numerical experiments (TERA1)(TERA1)

Successful approach for a large collection of industrial Successful approach for a large collection of industrial test cases (PARASOL, Boeing Harwell, CEA) on IBM SP3test cases (PARASOL, Boeing Harwell, CEA) on IBM SP3

TERA1 supercomputer of CEA Ile-de-France TERA1 supercomputer of CEA Ile-de-France (ES45 SMP 4 procs)(ES45 SMP 4 procs)

COUPOLE40000 : COUPOLE40000 : 26.5 1026.5 1066 of unknowns of unknowns 1.5 101.5 101010 NNZL and 10.8Tflops NNZL and 10.8Tflops

356 procs: 34s356 procs: 34s 512 procs: 27s512 procs: 27s 768 procs: 20s 768 procs: 20s

(>500Gflop/s about 35% peak perf.)(>500Gflop/s about 35% peak perf.)

Page 34: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Numerical experiments Numerical experiments (TERA10)(TERA10)

Successful approach on 3D mesh problem with about Successful approach on 3D mesh problem with about 30 millions of unkowns on TERA10 supercomputer30 millions of unkowns on TERA10 supercomputer

But memory is the bottleneck !!!But memory is the bottleneck !!!

ODYSSEE code of French ODYSSEE code of French CEA/CESTACEA/CESTA

Electro-Electro-magnetism code magnetism code (Finite Element (Finite Element Meth. + Integral Meth. + Integral Equation)Equation)Complex double Complex double precision, Schur precision, Schur Compl.Compl.

Page 35: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

LinksLinks Scotch : Scotch : http://gforge.inria.fr/projects/scotchhttp://gforge.inria.fr/projects/scotch PaStiX : PaStiX : http://gforge.inria.fr/projects/pastixhttp://gforge.inria.fr/projects/pastix MUMPS : MUMPS : http://mumps.enseeiht.fr/http://mumps.enseeiht.fr/

http://graal.ens-lyon.fr/MUMPShttp://graal.ens-lyon.fr/MUMPS ScAlApplix : ScAlApplix : http://www.labri.fr/project/scalapplixhttp://www.labri.fr/project/scalapplix

ANR CIGCANR CIGC NumasisNumasis ANR CIS ANR CIS Solstice & AsterSolstice & Aster

Latest publication : to appear in Parallel Computing : Latest publication : to appear in Parallel Computing : On finding On finding approximate supernodes for an efficient ILU(k) factorizationapproximate supernodes for an efficient ILU(k) factorization

For more publications, see : http://www.labri.fr/~ramet/For more publications, see : http://www.labri.fr/~ramet/

Page 36: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

OSSAU code of French OSSAU code of French CEA/CESTACEA/CESTA 2D / 3D structural mechanics code2D / 3D structural mechanics code

ODYSSEE code of French ODYSSEE code of French CEA/CESTACEA/CESTA Electro-magnetism code Electro-magnetism code

(Finite Element Meth. + Integral Equation)(Finite Element Meth. + Integral Equation) Complex double precision, Schur Compl.Complex double precision, Schur Compl.

Fluid mechanicsFluid mechanics LU factorization with static pivoting LU factorization with static pivoting

(SuperLU approach like)(SuperLU approach like)

Industrial applicationsIndustrial applications

Page 37: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Other parallel sparse direct codes

Code Technique Scope Availability (www.)

MA41 Multifrontal UNS cse.clrc.ac.uk/Activity/HSL

MA49 Multifrontal QR RECT cse.clrc.ac.uk/Activity/HSL

PanelLLT Left-looking SPD NgPARDISO Left-right looking UNS Schenk

PSL Left-looking SPD/UNS SGI product

SPOOLES Fan-in SYM/UNS netlib.org/linalg/spooles

SuperLU Left-looking UNS nersc.gov/xiaoye/SuperLU

WSMP Multifrontal SYM/UNS IBM product

Shared-memory codes

Page 38: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Other parallel sparse direct codes Distributed-memory codes

Code Technique Scope Availability (www.)Availability (www.)

CAPSS Multifrontal SPD netlib.org/scalapack

MUMPS Multifrontal SYM/UNS mumps.enseeiht.frgraal.ens-lyon.fr/MUMPS

PaStiX Fan-in SYM/UNS gforge.inria.fr/pastix

PSPASES Multifrontal SPD cs.umn.edu/mjoshi/pspases

SPOOLES Fan-in SYM/UNS netlib.org/linalg/spooles

SuperLU Fan-out UNS nersc.gov/xiaoye/SuperLU

S+ Fan-out UNS cs.ucsb.edu/research/S+

WSMP Multifrontal SYM IBM product

Page 39: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Sparse solver for Ax = b: only a black box ?

Preprocessing and postprocessing :• Symmetric permutations to reduce fill :(Ax = b => PAPtPx = b)• Numerical pivoting, scaling to preserve numerical accuracy• Maximum transversal (set large entries on the diagonal)• Preprocessing for parallelism (influence of task mapping on

parallelism)• Iterative refinement, error analysis

Default (often automatic/adaptive) setting of the options is available.However, a better knowledge of the options can help the

user to further improve : memory usage, time for solution, numerical accuracy.

Page 40: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

The GRID-TLSE ProjetThe GRID-TLSE Projet

A web expert site for sparse linear A web expert site for sparse linear algebraalgebra

Page 41: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Overview of GRID-TLSE: Overview of GRID-TLSE: http://gridtlse.orghttp://gridtlse.org

Supported by Supported by

• ANR LEGO ProjectANR LEGO Project

• ANR SOLSTICE ProjectANR SOLSTICE Project

• CNRS / JST Program : REDIMPS ProjectCNRS / JST Program : REDIMPS Project

• ACI GRID-TLSE ProjectACI GRID-TLSE Project Partners:Partners: Supported Supported

byby

Page 42: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Sparse Matrices Expert Site ?Sparse Matrices Expert Site ?

Expert siteExpert site: Help users in choosing the right : Help users in choosing the right solvers and its parameters for a given problemsolvers and its parameters for a given problem

Chosen approachChosen approach: Expert scenarios which answer : Expert scenarios which answer common user requestscommon user requests

Main goalMain goal: Provide a friendly test environment for : Provide a friendly test environment for expert and non-expert users of sparse linear expert and non-expert users of sparse linear algebra software.algebra software.

Page 43: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Sparse Matrices Expert Site ?Sparse Matrices Expert Site ?

Easy access toEasy access to::• Software and tools;Software and tools;• A wide range of computer architectures;A wide range of computer architectures;• Matrix collections;Matrix collections;• Expert Scenarios.Expert Scenarios.

AlsoAlso : Provide a testbed for sparse : Provide a testbed for sparse linear algebra softwarelinear algebra software

Page 44: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Why using the grid ?Why using the grid ? Sparse linear algebra software makes use of Sparse linear algebra software makes use of

sophisticated algorithms for (pre-/post-) sophisticated algorithms for (pre-/post-) processing the matrix.processing the matrix.

Multiple parameters interfere for efficient Multiple parameters interfere for efficient execution of a sparse direct solver:execution of a sparse direct solver:• Ordering;Ordering;• Amount of memory;Amount of memory;• Architecture of computer;Architecture of computer;• Available libraries.Available libraries.

Determining the best combination of parameter Determining the best combination of parameter values is a multi-parametric problem.values is a multi-parametric problem.

Well-suited for execution over a GridWell-suited for execution over a Grid..

Page 45: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

ComponentsComponents

How do software X and Y compare in terms of memory How do software X and Y compare in terms of memory and CPU on my favourite matrix A ?and CPU on my favourite matrix A ?

??0

5

10

15

20

25

30

35

40

Memory CPU

Matrix GRE

MUMPS

SuperLU

Software componentsSoftware components

Weaver: high level Weaver: high level administrator for the administrator for the deployment and the deployment and the exploitation of services on the exploitation of services on the gridgrid

Websolve: an Web Websolve: an Web interface to start services on interface to start services on he gridhe grid

Middleware: DIET Middleware: DIET developed within GRID-developed within GRID-ASP (LIP, Loria Resedas, ASP (LIP, Loria Resedas, LIFC-SDRP) and soon ITBL LIFC-SDRP) and soon ITBL (?)(?)

User 1User 1User iUser i

Page 46: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Hybrid SolversHybrid Solvers

Page 47: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Parallel hybrid iterative/direct solver for the solution of large sparse linear systems

arising from 3D elliptic discretization

L. Giraud1 A. Haidar2 S. Watson3

1ENSEEIHT, Parallel Algorithms and Optimization Group2 rue Camichel, 31071 Toulouse, France

CERFACS - Parallel Algorithm Project42 Avenue Coriolis, 31057 Toulouse, France

3Departments of Computer Science and Mathematics, Virginia Departments of Computer Science and Mathematics, Virginia Polytechnic InstitutePolytechnic Institute, USA

Page 48: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Non-overlapping domain Non-overlapping domain decomposition decomposition

Natural approach for PDE’s, extend to general sparse matrices

Partition the problem into subdomains, subgraphs• Use a direct solver on the subdomains (MUMPS package)

• Robust algebraically preconditioned iterative solver on the interface (Algebraic Additive Schwarz preconditioner, possibly with sparsified and mixed arithmetic variants)

Page 49: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Numerical behaviour of the Numerical behaviour of the preconditionerspreconditioners

Convergence history on a 43 millions dof problems on 1000 System X processors - Virginia Tech.

Page 50: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Parallel scaled scalability studyParallel scaled scalability study

Parallel elapsed time with fixed sub-problem size (43 000 dof)when the number of procesors varies from 27 (1.1 106 dof)

up-to 1000 (43 106 dof)

Page 51: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Hybrid iterative/direct strategiesfor solving large sparse linear systems

resulting from the finite element discretization of the time-harmonic

Maxwell equations

L. Giraud1 A. Haidar2 S. Lanteri31ENSEEIHT, Parallel Algorithms and Optimization Group

2 rue Camichel, 31071 Toulouse, FranceCERFACS - Parallel Algorithm Project

42 Avenue Coriolis, 31057 Toulouse, France3INRIA, nachos project-team

2004 Route des Lucioles, BP 93, 06902 Sophia Antipolis Cedex, France

Page 52: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Context and objectives Solution of Solution of time-harmonic electromagnetic wave

propagation Discretization in space

• Discontinuous Galerkin time-harmonic methods• Unstructured tetrahedral meshes• Based on Pp nodal (Lagrange) interpolation• Centered or upwind fluxes for the calculation of jump terms at

cell boundaries Discretization in space results in a large, sparse, with

complex coefficients linear system• Direct (sparse LU) solvers for 2D problems• Parallel solvers are mandatory for 3D problems

Related publications• H. Fol (PhD thesis, 2006)• V. Dolean, H. Fol, S. Lanteri and R. Perrussel(J. Comp. Appl.

Math., to appear, 2007)

Page 53: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Solution algorithm

Parallel hybrid iterative/direct solver• Domain decomposition framework

Schwarz algorithm with characteristic interface conditions

Sparse LU subdomain solver (MUMPS, P.R. Amestoy, I.S. Duff and J.-Y. L’Excellent, Comput. Meth. App. Mech. Engng., Vol 184, 2000)

Interface (Schur complement type) formulation Iterative interface solver (GMRES or BiCGstab) Algebraic block preconditioning of the interface

system; exploit the structure of the system (free to construct and store)

Page 54: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Scattering of a plane wave by a PEC cube

Plane wave frequency: 900 MHz Tetrahedral mesh:

• # vertices = 67,590 • # elements = 373,632

Total number of DOF: 2,241,792

Page 55: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Performance results on various number of processors (IBM JS21)

# procs # procs

PrecondPrecond

88 1616 3232

NoneNone 5050 5454 6363

M1M1 2424 2525 2626

Number of iterations

Scattering of a plane wave by a PEC cube

Page 56: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Scattering of a plane wave by a head

Plane wave frequency: 1800 MHz Tetrahedral mesh:

• # vertices = 188101 • # elements = 1118952

Total number of DOF: 6,713,712

Page 57: Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y

Performance results on various number of processors (Blue Gene B/L)

# procs # procs

PrecondPrecond

4848 6464 128128 256256

NoneNone 150150 161161 198198 240240

M1M1 4040 4242 5151 6262

Number of iterations

Scattering of a plane wave by a head