Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

1

Scalable Scalable Parallel Parallel Multigrid forMultigrid forFinite Element Finite Element ComputationsComputations

Minisymposium MS8 @ SIAM CS&E 2007Minisymposium MS8 @ SIAM CS&E 2007Large Scale Parallel MultigridLarge Scale Parallel Multigrid

Lehrstuhl für Informatik 10 (Systemsimulation)Universität Erlangen-Nürnberg

www10.informatik.uni-erlangen.deSIAM CS&E, Feb 2007

B. Bergen (LSS Erlangen/ now Los Alamos)T. GradlT. Gradl, , Chr. FreundlChr. Freundl (LSS Erlangen)

U. Rüde (LSS Erlangen, [email protected])

2

OverviewOverviewMotivation

SupercomputersExample Applications of Large-Scale FE Simulation

Scalable Finite Element SolversScalable Algorithms: MultigridScalable ArchitectScalable Architectures:ures:

•• HHierarchical HHybrid GGrids (HHG)•• ParParallel ExExpression Templates for PDE PDE (ParExPDE)

ResultsOutlook

Towards Peta-Scale FE Solvers

3

Part IPart I

MotivationMotivation

4

HHG MotivationHHG MotivationStructured Structured vsvs. Unstructured Grids. Unstructured Grids

(on Hitachi SR 8000)(on Hitachi SR 8000)

gridlib/HHG MFlops rates for matrix-vectormultiplication on one nodecompared with highly tuned JDS resultsfor sparse matricesa

Pet Dinosaur HLRB-I:Hitachi SR 8000

at the Leibniz-Rechenzentrumder Bayerischen Akademie der

WissenschaftenNo. 5 in TOP-500 at time of

installation in 2000

5

Our New Play Station: HLRB-IIOur New Play Station: HLRB-II

HLRB-II: SGI Altix 4700 at the Leibniz-Rechenzentrum der

Bayerischen Akademie der WissenschaftenNo. 18 in in TOP-500 of Nov. 2006

Shared memory architecture4096 CPUs17.5 TBytes RAMNUMAlink network26.2 TFLOP/s PeakPerformance24.5 TFLOP/s LINPACKperformancewill be doubled after 2007upgrade

Test ground for scalabilityexperiments

6

Motivation II: DiMe - ProjectMotivation II: DiMe - Project

Cache-optimizations for sparse matrixcodesHigh Performance Multigrid SolversEfficient (free surface) LBM codes forCFD

Collaboration projectfunded 1996 - 20xy by DFG

www10.informatik.uni-erlangen.de/de/Research/Projects/DiME/

Data Local Iterative Methods for theEfficient Solution of Partial Differential Equations

7

Part II-aPart II-a

Towards Scalable FE SoftwareTowards Scalable FE Software

Scalable Algorithms:Scalable Algorithms:MultigridMultigrid

8

Example: Flow Induced NoiseExample: Flow Induced NoiseFlow around a small square obstacle

Relatively simple geometries, but fine resolution for resolving physics

Images by M. Escobar (Dept. of Sensor Technology + LSTM)

9

What is Multigrid?What is Multigrid?Has nothing to do with „grid computing“A general methodology

multi - scale (actually it is the „original“)many different applicationsdevelopped in the 1970s - ...

Useful e.g. for solving elliptic PDEslarge sparse systems of equationsiterativeconvergence rate independent of problem sizeasymptotically optimal complexity -> algorithmic scalability!can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpointefficient parallelization - if one knows how to do itbest basis for fully scalable FE solvers

10

MultigridMultigrid: : V-CycleV-Cycle

Relax on

Residual

Restrict

Correct

Solve

Interpolate

by recursion

… …

Goal:Goal: solve Ah uh = f h using a hierarchy of grids

11

Parallel High Performance FE MultigridParallel High Performance FE MultigridParallelize „plain vanilla“ multigrid

partition domainparallelize all operations on all gridsuse clever data structures

Do not worry (so much) about Coarse Gridsidle processors?short messages?sequential dependency in grid hierarchy?

Why we do not use Domain DecompositionDD without coarse grid does not scale (algorithmically) and isinefficient for large problems/ many procssorsDD with coarse grids is still less efficient than multigrid and is asdifficult to parallelize

12

Part II - bPart II - b


Scalable ArchitectureScalable Architecture

Hierarchical Hybrid Grids andHierarchical Hybrid Grids andParallel Expression Templates forParallel Expression Templates for

PDEsPDEs

13

HHierarchical ierarchical Hybrid Hybrid Grids (HHG)Grids (HHG)Unstructured input grid

Resolves geometry of problem domainPatch-wise regular refinement

generates nested grid hierarchies naturally suitable forgeometric multigrid algorithms

New:Modify storage formats and operations on the grid toexploit the regular substructures

Does an unstructured grid with 100 000 000 000 elementsmake sense?

HHG - Ultimate Parallel FE Performance!

14

HHG refinement exampleHHG refinement example

Input Grid

15

HHG Refinement exampleHHG Refinement example

Refinement Level one

16


Refinement Level Two

17


Structured InteriorStructured Interior

18


Structured Interior

19


Edge Interior

20


Edge Interior

21

HHG for ParallelizationHHG for ParallelizationUse regular HHG patches for partitioning the domain

22

HHG Parallel Update AlgorithmHHG Parallel Update Algorithmfor each vertex do

apply operation to vertexend for

for each edge docopy from vertex interiorapply operation to edgecopy to vertex halo

end for

for each element docopy from edge/vertex interiorsapply operation to elementcopy to edge/vertex halos

end for

update vertex primary dependenciesupdate vertex primary dependencies

update edge primary dependenciesupdate edge primary dependencies

update secondary dependenciesupdate secondary dependencies

23

HHG for ParallelizationHHG for Parallelization

24

Parallel HHG - FrameworkParallel HHG - FrameworkDesign GoalsDesign Goals

To achieve good parallel scalability:

Minimize latency by reducing the number ofmessages that must be sentOptimize for high bandwidth interconnects ⇒large messagesAvoid local copying into MPI buffers

25

ParExPDEParExPDEA library for the user friendly,rapid development of numericalPDE solvers on parallel(super-)computersProvides a high level andintuitive user interface withoutcompromising on efficiencyRegularly refined hexahedralgridsSupport for deformed edges

26

Introduction Introduction to Expression to Expression TemplatesTemplates

Encapsulation of arithmetic expressions in a tree-likestructure by using C++ template constructsEvaluation of expression happens at compile timeAvoidance of unnecessary copying and generation oftemporary objects

Elegance &Performance

27

Introduction Introduction to Expression to Expression TemplatesTemplates

Evaluation of an expression:template <class T>void Vector::operator=(Expr<T>& expr) { for (int i=0; i<_size; i++) _values[i] = expr.valueAt(i);}

z = a * x + y

ExprLiteral( a ) ExprVector( x )

ExprVector( y )Expr< ExprBinOp< ., ., OpMult > >

Expr< ExprBinOp< ., ., OpAdd > >

28

Part II - cPart II - c


Performance ResultsPerformance Results

29

Node Performance is Difficult! Node Performance is Difficult! (B. Gropp)(B. Gropp)DiMe project: Cache-aware Multigrid (1996- ...)DiMe project: Cache-aware Multigrid (1996- ...)

1282128413191312191324002x blocking

2049213421402167238924203x blocking

819849106599514172445no blocking

57949067771513441072standard

513325731293653333173grid size

Performance of 3D-MG-Smoother for 7-pt stencil in Mflops on Itanium 1.4 GHz

Array Padding

Temporal blocking - in EPIC assembly language

Software pipelining in the extreme (M. Stürmer - J. Treibig)

Node Performance is Possible!

30

Single Processor HHG Performance on Itanium forSingle Processor HHG Performance on Itanium forRelaxation of a Tetrahedral Finite Element MeshRelaxation of a Tetrahedral Finite Element Mesh

31

Multigrid Convergence Multigrid Convergence RatesRates

32

Scaleup Scaleup HHG and HHG and ParExPDEParExPDE

33

Speedup Speedup HHG and HHG and ParExPDEParExPDE

34

HHG: Parallel ScalabilityHHG: Parallel Scalability

431,456/96449152103,07917,1671024

75762/54549152103,07917,167512

76409/2702457651,5398,577256

69200/1471228825,7694,288128

68100/75614412,8842,14464

Time [s]GFLOP/s#Input Els#Els x 106#DOFS x 106#Procs

Parallel scalability of Poisson problemdiscretized by tetrahedral finite elements.

Times for six V(3,3) cycles on SGI Altix (Itanium-2 1.6 GHz).

Largest problem solved to date:1.28 x 1011 DOFS (45900 Input Els) on 3825 Procs: 7 s per V(2,2) cycle

35

Why Multigrid?Why Multigrid?A direct elimination banded solver has complexityO(N2.3) for a 3-D problem.This becomes

338869529114764631553758 = O(1023)operations for our problem size

At one Teraflops this would result in a runtime of10000 years which could be reduced to 10 years ona Petaflops system

36

SIAM J. Scientific ComputingSIAM J. Scientific Computing

Special Issue on CS&ESpecial Issue on CS&Ehttp://www.siam.org/journals/sisc.php

Guest Editors-in-Chief:Chris JohnsonDavid KeyesUlrich Ruede

Call for papers:Modeling techniquesSimulation techniquesAnalysis techniquesTools for realisticproblems

Deadline:April 30, 2007

Guest Editorial Board: Gyan Bhanot, IBM Watson Research Center,Yorktown Hights; Rupak Biswas, NASA Ames Research Center,Moffett Field, CA; Edmond Chow, D. E. Shaw Research, LLC; PhilColella, Lawrence Berkeley National Laboratory; Yuefan Deng, StonyBrook University, Stony Brook, NY Brookhaven National Lab; LoriFreitag Diachin, Lawrence Livermore National Laboratory; OmarGhattas, The University of Texas at Austin, Laurence Halpern, Univ.Paris XIII; Robert Harrison, Oak Ridge National Laboratory; BruceHendrickson, Sandia National Laboratories; Kirk Jordan, IBM,Cambridge MA; Tammy Kolda, Sandia National Laboratories; LouisKomzsik, UGS Corp., Cypress, CA, USA; Ulrich Langer, JohannRadon Institute for Computational and Applied Mathematics (RICAM)Linz; Hans Petter Langtangen, Simula Research Laboratory, Oslo;Steven Lee, Lawrence Livermore National Laboratory; KengoNakajima, University of Tokyo; Aiichiro Nakano, University ofSouthern California, Los Angeles; Esmong G. Ng, LawrenceBerkeley National Laboratory; Stefan Turek, University ofDortmund, Germany; Andy Wathen, Oxford University; MargaretWright, New York University; David P. Young, Boeing.

37

Part IVPart IV

ConclusionsConclusions

38

ConclusionsConclusionsSupercomputer Performance is Easy!

If parallel efficiency is bad, choose a slower serial algorithmit is probably easier to parallelizeand will make your speedups look much more impressive

Introduce the “CrunchMe” variable for getting high Flops ratesadvanced method: disguise CrunchMe by using an inefficient (but compute-intensive)algorithm from the start

Introduce the “HitMe” variable to get good cache hit ratesadvanced version: disguise HitMe within “clever data structures” that introduce a lot ofoverhead

Never cite “time-to-solution”who cares whether you solve a real life problem anywayit is the MachoFlops that interest the people who pay for your research

Never waste your time by trying to use a complicated algorithm in parallel (suchas multigrid)

the more primitive the algorithm the easier to maximize your MachoFlops.

39

ReferencesReferences

B. Bergen, F. Hülsemann, U. Ruede: Is 1.7× 1010 unknowns thelargest finite element system that can be solved today?SuperComputing, Nov‘ 2005.(Answer: no, it‘s 1.3 x 1011 since Dec. 2006)ISC Award: 2006 for Application Scalability.B. Bergen, T. Gradl, F. Hülsemann, U. Ruede: A massivelyparallel multigrid method for finite elementsComputing in Science & Engineering, Vol. 8, No. 6

40

AcknowledgementsAcknowledgementsCollaborators

In Erlangen: WTM, LSE, LSTM, LGDV, RRZE, Neurozentrum, Radiologie, etc.Especially for foams: C. Körner (WTM)International: Utah, Technion, Constanta, Ghent, Boulder, München, Zürich, ...

Dissertationen ProjectsU. Fabricius (AMG-Verfahren and SW-Engineering for parallelization)C. Freundl (Parelle Expression Templates for PDE-solver)J. Härtlein (Expression Templates for FE-Applications)N. Thürey (LBM, free surfaces)T. Pohl (Parallel LBM)... and 6 more

19 Diplom- /Master- ThesisStudien- /Bachelor- Thesis

Especially for Performance-Analysis/ Optimization for LBM• J. Wilke, K. Iglberger, S. Donath

... and 23 more

KONWIHR, KONWIHR, DFG, DFG, NATO, BMBFNATO, BMBFElitenetzwerk BayernElitenetzwerk Bayern

Bavarian Graduate School in Computational EngineeringBavarian Graduate School in Computational Engineering (with TUM, since 2004) (with TUM, since 2004)Special International Special International PhD programPhD program: : Identifikation, Optimierung und Steuerung für technischeIdentifikation, Optimierung und Steuerung für technischeAnwendungenAnwendungen ( (with with Bayreuth and Würzburg) sinceBayreuth and Würzburg) since Jan. 2006. Jan. 2006.

41

Talk is OverTalk is Over

Questions?Questions?

Preview of presentation in MS 92

BGCE Student Paper PrizeMS 83, Thu 9:45 am, Laguna Beach I/II - B1MS 92, Thu 3:00 pm, Pacific II-B

Documents

Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis