41
1 Scalable Scalable Parallel Parallel Multigrid for Multigrid for Finite Element Finite Element Computations Computations Minisymposium MS8 @ SIAM CS&E 2007 Minisymposium MS8 @ SIAM CS&E 2007 Large Scale Parallel Multigrid Large Scale Parallel Multigrid Lehrstuhl für Informatik 10 (Systemsimulation) Universität Erlangen-Nürnberg www10.informatik.uni-erlangen.de SIAM CS&E, Feb 2007 B. Bergen (LSS Erlangen/ now Los Alamos) T. Gradl T. Gradl , , Chr. Freundl Chr. Freundl (LSS Erlangen) U. Rüde (LSS Erlangen, [email protected])

Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

1

Scalable Scalable Parallel Parallel Multigrid forMultigrid forFinite Element Finite Element ComputationsComputations

Minisymposium MS8 @ SIAM CS&E 2007Minisymposium MS8 @ SIAM CS&E 2007Large Scale Parallel MultigridLarge Scale Parallel Multigrid

Lehrstuhl für Informatik 10 (Systemsimulation)Universität Erlangen-Nürnberg

www10.informatik.uni-erlangen.deSIAM CS&E, Feb 2007

B. Bergen (LSS Erlangen/ now Los Alamos)T. GradlT. Gradl, , Chr. FreundlChr. Freundl (LSS Erlangen)

U. Rüde (LSS Erlangen, [email protected])

Page 2: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

2

OverviewOverviewMotivation

SupercomputersExample Applications of Large-Scale FE Simulation

Scalable Finite Element SolversScalable Algorithms: MultigridScalable ArchitectScalable Architectures:ures:

•• HHierarchical HHybrid GGrids (HHG)•• ParParallel ExExpression Templates for PDE PDE (ParExPDE)

ResultsOutlook

Towards Peta-Scale FE Solvers

Page 3: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

3

Part IPart I

MotivationMotivation

Page 4: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

4

HHG MotivationHHG MotivationStructured Structured vsvs. Unstructured Grids. Unstructured Grids

(on Hitachi SR 8000)(on Hitachi SR 8000)

gridlib/HHG MFlops rates for matrix-vectormultiplication on one nodecompared with highly tuned JDS resultsfor sparse matricesa

Pet Dinosaur HLRB-I:Hitachi SR 8000

at the Leibniz-Rechenzentrumder Bayerischen Akademie der

WissenschaftenNo. 5 in TOP-500 at time of

installation in 2000

Page 5: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

5

Our New Play Station: HLRB-IIOur New Play Station: HLRB-II

HLRB-II: SGI Altix 4700 at the Leibniz-Rechenzentrum der

Bayerischen Akademie der WissenschaftenNo. 18 in in TOP-500 of Nov. 2006

Shared memory architecture4096 CPUs17.5 TBytes RAMNUMAlink network26.2 TFLOP/s PeakPerformance24.5 TFLOP/s LINPACKperformancewill be doubled after 2007upgrade

Test ground for scalabilityexperiments

Page 6: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

6

Motivation II: DiMe - ProjectMotivation II: DiMe - Project

Cache-optimizations for sparse matrixcodesHigh Performance Multigrid SolversEfficient (free surface) LBM codes forCFD

Collaboration projectfunded 1996 - 20xy by DFG

www10.informatik.uni-erlangen.de/de/Research/Projects/DiME/

Data Local Iterative Methods for theEfficient Solution of Partial Differential Equations

Page 7: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

7

Part II-aPart II-a

Towards Scalable FE SoftwareTowards Scalable FE Software

Scalable Algorithms:Scalable Algorithms:MultigridMultigrid

Page 8: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

8

Example: Flow Induced NoiseExample: Flow Induced NoiseFlow around a small square obstacle

Relatively simple geometries, but fine resolution for resolving physics

Images by M. Escobar (Dept. of Sensor Technology + LSTM)

Page 9: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

9

What is Multigrid?What is Multigrid?Has nothing to do with „grid computing“A general methodology

multi - scale (actually it is the „original“)many different applicationsdevelopped in the 1970s - ...

Useful e.g. for solving elliptic PDEslarge sparse systems of equationsiterativeconvergence rate independent of problem sizeasymptotically optimal complexity -> algorithmic scalability!can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpointefficient parallelization - if one knows how to do itbest basis for fully scalable FE solvers

Page 10: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

10

MultigridMultigrid: : V-CycleV-Cycle

Relax on

Residual

Restrict

Correct

Solve

Interpolate

by recursion

… …

Goal:Goal: solve Ah uh = f h using a hierarchy of grids

Page 11: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

11

Parallel High Performance FE MultigridParallel High Performance FE MultigridParallelize „plain vanilla“ multigrid

partition domainparallelize all operations on all gridsuse clever data structures

Do not worry (so much) about Coarse Gridsidle processors?short messages?sequential dependency in grid hierarchy?

Why we do not use Domain DecompositionDD without coarse grid does not scale (algorithmically) and isinefficient for large problems/ many procssorsDD with coarse grids is still less efficient than multigrid and is asdifficult to parallelize

Page 12: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

12

Part II - bPart II - b

Towards Scalable FE SoftwareTowards Scalable FE Software

Scalable ArchitectureScalable Architecture

Hierarchical Hybrid Grids andHierarchical Hybrid Grids andParallel Expression Templates forParallel Expression Templates for

PDEsPDEs

Page 13: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

13

HHierarchical ierarchical Hybrid Hybrid Grids (HHG)Grids (HHG)Unstructured input grid

Resolves geometry of problem domainPatch-wise regular refinement

generates nested grid hierarchies naturally suitable forgeometric multigrid algorithms

New:Modify storage formats and operations on the grid toexploit the regular substructures

Does an unstructured grid with 100 000 000 000 elementsmake sense?

HHG - Ultimate Parallel FE Performance!

Page 14: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

14

HHG refinement exampleHHG refinement example

Input Grid

Page 15: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

15

HHG Refinement exampleHHG Refinement example

Refinement Level one

Page 16: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

16

HHG Refinement exampleHHG Refinement example

Refinement Level Two

Page 17: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

17

HHG Refinement exampleHHG Refinement example

Structured InteriorStructured Interior

Page 18: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

18

HHG Refinement exampleHHG Refinement example

Structured Interior

Page 19: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

19

HHG Refinement exampleHHG Refinement example

Edge Interior

Page 20: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

20

HHG Refinement exampleHHG Refinement example

Edge Interior

Page 21: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

21

HHG for ParallelizationHHG for ParallelizationUse regular HHG patches for partitioning the domain

Page 22: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

22

HHG Parallel Update AlgorithmHHG Parallel Update Algorithmfor each vertex do

apply operation to vertexend for

for each edge docopy from vertex interiorapply operation to edgecopy to vertex halo

end for

for each element docopy from edge/vertex interiorsapply operation to elementcopy to edge/vertex halos

end for

update vertex primary dependenciesupdate vertex primary dependencies

update edge primary dependenciesupdate edge primary dependencies

update secondary dependenciesupdate secondary dependencies

Page 23: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

23

HHG for ParallelizationHHG for Parallelization

Page 24: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

24

Parallel HHG - FrameworkParallel HHG - FrameworkDesign GoalsDesign Goals

To achieve good parallel scalability:

Minimize latency by reducing the number ofmessages that must be sentOptimize for high bandwidth interconnects ⇒large messagesAvoid local copying into MPI buffers

Page 25: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

25

ParExPDEParExPDEA library for the user friendly,rapid development of numericalPDE solvers on parallel(super-)computersProvides a high level andintuitive user interface withoutcompromising on efficiencyRegularly refined hexahedralgridsSupport for deformed edges

Page 26: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

26

Introduction Introduction to Expression to Expression TemplatesTemplates

Encapsulation of arithmetic expressions in a tree-likestructure by using C++ template constructsEvaluation of expression happens at compile timeAvoidance of unnecessary copying and generation oftemporary objects

Elegance &Performance

Page 27: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

27

Introduction Introduction to Expression to Expression TemplatesTemplates

Evaluation of an expression:template <class T>void Vector::operator=(Expr<T>& expr) { for (int i=0; i<_size; i++) _values[i] = expr.valueAt(i);}

z = a * x + y

ExprLiteral( a ) ExprVector( x )

ExprVector( y )Expr< ExprBinOp< ., ., OpMult > >

Expr< ExprBinOp< ., ., OpAdd > >

Page 28: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

28

Part II - cPart II - c

Towards Scalable FE SoftwareTowards Scalable FE Software

Performance ResultsPerformance Results

Page 29: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

29

Node Performance is Difficult! Node Performance is Difficult! (B. Gropp)(B. Gropp)DiMe project: Cache-aware Multigrid (1996- ...)DiMe project: Cache-aware Multigrid (1996- ...)

1282128413191312191324002x blocking

2049213421402167238924203x blocking

819849106599514172445no blocking

57949067771513441072standard

513325731293653333173grid size

Performance of 3D-MG-Smoother for 7-pt stencil in Mflops on Itanium 1.4 GHz

Array Padding

Temporal blocking - in EPIC assembly language

Software pipelining in the extreme (M. Stürmer - J. Treibig)

Node Performance is Possible!

Page 30: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

30

Single Processor HHG Performance on Itanium forSingle Processor HHG Performance on Itanium forRelaxation of a Tetrahedral Finite Element MeshRelaxation of a Tetrahedral Finite Element Mesh

Page 31: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

31

Multigrid Convergence Multigrid Convergence RatesRates

Page 32: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

32

Scaleup Scaleup HHG and HHG and ParExPDEParExPDE

Page 33: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

33

Speedup Speedup HHG and HHG and ParExPDEParExPDE

Page 34: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

34

HHG: Parallel ScalabilityHHG: Parallel Scalability

431,456/96449152103,07917,1671024

75762/54549152103,07917,167512

76409/2702457651,5398,577256

69200/1471228825,7694,288128

68100/75614412,8842,14464

Time [s]GFLOP/s#Input Els#Els x 106#DOFS x 106#Procs

Parallel scalability of Poisson problemdiscretized by tetrahedral finite elements.

Times for six V(3,3) cycles on SGI Altix (Itanium-2 1.6 GHz).

Largest problem solved to date:1.28 x 1011 DOFS (45900 Input Els) on 3825 Procs: 7 s per V(2,2) cycle

Page 35: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

35

Why Multigrid?Why Multigrid?A direct elimination banded solver has complexityO(N2.3) for a 3-D problem.This becomes

338869529114764631553758 = O(1023)operations for our problem size

At one Teraflops this would result in a runtime of10000 years which could be reduced to 10 years ona Petaflops system

Page 36: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

36

SIAM J. Scientific ComputingSIAM J. Scientific Computing

Special Issue on CS&ESpecial Issue on CS&Ehttp://www.siam.org/journals/sisc.php

Guest Editors-in-Chief:Chris JohnsonDavid KeyesUlrich Ruede

Call for papers:Modeling techniquesSimulation techniquesAnalysis techniquesTools for realisticproblems

Deadline:April 30, 2007

Guest Editorial Board: Gyan Bhanot, IBM Watson Research Center,Yorktown Hights; Rupak Biswas, NASA Ames Research Center,Moffett Field, CA; Edmond Chow, D. E. Shaw Research, LLC; PhilColella, Lawrence Berkeley National Laboratory; Yuefan Deng, StonyBrook University, Stony Brook, NY Brookhaven National Lab; LoriFreitag Diachin, Lawrence Livermore National Laboratory; OmarGhattas, The University of Texas at Austin, Laurence Halpern, Univ.Paris XIII; Robert Harrison, Oak Ridge National Laboratory; BruceHendrickson, Sandia National Laboratories; Kirk Jordan, IBM,Cambridge MA; Tammy Kolda, Sandia National Laboratories; LouisKomzsik, UGS Corp., Cypress, CA, USA; Ulrich Langer, JohannRadon Institute for Computational and Applied Mathematics (RICAM)Linz; Hans Petter Langtangen, Simula Research Laboratory, Oslo;Steven Lee, Lawrence Livermore National Laboratory; KengoNakajima, University of Tokyo; Aiichiro Nakano, University ofSouthern California, Los Angeles; Esmong G. Ng, LawrenceBerkeley National Laboratory; Stefan Turek, University ofDortmund, Germany; Andy Wathen, Oxford University; MargaretWright, New York University; David P. Young, Boeing.

Page 37: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

37

Part IVPart IV

ConclusionsConclusions

Page 38: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

38

ConclusionsConclusionsSupercomputer Performance is Easy!

If parallel efficiency is bad, choose a slower serial algorithmit is probably easier to parallelizeand will make your speedups look much more impressive

Introduce the “CrunchMe” variable for getting high Flops ratesadvanced method: disguise CrunchMe by using an inefficient (but compute-intensive)algorithm from the start

Introduce the “HitMe” variable to get good cache hit ratesadvanced version: disguise HitMe within “clever data structures” that introduce a lot ofoverhead

Never cite “time-to-solution”who cares whether you solve a real life problem anywayit is the MachoFlops that interest the people who pay for your research

Never waste your time by trying to use a complicated algorithm in parallel (suchas multigrid)

the more primitive the algorithm the easier to maximize your MachoFlops.

Page 39: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

39

ReferencesReferences

B. Bergen, F. Hülsemann, U. Ruede: Is 1.7× 1010 unknowns thelargest finite element system that can be solved today?SuperComputing, Nov‘ 2005.(Answer: no, it‘s 1.3 x 1011 since Dec. 2006)ISC Award: 2006 for Application Scalability.B. Bergen, T. Gradl, F. Hülsemann, U. Ruede: A massivelyparallel multigrid method for finite elementsComputing in Science & Engineering, Vol. 8, No. 6

Page 40: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

40

AcknowledgementsAcknowledgementsCollaborators

In Erlangen: WTM, LSE, LSTM, LGDV, RRZE, Neurozentrum, Radiologie, etc.Especially for foams: C. Körner (WTM)International: Utah, Technion, Constanta, Ghent, Boulder, München, Zürich, ...

Dissertationen ProjectsU. Fabricius (AMG-Verfahren and SW-Engineering for parallelization)C. Freundl (Parelle Expression Templates for PDE-solver)J. Härtlein (Expression Templates for FE-Applications)N. Thürey (LBM, free surfaces)T. Pohl (Parallel LBM)... and 6 more

19 Diplom- /Master- ThesisStudien- /Bachelor- Thesis

Especially for Performance-Analysis/ Optimization for LBM• J. Wilke, K. Iglberger, S. Donath

... and 23 more

KONWIHR, KONWIHR, DFG, DFG, NATO, BMBFNATO, BMBFElitenetzwerk BayernElitenetzwerk Bayern

Bavarian Graduate School in Computational EngineeringBavarian Graduate School in Computational Engineering (with TUM, since 2004) (with TUM, since 2004)Special International Special International PhD programPhD program: : Identifikation, Optimierung und Steuerung für technischeIdentifikation, Optimierung und Steuerung für technischeAnwendungenAnwendungen ( (with with Bayreuth and Würzburg) sinceBayreuth and Würzburg) since Jan. 2006. Jan. 2006.

Page 41: Scalable Parallel Multigrid for Finite Element …can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpoint efficient parallelization - if one knows how to do it best basis

41

Talk is OverTalk is Over

Questions?Questions?

Preview of presentation in MS 92

BGCE Student Paper PrizeMS 83, Thu 9:45 am, Laguna Beach I/II - B1MS 92, Thu 3:00 pm, Pacific II-B