Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
1
Scalable Scalable Parallel Parallel Multigrid forMultigrid forFinite Element Finite Element ComputationsComputations
Minisymposium MS8 @ SIAM CS&E 2007Minisymposium MS8 @ SIAM CS&E 2007Large Scale Parallel MultigridLarge Scale Parallel Multigrid
Lehrstuhl für Informatik 10 (Systemsimulation)Universität Erlangen-Nürnberg
www10.informatik.uni-erlangen.deSIAM CS&E, Feb 2007
B. Bergen (LSS Erlangen/ now Los Alamos)T. GradlT. Gradl, , Chr. FreundlChr. Freundl (LSS Erlangen)
U. Rüde (LSS Erlangen, [email protected])
2
OverviewOverviewMotivation
SupercomputersExample Applications of Large-Scale FE Simulation
Scalable Finite Element SolversScalable Algorithms: MultigridScalable ArchitectScalable Architectures:ures:
•• HHierarchical HHybrid GGrids (HHG)•• ParParallel ExExpression Templates for PDE PDE (ParExPDE)
ResultsOutlook
Towards Peta-Scale FE Solvers
3
Part IPart I
MotivationMotivation
4
HHG MotivationHHG MotivationStructured Structured vsvs. Unstructured Grids. Unstructured Grids
(on Hitachi SR 8000)(on Hitachi SR 8000)
gridlib/HHG MFlops rates for matrix-vectormultiplication on one nodecompared with highly tuned JDS resultsfor sparse matricesa
Pet Dinosaur HLRB-I:Hitachi SR 8000
at the Leibniz-Rechenzentrumder Bayerischen Akademie der
WissenschaftenNo. 5 in TOP-500 at time of
installation in 2000
5
Our New Play Station: HLRB-IIOur New Play Station: HLRB-II
HLRB-II: SGI Altix 4700 at the Leibniz-Rechenzentrum der
Bayerischen Akademie der WissenschaftenNo. 18 in in TOP-500 of Nov. 2006
Shared memory architecture4096 CPUs17.5 TBytes RAMNUMAlink network26.2 TFLOP/s PeakPerformance24.5 TFLOP/s LINPACKperformancewill be doubled after 2007upgrade
Test ground for scalabilityexperiments
6
Motivation II: DiMe - ProjectMotivation II: DiMe - Project
Cache-optimizations for sparse matrixcodesHigh Performance Multigrid SolversEfficient (free surface) LBM codes forCFD
Collaboration projectfunded 1996 - 20xy by DFG
www10.informatik.uni-erlangen.de/de/Research/Projects/DiME/
Data Local Iterative Methods for theEfficient Solution of Partial Differential Equations
7
Part II-aPart II-a
Towards Scalable FE SoftwareTowards Scalable FE Software
Scalable Algorithms:Scalable Algorithms:MultigridMultigrid
8
Example: Flow Induced NoiseExample: Flow Induced NoiseFlow around a small square obstacle
Relatively simple geometries, but fine resolution for resolving physics
Images by M. Escobar (Dept. of Sensor Technology + LSTM)
9
What is Multigrid?What is Multigrid?Has nothing to do with „grid computing“A general methodology
multi - scale (actually it is the „original“)many different applicationsdevelopped in the 1970s - ...
Useful e.g. for solving elliptic PDEslarge sparse systems of equationsiterativeconvergence rate independent of problem sizeasymptotically optimal complexity -> algorithmic scalability!can solve e.g. 2D Poisson Problem in ~ 30 operations per gridpointefficient parallelization - if one knows how to do itbest basis for fully scalable FE solvers
10
MultigridMultigrid: : V-CycleV-Cycle
Relax on
Residual
Restrict
Correct
Solve
Interpolate
by recursion
… …
Goal:Goal: solve Ah uh = f h using a hierarchy of grids
11
Parallel High Performance FE MultigridParallel High Performance FE MultigridParallelize „plain vanilla“ multigrid
partition domainparallelize all operations on all gridsuse clever data structures
Do not worry (so much) about Coarse Gridsidle processors?short messages?sequential dependency in grid hierarchy?
Why we do not use Domain DecompositionDD without coarse grid does not scale (algorithmically) and isinefficient for large problems/ many procssorsDD with coarse grids is still less efficient than multigrid and is asdifficult to parallelize
12
Part II - bPart II - b
Towards Scalable FE SoftwareTowards Scalable FE Software
Scalable ArchitectureScalable Architecture
Hierarchical Hybrid Grids andHierarchical Hybrid Grids andParallel Expression Templates forParallel Expression Templates for
PDEsPDEs
13
HHierarchical ierarchical Hybrid Hybrid Grids (HHG)Grids (HHG)Unstructured input grid
Resolves geometry of problem domainPatch-wise regular refinement
generates nested grid hierarchies naturally suitable forgeometric multigrid algorithms
New:Modify storage formats and operations on the grid toexploit the regular substructures
Does an unstructured grid with 100 000 000 000 elementsmake sense?
HHG - Ultimate Parallel FE Performance!
14
HHG refinement exampleHHG refinement example
Input Grid
15
HHG Refinement exampleHHG Refinement example
Refinement Level one
16
HHG Refinement exampleHHG Refinement example
Refinement Level Two
17
HHG Refinement exampleHHG Refinement example
Structured InteriorStructured Interior
18
HHG Refinement exampleHHG Refinement example
Structured Interior
19
HHG Refinement exampleHHG Refinement example
Edge Interior
20
HHG Refinement exampleHHG Refinement example
Edge Interior
21
HHG for ParallelizationHHG for ParallelizationUse regular HHG patches for partitioning the domain
22
HHG Parallel Update AlgorithmHHG Parallel Update Algorithmfor each vertex do
apply operation to vertexend for
for each edge docopy from vertex interiorapply operation to edgecopy to vertex halo
end for
for each element docopy from edge/vertex interiorsapply operation to elementcopy to edge/vertex halos
end for
update vertex primary dependenciesupdate vertex primary dependencies
update edge primary dependenciesupdate edge primary dependencies
update secondary dependenciesupdate secondary dependencies
23
HHG for ParallelizationHHG for Parallelization
24
Parallel HHG - FrameworkParallel HHG - FrameworkDesign GoalsDesign Goals
To achieve good parallel scalability:
Minimize latency by reducing the number ofmessages that must be sentOptimize for high bandwidth interconnects ⇒large messagesAvoid local copying into MPI buffers
25
ParExPDEParExPDEA library for the user friendly,rapid development of numericalPDE solvers on parallel(super-)computersProvides a high level andintuitive user interface withoutcompromising on efficiencyRegularly refined hexahedralgridsSupport for deformed edges
26
Introduction Introduction to Expression to Expression TemplatesTemplates
Encapsulation of arithmetic expressions in a tree-likestructure by using C++ template constructsEvaluation of expression happens at compile timeAvoidance of unnecessary copying and generation oftemporary objects
Elegance &Performance
27
Introduction Introduction to Expression to Expression TemplatesTemplates
Evaluation of an expression:template <class T>void Vector::operator=(Expr<T>& expr) { for (int i=0; i<_size; i++) _values[i] = expr.valueAt(i);}
z = a * x + y
ExprLiteral( a ) ExprVector( x )
ExprVector( y )Expr< ExprBinOp< ., ., OpMult > >
Expr< ExprBinOp< ., ., OpAdd > >
28
Part II - cPart II - c
Towards Scalable FE SoftwareTowards Scalable FE Software
Performance ResultsPerformance Results
29
Node Performance is Difficult! Node Performance is Difficult! (B. Gropp)(B. Gropp)DiMe project: Cache-aware Multigrid (1996- ...)DiMe project: Cache-aware Multigrid (1996- ...)
1282128413191312191324002x blocking
2049213421402167238924203x blocking
819849106599514172445no blocking
57949067771513441072standard
513325731293653333173grid size
Performance of 3D-MG-Smoother for 7-pt stencil in Mflops on Itanium 1.4 GHz
Array Padding
Temporal blocking - in EPIC assembly language
Software pipelining in the extreme (M. Stürmer - J. Treibig)
Node Performance is Possible!
30
Single Processor HHG Performance on Itanium forSingle Processor HHG Performance on Itanium forRelaxation of a Tetrahedral Finite Element MeshRelaxation of a Tetrahedral Finite Element Mesh
31
Multigrid Convergence Multigrid Convergence RatesRates
32
Scaleup Scaleup HHG and HHG and ParExPDEParExPDE
33
Speedup Speedup HHG and HHG and ParExPDEParExPDE
34
HHG: Parallel ScalabilityHHG: Parallel Scalability
431,456/96449152103,07917,1671024
75762/54549152103,07917,167512
76409/2702457651,5398,577256
69200/1471228825,7694,288128
68100/75614412,8842,14464
Time [s]GFLOP/s#Input Els#Els x 106#DOFS x 106#Procs
Parallel scalability of Poisson problemdiscretized by tetrahedral finite elements.
Times for six V(3,3) cycles on SGI Altix (Itanium-2 1.6 GHz).
Largest problem solved to date:1.28 x 1011 DOFS (45900 Input Els) on 3825 Procs: 7 s per V(2,2) cycle
35
Why Multigrid?Why Multigrid?A direct elimination banded solver has complexityO(N2.3) for a 3-D problem.This becomes
338869529114764631553758 = O(1023)operations for our problem size
At one Teraflops this would result in a runtime of10000 years which could be reduced to 10 years ona Petaflops system
36
SIAM J. Scientific ComputingSIAM J. Scientific Computing
Special Issue on CS&ESpecial Issue on CS&Ehttp://www.siam.org/journals/sisc.php
Guest Editors-in-Chief:Chris JohnsonDavid KeyesUlrich Ruede
Call for papers:Modeling techniquesSimulation techniquesAnalysis techniquesTools for realisticproblems
Deadline:April 30, 2007
Guest Editorial Board: Gyan Bhanot, IBM Watson Research Center,Yorktown Hights; Rupak Biswas, NASA Ames Research Center,Moffett Field, CA; Edmond Chow, D. E. Shaw Research, LLC; PhilColella, Lawrence Berkeley National Laboratory; Yuefan Deng, StonyBrook University, Stony Brook, NY Brookhaven National Lab; LoriFreitag Diachin, Lawrence Livermore National Laboratory; OmarGhattas, The University of Texas at Austin, Laurence Halpern, Univ.Paris XIII; Robert Harrison, Oak Ridge National Laboratory; BruceHendrickson, Sandia National Laboratories; Kirk Jordan, IBM,Cambridge MA; Tammy Kolda, Sandia National Laboratories; LouisKomzsik, UGS Corp., Cypress, CA, USA; Ulrich Langer, JohannRadon Institute for Computational and Applied Mathematics (RICAM)Linz; Hans Petter Langtangen, Simula Research Laboratory, Oslo;Steven Lee, Lawrence Livermore National Laboratory; KengoNakajima, University of Tokyo; Aiichiro Nakano, University ofSouthern California, Los Angeles; Esmong G. Ng, LawrenceBerkeley National Laboratory; Stefan Turek, University ofDortmund, Germany; Andy Wathen, Oxford University; MargaretWright, New York University; David P. Young, Boeing.
37
Part IVPart IV
ConclusionsConclusions
38
ConclusionsConclusionsSupercomputer Performance is Easy!
If parallel efficiency is bad, choose a slower serial algorithmit is probably easier to parallelizeand will make your speedups look much more impressive
Introduce the “CrunchMe” variable for getting high Flops ratesadvanced method: disguise CrunchMe by using an inefficient (but compute-intensive)algorithm from the start
Introduce the “HitMe” variable to get good cache hit ratesadvanced version: disguise HitMe within “clever data structures” that introduce a lot ofoverhead
Never cite “time-to-solution”who cares whether you solve a real life problem anywayit is the MachoFlops that interest the people who pay for your research
Never waste your time by trying to use a complicated algorithm in parallel (suchas multigrid)
the more primitive the algorithm the easier to maximize your MachoFlops.
39
ReferencesReferences
B. Bergen, F. Hülsemann, U. Ruede: Is 1.7× 1010 unknowns thelargest finite element system that can be solved today?SuperComputing, Nov‘ 2005.(Answer: no, it‘s 1.3 x 1011 since Dec. 2006)ISC Award: 2006 for Application Scalability.B. Bergen, T. Gradl, F. Hülsemann, U. Ruede: A massivelyparallel multigrid method for finite elementsComputing in Science & Engineering, Vol. 8, No. 6
40
AcknowledgementsAcknowledgementsCollaborators
In Erlangen: WTM, LSE, LSTM, LGDV, RRZE, Neurozentrum, Radiologie, etc.Especially for foams: C. Körner (WTM)International: Utah, Technion, Constanta, Ghent, Boulder, München, Zürich, ...
Dissertationen ProjectsU. Fabricius (AMG-Verfahren and SW-Engineering for parallelization)C. Freundl (Parelle Expression Templates for PDE-solver)J. Härtlein (Expression Templates for FE-Applications)N. Thürey (LBM, free surfaces)T. Pohl (Parallel LBM)... and 6 more
19 Diplom- /Master- ThesisStudien- /Bachelor- Thesis
Especially for Performance-Analysis/ Optimization for LBM• J. Wilke, K. Iglberger, S. Donath
... and 23 more
KONWIHR, KONWIHR, DFG, DFG, NATO, BMBFNATO, BMBFElitenetzwerk BayernElitenetzwerk Bayern
Bavarian Graduate School in Computational EngineeringBavarian Graduate School in Computational Engineering (with TUM, since 2004) (with TUM, since 2004)Special International Special International PhD programPhD program: : Identifikation, Optimierung und Steuerung für technischeIdentifikation, Optimierung und Steuerung für technischeAnwendungenAnwendungen ( (with with Bayreuth and Würzburg) sinceBayreuth and Würzburg) since Jan. 2006. Jan. 2006.
41
Talk is OverTalk is Over
Questions?Questions?
Preview of presentation in MS 92
BGCE Student Paper PrizeMS 83, Thu 9:45 am, Laguna Beach I/II - B1MS 92, Thu 3:00 pm, Pacific II-B