Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
LLNL-PRES-688866 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Investigating interoperability and performance portability of select LLNL numerical libraries DOE Center of Excellence Performance Portability Meeting
Glendale, Arizona, April 20, 2016
SlavenPeles,JohnLoffeld,CarolS.WoodwardandUlrikeYang
LLNL-PRES-688866 2
§ Challenges
§ Descrip?onofLLNLsoBwarestack
§ SUNDIALSlibrary
§ Preliminaryperformancetes?ngresults
§ Futureworkandconclusion
Outline
LLNL-PRES-688866 3
§ Implementnumericalalgorithmsinawaythatmakesbestuseofheterogeneoushardwarearchitecture
§ Developcodethatcanevolvealongwiththenewhardware--separateplaMormspecificfromalgorithmicpart(RAJA,Kokkos).
§ Totalcostofownership:— Howeasyisittodeploythecodeinnewenvironments?— Howeasyisittoaddnewfeatures?— Whatisthemaintenancecost?
§ Technicaldebtmanagement
ChallengesPor?ngexis?ngcodestoheterogeneoushardwarearchitectures
Maximizingperformanceisbutoneofseveralchallengesthatneedtobeaddressedwhenmovingtonewarchitectures.
LLNL-PRES-688866 4
MFEM:Afree,lightweight,scalableC++libraryforfiniteelementmethods.
SUNDIALS:Suiteofstate-of-the-artnumericalintegratorsandnonlinearsolvers.
hypre:Alibraryforsolvinglarge,sparselinearsystemsofequa?onsonmassivelyparallelcomputers
LLNLSo<wareStackLibrariescurrentlybeingportedtoheterogeneousarchitectures
MaintaininginteroperabilityandperformanceportabilityofthesoBwarestackismorechallengingonheterogeneousarchitectures.
The combined use of MFEM, hypre and SUNDIALS is critical for the efficient solution of a wide variety of transient PDEs, such as non-linear elasticity and magnetohydrodynamics.
LLNL-PRES-688866 5
NumericalsimulaAonanddataflowUsecase:implicitintegra?onschemewithitera?velinearsolver
No
No
No
Converged?
Final time?
Converged?
Linear solver step
Time integrator step
Nonlinear solver step
Finite elements tools: Function and Jacobian
evaluation
Updated solution vector x
Preconditioner P
dx
x
SUNDIALS
hypre
MFEM
Updated residual vector f and Jacobian J
Time integrator and nonlinear solver agnostic of vector data layout.
Numerical integrators and nonlinear solvers may invoke fairly complex step size control logic.
f, J
LLNL-PRES-688866 6
§ GPUprocessingpower>>CPUprocessingpower
§ GPUmemory<<CPUmemory
§ MovingdatabetweenCPUandGPUisexpensive
§ Proposedcomputa?onlayouts:
§ Bestlayoutmostlikelyproblemdependent.
HowtolayoutthecomputaAon?Consideringconstraintsofheterogeneousarchitectures
Numerical Integrator Linear Solver Function Evaluation GPU GPU GPU
CPU GPU GPU
CPU GPU CPU
CPU CPU GPU
CPU/GPU CPU/GPU CPU/GPU
LLNL-PRES-688866 7
§ Forwardlooking,extensibleobjectorienteddesignwithsimpleandcleanlinearsolverandvectorinterfaces.
§ Designedtobeincorporatedintoexis?ngcodes.
§ Modularstructureallowsuserstosupplytheirowndatastructures.
§ Scaleswellinsimula?onsonover500,000cores.
§ Suppliedwithserial,MPIandthread-parallel(openMPandPthreads)structures.
§ CMAKEsupportforconfigura?onandbuild.
§ Freelyavailable,releasedunderBSDlicense;Over4,500downloadsperyear.
§ Modulesandfunc?onality:— ODEintegrators:(CVODE)variableorderandsteps?ffBDFandnon-s?ffAdams,(ARKode)
variablestepimplicit,explicit,andaddi?veRunge-KukaforIMEXapproaches.— DAEintegrator:(IDA)variableorderandsteps?ffBDF.— CVODESandIDASincludeforwardandadjointsensi?vitycapabili?es.— KINSOLnonlinearsolver:Newton-KrylovandacceleratedfixedpointandPicardmethods.
SUNDIALSSuiteofstate-of-theartnumericalintegratorsandnonlinearsolvers
LLNL-PRES-688866 8
§ Powergridmodeling(RTEFrance,ISU)§ Simula?onofclutchesandpowertrainparts(LuK
GmbH&Co.)§ Electricalandheatgenera?onwithinbakerycells
(CD-adapco)§ 3Dparallelfusion(SMU,U.York,LLNL)§ Implicithydrodynamicsincorecollapsesupernova
(StonyBrook)§ Disloca?ondynamics(LLNL)§ Sensi?vityanalysisofchemicallyreac?ngflows
(Sandia)§ Large-scalesubsurfaceflows(COMines,LLNL)§ Op?miza?oninsimula?onofenergy-producing
algae(NREL)§ Micromagne?csimula?ons(U.Southampton)
SUNDIALSUsedinindustrialandacademicapplica?onsworldwide
Magnetic reconnection
Core collapse supernova
Dislocation dynamics
Subsurface flow
LLNL-PRES-688866 9
InterfacingSUNDIALSwithotherso<ware
§ Specifies:— 3constructors/destructors— 3u?lityfunc?ons.— 9streamingoperators.— 10reduc?onoperators.
§ En?reinterac?onwithapplica?ondataiscarriedoutthroughthese19operators.
§ Allarelevel-1BLASoperators.
§ Individualmodulesrequireonlyasubsetoftheseoperators.
§ Specifiesfollowingfivefunc?ons:init,setup,solve,perfandfree.
§ SUNDIALSonlyrequestslinearsolvesatspecificpoints.Itisindependentoflinearsolvestrategy.
§ Implementa?onofhyprelinearsolverinterfaceisinprogress.
Vector interface Linear solver interface
Objectorienteddesignandwelldefinedinterfacessimplifypor?ngSUNDIALStonewplaMorms.
LLNL-PRES-688866 10
SUNDIALSvectorkernelsperformancetesAngStandalonekernelperformance
0.00#
0.10#
0.20#
0.30#
0.40#
0.50#
0.60#
0.70#
0.80#
0.90#
1.00#
10# 100# 1000# 10000#
Peak%ban
dwidth%frac0o
n%
Number%of%vector%elements%Thousands%
GPU#
CPU#
0.00#
0.10#
0.20#
0.30#
0.40#
0.50#
0.60#
0.70#
0.80#
0.90#
1.00#
10# 100# 1000# 10000#
Peak%ban
dwidth%frac0o
n%
Number%of%vector%elements%Thousands%
GPU#
CPU#
axpy
dot product
0.0#
0.1#
1.0#
10.0#
10# 100# 1000# 10000#
Speedu
p&factor&
Number&of&vector&elements&
Thousands&
axpy#
dot#product#
% of theoretical peak throughput Speedup compared to CPU
§ ComparedCUDAimplementa?onwithcorrespondingthreadedMKLcalls.
§ PeakbandwidthofGPUis~3xlargerthanCPUbandwidthonthetesthardware.
§ axpyshowsspeeduponGPUforN>104,dotproductforN>106.
LLNL-PRES-688866 11
AndersonAcceleraAonSolverPerformanceAsimplenonlinearsolvermethodimplementedinSUNDIALS
§ Forvectorslessthan10,000,CPUversionstakeless?methanGPUversion.
§ CPUversioncostsremainapproximatelyconstantun?lvectorlengthsreach100.
§ GPUversioncostisconstantun?lvectoris10,000–lengthatwhichtheworkpervectordominatesoverheadpervectoropera?on.
§ Timesapproachlinearwithvectorlength
§ WhenbothCPUandGPUversionsareinlinearregime,weexpectra?obetween?mingstobeapproximatelyra?oofbandwidth.
§ Threadingreducesrun?meonCPU.
§ GPUgivesmorebenefitonlargeproblem.
Its = 16, m = 16
Its = 16, m = 4
Run times for CPU and GPU (function cost not timed)
LLNL-PRES-688866 12
AndersonAcceleraAonSolverPerformanceCPU-GPUcomputa?onlayout
§ Applied16itera?onswithsimplefunc?on,f(x)=x;touches2vectors
§ CPUrunsusedall16coresand16threads
§ Timed4combina?ons:— Bothfunc?onevalua?on(FE)andvectoropera?onsonsamesideofbus— Andonoppositeside:CPU(GPU)=vectorsonCPU,FEonGPU
§ Fastest?mesoccurwhenvectorsareonGPU
§ FEonoppositesideofbuscausesdatatransfereveryitera?on
§ Forthis“lightweight”func?on,notworthcompu?ngitonGPUifvectoropera?onsarenotalsoonGPU(infact,thisgivesworstperformance)
LLNL-PRES-688866 13
AndersonAcceleraAonSolverPerformanceDatatransferoverhead
§ 16itswithm=16
§ Forsmallvectorscosttotransferismuchhigher%ofAAonCPUthanAAonGPU
§ Forlargevectors,costtotransferisfarlarger%ofAAonGPU
§ AAisfastonGPUsobandwidthcostofdatatransfersignificantlyincreasescomparedtocostofvectoropsonGPU
Cost to transfer one vector across PCI bus as a % cost of AA operations (mostly vector ops) for various vector sizes
LLNL-PRES-688866 14
§ Testcase:Mul?-runsimula?onofacombus?onkine?cproblemcastintermsofs?ffODEs.ModelandsolverbothrunonGPU.
§ SingleODEsystemsmall(19equa?ons),numberofrunslarge.
OtheraIemptstoportSUNDIALStoGPUStone&Davis2013
performance of the RKF45 algorithm can be directly compared withthe baseline DVODE and GPU-based CVODE algorithms.Figures 4 and 5 show the run time and speed up of the serial CPU
and parallel GPU ODE solvers. The speed up is presented as afunction of the number of ODEs solved per kernelNode and is relativeto the serial CPU DVODE run time. Some variation in cost isobserved between the codes for small Node due to differences in theICs; however, these differences are largely averaged away after 103
ODEs. Beyond that point, both DVODE and RKF45 scale linearlywith Node. Overall, the DVODE run time is approximately 50%faster, despite taking twice the number of integration steps (onaverage). The cost saving inDVODE comes largely from the reducednumber of RHS evaluations per integration step. For instance,DVODE required only 1.78 RHS evaluations per step on average,compared with six for RKF45 for Node ! 50; 000. The number offailed integration steps in RKF45wasminor at this point: Only 5.6% ofthe integration steps failed, forcing refinement. As noted previously, thesequential cost-saving measures taken by DVODE (e.g., Jacobianrecycling) may prove counterproductive in the many-core GPU envi-ronment. However, the CUDA RKF45 implementations must over-come nearly a 50% performance penalty to break even with DVODE.
C. Ordinary Differential Equation Performance: GPU
The performance of the GPU implementations of the CVODE andRKF45ODE solvers is now analyzed relative to the baselineDVODEand RKF45 solvers executed serially on the CPU. Referring again toFig. 5, the CUDA-CVODE one-thread performance is seen to bemany times slower than DVODE until Node exceeds 103. After thisbreakeven point, the speed up with the CUDA-CVODE one-threadgrows slowly, eventually reaching a steady 7.7x speed up over thebaseline DVODE CPU solver.CUDA-RKF45 one-thread follows a similar scaling trend but is
consistently 2.3x faster than CUDA-CVODE one-thread over theentire range of Node. The breakeven point with the CUDA-RKF45one-thread solver is between 102 and 103 ODEs; the speed up isalready 2.4x at only 103 ODEs. The maximum speed up for theCUDA-RKF45one-thread solver is 20.2x over the serialDVODE runtime. This 20.2x speed up matches closely to the CUDA RHS speedup previously reported, suggesting that the RHS function is thelimiting factor in the throughput using the CUDA-RKF45 one-threadmethod. Note that CUDA-RKF45 one-thread is 28.6x faster than theserial CPU implementation of RKF45.Both one-thread versions of CUDA-CVODE and CUDA-RKF45
suffer from poor performance when Node is small. This is consistentwith the one-thread RHS-only results shown earlier in Fig. 2. TheCUDA-RKF45 one-block breakeven point is only slightly greaterthan 10 ODEs: far sooner than either one-thread ODEimplementation and also sooner than was observed for the one-block RHS-only results. CUDA-RKF45 one-block quickly reaches amaximum speed up of 10.7x relative to DVODE at N ≈ 104. Again,the CUDA-RKF45 one-blockmaximum speed upmatches closely tothe one-block RHS implementation (approximately 11x speed up),clarifying that theRHS function is the limiting factor for bothCUDA-RKF45 implementations. The CUDA-CVODE one-block perfor-mance is nearly identical to CUDA-RKF45 one block for smallNode
but achieves only a 7.3x speed up for large Node.The relative overhead cost can be inferred by referring back to
Fig. 2. The absolute overhead for the ODE solvers is the same as theRHS-only performance test. Recall that the RHS function must becalled at least 60 times (a minimum of 10 time steps) by the RKF45ODE solver. There are similar lower limits for CVODE as well. Thiseffectively amortizes the overhead reported in Fig. 2 over manymoreRHS function evaluations and reduces the relative overhead. TheCUDA-CVODE one-block overhead accounts for only 1.5% of thetotal run time when Node is less than 100 and quickly drops below0.1% for large Node. Obviously, the data transfer and memoryallocation overhead has little impact on the peak CUDAODE solverperformance.The preceding benchmarks showed the performance of the various
ODE solvers on a database of ICs taken from actual LEMsimulations. In these simulations, hundreds of LEM cells are used todiscretize the LEM computational domain. Many different LEMsimulations are therefore solved concurrently when Node is muchgreater than 103. Recall that the LEM can be viewed as a 1-D DNSmethod, and therefore the concentration profiles and temperatureshould vary smoothly throughout the domain. For example, Fig. 1showed a non-premixed combustion simulation with 241 LEM cells.Because the profiles vary smoothly, neighboring LEM cells within
H2 H O O2
OH
H2O
HO
2
H2O
2
CH
3
CH
4
CO
CO
2
CH
2O
C2H
2
C2H
4
C2H
6
CH
2CO
C3H
6
N2
Tem
p
Species Name
-14
-12
-10
-8
-6
-4
-2
0
Log
10(e
rror
)
L2
Linf
Fig. 3 Numerical difference between DVODE and RKF45 over 50,000ODEs. Difference shown in the L2 and L∞ norm.
100
101
102
103
104
105
106
Number of ODEs
10-4
10-3
10-2
10-1
100
101
102
103
104
Wal
l clo
ck ti
me
(s)
CPU-DVODECUDA-CVODE (Thread)CUDA-CVODE (Block)CPU-RKF45CUDA-RKF45 (Thread)CUDA-RKF45 (Block)
Fig. 4 Total run time of the baseline CPU DVODE and RKF45 solversand their CUDA counterparts.
100
101
102
103
104
105
106
Number of ODEs
10-3
10-2
10-1
100
101
102
Spee
d up
to D
VO
DE
CUDA-CVODE (Thread)CUDA-CVODE (Block)CPU-RKF45CUDA-RKF45 (Thread)CUDA-RKF45 (Block)
Fig. 5 Speed up of RKF45 and the CUDA ODE solvers relative to thebaseline CPU DVODE run time.
STONE AND DAVIS 771
Dow
nloa
ded
by U
NIV
ERSI
TY O
F CA
LIFO
RNIA
- D
AV
IS o
n Ju
ly 2
4, 2
013
| http
://ar
c.ai
aa.o
rg |
DO
I: 10
.251
4/1.
B348
74
§ UsedBDF(CVODE)andRunge-Kukaintegrators.
§ Paralleliza?onstrategies:— OneODEsystemperthread.— OneODEsystemperblock.
§ ReportedspeedupforCVODEis7xinperblockscheme,comparedtoserialcase. Stone, Christopher P., and Roger L. Davis. "Techniques for solving
stiff chemical kinetics on graphical processing units." Journal of Propulsion and Power 29.4 (2013): 764-773.
LLNL-PRES-688866 15
§ PrototypeandimplementbasicdatastructuresforthesoBwarestack(vector,matrix)— Implementedinhypre.— SUNDIALSandMFEMusewrappersaroundhypreobjects.
§ Separatepar??oningfromnumericalalgorithms— ParallelizeusingOpenMP4pragmas— UseRAJAunderneath— UseKokkosunderneath
§ Maintaininteroperabilityofthestack
§ PortthestacktoheterogeneousplaMorm.
§ Carryoutperformanceanalysisofthestackindifferentconfigura?onsandondifferenthardware.
NextSteps
LLNL-PRES-688866 16
§ Heterogeneoushardwarearchitecturesarepromising,butthereares?llmanychallengestoovercome.Wenowoperateinanenvironmentwithmanymoredegreesoffreedom.
§ LLNLSoBwareStacklibrariesneedstobeportedasawhole.Workonindividuallibrariescannotbedoneinisola?on.
§ Conven?onalwisdommaynotapply–wemayneedtoredefinecomputa?onalstrategies,notsimplyportthesamealgorithmstonewarchitectures.
§ LotsofheavyliBingaheadofus:Needtocreatetestcases,benchmarkperformance,analyzeandimprovealgorithms.
Conclusions