Investigating interoperability and performance portability ... · Slaven Peles, John Loﬀeld, Carol S. Woodward and Ulrike Yang LLNL-PRES-688866 2

LLNL-PRES-688866 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Investigating interoperability and performance portability of select LLNL numerical libraries DOE Center of Excellence Performance Portability Meeting

Glendale, Arizona, April 20, 2016

SlavenPeles,JohnLoffeld,CarolS.WoodwardandUlrikeYang

LLNL-PRES-688866 2

§  Challenges

§  Descrip?onofLLNLsoBwarestack

§  SUNDIALSlibrary

§  Preliminaryperformancetes?ngresults

§  Futureworkandconclusion

Outline

LLNL-PRES-688866 3

§  Implementnumericalalgorithmsinawaythatmakesbestuseofheterogeneoushardwarearchitecture

§  Developcodethatcanevolvealongwiththenewhardware--separateplaMormspecificfromalgorithmicpart(RAJA,Kokkos).

§  Totalcostofownership:— Howeasyisittodeploythecodeinnewenvironments?— Howeasyisittoaddnewfeatures?— Whatisthemaintenancecost?

§  Technicaldebtmanagement

ChallengesPor?ngexis?ngcodestoheterogeneoushardwarearchitectures

Maximizingperformanceisbutoneofseveralchallengesthatneedtobeaddressedwhenmovingtonewarchitectures.

LLNL-PRES-688866 4

MFEM:Afree,lightweight,scalableC++libraryforfiniteelementmethods.

SUNDIALS:Suiteofstate-of-the-artnumericalintegratorsandnonlinearsolvers.

hypre:Alibraryforsolvinglarge,sparselinearsystemsofequa?onsonmassivelyparallelcomputers

LLNLSo<wareStackLibrariescurrentlybeingportedtoheterogeneousarchitectures

MaintaininginteroperabilityandperformanceportabilityofthesoBwarestackismorechallengingonheterogeneousarchitectures.

The combined use of MFEM, hypre and SUNDIALS is critical for the efficient solution of a wide variety of transient PDEs, such as non-linear elasticity and magnetohydrodynamics.

LLNL-PRES-688866 5

NumericalsimulaAonanddataflowUsecase:implicitintegra?onschemewithitera?velinearsolver

No

No

No

Converged?

Final time?

Converged?

Linear solver step

Time integrator step

Nonlinear solver step

Finite elements tools: Function and Jacobian

evaluation

Updated solution vector x

Preconditioner P

dx

x

SUNDIALS

hypre

MFEM

Updated residual vector f and Jacobian J

Time integrator and nonlinear solver agnostic of vector data layout.

Numerical integrators and nonlinear solvers may invoke fairly complex step size control logic.

f, J

LLNL-PRES-688866 6

§  GPUprocessingpower>>CPUprocessingpower

§  GPUmemory<<CPUmemory

§ MovingdatabetweenCPUandGPUisexpensive

§  Proposedcomputa?onlayouts:

§  Bestlayoutmostlikelyproblemdependent.

HowtolayoutthecomputaAon?Consideringconstraintsofheterogeneousarchitectures

Numerical Integrator Linear Solver Function Evaluation GPU GPU GPU

CPU GPU GPU

CPU GPU CPU

CPU CPU GPU

CPU/GPU CPU/GPU CPU/GPU

LLNL-PRES-688866 7

§  Forwardlooking,extensibleobjectorienteddesignwithsimpleandcleanlinearsolverandvectorinterfaces.

§  Designedtobeincorporatedintoexis?ngcodes.

§  Modularstructureallowsuserstosupplytheirowndatastructures.

§  Scaleswellinsimula?onsonover500,000cores.

§  Suppliedwithserial,MPIandthread-parallel(openMPandPthreads)structures.

§  CMAKEsupportforconfigura?onandbuild.

§  Freelyavailable,releasedunderBSDlicense;Over4,500downloadsperyear.

§  Modulesandfunc?onality:—  ODEintegrators:(CVODE)variableorderandsteps?ffBDFandnon-s?ffAdams,(ARKode)

variablestepimplicit,explicit,andaddi?veRunge-KukaforIMEXapproaches.—  DAEintegrator:(IDA)variableorderandsteps?ffBDF.—  CVODESandIDASincludeforwardandadjointsensi?vitycapabili?es.—  KINSOLnonlinearsolver:Newton-KrylovandacceleratedfixedpointandPicardmethods.

SUNDIALSSuiteofstate-of-theartnumericalintegratorsandnonlinearsolvers

LLNL-PRES-688866 8

§  Powergridmodeling(RTEFrance,ISU)§  Simula?onofclutchesandpowertrainparts(LuK

GmbH&Co.)§  Electricalandheatgenera?onwithinbakerycells

(CD-adapco)§  3Dparallelfusion(SMU,U.York,LLNL)§  Implicithydrodynamicsincorecollapsesupernova

(StonyBrook)§  Disloca?ondynamics(LLNL)§  Sensi?vityanalysisofchemicallyreac?ngflows

(Sandia)§  Large-scalesubsurfaceflows(COMines,LLNL)§  Op?miza?oninsimula?onofenergy-producing

algae(NREL)§  Micromagne?csimula?ons(U.Southampton)

SUNDIALSUsedinindustrialandacademicapplica?onsworldwide

Magnetic reconnection

Core collapse supernova

Dislocation dynamics

Subsurface flow

LLNL-PRES-688866 9

InterfacingSUNDIALSwithotherso<ware

§  Specifies:—  3constructors/destructors—  3u?lityfunc?ons.—  9streamingoperators.—  10reduc?onoperators.

§  En?reinterac?onwithapplica?ondataiscarriedoutthroughthese19operators.

§  Allarelevel-1BLASoperators.

§  Individualmodulesrequireonlyasubsetoftheseoperators.

§  Specifiesfollowingfivefunc?ons:init,setup,solve,perfandfree.

§  SUNDIALSonlyrequestslinearsolvesatspecificpoints.Itisindependentoflinearsolvestrategy.

§  Implementa?onofhyprelinearsolverinterfaceisinprogress.

Vector interface Linear solver interface

Objectorienteddesignandwelldefinedinterfacessimplifypor?ngSUNDIALStonewplaMorms.

LLNL-PRES-688866 10

SUNDIALSvectorkernelsperformancetesAngStandalonekernelperformance

0.00#

0.10#

0.20#

0.30#

0.40#

0.50#

0.60#

0.70#

0.80#

0.90#

1.00#

10# 100# 1000# 10000#

Peak%ban

dwidth%frac0o

n%

Number%of%vector%elements%Thousands%

GPU#

CPU#

0.00#

0.10#

0.20#

0.30#

0.40#

0.50#

0.60#

0.70#

0.80#

0.90#

1.00#

10# 100# 1000# 10000#

Peak%ban

dwidth%frac0o

n%

Number%of%vector%elements%Thousands%

GPU#

CPU#

axpy

dot product

0.0#

0.1#

1.0#

10.0#

10# 100# 1000# 10000#

Speedu

p&factor&

Number&of&vector&elements&

Thousands&

axpy#

dot#product#

% of theoretical peak throughput Speedup compared to CPU

§  ComparedCUDAimplementa?onwithcorrespondingthreadedMKLcalls.

§  PeakbandwidthofGPUis~3xlargerthanCPUbandwidthonthetesthardware.

§  axpyshowsspeeduponGPUforN>104,dotproductforN>106.

LLNL-PRES-688866 11

AndersonAcceleraAonSolverPerformanceAsimplenonlinearsolvermethodimplementedinSUNDIALS

§  Forvectorslessthan10,000,CPUversionstakeless?methanGPUversion.

§  CPUversioncostsremainapproximatelyconstantun?lvectorlengthsreach100.

§  GPUversioncostisconstantun?lvectoris10,000–lengthatwhichtheworkpervectordominatesoverheadpervectoropera?on.

§  Timesapproachlinearwithvectorlength

§  WhenbothCPUandGPUversionsareinlinearregime,weexpectra?obetween?mingstobeapproximatelyra?oofbandwidth.

§  Threadingreducesrun?meonCPU.

§  GPUgivesmorebenefitonlargeproblem.

Its = 16, m = 16

Its = 16, m = 4

Run times for CPU and GPU (function cost not timed)

LLNL-PRES-688866 12

AndersonAcceleraAonSolverPerformanceCPU-GPUcomputa?onlayout

§  Applied16itera?onswithsimplefunc?on,f(x)=x;touches2vectors

§  CPUrunsusedall16coresand16threads

§  Timed4combina?ons:—  Bothfunc?onevalua?on(FE)andvectoropera?onsonsamesideofbus—  Andonoppositeside:CPU(GPU)=vectorsonCPU,FEonGPU

§  Fastest?mesoccurwhenvectorsareonGPU

§  FEonoppositesideofbuscausesdatatransfereveryitera?on

§  Forthis“lightweight”func?on,notworthcompu?ngitonGPUifvectoropera?onsarenotalsoonGPU(infact,thisgivesworstperformance)

LLNL-PRES-688866 13

AndersonAcceleraAonSolverPerformanceDatatransferoverhead

§  16itswithm=16

§  Forsmallvectorscosttotransferismuchhigher%ofAAonCPUthanAAonGPU

§  Forlargevectors,costtotransferisfarlarger%ofAAonGPU

§  AAisfastonGPUsobandwidthcostofdatatransfersignificantlyincreasescomparedtocostofvectoropsonGPU

Cost to transfer one vector across PCI bus as a % cost of AA operations (mostly vector ops) for various vector sizes

LLNL-PRES-688866 14

§  Testcase:Mul?-runsimula?onofacombus?onkine?cproblemcastintermsofs?ffODEs.ModelandsolverbothrunonGPU.

§  SingleODEsystemsmall(19equa?ons),numberofrunslarge.

OtheraIemptstoportSUNDIALStoGPUStone&Davis2013

performance of the RKF45 algorithm can be directly compared withthe baseline DVODE and GPU-based CVODE algorithms.Figures 4 and 5 show the run time and speed up of the serial CPU

and parallel GPU ODE solvers. The speed up is presented as afunction of the number of ODEs solved per kernelNode and is relativeto the serial CPU DVODE run time. Some variation in cost isobserved between the codes for small Node due to differences in theICs; however, these differences are largely averaged away after 103

ODEs. Beyond that point, both DVODE and RKF45 scale linearlywith Node. Overall, the DVODE run time is approximately 50%faster, despite taking twice the number of integration steps (onaverage). The cost saving inDVODE comes largely from the reducednumber of RHS evaluations per integration step. For instance,DVODE required only 1.78 RHS evaluations per step on average,compared with six for RKF45 for Node ! 50; 000. The number offailed integration steps in RKF45wasminor at this point: Only 5.6% ofthe integration steps failed, forcing refinement. As noted previously, thesequential cost-saving measures taken by DVODE (e.g., Jacobianrecycling) may prove counterproductive in the many-core GPU envi-ronment. However, the CUDA RKF45 implementations must over-come nearly a 50% performance penalty to break even with DVODE.

C. Ordinary Differential Equation Performance: GPU

The performance of the GPU implementations of the CVODE andRKF45ODE solvers is now analyzed relative to the baselineDVODEand RKF45 solvers executed serially on the CPU. Referring again toFig. 5, the CUDA-CVODE one-thread performance is seen to bemany times slower than DVODE until Node exceeds 103. After thisbreakeven point, the speed up with the CUDA-CVODE one-threadgrows slowly, eventually reaching a steady 7.7x speed up over thebaseline DVODE CPU solver.CUDA-RKF45 one-thread follows a similar scaling trend but is

consistently 2.3x faster than CUDA-CVODE one-thread over theentire range of Node. The breakeven point with the CUDA-RKF45one-thread solver is between 102 and 103 ODEs; the speed up isalready 2.4x at only 103 ODEs. The maximum speed up for theCUDA-RKF45one-thread solver is 20.2x over the serialDVODE runtime. This 20.2x speed up matches closely to the CUDA RHS speedup previously reported, suggesting that the RHS function is thelimiting factor in the throughput using the CUDA-RKF45 one-threadmethod. Note that CUDA-RKF45 one-thread is 28.6x faster than theserial CPU implementation of RKF45.Both one-thread versions of CUDA-CVODE and CUDA-RKF45

suffer from poor performance when Node is small. This is consistentwith the one-thread RHS-only results shown earlier in Fig. 2. TheCUDA-RKF45 one-block breakeven point is only slightly greaterthan 10 ODEs: far sooner than either one-thread ODEimplementation and also sooner than was observed for the one-block RHS-only results. CUDA-RKF45 one-block quickly reaches amaximum speed up of 10.7x relative to DVODE at N ≈ 104. Again,the CUDA-RKF45 one-blockmaximum speed upmatches closely tothe one-block RHS implementation (approximately 11x speed up),clarifying that theRHS function is the limiting factor for bothCUDA-RKF45 implementations. The CUDA-CVODE one-block perfor-mance is nearly identical to CUDA-RKF45 one block for smallNode

but achieves only a 7.3x speed up for large Node.The relative overhead cost can be inferred by referring back to

Fig. 2. The absolute overhead for the ODE solvers is the same as theRHS-only performance test. Recall that the RHS function must becalled at least 60 times (a minimum of 10 time steps) by the RKF45ODE solver. There are similar lower limits for CVODE as well. Thiseffectively amortizes the overhead reported in Fig. 2 over manymoreRHS function evaluations and reduces the relative overhead. TheCUDA-CVODE one-block overhead accounts for only 1.5% of thetotal run time when Node is less than 100 and quickly drops below0.1% for large Node. Obviously, the data transfer and memoryallocation overhead has little impact on the peak CUDAODE solverperformance.The preceding benchmarks showed the performance of the various

ODE solvers on a database of ICs taken from actual LEMsimulations. In these simulations, hundreds of LEM cells are used todiscretize the LEM computational domain. Many different LEMsimulations are therefore solved concurrently when Node is muchgreater than 103. Recall that the LEM can be viewed as a 1-D DNSmethod, and therefore the concentration profiles and temperatureshould vary smoothly throughout the domain. For example, Fig. 1showed a non-premixed combustion simulation with 241 LEM cells.Because the profiles vary smoothly, neighboring LEM cells within

H2 H O O2

OH

H2O

HO

2

H2O

2

CH

3

CH

4

CO

CO

2

CH

2O

C2H

2

C2H

4

C2H

6

CH

2CO

C3H

6

N2

Tem

p

Species Name

-14

-12

-10

-8

-6

-4

-2

0

Log

10(e

rror

)

L2

Linf

Fig. 3 Numerical difference between DVODE and RKF45 over 50,000ODEs. Difference shown in the L2 and L∞ norm.

100

101

102

103

104

105

106

Number of ODEs

10-4

10-3

10-2

10-1

100

101

102

103

104

Wal

l clo

ck ti

me

(s)

CPU-DVODECUDA-CVODE (Thread)CUDA-CVODE (Block)CPU-RKF45CUDA-RKF45 (Thread)CUDA-RKF45 (Block)

Fig. 4 Total run time of the baseline CPU DVODE and RKF45 solversand their CUDA counterparts.

100

101

102

103

104

105

106

Number of ODEs

10-3

10-2

10-1

100

101

102

Spee

d up

to D

VO

DE

CUDA-CVODE (Thread)CUDA-CVODE (Block)CPU-RKF45CUDA-RKF45 (Thread)CUDA-RKF45 (Block)

Fig. 5 Speed up of RKF45 and the CUDA ODE solvers relative to thebaseline CPU DVODE run time.

STONE AND DAVIS 771

Dow

nloa

ded

by U

NIV

ERSI

TY O

F CA

LIFO

RNIA

- D

AV

IS o

n Ju

ly 2

4, 2

013

| http

://ar

c.ai

aa.o

rg |

DO

I: 10

.251

4/1.

B348

74

§  UsedBDF(CVODE)andRunge-Kukaintegrators.

§  Paralleliza?onstrategies:— OneODEsystemperthread.— OneODEsystemperblock.

§  ReportedspeedupforCVODEis7xinperblockscheme,comparedtoserialcase. Stone, Christopher P., and Roger L. Davis. "Techniques for solving

stiff chemical kinetics on graphical processing units." Journal of Propulsion and Power 29.4 (2013): 764-773.

LLNL-PRES-688866 15

§  PrototypeandimplementbasicdatastructuresforthesoBwarestack(vector,matrix)—  Implementedinhypre.—  SUNDIALSandMFEMusewrappersaroundhypreobjects.

§  Separatepar??oningfromnumericalalgorithms—  ParallelizeusingOpenMP4pragmas— UseRAJAunderneath— UseKokkosunderneath

§ Maintaininteroperabilityofthestack

§  PortthestacktoheterogeneousplaMorm.

§  Carryoutperformanceanalysisofthestackindifferentconfigura?onsandondifferenthardware.

NextSteps

LLNL-PRES-688866 16

§  Heterogeneoushardwarearchitecturesarepromising,butthereares?llmanychallengestoovercome.Wenowoperateinanenvironmentwithmanymoredegreesoffreedom.

§  LLNLSoBwareStacklibrariesneedstobeportedasawhole.Workonindividuallibrariescannotbedoneinisola?on.

§  Conven?onalwisdommaynotapply–wemayneedtoredefinecomputa?onalstrategies,notsimplyportthesamealgorithmstonewarchitectures.

§  LotsofheavyliBingaheadofus:Needtocreatetestcases,benchmarkperformance,analyzeandimprovealgorithms.

Conclusions

Documents

Investigating interoperability and performance portability ... · Slaven Peles, John Loﬀeld, Carol S. Woodward and Ulrike Yang LLNL-PRES-688866 2