NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark

1

LBNL‐1014E

NERSC‐6WorkloadAnalysisandBenchmarkSelectionProcess

KatieAntypas,JohnShalf,andHarveyWasserman

National Energy Research Scientific Computing Center Division Ernest Orlando Lawrence Berkeley National Laboratory

Berkeley, CA 94720

August 13, 2008

This work was supported by the U.S. Department of Energy’s Office of Science, Office of Advanced Scientific Computing Research, Mathematical, Information and Computational

Sciences Division under Contract No. DE-AC02-05CH11231.

2

Abstract

Thisreportdescribeseffortscarriedoutduringearly2008todeterminesomeofthescience drivers for the “NERSC‐6” next‐generation high‐performance computingsystemacquisition.AlthoughthestartingpointwasexistingGreenbooksfromDOEandtheNERSCUserGroup,themaincontributionofthisworkisananalysisofthecurrentNERSC computationalworkload combinedwith requirements informationelicitedfromkeyusersandotherscientistsaboutexpectedneedsinthe2009‐2011timeframe.TheNERSCworkload isdescribed in termsof science areas, computercodes supporting research within those areas, and description of key algorithmsthat comprise the codes. Thisworkwas carriedout in largepart tohelp select asmall set of benchmark programs that accurately capture the science andalgorithmic characteristics of the workload. The report concludes with adescriptionofthecodesselectedandsomepreliminaryperformancedataforthemonseveralimportantsystems.

3

NERSC‐6WorkloadAnalysisandBenchmarkSelectionProcess

IntroductionNERSC’sscience‐drivenstrategyforincreasingresearchproductivityfocusesonthebalancedandtimelyintroductionofthebestnewtechnologiestobenefitthebroadestsubsetoftheNERSCworkload.Thegoalinprocuringnewsystemsistoenablenewsciencediscoveriesviacomputationsofascalethatgreatlyexceedwhatispossibleoncurrentsystems.Thekeytounderstandingtherequirementsforanewsystemistotranslatescientificneedsexpressedbysimulationscientistsintoasetofcomputationaldemandsandfromtheseintoasetofhardwareandsoftwareattributesrequiredtosupportthem.Thiscombinationofrequirementsanddemandsisabstractedintoasetofrepresentativeapplicationbenchmarksthatbynecessityalsocapturecharacteristicstylesofcodingandthusassureaworkload‐drivenevaluation.Theimportanceofworkload‐basedperformanceanalysiswasbestexpressedbyIngridBucherandJoanneMartinina1982LANLtechnicalreportwheretheystated,“Benchmarksareonlyusefulinsofarastheymodeltheintendedcomputationalworkload.”AtNERSCthesecarefullychosenbenchmarksgobeyondrepresentingasetofindividualscientificandcomputationalneeds.ThebenchmarksalsocomprisetheSustainedSystemPerformance(SSP)metric,anaggregatemeasureofthereal,deliveredperformanceofacomputingsystem.TheSSPisevaluatedatdiscretepointsintimeandalsoasanintegratedvaluetogivethesystem’spotency,meaninganestimateofhowwellthesystemwillperformtheexpectedworkoversometimeperiod.Takentogether,theNERSCSSPbenchmarks,theirderivedaggregatemeasures,andtheentireNERSCworkload‐drivenevaluationmethodologycreateastrongconnectionbetweensciencerequirements,howthemachinesareused,andthetestsweuse.ThisdocumentbeginswithanexaminationofthekeysciencedriversforDOEhigh‐endcomputationthatwereidentifiedinthe2005Greenbookreport.Weidentifyanychangesinthesciencedriversandcommensuraterequirementsthatmightaffecttheselectionofnext‐generationsystems.WethenprovideanintroductiontoNERSC’sworkloadanalysisandbenchmarkselectionprocess.Finally,wepresentananalysisoftheperformanceoftheselectedbenchmarkapplications.AbriefsnapshotofbenchmarkperformanceasofMay2008ispresented.Thesearepreliminarydata,measured,ingeneral,withnoattemptatoptimization;inotherwords,representingwhatausermightobserveuponinitialportingofthecodes.Insomecasesthedatacontainestimates,eitherbecausethefullrunistoolongorbecauseitrequiresmorecoresthanwehaveaccesstoonthegivenmachine.

4

OverviewoftheNERSCWorkloadNERSCservesadiverseusercommunityofover3000usersand400distinctprojects.Theworkloadiscomprisedofsome600codestoservethediversescienceneedsoftheDOEOfficeofScienceresearchcommunity.Despitethediversityofcodes,themedianjobsizeatNERSCisasubstantialfractionoftheoverallsizeofNERSC’sflagshipXT4computingsystem.NERSC’sworkloadcoverstheneedsof15scienceareasacrosssixDOEOfficeofSciencedivisions,inadditiontousersfromotherresearchandacademicinstitutionsoutsideoftheDOE.ThepiechartsinFigure1showthepercentagesofcyclesatNERSCawardedtothevariousDOEofficesandscienceareas.

Figure1:AwardsbyDOEoffice(left)andawardsbysciencearea(right).

InthisstudywefocusedonthelargestcomponentsoftheNERSCworkload,whichincludeFusion,Chemistry,MaterialsScience,ClimateResearch,LatticeGaugeTheory,AcceleratorPhysics,Astrophysics,LifeSciences,andNuclearPhysics,asshowninFigure2.

5

Figure2:TheNERSCworkloadstudyfocusedondominantcontributorstotheNERSCworkload.

ClimateScienceClimatesciencecomputingatNERSCincludesavarietyofpredictionandsimulationactivities,includingcriticalsupportoftheU.S.submissiontotheIntergovernmentalPanelonClimateChange(IPCC).Figure3showsthatclimatescienceallocationsaredominatedbyCAM,theCommunityClimateSystemModel(CCSM3),andweathermodelingcodes.ThereareINCITEallocationsthatarealsolargeplayers,butbecauseINCITEisnotasteady‐stateallocationwedonotconsidertheseprojectsfurther.WiththeINCITEallocationsremoved,thedominanceofCAMandCCSM3isclear(Table1).Furthermore,CAMisthedominantcomputationalcomponentofthefullycoupledCCSM3climatemodel.Therefore,CAMisagoodproxyforthemostdemandingcomponentsoftheclimateworkload.

6

Figure3:DistributionofclimateallocationwithandwithoutthecontributionoftheINCITEallocations.TheINCITEawardsfor"forecast"and"filter"codesaremuchlargerthantheyareinthesteady‐stateworkload

Table1:BreakdownoftheNERSCworkloadinClimateSciences.CCSMandCAMaccountfor>74%oftheworkload.SinceCAMisthedominantcomponentofCCSM,CAMisclearlyagoodproxyfortheclimaterequirements.

Inordertobeusefulforclimateresearch,CAMmustachieveatargetperformanceof1000xfasterthanreal‐time.Thisplacesastringentrequirementfortheachievedperformanceoftheclimateapplicationonthetargetsystemarchitecture.Thecurrentmeshresolutionultimatelylimitstheexposableparallelismoftheapplication.However,iftheresolutionisincreased,thesizeofthesimulationtimestepsmustbereducedtosatisfytheCourantstabilitycondition.Increasingtheresolutionby2xrequires4xmoretimeduetotheshortertimesteps.Therefore,theclimateapplicationpushestowardshigherper‐processorsustainedperformance,whichrunscountertocurrentmicroprocessorarchitecturaltrends.

7

AlthoughspectralCAMdominatesthecurrentworkload,theIPCCisrapidlymigratingtothefinite‐volumeformulationofCAM(fvCAM).FiniteVolumeCAMismorescalablethanspectralCAMandisbetterabletoexploitlooplevelparallelismusingahybridOpenMP+MPIprogrammingmodeandtherefore,takeadvantageofSMPnodes.Thehybridmodelmaymitigatetheneedforimprovedper‐processorserialperformancebutwouldrequirewiderSMPnodes.Inanticipationofthelargermeshresolutiontargets,weselectedtheD‐mesh(~0.5degree),whichislikelytobetheproductionresolutionforsometime.

MaterialsScienceSimulationsinmaterialscienceplayanindispensableroleinnanoscienceresearch,anareaidentifiedasoneofthesevenhighestprioritiesfortheOfficeofScience.OfparticularinterestrecentlyarestudiesofQuantumDots,duetotheiremergingimportanceinminiaturizationofelectronicandoptoelectronicdevices,andcomputationalneedsseemvirtuallyunlimited.Forexample,twospecificapplicationsthatscientistsinthisareaforeseeincludethefollowing:• Electronicstructuresandmagneticmaterials.Thecurrentstateoftheartisa

500‐atomstructure,whichrequires1.0Tflop/ssustainedperformance(typicallyforseveralhours),and100GBmainmemory.Futurerequirements(aharddiskdrivesimulation,forinstance)arefora5000‐atomstructure,whichwillrequireroughly30Tflop/sand4TBmainmemory.

• Moleculardynamicscalculations.Thecurrentstate‐of‐the‐artisfora109‐atom

structure,whichrequires1Tflop/ssustainedand50GBmemory.Futurerequirements(analloymicrostructureanalysis)willrequire20Tflop/ssustainedand5TBmainmemory.

Thematerialsworkloadpresentsadizzyingarrayofcodes,asshowninFigure4.Thereare65accountsundertheBES/materialsciencecategorythataccountfor20%ofthe324totalNERSCaccounts.ThetotalHPCallocationfortheseaccountswas4.1MhoursbeforeBassiwasonlineand8.8MhoursaftertheBassiwasonline.Althoughthisallocationasafunctionofthetotalissmallerthanitwasafewyearsago,itstillrepresents13%ofthetotalNERSCallocation(66.7MhoursafterBassiwasonline).ThematerialssciencepercentagehasdecreasedpartlybecauseofthecreationsofspecialprogramslikeSciDACandINCITE.Inthisscienceareathereisanaverageofaboutonecodeperaccount(orgroupofusers).However,sincethesamecodecanbeusedbydifferentaccounts,onecodeisusedby2.15usergroupsonaverage.ExceptfortheVASPcode,whichisusedby23groups,themajorityofthecodesareusedbylessthan5groups,andmanyofthecodesareusedonlybyonegroup.Thisisaverydiversecommunity,withmanygroupsusingtheirowncodes.

8

Figure4:Thematerialsscienceareapresentsadizzyingarrayofcodes,noneofwhichdominatetheworkload.

Thedifferentcodescanbeclassifiedintosixcategoriesbasedontheirphysicsandthecorrespondingalgorithms.Thesecategories,andtheircorrespondingallocations,are:densityfunctionaltheory(DFT,74%);beyondDFT(GW+BSE,6.9%);quantumMonteCarlo(QMC,6.7%);classicalmoleculardynamics(CMD,6.4%);(classicalMonteCarlo(CMC,3.1%);andotherpartialdifferentialequations(PDE,2.9%).ThereisanoverwhelmingpreferencefortheDFTmethod,owingtothecurrentsuccessofthatmethodinabinitiomaterialssciencesimulations.However,theDFTmethoditselfhasdifferentnumericalapproachessuchasplanewaveDFT,Green’sfunctionDFT,localizedbasisandorbitalDFT,muffintinspheretypeDFT,andrealspacegridDFT.Themostpopular(bothintermsofnumberofcodesandHPChours)isplanewaveDFT,forwhichthereare12codesaccountingfor1.6Mhours.

DensityFunctionalTheory(DFT)IntheDFTmethod,oneneedstosolvethesingle‐particleSchrödingerequation(asecondorderpartialdifferentialequation).Typically,5–10%ofthelowesteigenvectorsareneededfromthissolution,andthenumberofeigenvectorsisproportionaltothenumberofelectronsinthesystem.Forexample,forathousand‐atomsystem,afewthousandeigenvectorsareneeded.Thereisakeydifferencebetweenthecomplexityofthisapproachandthatofmostengineeringproblems(e.g,fluiddynamics,climatesimulation,combustion,electromagnetics)whereonlyasmall,fixednumberoftimeevolvingoreigenvectorfieldsaresolved.InDFT,theSchrödingerequationitselfdependsontheeigenvectorsthroughthedensityfunction(whichgivesrisetothenamedensityfunctionaltheory).Thus,itisanonlinearproblem,butonethatcanbesolvedbyself‐consistentiterationofthelinearizedeigenstaterepresentedbySchrödinger’sequation,orbydirectnonlinearminimization.Currently,mostlarge‐scalecalculationsaredoneusingself‐consistent

9

iterations.Numerically,whatdistinguishthedifferentDFTmethodsandcodesarethedifferentbasissetsusedtodescribethewavefunctions(theeigenvectors).

PlanewaveDFTPlanewaveDFTusesplanewavestodescribethewavefunctions,whilerealspaceDFTusesaregularreal‐spacegrid.Duetothesharppeakofthepotentialnearatomicnuclei,specialcareisneededtochoosedifferentbasissets.Besidestheplanewaveandreal‐spacegrid,otherconventionalbasissetsinclude:atomicorbitalbasisset,wheretheeigenvectorsoftheatomicSchrödingerequationareusedtodescribethewavefunctionsinasolidormolecule;Gaussianbasissets,whicharemoreoftenusedinquantumchemistryduetotheiranalyticalproperties;muffin‐tinbasis,whereasphericalholeiscastoutneareachnucleusandsphericalharmonicsandBesselfunctionsareusedtodescribethewavefunctioninsidethehole;augmentedplanewaves,wheresphericalBesselfunctionsnearthenucleiareconnectedwiththeplanewavesintheinterstitialregionsandusedasthebasisset;andthewaveletbasissets.Intermsofthemethodstosolvetheeigenstateproblem,bothiterativeschemeanddirecteigensolvershavebeenusedindifferentcodes.IntheplanewaveDFT,aniterativemethod(e.g.,conjugategradient)isoftenused;whileintheatomicorbital,Gaussian,andaugmentedplane‐wave(FLAPW)methods,directdensesolvers(ScaLAPACK)areoftenused;andinreal‐spacegridmethods,sparsematrixsolversareused.Therearethreekeycomponentsoftheiterativeplanewavemethod:3DglobalFFTforconvertingbetweenrealandreciprocalspace,orthogonalizationofbasisfunctionsusingdenselinearalgebra,andpseudopotentialcalculationsusingdenselinearalgebra.Themosttimeconsumingstepsarethematrixvectormultiplicationandvector‐vectormultiplicationfororthogonalization.Forhighlyparallelcomputation,theFFTistheprimarybottleneck.However,iftheproblemsizeisincreased,thecomputationalrequirementsofthedenselinearalgebragrowO(N3)whereastheFFTrequirementsgrowO(N2).Therefore,weakscalingaDFTproblemcanproducedramaticimprovementsinthedeliveredflopratesgiventhelocalityofthelinear‐algebracalculation;butthebenefitsofweak‐scalingarehighlyproblemdependent.Forexample,time‐domainexperimentsdonotbenefitatallfromthiseffect.

GW+BSEGW+BSEisoneapproachtocalculatingexcitedstatesandopticalspectra.Itoperatesprimarilyonlarge,densematrices.Themosttimeconsumingpartsaregeneratinganddiagonalizingthesematrices.Thediagonalizationisoftendoneusingadenseeigensolver(withlibrariessuchasScaLAPACK).Thedimensionofthematrixisproportionaltothesquareofthenumberofelectronsinthesystem.

QuantumMonteCarlo(QMC)QuantumMonteCarlo(QMC)usesstochasticrandomwalkmethodstocarryoutthemultidimensionalintegrationofthemany‐bodywavefunctions.Sinceitneedsan

10

assemblysumofdifferentindependentwalkers,itisembarrassinglyparallel.QMCisaveryaccuratemethod,butitsuffersfromstatisticalnoise;thus,itisdifficulttouseforatomicdynamics(wheretheforcesontheatomsareneeded).

ClassicalMolecularDynamicsandClassicalMonteCarloInclassicalmoleculardynamics(MD),theforcecalculationstepisdoneinparallel.Sincetheclassicalforcefieldformalismislocalinnature(excepttheelectrostaticforce),efficientparallelizationispossibleasintheNAMDcode.ClassicalMonteCarloissometimesusedinsteadofclassicalMD,thusitismoreinterestedinthetimeevolvingprocess,insteadofanassemblesum(asinquantumMC).Asaresult,parallelizationisnotsotrivial.TherearemanyrecentdevelopmentsinparallelschemesforclassicalMC(besidesthepossibleapproachforparallelevaluationofthetotalenergyasinclassicalMD).OtherPDEsincludeMaxwellequations(e.g.,inphotonicstudy)andtimeevolvingdifferentialequationsforgrainboundaryanddefectdynamics,etc.Oncethematerialscienceworkloadisreorganizedusingthistaxonomyfortheircomputationalmethods,thedistributionofalgorithmsbecomesclearer(Figure5).Thedensityfunctionaltheory(DFT)codesclearlydominateallocationswith72%oftheoverallworkload.Ofthatsubcategory,planewaveDFTcodes(thatincludeVASP,PARATEC,QBox,andPETot)arethedominantcomponent.TheDFTcodes(particularlyplanewaveDFT)haveverysimilarcomputationalrequirementsaspreviouslydiscussed—withacommonsetofdominantcomponentsintheirexecutionprofileincluding:

• 3DFFT,whichrequiresglobalcommunicationoperationsthatdominateathighconcurrenciesandlowatomcounts

• denselinearalgebrafororthogonalizationofwavefunctions(whichdominatesforlargenumbersofelectronicstatesinlargeatomicsystemswithmanyelectronsbuthasverylocalizedcommunication)

• denselinearalgebraforcalculatingthepseudopotential(thatalsodominatesforlargenumbersofatoms).

SinceVASPwasnotavailableforpublicdistributionandQBoxisexport‐controlled,wechosePARATECasthebest‐availableproxytotherequirementsoftheMaterialsScienceworkload.WealsogettoshareeffortwiththeNSFTrack‐1andTrack‐2procurementbenchmarksthathavealsoadoptedPARATECasaproxytotheMaterialsSciencecomponentoftheirworkloads.

11

Figure5:Materialssciencecodescategorizedbyalgorithm.

ChemistryDespitebeingoneofthemostmaturesimulationfields,theroleofcomputationasafundamentaltooltoaidintheunderstandingofchemicalprocessescontinuestogrowsteadily,andoneestimateplacesthenumberofelectronicstructurecalculationspublishedyearlyaboutaround8,000!ThedominantsinglecodeintheChemistryworkloadisS3DduetoitssubstantialINCITEallocation.However,ifwesetasideINCITEapplicationsbecausetheyarenotsteady‐statecomponentsoftheworkload,weseethatthechemistryworkloadisdominatedbyVASP,whichisalsodominantinthematerialsscienceworkload.Asaresult,wechosetofocusonothercomponentsoftheChemistryworkloadtoemphasizeadifferentsetofcriteriainthealgorithmspace.S3DbearsmoresimilaritytothecomputationalrequirementsofcodesintheastrophysicsworkloadwheretherearemanyCFDsimulations.WedecidedtoevaluateS3Dasacomponentofthatworkloadsimplyduetothesimilaritiesinthecomputationalmethods.

Table2:Top50%ofthechemistryworkloadshowsadiversityofcodes.

12

Table2showsthatthetop50%ofthechemistryworkloadhasadiversityofcodes.Theleader,ZORI,isaQMCcode,andassuch,emphasizesper‐processorperformance,butisnotverydemandingoftheinterconnect.TheremainingcodesareeitherDFT(verysimilartomaterialsscience)orquantumchemistrycodes.ForquantumchemistrythemostscalablecodesareNWChemandGAMESS,andsincetheseareverydemandingoftheinterconnecttheycanbeusefulindifferentiatingmachinesbasedonsupportforefficientone‐sidedmessaging.GAMESSwasselectedforNERSCbenchmarking.ItisalsoemployedbytheDOD‐MODTI‐0xprocurements,whichallowsustocollaboratewithDODandallowsvendorstoleveragetheirbenchmarkingeffortsformultiplesystembids.

FusionFusionhaslongbeenthedominantcomponentoftheNERSCworkload,accountingformorethan27%oftheoverallworkloadin2007.Theworkloadiscurrentlydominatedbyfivecodes,whichaccountformorethan50%oftheoverallfusionworkload(Table3).Thosecodes,OSIRIS,GEM,NIROD,M3D,andGTC,fallintotwobasiccategories.Threeofthese,OSIRIS,GTC,GEM,areparticle‐in‐cell(PIC)codes,whereasM3DandNIMRODarebothcontinuumcodes(PDEsolversonsemi‐structuredgrids).Thepercentageofper‐processorpeakachievedbyparticle‐in‐cellcodesisgenerallylowbecausethegather/scatteroperationsrequiredtodepositthechargeofeachparticleonthegridthatwillbeusedtocalculatethefieldinvolvealargenumberofrandomaccessestomemory,makingthecodesensitivetomemoryaccesslatency.GTCiscertainlyrepresentativeinthisregard.However,unlikeotherPICcodes,GTCdoesnotrequireaglobal3‐DFFT,soperformanceandscalabilityissuesassociatedwiththisaspectofPICcodesmustbecoveredbybenchmarksfromothercomponentsoftheNERSCworkload.

13

Table3:Thetop50%ofthefusiontimeusageisdominatedby5codes.

GTCnowutilizestwoparalleldecompositions,aone‐dimensionaldomaindecompositioninthetoroidaldirectionandaparticledistributionwithineachdomain.Processorsassignedtothesamedomainhaveacopyofthelocalgridbutonlyafractionoftheparticlesandarelinkedwitheachotherbyadomain‐specificcommunicatorusedwhenupdatinggridquantities.AnAllreduceoperationisrequiredwithineachdomaintosumupthecontributionofeachprocessor,whichcanleadtolowerperformanceincertaincases.Althoughit’snotclearthatGTCadequatelycapturestheparallelscalingcharacteristicsofallotherPICcodes,itemergesasaclearwinningcandidateforbenchmarkingbyvirtueofitsexcellentscalability,easeofporting,lackofintellectualpropertyencumbrances,andgeneralsimplicity.Ontheotherhand,wefoundthefusioncontinuumcodestobeexceedinglycomplexasbenchmarks,oneexampleofwhichistherequireduseofexternallibrariesnotguaranteedtobepresentontypicalvendorin‐housebenchmarkingmachines.However,wealsofoundthatthesemi‐implicitsolversemployedbythecontinuumcodes,presumedtobethemosttime‐consumingportionofarun,hadsimilarresourcerequirementstoastrophysicscodesthatmodelstellarprocesses(MHDcodesusedformodelingcorecollapsesupernovaeandastrophysicaljets).Therefore,wechosetofocusonthePICcodesasproxiesforthefusionworkloadrequirements.

AcceleratorModelingSimulationisplayinganincreasinglyprominentandvaluableroleinthetheory,designanddevelopmentofparticleaccelerators,detectors,andX‐rayfreeelectronlasers,whichareamongthelargest,mostcomplexscientificinstrumentsintheworld.TheacceleratorworkloadisdominatedbyPICcodes,butalsohasstrongrepresentationbysparsematrixcodeswithOmega3P(seeTable4).

Table4.TopCodesinAcceleratorModeling.ThePICcodesVORPALandOSIRISdominatetheacceleratormodelingworkload.Evenaswelookfurtherdownthelistto70%oftheworkload,theallocationsaredominatedbyPICcodes.TheexceptionisOmega3p,whichisbottleneckedbyitseigensolver,whichdependsonasparse‐matrixdirectsolver.

Thesparsematrixdirectsolvershavewellknownscalabilityissuesthatwill

14

ultimatelyforcesuchcodestoadoptnewalgorithmicapproaches(likelysparseiterativesolvers).Therewasconcernthatselectionofcurrentsparsematrixcodeexampleswouldnotbereflectiveofthestate‐of‐the‐artcomputationalmethodsthatwillemergefortheseproblemsinthe2009timeframe.Therefore,weconcentratedourattentiononthePICcomponentoftheworkload,whichisclearlydominantforthissciencearea.ThePICcodesinacceleratormodelinghaveuniqueload‐balancingproblemsandpotentialinter‐processorcommunicationissuesthatdistinguishthemfromplasmasimulationPICcodesrepresentedbyGTC.Whereasplasmamicroturbulencecodesaregenerallywellloadbalancedduetothelargelyuniformdistributionoftheplasmathroughoutthecomputationaldomain,manyacceleratorproblemsdependonbunchinguptheparticlesbecauseofthenarrowfocusoftheacceleratorbeam.Thiscanleadtoseriousissuesforloadbalancing.Additionally,someoftheacceleratorcodes,notablythosecomprisingtheIMPACTsuitefromLBNL,useanFFTalgorithmtosolvePoisson’sequation,withitsassociatedglobal,all‐to‐allcommunication.IMPACT‐Twaschosenasthebenchmarkrepresentingbeamdynamicssimulation.Partofasuiteofcodesusedforthispurpose,IMPACT‐Tisusedspecificallyforphotoinjectorsandlinearaccelerators.Itusesa2‐Ddomaindecompositionofthe3‐DCartesiangridandisoneofthefewcodesusedinthephotoinjectorcommunitythathasaparallelimplementation.Ithasanobject‐orientedFortran90codingstylethatiscriticalforprovidingsomeofitskeyfeatures.Forexample,ithasaflexibledatastructurethatallowsparticlestobestoredin“containers”withcommoncharacteristicssuchasparticlespeciesorphotoinjectorslices;thisallowsbinninginthespace‐chargecalculationforbeamswithlargeenergyspreads.

15

AstrophysicsInthepast,NERSCresourceshavebeenbroughttobearonsomeofthemostsignificantproblemsincomputationalastrophysics,includingsupernovaexplosions,thermonuclearsupernovae,andanalysisofcosmicmicrowavebackgrounddata.WorkdonebyvariousresearchersatLBNLhasshownthatI/ObenchmarkingismosteffectivelydoneusingI/OkernelbenchmarkssuchasIORandstripped‐downapplicationssuchasMADCAPratherthanusingfullapplicationcodes.ThisisinpartbecauseoftheflexibilityIORoffersinmeasuringarangeoffileandblocksizes,I/Omethods,andfilesharingapproaches.SinceI/Obenchmarkinghasessentiallybeenfactoredoutfromthemainworkload,therefore,weconcentrateontheothersolversintheastrophysicsworkloadthatincludeavarietyofradiationhydrodynamics,MHD,andotherastrophysicalfluidscalculations.PDEsolversonblock‐structuredgridsareadominantcomponentofthisworkload.Althoughmanyofthesecodescurrentlyemployanexplicittime‐steppingscheme,manyofthescientistsinthefieldindicatetheyaremovingasquicklyaspossibletowardsimplicitschemes(suchasNewtonKrylovschemes)thathavemoredemandingcommunicationrequirementsbutofferamuchbettertime‐to‐solution.Inparticular,manyimplicitschemesarelimitedbyfastglobalreductions.VirtuallyallmodernKrylovsubspacealgorithms(i.e.,anythingofthefamilyofconjugategradient,biconjugategradient,biconjugategradient‐stabilized,generalizedminimumresidual,etc.methods)forsparselinearsystemsrequireinnerproductsofvectorsinordertofunction.InFortran,fortwovectorsXandY,thiswouldlooklike inner_prod=0.0 doi=1,n inner_prod=inner_prod+X(i)*Y(i) enddoWhenthevectorsaredistributedacrossaparallelarchitecture,the"Do"loopsumisonlyovertheelementsofthevectorthatexistoneachprocessor.Theglobalsummationisthendoneovertheentireprocesstopologywithalocalsumoneachprocessorfollowedbyaglobalreductionoperationtogetthetotalsumovertheprocesstopology.Thusitwouldlookroughlylike: local_sum=0.0 doi=1,nlocal local_sum=local_sum+X(i)*Y(i) enddo callmpi_all_reduce(local_sum,inner_product,MPI_SUM)

16

VirtuallyallKrylovsubspacealgorithmsrequirethesekindsofinnerproductsummations.MostclassesofimplicitalgorithmsinturnrelyheavilyontheseKrylovsubspacealgorithms:sparselinearsystems,sparsenon‐linearsystems,optimization,sensitivityanalysis,etc.Therefore,wefeltitnecessarytoselectabenchmarkthatemphasizesthedemandsofimplicittimesteppingschemestothatwedon’tselectamachinethatisonlygoodforexplicitfluiddynamicsalgorithmsbutwillignorethemanynewapplicationsthatutilizeimplicitapproachesthatarejuststartingtobecomeviable.Unfortunately,atthistimeveryfewcodeshavebeendevelopedthatfullyimplementimplicittimesteppingschemes.Therewerenoclearlydominantcodesthatusedimplicittimesteppingschemes.Suchcodesarestillindevelopmentandarenotveryportable.WeselectedtheMAESTROcodeinpartbecauseofcloseinteractionswiththedevelopers.MAESTROisalowmachnumberastrophysicscodeusedtosimulatethelong‐timeevolutionleadinguptoatype1asupernovaexplosion.TheNERSC‐6MAESTROsimulationexaminesconvectioninawhitedwarfandincludesbothexplicitandimplicitsolvers.MAESTROiscapableofdecomposingauniformgridofconstantresolutionintoaspecifiednumberofnon‐overlappinggridpatchesbutintheNERSC‐6benchmarkdoesnotrefinethegrid.

AMRAlthoughadaptivemeshrefinement(AMR)isapowerfultechniquewiththepotentialtoreduceresourcesnecessarytosolveotherwiseintractableproblemsinavarietyofcomputationalsciencefields,itbringswithitavarietyofadditionalperformancechallengesthatareworthyoftestinginaworkload‐drivenevaluationenvironment.Inparticular,theyhavethepotentialtoaddnon‐uniformmemoryaccessandirregularinter‐processorcommunicationtowhatmighthaveotherwisebeenrelativelyregularanduniformaccessandpatterns.WerepresentAMRintheNERSC‐6benchmarksuitewithastripped‐downapplicationthatexercisesanellipticsolverintheChomboAMRframeworkdevelopedatLBNL.ChomboprovidesadistributedinfrastructureforimplementingfinitedifferencemethodsforPDEsonblock‐structuredadaptivelyrefinedrectangulargrids.Thebenchmarkproblemusesasolverwithafixednumberofiterationsandweakscalestheproblemfrom256to4096cores.Formoreinformationseehttp://www.nersc.gov/projects/SDSA/software/?benchmark=AMR.

NuclearPhysicsItisnosurprisethatMILCandothervariantsoflatticeQCDcodesdominatetheworkload.

17

MILCwasalsoadoptedbytheNSFprocurementsasabenchmark,soweareabletoleveragetheinvestmentinpackagingthisbenchmarkfortheprocurement.

BiologyandLifeSciencesCodesinthisgroupseemtofallintothreebasiccategories.Oneconsistsofcodesforsequencesearching,comparison,identification,and/orprediction.Basicallyinformaticstypescodes,thesedifferfromtypicalHPCworkloadsandarenotrepresentedtoanyextentinNERSCbenchmarksuites.Thesecodeshadamoderateallocationin2007,butweconcludedthatexplicitlyrepresentingtheminthebenchmarksuitewasunnecessarybecausethecodesarebasicallyconstrainedbyclockspeed(integerprocessingrates,actually)andthusdonotpresentasufficientlychallengingoruniqueproblemtothearchitecture.Thesecondcategoryconsistsofavarietyofmoleculardynamicscodes,themostimportantofwhichareAMBER,NAMD,andCHARMM.Thecapabilitythesecodesofferismergedwithrequirementsforproteinfoldinganalysistocreatewhatisreferredtoasmoleculardynameomics—oneofthemostchallengingareasofcomputationalscienceingeneral.Thecharacteristicsofthesecodesare,inamacrosense,similarto,ifnotthesameas,themoleculardynamicscodesusedinmaterialscience;however,individuallife‐sciencerelatedmoleculardynamicssimulationsarelikelytobelessscalableandmoreinter‐processorcommunicationlimitedthanthoseofmaterialscience,sinceinter‐atomicpotentialsarelikelytobemoresubtleandlongerrange.Thethirdcategoryconsistsofcodesthatalsoareorlooklikematerialscienceorchemistrycodes,eitherforquantumMonteCarlo,classicalquantumchemistry,orDFTmethods.Thecomputationalcharacteristicsofthesehavealreadybeendiscussedandlikelydonotdifferforlifescienceapplications.ThetotallifescienceNERSCallocationinrecentyearshasremainedsmallandsonocodehasbeenchosentouniquelyrepresentthisarea.

BenchmarkSelectionBenchmarkselectionisanextraordinarilydifficultprocess.AsperIngridBucher’s1983observation,“benchmarksareonlyusefulinsofarastheymodeltheworkloadthatwillrunonagivensystem.”Tothatend,weuseinformationfromourworkloadanalysistoaidintheselectionofbenchmarksthatareaproxytotheanticipatedNERSCsystemrequirements.Ideally,thebenchmarksmustcovertherequirementsofthetargetHPCworkload,butthereareanumberofpracticalconsiderationsthatmustbetakenintoaccountduringtheselectionprocess.Coverage:WeassesshowwellthecodescoverscienceareasthatdominatetheNERSCworkloadaswellasthecoverageofthealgorithmspace.Thiscreatesatwo‐dimensionalmatrixofcomputationalmethodsandscienceareasthatmustbe

18

covered.Portability:Theselectedcodesmustbeabletorunonawidevarietyofsystemstoallowvalidcomparisonsacrossarchitectures.Theimplementationofthecodesmustnotbetooarchitecture‐specificandthe”build”systemsforthesecodesmustberobustsothattheycanbeusedacrossmanysystemarchitectures.Scalability:ItisimportanttoemphasizeapplicationsthatjustifyscalableHPCresources.Overtime,codesthatareunabletoscalewiththeresourcestendtohaveadiminishingroleintheHPCworkload.Inselectingscalablecodes,wearenotselectingthosethatscaleeasily—ratherthosethatpresentsomeofthemostdemandingrequirements,butareotherwiseengineeredtoexposesufficientparallelismtorunonhighlyparallelsystems.OpenSource:Sincethebenchmarksmustbedistributedwithoutrestrictiontovendorsforbenchmarkingpurposes,noproprietaryorexport‐controlledcodecanbeused.Onlyopen‐sourceapplicationsthatallowustosharesnapshotswithvendorscanbeconsidered.Availability:Packagingappropriatebenchmarkproblemsrequiresclosecollaborationwiththedeveloperforassistanceandsupport.Withouttheassistanceoftheusercommunity,wecannotselectappropriateproblemsforevaluatingsystems,andhencewillbeunabletogaugethevalueofagivensystem.

Table5:Matrixofcodesforfirstcutoftheselectionprocess.Thecodeswererankedbasedontheirscoresascandidatebenchmarks.Codesthatcouldnotbewidelydistributedorweredifficulttoportwereslowlyeliminateduntilthefinal6benchmarks.Withtheseconsiderationsinmind,thebenchmarkteamselectedproblemsbasedontheirabilitytofulfilltheseattributesandrankthembasedontheirscoreineachoftheseareasascandidatebenchmarkcodes.Aftergainingexperiencewiththecodesbybuildingthemorthroughdetaileddiscussionswiththedevelopers,weslowlyreducedthelistdowntosixtargetbenchmarkcodes.Thefinalproblemsizesandbenchmarkconfigurationsweredeterminedafterthefinalcodeswereselected.

19

Thesebenchmarksareusedindividuallytounderstandandcomparesystemperformance.Althoughouranalysisoftheindividualapplicationperformancehasfoundspecificbottlenecksoneachsystem,distillingdowntoasetofspecifichardwarefeaturesisunlikelytocapturethecomplexinteractionsbetweenfullapplicationsandcorrespondinglycomplexsystemarchitectures.Thereforetheperformanceoffullapplicationswiththeirassociatedcomplexbehaviorisusedtoassessthevalueofsystems.Therefore,wefeelusingacarefullyselectedsetoffullapplications,asaproxytofullworkloadperformanceisbetterabletoencodetheDOEOfficeofScienceworkloadrequirementsforthepurposeofcomparingHPCsystems.

Table6:FinalBenchmarkSelectionMatrixtogetherwithproblemsizesandconcurrencies.

Benchmark ScienceArea AlgorithmSpace BaseCaseConcurrency

ProblemDescription

Lang Libraries

CAM Climate(BER) NavierStokesCFD 56,240Strongscaling

DGrid,(~0.5degresolution);240timesteps

F90 netCDF

GAMESS QuantumChem(BES)

Denselinearalgebra,DFT

256,1024(SameasTI‐09)

rmsgradient,MP2gradient

F77 DDI,BLAS

GTC Fusion(FES) PIC,finitedifference

512,2048Weakscaling

100particlespercell

F90

IMPACT‐T AcceleratorPhysics(HEP)

LargelyPIC,FFTcomponent

256,1024Strongscaling

50particlespercell(currently)

F90

MAESTRO Astrophysics(HEP)

LowMachHydro;blockstructured‐gridmultiphysics

512,2048Weakscaling

64x64x128gridptsperproc;10timesteps

F90 Boxlib

MILC LatticeGaugePhysics(NP)

Conjugategradient,sparsematrix;FFT

256,1024,8192Weakscaling

8x8x8x9LocalGrid,~70,000iters

C,assem.

PARATEC MaterialScience(BES)

DFT;FFT,BLAS3 256,1024Strongscaling

686Atoms,1372bands,20iters

F90 Scalapack

20

ScienceAreasDenseLinearAlgebra

SparseLinearAlgebra

SpectralMethods(FFT)s

ParticleMethods

StructuredGrids

UnstructuredorAMRGrids

AcceleratorScience

XX

IMPACTX

IMPACTX

IMPACT

Astrophysics XX

MAESTROX

XMAESTRO

XMAESTRO

ChemistryX

GAMESSX

Climate X

CAM

XCAM

Combustion X

MAESTROX

AMRElliptic

Fusion X

GTCX

GTC

LatticeGauge X

MILCX

MILCX

MILCX

MILC

MaterialScience

XPARATEC

X

PARATEC

XPARATEC

Table7:AlgorithmcoveragebytheselectedSSPalgorithms.

EvolutionfromNERSC‐5ApplicationBenchmarksWepointoutthattheselectedapplicationbenchmarkshavechangedfromthoseusedfortheNERSC‐5procurement.Thefollowingconsiderationsledtothesechanges:

WorkloadEvolution:TheworkloadhaschangedmodestlysincetheNERSC‐5procurement.Twoapplicationsfromthelastbenchmarksuite(NAMDandMadBENCH)wereremovedbecauseoftheirreducedcontributiontotheworkloadandwerereplacedbyMAESTROandIMPACT‐Trespectively.Multicore:PowerhasendedDennardscaling,sofutureperformanceimprovementsofHPCsystemshingeondoublingthenumberofprocessorcoresevery18monthsratherthanthehistoricalclock‐frequencyscaling.Weprojectthenext‐generationNERSCsystemwillbe3–5xmorecapablethanNERSC‐5.Consequently,theconcurrencyofthebenchmarkshasincreasedby4xovertheNERSC‐5suite.StrongScaling:Overthepast15years,parallelcomputinghasthrivedonweak‐scalingofproblems.However,withthestallinCPUclockfrequencies,thereisincreasedpressuretoimprovestrong‐scalingperformanceofalgorithms.Thisisbecausefortime‐domainmethods,increasesinproblemresolutionmustoftenbeaccompaniedbycorrespondingdecreasesintime‐

21

stepsizetosatisfythecourantstabilitycondition.Whereasthetime‐stepdecreasescouldbeaccommodatedbyperformanceimprovementsofindividualcores,nextgenerationalgorithmswillneedtoimprovetheirstepratesusingincreasedparallelism(strongscaling).Consequently,ourprobleminputdecksfortheNERSC‐6applicationbenchmarksnowemphasizestrong‐scalingrequirements.ImplicitTimestepping:ManyPDEsolversonblock‐structuredgridsemployexplicittimesteppingmethods,whichperformwellwhenweak‐scaledonmassivelyconcurrentsystems.However,itisverydifficulttoimprovestrong‐scalingperformanceinlightofthestallinCPUclockfrequencies.Thiswillprovideincreasedpressuretomovetowardsimplicittimesteppingschemes,whicharefarmorechallengingtoscaletohighconcurrencies.Evidenceofthistrendhasbeendocumentedinthe2005GreenbookandsubsequentreportsfromtheFusionSimulationProject(FSP)andotherDOEreports.WehavecodedtherequirementintoourbenchmarksuiteusingMAESTRO,whichimplementsbothimplicitandexplicittimesteppingschemes.AMR/MultiscalePhysics:Althoughitisnotamajorcomponentofourworkloadyet,thepresenceofscalableAMRcodeshasincreasedsubstantiallyinourworkloads.Giventheincreasedemphasisonmulti‐physicsmultiscaleproblemsinSciDACandincreasedneedforstrong‐scaling,weexpectthesemethodstobecomeincreasinglyimportantcomponentoffutureworkloads.SupportforLightweight/OneSidedMessaging:Assystemsscaletoevenlargerconcurrencies,theperformanceoftheinterconnectforverysmallmessagesbecomesincreasinglyimportant.Already,ourcurrentworkloadincludesmanychemistryapplicationsthatdependonGlobalArrays(GA)andotherlibrariesthatemulateaglobalsharedaddressspace.Thesecodebenefitgreatlyfromefficientsupportoflight‐weightmessaging,includingone‐sidedmessageconstructs.TheserequirementsarerepresentedbythechoiceofGAMESSfortheapplicationsuite.

NERSCCompositeBenchmarks(SustainedSystemPerformance)Thehighest‐concurrencyrunsfortheseapplicationbenchmarksarealsousedtogetherintheSustainedSystemPerformance(SSP)compositebenchmark,whichisdiscussedinacompaniondocumentdescribingboththeSSPandESPtests.ThegeometricmeanoftheperformanceofthesebenchmarksisusedtoestimatethelikelydeliveredperformanceoftheevaluatedsystemsforthetargetNERSCworkload.Therefore,thisapproachfulfillstheobservationbyIngridBucherthat“benchmarksareonlyusefulinsofarastheymodeltheintendedworkload.”Thesebenchmarksencodeourworkloadrequirements,andconsequentlytheSSPmetricprovidesagoodmetricforestimatingthesustainedperformanceontheanticipatedworkload.

22

AnalysisofNERSC‐6BenchmarkCodesThissectionprovidesmoredetaileddescriptionsoftheNERSC‐6applicationbenchmarkcodesandtheirproblemsets,alongwithsomeobservedcharacteristicsofthecodesandsomepreliminaryperformancedatafromvarioussystems.Thebenchmarksservethreepurposesintheprocurementproject.First,eachNERSC‐6benchmarkanditsrespectivetestproblemhasbeencarefullychosenanddevelopedtorepresentaparticularsubsetand/orspecificcharacteristicoftheexpectedDOEOfficeofScienceworkloadontheNERSC‐6system.Asithasbeenstated,“Forbetterorforworse,benchmarksshapeafield1,”andthus,akeycomponentoftheNERSCworkload‐drivenevaluationstrategyismakingavailablerealapplicationcodesthatcapturetheperformance‐criticalaspectsoftheworkload.Second,thebenchmarksprovidevendorstheopportunitytoprovideNERSCwithconcretedata(inanRFPresponse)associatedwiththeperformanceandscalabilityoftheproposedsystemsonprogrammaticallyimportantapplications.WeexpectthatvendorproposalresponseswillincludetheresultsofrunningNERSC‐6benchmarksonexistinghardwareandaswellasprojectionstofuture,currentlyunavailablesystems.Inthisrole,thebenchmarksplayanessentialroleintheproposalevaluationprocess.TheNERSCstaffarewellqualifiedtoevaluateanyprojectionssuppliedbyvendorsasaresultofhavingcarriedoutextensivetechnologyassessmentsofprospectivevendorsystemsforthetimeframeoftheNERSC‐6project.Third, the benchmarks will be used as an integral part of the system acceptance test and part of a continuous measurement of hardware and software performance throughout the operational lifetime of the NERSC-6 system. The NERSC evaluation strategy uses benchmark programs of varying complexity, ranging from kernels and microbenchmarks through stripped-down applications and full applications. Each has its place: we use kernels and microbenchmarks to gain understanding, but applications to evaluate effectiveness and performance. The remainder of this document will concentrate on the NERSC-6 full application codes. For each of the codes a few key characteristics are presented, most of which have been obtained using the NERSC Integrated Performance Modeling (IPM) tool - an application profiling layer that uses a unique hashing approach to allow a fixed memory footprint and minimal CPU usage. Key characteristics measured include the type and frequency of application-issued MPI calls, buffer sizes utilized for point-to-point and collective communications, the communication topology of each application, and in some cases, a time profile of the interprocessor communications. Another key parameter of a completely different type is the computational intensity, the ratio of the number of floating-point operations executed to the number of memory operations.

1 Patterson, D., CS252 Lecture Notes, University of California Berkeley, Spring, 1998.

23

CAM:CCSMCommunityAtmosphereModelTheCommunityAtmosphereModel(CAM)istheatmosphericcomponentoftheCommunityClimateSystemModel(CCSM)developedatNCARandelsewherefortheweatherandclimateresearchcommunities.AlthoughgenerallyusedinproductionaspartofacoupledsysteminCCSM,CAMcanberunasastandalone(uncoupled)modelasitisatNERSC.TheNERSC‐6benchmarkrunsCAMversion3.1atDresolution(about0.5degree)usingafinitevolume(FV)dynamicalcore.Twoproblemsarerunusingstrongscalingat56and240cores.Atmosphericmodelsconsistoftwoprincipalcomponents,thedynamicsandthephysics.Thedynamics,forwhichthesolverisreferredtoasthe“dynamicalcore,”arethelarge‐scalepartofamodel,theatmosphericequationsofmotionaffectingwind,pressureandtemperaturethatareresolvedontheunderlyinggrid.Thephysicsarecharacterizedbysubgrid‐scaleprocessessuchasradiation,moistureconvection,friction,andboundarylayerinteractionsthataretakenintoconsiderationimplicitly(viaparameterizations).InCAM3asitisrunatNERSC,thedynamicsaresolvedusinganexplicittimeintegrationandfinite‐volumediscretizationthatislocalandentirelyinphysicalspace.HydrostaticequilibriumisassumedandaLagrangianverticalcoordinateisused,whichtogethereffectivelyreducethedimensionalityfromthreetotwo.

RelationshiptoNERSCWorkloadCAMisusedforbothshort‐termweatherpredictionandaspartofalargeclimatemodelingefforttoaccuratelydetectandattributeclimatechange,predictfutureclimate,andengineermitigationstrategies.Anexampleofa“breakthrough”computationinvolvingCAMmightbeitsuseinafullycoupledocean/land/atmosphere/iceclimatesimulationwith0.125‐degree(approximately12kilometer)resolutioninvolvinganensembleofeighttotenindependentruns.DoublinghorizontalresolutioninCAMincreasescomputationalcosteightfold.ThecomputationalcostofCAMintheCCSM,holdingresolutionconstant,hasincreased4xsince1996.Morecomputationalcomplexityiscomingintheformof,forexample,super‐parameterizationsofmoistprocesses,andultimatelyeliminationofparameterizationanduseofgrid‐basedmethods.

ParallelizationThesolutionprocedurefortheFVdynamicalcoreinvolvestwoparts:(1)dynamicsandtracertransportwithineachcontrolvolumeand(2)remappingofprognosticdatafromtheLagrangianframebacktotheoriginalverticalcoordinate.Theremappingoccursonatimescaleseveraltimeslongerthanthatofthetransport.TheversionofCAMusedbyNERSCusesaformalismeffectivelycontainingtwodifferent,two‐dimensionaldomaindecompositions.Inthefirst,thedynamicsisdecomposedoverlatitudeandverticallevel.Becausethisdecompositionisinappropriatefortheremapphase,andfortheverticalintegrationstepfor

24

calculationofpressure,forwhichthereexistverticaldependences,alongitude–latitudedecompositionisusedforthesephasesofthecomputation.Optimizedtransposesmovedatafromtheprogramstructuresrequiredbyonephasetotheother.Onekeytothisyztoxytransposeschemeishavingthenumberofprocessorsinthelatitudinaldirectionbethesameforbothofthetwo‐dimensionaldecompositions.Figure6presentsthetopologicalconnectivityofcommunicationforfv‐CAM—eachpointinthegraphindicatesprocessorsexchangingmessages,andthecolorindicatesthevolumeofdataexchanged.Figures7and8giveanindicationoftheextenttowhichdifferentcommunicationprimitivesareusedinthecodeandtheirrespectivetimesononesystem.Figures9and10giveanindicationofthesizesoftheMPImessagesinanactualrunontheNERSCCrayXT4systemFranklin.

Figure6.CommunicationtopologyandintensityforfvCAM.

Figure7.MPIcallcountsforfvCAM.

Figure8.PercentoftimespentinvariousMPIroutinesforfvCAMon240processorsontheNERSCBassisystem.

25

Figure9.MPIbuffersizeforpoint‐to‐pointmessagesinfvCAM.

Figure10.MPIbuffersizeforcollectivemessagesinfvCAM.

ImportanceforNERSC‐6ForNERSC‐6benchmarking,CAMoffersthefollowingcharacteristics:complexity(relativelyflatperformanceprofileinhibitssimpleoptimization);relativelylowcomputationalintensitystresseson‐node/processordatamovement;relativelylongMPImessagesstressinterconnectpoint‐to‐pointbandwidth;strongscalinganddecompositionmethodlimitscalability;moderatelysensitivetoTLBperformanceandVMpagesize.

PerformanceTable8presentsmeasuredperformanceforCAMonseveralsystems.The“GFLOPs”columnrepresentstheaggregatecomputingrate,basedonthereferencefloating‐pointoperationcountfromFranklinandthetimeonthespecificmachine.Theefficiencycolumn(“Effic.”)representsthepercentageofpeakperformanceforthenumberofcoresusedtorunthebenchmark.

26

Cores

IBMPower5Bassi

IBMBG/P

QCOpteronCrayXT4Jaguar

DCOpteronCrayXT4Franklin

GFLOPs Effic. GFLOPs Effic. GFLOPs Effic. GFLOPs Effic56 30 7% 6 3% 32 7% 33.5 12%

240 100 6% 49 3% 143 7% 130.6 11%

Table8.MeasuredAggregatePerformanceandPercentofPeakforCAM

GAMESS:GeneralAtomicandMolecularElectronicStructureSystemTheGAMESS(GeneralAtomicandMolecularElectronicStructureSystem)codefromtheGordonresearchgroupattheDepartmentofEnergyAmesLabatIowaStateUniversityisoneofthemostimportanttoolsavailableforvariousabinitioquantumchemistrysimulations.SeveralkindsofSCFwavefunctionscanbecomputedwithconfigurationinteraction,secondorderperturbationtheory,andcoupled‐clusterapproaches,aswellasthedensityfunctionaltheoryapproximation.Avarietyofmolecularproperties,rangingfromsimpledipolemomentstofrequency‐dependenthyper‐polarizabilitiesmaybecomputed.Manybasissetsarestoredinternally,togetherwitheffectivecorepotentials,soallelementsuptoradonmaybeincludedinmolecules.ThetwobenchmarksusedheretestDFTenergyandgradient,RHFenergy,MP2energy,andMP2gradient.

RelationshiptoNERSCWorkloadGAMESSisavailableandisusedonallNERSCsystems.ItismostcommonlyusedbyallocationsfromtheBESarea.Asabenchmark,GAMESSrepresentsavarietyofdifferentconventionalquantumchemistrycodesthatuseGaussianbasissetstosolveHartree‐Fock,MP2,coupledclusterequations.Thesecodesalsonowincludethecapabilitytoperformdensityfunctionalcomputations.CodessuchasGAMESScanbeusedtostudyrelativelylargemolecules,inwhichcaselargeprocessorconfigurationscanbeused,butfrequentlytheyarealsousedtocalculatemany‐atomicconfigurations(e.g.,tomapoutthepotentialmanifoldinachemicalreaction)forsmallmolecules.

ParallelizationGAMESSusesanSPMDapproachbutincludesitsownunderlyingcommunicationlibrary,calledtheDistributedDataInterface(DDI),topresenttheabstractionofaglobalsharedmemorywithone‐sidedatatransfersevenonsystemswithphysicallydistributedmemory.Inpart,theDDIphilosophyistoaddmoreprocessorsnotjustfortheircomputeperformancebutalsotoaggregatemoretotalmemory.Infact,onsystemswherethehardwaredoesnotcontainsupportforefficientone‐sidedmessages,anMPIimplementationofDDIisusedinwhichonlyone‐halfoftheprocessorsallocatedcomputewhiletheotherhalfareessentiallydatamovers.

BenchmarkProblemsRatherthanusingGAMESSforscalingstudies,twoseparaterunsaredonetestingdifferentpartsoftheprogram.TheNERSC“medium”case,runon64coresfor

27

NERSC‐5andexpectedtorunon256coresforNERSC‐6,isaB3LYP(5)/6‐311G(d,p)calculationwithaRHFSCFgradient.The“large”case,384coresforNERSC‐5andexpectedtorunon1024coresforNERSC‐6,isa6‐311++G(d,p)MP2computationonafive‐atomsystemwithTsymmetry.

ImportanceforNERSC‐6ForNERSC‐6benchmarking,GAMESSoffersthefollowingcharacteristics:complexity(relativelyflatperformanceprofileinhibitssimpleoptimizationbutbuilt‐incommunicationlayeraffordsopportunityforkeysystem‐dependentoptimization);considerablestride‐1memoryaccess;built‐incommunicationlayeryieldsperformancecharacteristicsnotvisiblefromMPI(typicalofGlobalArraycodes,forexample).

PerformanceSincetheproblemsetsforNERSC‐6GAMESShavenotbeenfinalized,thedatapresentedinTable9arefortheNERSC‐5problems.Performanceisnotexpectedtodiffersignificantly.

Cores

IBMPower5Bassi

IBMBG/P

QCOpteronCrayXT4Jaguar


GFLOPs Effic. GFLOPs Effic. GFLOPs Effic. GFLOPs Effic64 26 5% 18 3% 23 7%

384 121 4% 39 2% 177 6% 214 11%

Table9.MeasuredAggregatePerformanceandPercentofPeakforGAMESS

GTC:3DGyrokineticToroidalCodeGTCisathree‐dimensionalcodeusedtostudymicroturbulenceinmagneticallyconfinedtoroidalfusionplasmasviatheparticle‐in‐cell(PIC)method.Itsolvesthegyro‐averagedVlasovequationinrealspaceusingglobalgyrokinetictechniquesandanelectrostaticapproximation.TheVlasovequationdescribestheevolutionofasystemofparticlesundertheeffectsofself‐consistentelectromagneticfields.Theunknownistheflux,f(t,x,v),whichisafunctionoftimet,positionx,andvelocityv,andrepresentsthedistributionfunctionofparticles(electrons,ions,etc.)inphasespace.Thismodelassumesacollisionlessplasma,i.e.,theparticlesinteractwithoneanotheronlythroughaself‐consistentfieldandthusthealgorithmscalesasNinsteadofN2,whereNisthenumberofparticles.

A“classical”PICalgorithminvolvesusingthenotionofdiscrete“particles”tosampleachargedistributionfunction.Theparticlesinteractviaagrid,onwhichthepotentialiscalculatedfromcharges“deposited”atthegridpoints.Fourstepsareinvolvediteratively:

• “SCATTER,”ordeposit,chargesonthegrid.Aparticletypicallycontributestomanygridpointsandmanyparticlescontributetoanyonegridpoint.

28

• SolvethePoissonequation.• “GATHER”forcesoneachparticlefromtheresultantpotentialfunction.• “PUSH”(move)theparticlestonewlocationsonthegrid.InGTC,point‐chargeparticlesarereplacedbychargedringsusingafour‐pointgyroaveragingscheme,asshowninFigure11.

RelationtoNERSCWorkloadGTCisusedforfusionenergyresearchviatheDOESciDACprogramandfortheinternationalfusioncollaboration.SupportforitcomesfromtheDOEOfficeofFusionEnergyScience.Itisusedforstudyingneoclassicalandturbulenttransportintokamaksandstellaratorsaswellasforinvestigatinghot‐particlephysics,toroidalAlfvenmodes,andneoclassicaltearingmodes.

Figure11.RepresentationofthechargeaccumulationtothegridinGTC.

ParallelizationInolderversionsofGTC,theprimaryparallelprogrammingmodelwasa1‐DdomaindecompositionwitheachMPIprocessassignedatoroidalsectionandMPI_SENDRECVusedtoshiftparticlesmovingbetweendomains.NewerversionsofGTC,includingthecurrentNERSC‐6benchmark,useafixed,1‐Ddomaindecompositionwith64domainsplusaparticle‐baseddecomposition.Atgreaterthan64processors,eachdomaininthe1‐Ddomaindecompositionhasmorethanoneprocessorassociatedwithit,andeachprocessorholdsafractionofthetotalnumberofparticlesinthatdomain.Thisrequiresanall_reduceoperationtocollectallofthechargesfromallparticlesinagivendomain.ThecommunicationtopologyoftheNERSC‐6versionofGTC,usingthecombinationtoroidaldomain/particledecompositionmethod,isshowninFigure12.Figures13,14,and15showothercommunicationcharacteristics.

29

Figure12.MPImessagetopologyforNERSC‐6versionofGTC.

Figure13.MPIcallcountsfortheNERSC‐6versionofGTC.

Figure14.MPImessagebuffersizedistributionbasedoncallsfortheNERSC‐6versionofGTC.

Figure15.TimespentinMPIfunctionsintheNERSC‐6versionofGTConFranklin.

ImportanceforNERSC‐6ForNERSC‐6benchmarking,GTCoffersthefollowingcharacteristics:chargedepositioninPICcodesutilizesindirectaddressesandthereforestressesrandomaccesstomemory;particle‐decompositionstressescollectivecommunicationswithsmallmessages;point‐to‐pointcommunicationsarestronglybandwidthbound;runsadequatelyonrelativelysimpletopologycommunicationnetworks;importantloopconstructswithlargenumberofarraysleadstohighlysensitiveTLBbehavior.

PerformanceTable10,whichlistssomemeasuredperformancedata,suggeststhatGTCiscapableofachievingrelativelyhighpercentagesofpeakperformanceonsomesystems,whichisprimarilyduetoarelativelyhighcomputationalintensityandalowpercentageofcommunicationcontribution.ItalsoseemsthatOpteronprocessorshaveaperformanceadvantageonthiscode.

30

Cores

IBMPower5Bassi

IBMBG/P

QCOpteronCrayXT4

Jaguar


GFLOPs Effic. GFLOPs Effic. GFLOPs Effic. GFLOPs Effic512 428 11% 98 6% 511 12% 565 21%

2048 n/a 391 6% 2023 12% 2438 23%

Table10.MeasuredAggregatePerformanceandPercentofPeakforGTC.

IMPACT‐T:Parallel3‐DBeamDynamicsIMPACT‐T(IntegratedMapandParticleAcceleratorTracking—Time)isonememberofasuiteofcomputationaltoolsforthepredictionandperformanceenhancementofaccelerators.Thesetoolsareintendedtopredictthedynamicsofbeamsinthepresenceofopticalelementsandspacechargeforces,thecalculationofelectromagneticmodesandwakefieldsofcavities,thecoolinginducedbyco‐movingbeams,andtheaccelerationofbeamsbyintensefieldsinplasmasgeneratedbybeamsorlasers.TheImpact‐Tcodeusesaparallel,relativisticPICmethodforsimulating3‐Dbeamdynamicsusingtimeastheindependentvariable.(ThedesignofRFacceleratorsisnormallyperformedusingposition,eitherthearclengthorz‐direction,ratherthanthetime,astheindependentvariable.)IMPACT‐Tincludesthearbitraryoverlapoffieldsfromacomprehensivesetofbeamlineelements,canmodelbothstandingandtravelingwavestructure,andcanincludeawidevarietyofexternalmagneticfocusingelementssuchassolenoids,dipoles,quadrupoles,etc.IMPACT‐Tisuniqueintheacceleratormodelingfieldinseveralways.Itusesspace‐chargesolversbasedonanintegratedGreenfunctiontoefficientlyandaccuratelytreatbeamswithlargeaspectratios,andashiftedGreenfunctiontoefficientlytreatimagechargeeffectsofacathode.Itusesenergybinninginthespace‐chargecalculationtomodelbeamswithlargeenergyspread.Italsohasaflexibledatastructure(implementedinobject‐orientedFortran)thatallowsparticlestobestoredincontainerswithcommoncharacteristics;forphotoinjectorsimulationsthecontainersrepresentmultipleslices,butinotherapplicationstheycouldcorrespond,e.g.,toparticlesofdifferentspecies.WhilethecorePICmethodusedinIMPACT‐Tisessentiallythesameasthatusedforplasmasimulation(seeGTC,above),manydetailsoftheimplementationdiffer;inparticular,thespectral/Green’sfunctionpotentialsolver(vs.thefinitedifferencemethodusedinGTC)resultsinasignificantglobalinter‐processorcommunicationpattern.

RelationshiptoNERSCWorkloadPastapplicationsoftheIMPACTcodesuiteincludetheSNSlinacmodeling,theJ‐PARClinaccommissioning,theRIAdriverlinacdesign,theCERNsuperconductinglinacdesign,theLEDAhaloexperiment,theBNLphotoinjector,andtheFNALNICADDphotoinjector.RecentcodeenhancementshaveexpandedIMPACT’sapplicability,anditisnowbeingused(alongwithothercodes)ontheLCLSprojectandtheFERMI@Elettraproject.ArecentDOEINCITEgrantallowedIMPACTtobe

31

usedinthedesignoffreeelectronlasers.

ParallelizationAtwo‐dimensionaldomaindecompositioninthey‐zdirectionsisusedalongwithadynamicloadbalancingschemebasedondomain.TheimplementationusesMPI‐1messagepassing.Forsomeproblems,mostofthecommunicationduetoparticlescrossingprocessorboundariesislocal;however,inthesimulationofcollidingbeams,particlescanmovealongdistanceandmorethanonepairofexchangesmightberequiredforasingleparticleupdate.Impact‐Tdoesnotuseaparticle‐fieldparalleldecompositionusedbyotherPICcodes.Hockney’sFFTalgorithmisusedtosolvePoisson’sequationwithopenboundaryconditions.Inthisalgorithm,thenumberofgridpointsisdoubledineachdimension,withthechargedensityontheoriginalgridkeptthesame,andthechargedensityelsewhereissettozero.SomecommunicationcharacteristicsforIMPACT‐TarepresentedinTables11and12andFigures16–18.MPIEvent BufferSize

(Bytes)CumulativePercentofTotalWallClockTime

MPI_Alltoallv 131K–164K 15%

MPI_Send 8K 3%

MPI_Allreduce 4–8 4%

Table11.ImportantMPIMessageBufferSizesforIMPACT‐T.RoutineName PercentofTotalTime Purpose

beambunchclass_kick2t_beambunch 32% Interpolatethespace‐chargefieldsEandBinthelabframetoindividualparticle+externalfields.

beamlineelemclass_getfldt_beamlineelem 29% Getexternalfield

ptclmgerclass_ptsmv2_ptclmger 10% Moveparticlesfromoneprocessortofourneighboringprocessors

Table12.ProfileofTopSubroutinesinIMPACT‐TonCrayXT4

ImportanceforNERSC‐6ForNERSC‐6benchmarking,IMPACT‐Toffersthefollowingcharacteristics:object‐orientedFortran90codingstylepresentscompileroptimizationchallenges;FFTPoissonsolverstressescollectivecommunicationswithmoderatemessagesizes;greatercomplexitythanotherPICcodesduetoexternalfields,openboundaryconditions;relativelymoderatecomputationalintensity;fixedglobalproblemsizecausessmallermessagesizesandincreasingimportanceofMPIlatencyathigherconcurrencies.

32

Figure16.MPIProfileon1024coresofFranklin.

Figure17.MPIcallcounts.

Figure18.MessagetopologyforIMPACT‐T.

PerformanceAlthoughthey’rebothPICcodes,performanceforIMPACT‐T(Table13)isquitedifferentfromthatofGTC,withIMPACT‐Tgenerallyachievingalowerrate,duetoincreasedimportanceofglobalcommunicationsandaconsiderablylowercomputationalintensity.Cores

IBMPower5Bassi

IBMBG/P

QCOpteronCrayXT4

Jaguar



1024 n/a 174 5% 513 6% 638 12%

Table13.MeasuredAggregatePerformanceandPercentofPeakforIMPACT‐T

33

MAESTRO:LowMachNumberStellarHydrodynamicsCodeMAESTROisusedforsimulatingastrophysicalflowssuchasthoseleadinguptoignitioninTypeIasupernovae.ItcontainsanewalgorithmspecificallydesignedtoneglectacousticwavesinthemodelingoflowMachnumberconvectivefluidsbutretaincompressibilityeffectsduetonuclearreactionsandbackgroundstratificationthattakesplaceinthecenturies‐longperiodpriortoexplosion.MAESTROallowsforlargetimesteps,finite‐amplitudedensityandtemperatureperturbations,andthehydrostaticevolutionofthebasestateofthestar.ThebasicdiscretizationinMAESTROcombinesasymmetricoperator‐splittreatmentofchemistryandtransportwithadensity‐weightedapproximateprojectionmethodtoimposetheevolutionconstraint.Theresultingintegrationproceedsonthetimescaleoftherelativelyslowadvectivetransport.Fasterdiffusionandchemistryprocessesaretreatedimplicitlyintime.Thisintegrationschemeisembeddedinanadaptivemeshrefinementalgorithmbasedonahierarchicalsystemofrectangularnon‐overlappinggridpatchesatmultiplelevelswithdifferentresolution;however,intheNERSC‐6benchmarkthegriddoesnotadapt.Amultigridsolverisused.

ParallelizationParallelizationinMAESTROisviaa3‐Ddomaindecompositioninwhichdataandworkareapportionedusingacoarse‐graineddistributionstrategytobalancetheloadandminimizecommunicationcosts.Optionsfordatadistributionincludeaknapsackalgorithm,withanadditionalsteptobalancecommunications,andMorton‐orderingspace‐fillingcurves.Studieshaveshownthatthetime‐to‐solutionforthelowMachnumberalgorithmisroughlytwoordersofmagnitudefasterthanamoretraditionalcompressiblereactingflowsolver.MAESTROusesBoxLib,afoundationlibraryofFortran90modulesfromLBNLthatfacilitatesdevelopmentofblock‐structuredfinitedifferencealgorithmswithrichdatastructuresfordescribingoperationsthattakeplaceondatadefinedinregionsofindexspacethatareunionsofnon‐intersectingrectangles.Itisparticularlyusefulinunsteadyproblemswheretheregionsofinterestmaychangeinresponsetoanevolvingsolution.TheMAESTROcommunicationtopologypattern(Figure19)isquiteunusual.OthercommunicationcharacteristicsareshowninFigures20–22.

RelationshiptoNERSCWorkloadMAESTROandthealgorithmsitrepresentsareusedinatleasttwoareasatNERSC:supernovaeignitionandturbulentflamecombustionstudieshavingbothSciDAC,INCITEandbaseallocations.

34

Figure19.CommunicationtopologyforMAESTROfromIPM.

Figure20.MPIcallcountsforMaestro

Figure21.DistributionofMPItimeson512coresofFranklinfromIPM.

Figure22.DistributionofMPIbuffersizesbytimefromIPMonFranklin.

ImportanceforNERSC‐6ForNERSC‐6benchmarking,MAESTROoffersthefollowingcharacteristics:unusualcommunicationtopologyshouldstresssimpletopologyinterconnectsandrepresentcharacteristicsassociatedwithirregularorrefinedgrids;verylowcomputationalintensitystressesmemoryperformance,especiallylatency;implicitsolvertechnologystressesglobalcommunications;widerangeofmessagesizesfromshorttorelativelymoderate.

PerformanceMaestroachievesarelativelylowoverallrateonthesystemsshowninTable14.Thecomputationalintensityisverylow,mostlikelyasaresultofnon‐floating‐pointoperationoverheadrequiredfortheadaptivemeshframework.Thedataalso

35

suggestthatMPIcommunicationsperformancemaybeabottleneck,sincetheefficiencydecreasesforthelargerproblemofthisweak‐scalingset.Cores

IBMPower5Bassi

IBMBG/P

QCOpteronCrayXT4

Jaguar


GFLOPs Effic. GFLOPs Effic. GFLOPs Effic. GFLOPs Effic512 178 5% 52* 3% 230 5% 245 9%

2048 n/a 406 2% 437 4%

Table14.MeasuredAggregatePerformanceandPercentofPeakforMAESTRO

MILC:MIMDLatticeComputationThebenchmarkcodeMILCrepresentspartofasetofcodeswrittenbytheMIMDLatticeComputation(MILC)collaborationtostudyquantumchromodynamics(QCD),thetheoryofthestronginteractionsofsubatomicphysics.ItperformssimulationsoffourdimensionalSU(3)latticegaugetheoryonMIMDparallelmachines.Stronginteractionsareresponsibleforbindingquarksintoprotonsandneutronsandholdingthemalltogetherintheatomicnucleus.

TheMILCcollaborationhasproducedapplicationcodestostudyseveraldifferentQCDresearchareas,onlyoneofwhich,ks_dynamicalsimulationswithconventionaldynamicalKogut‐Susskindquarks,isusedhere.

QCDdiscretizesspaceandevaluatesfieldvariablesonsitesandlinksofaregularhypercubelatticeinfour‐dimensionalspacetime.Eachlinkbetweennearestneighborsinthislatticeisassociatedwithathree‐dimensionalSU(3)complexmatrixforagivenfield.

QCDinvolvesintegratinganequationofmotionforhundredsorthousandsoftimestepsthatrequiresinvertingalarge,sparsematrixateachstepoftheintegration.Thesparsematrixproblemissolvedusingaconjugategradientmethod,butbecausethelinearsystemisnearlysingular,manyCGiterationsarerequiredforconvergence.Withinaprocessor,thefour‐dimensionalnatureoftheproblemrequiresgathersfromwidelyseparatedlocationsinmemory.Thematrixinthelinearsystembeingsolvedcontainssetsofcomplexthree‐dimensional“link”matrices,oneper4‐Dlatticelink,butonlylinksbetweenoddsitesandevensitesarenon‐zero.TheinversionbyCGrequiresrepeatedthree‐dimensionalcomplexmatrix‐vectormultiplications,whichreducestoadotproductofthreepairsofthree‐dimensionalcomplexvectors.Thecodeseparatestherealandimaginaryparts,producingsixdotproductpairsofsix‐dimensionalrealvectors.Eachsuchdotproductconsistsoffivemultiply‐addoperationsandonemultiply.

RelationshiptoNERSCWorkloadMILChaswidespreadphysicscommunityuseandalargeallocationofresourcesonNERSCsystems.FundedthroughHighEnergyPhysicsTheory,itsupportsresearch

36

thataddressesfundamentalquestionsinhighenergyandnuclearphysicsandisdirectlyrelatedtomajorexperimentalprogramsinthesefields.

ParallelizationTheprimaryparallelprogrammingmodelforMILCisa4‐DdomaindecompositionwitheachMPIprocessassignedanequalnumberofsub‐latticesofcontiguoussites.Inafour‐dimensionalproblem,eachsitehaseightnearestneighbors.Thecodeisorganizedsothatmessagepassingroutinesarecompartmentalizedfromwhatisbuiltasalibraryofsingle‐processorlinearalgebraroutines.ManyrecentproductionMILCrunshaveusedtheRHMCalgorithm,whichisanimprovementinthemoleculardynamicsevolutionprocessforQCDthatreducescomputationalcostdramatically.AllthreeNERSC‐6MILCtestsaresetuptoactuallydotworunseach.ThisistomoreaccuratelyrepresenttheworkthattheCGsolvemustdoinactualQCDsimulations.TheCGsolverwilltakeinsufficientiterationstoconvergeifonestartswithanorderedsystem,sowefirstdoashortrunwithafewsteps,withalargerstepsize,andwithalooseconvergencecriterion.Thisletsthelatticeevolvefromtotallyordered.IntheNERSC‐6versionofthecode,thisportionoftherunistimedas“INIT_TIME”andittakesabouttwominutesontheNERSCCrayXT4.Then,startingwiththis“primed”lattice,weincreasetheaccuracyfortheCGsolve,andtheiterationcountpersolvegoesuptoamorerepresentativevalue.Thisistheportionofthecodethatwetime.Between65,000and75,000CGiterationsaredone(dependingonproblemsize).ThethreeNERSC‐6problemsuseweakscaling,andtheonlydifferencebetweenthethreeisthesizeofthelattice;i.e.,therearenodifferencesinstepspertrajectoryornumberoftrajectories.InTable15,thelocallatticesizeisbasedonthetargetconcurrency.Notethatthesesizeswerechosenasacompromisebetweenperceivedscienceneedsandavarietyofpracticalconsiderations.ScalingstudieswithlargerconcurrenciescaneasilybedoneusingMILC.EstimatesofQCDrunsinthetimeframeoftheNERSC‐6systemincludesomeofthelatticesintheNERSC‐6benchmarkbutalsouptoabout963 x 192. Communication characteristics are presented in Figures 23–26. TargetConcurrency LatticeSize(Global) LatticeSize(Local)

256cores 32x32x32x36 8x8x8x9

1024cores 64x64x32x72 8x8x8x9

8192cores 64x64x64x144 8x8x8x9

Table15.NERSC‐6MILClatticesizes.

37

Figure23.MPICallcountsforMILC.

Figure24.CommunicationtopologyforMILCfromIPMonFranklin.

Figure25.DistributionofCommunicationtimesforMILCfromIPMon1024coresofFranklin.

Figure26.MPImessagebuffersizedistributionbasedontimeforMILConFranklinfromIPM.

38

ImportanceforNERSC‐6ForNERSC‐6benchmarking,MILCoffersthefollowingcharacteristics:CGsolverwithsmallsubgridsizesusedstressesinterconnectwithsmallmessagesforbothpoint‐to‐pointandcollectiveoperations;extremelydependentonmemorybandwidthandprefetching;highcomputationalintensity.

PerformanceTable16showsthatMILC,too,achievesarelativelyhighpercentageofpeakonsomesystems.Thedual‐coreOpteronresultsincludesomecodetunedforthatsystem,butquad‐coreOpteroncodeisnotyetavailable.TheincreasingeffectofcommunicationsonlargerconcurrenciesisobviousfromtheFranklinandJaguarresultsonthisweakscaledproblem.Cores

IBMPower5Bassi

IBMBG/P

QCOpteronCrayXT4

Jaguar



1024 n/a 456 13% 513 6% 1101 21%8192 n/a n/a 3179 5% 5783 14%

Table16.MeasuredAggregatePerformanceandPercentofPeakforMILC

PARATEC:ParallelTotalEnergyCodeThebenchmarkcodePARATEC(PARAllelTotalEnergyCode)performsabinitioquantum‐mechanicaltotalenergycalculationsusingpseudopotentialsandaplanewavebasisset.Totalenergyminimizationofelectronsisdonewithaself‐consistentfield(SCF)method.Forcecalculationsarealsodonetorelaxtheatomsintoequilibriumpositions.PARATECusesanall‐band(unconstrained)conjugategradient(CG)approachtosolvetheKohn‐Shamequationsofdensityfunctionaltheory(DFT)andobtaintheground‐stateelectronwavefunctions.InsolvingtheKohn‐Shamequationsusingaplanewavebasis,partofthecalculationiscarriedoutinFourierspace,wherethewavefunctionforeachelectronisrepresentedbyasphereofpoints,andtheremainderisinrealspace.Specializedparallelthree‐dimensionalFFTsareusedtotransformthewavefunctionsbetweenrealandFourierspace;andakeyoptimizationinPARATECistotransformonlythenon‐zerogridelements.ThecodecanuseoptimizedlibrariesforbothbasiclinearalgebraandfastFouriertranforms;butduetoitsglobalcommunicationrequirements,architectureswithapoorbalancebetweenbisectionbandwidthandcomputationalratewillsufferperformancedegradationathigherconcurrenciesonPARATEC.Nevertheless,duetofavorablescaling,highcomputationalintensityandotheroptimizations,thecodegenerallyachievesahighpercentageofpeakperformanceonbothsuperscalarandvectorsystems.

39

Inthebenchmarkproblemssuppliedhere,20conjugategradientiterationsareperformed;however,arealrunmighttypicallydoupto60.

RelationshiptoNERSCWorkloadArecentsurveyofNERSCERCAPrequestsformaterialsscienceapplicationsshowedthatDFTcodessimilartoPARATECaccountedfornearly80%ofallHPCcyclesdeliveredtotheMaterialsSciencecommunity.SupportedbyDOEBES,PARATECisanexcellentproxytotheapplicationrequirementsoftheentireMaterialsSciencecommunity.PARATECsimulationscanalsobeusedtopredictnuclearmagneticresonanceshifts.Theoverallgoalistosimulatethesynthesisandpredictthepropertiesofmulti‐componentnanosystems.

ParallelizationInaplane‐wavesimulation,agridofpointsrepresentseachelectron’swavefunction.Paralleldecompositioncanbeovern(g),thenumberofgridpointsperelectron(typicallyO(100,000)perelectron),n(i),thenumberofelectrons(typicallyO(800)persystemsimulated),orn(k),anumberofsamplingpoints(O(1‐10)).PARATECusesMPIandparallelizesovergridpoints,therebyachievingafine‐grainlevelofparallelism.InFourierspace,eachelectron’swavefunctiongridformsasphere.Figure27depictsavisualizationoftheparalleldatalayoutonthreeprocessors.Eachprocessorholdsseveralcolumns,whicharelinesalongthez‐axisoftheFFTgrid.Loadbalancingisimportantbecausemuchofthecompute‐intensivepartofthecalculationiscarriedoutinFourierspace.Togetgoodloadbalancing,thecolumnsarefirstassignedtoprocessorsindescendinglengthorderandthentoprocessorscontainingthefewestpoints.Thereal‐spacedatalayoutofthewavefunctionsisonastandardCartesiangrid,whereeachprocessorholdsacontiguouspartofthespacearrangedincolumns,showninFigure27b.Customthree‐dimensionalFFTstransformbetweenthesetwodatalayouts.Dataarearrangedincolumnsasthethree‐dimensionalFFTisperformed,bytakingone‐dimensionalFFTsalongthez,y,andxdirectionswithparalleldatatransposesbetweeneachsetofone‐dimensionalFFTs.Thesetransposesrequireglobalinter‐processorcommunicationandrepresentthemostimportantimpedimenttohighscalability.ThecommunicationtopologyforPARATECisshownbelow.TheFFTportionofthecodescalesapproximatelyasn2log(n),andthedenselinearalgebraportionscalesapproximatelyasn³;therefore,theoverallcomputation‐to‐communicationratioscalesasn,wherenisthenumberofatomsinthesimulation.Ingeneral,thespeedatwhichtheFFTproceedsdominatestheruntime.Table19,below,comparesthespeedoftheFFTandBLAS‐dominatedProjectorportionsofthecodeforvariousmachines.ThedatashowthatPARATECisanextremelyuseful

40

benchmarkfordiscriminatingbetweenmachinesbasedonbothcomputationandcommunicationspeed.CommunicationcharacteristicsareshowninFigures28–30andinTable17.

Figure27.Athree‐processorexampleofPARATEC’sparalleldatalayoutforthewavefunctionsofeachelectronin(a)Fourierspaceand(b)realspace.

Figure28.MPIcallcountsforPARATEC.

Figure29.DistributionofMPItimesforPARATECon256coresofFranklinfromIPM.

Figure30.CommunicationTopologyforPARATECfromIPM.

41

MessageBufferSize Counton256Cores Counton1024Cores

TotalMessageCount 428,318 1,940,665

16≤MsgSz<256 114,432

256≤MsgSz<4KB 20,337 1,799,211

4KB≤MsgSz<64KB 403,917 4,611

64KB≤MsgSz<1MB 1,256 22,412

1MB≤MsgSz<16MB 2,808

Table17.ImportantMPIMessageBufferSizesforPARATEC(fromIPMon256cores,Franklin)

ImportanceforNERSC‐6ForNERSC‐6benchmarkingPARATECoffersthefollowingcharacteristics:hand‐codedspectralsolverwithall‐to‐alldatatranspositionsstressesglobalcommunicationsbandwidth;mostlypoint‐to‐pointmessages,andsinceinasingle3‐DFFTthesizeofthedatapacketsscalesastheinverseoftheinverseofthesquareofthenumberofprocessors,relativelyshortmessages;canbeverymemory‐capacitylimited;tuningparameteroffersopportunitytotradememoryusageforinterprocessorcommunicationintensity;requiresqualityimplementationofsystemlibraries(FFTW/SCALAPACK/BLACS/BLAS);useofBLAS3and1‐DFFTlibrariesresultsinhighcachereuseandhighpercentageofper‐processorpeakperformance;fixedglobalproblemsizecausessmallermessagesizesandincreasingimportanceofMPIlatencyathigherconcurrencies;moderatelysensitivetoTLBperformance.

PerformanceTables18and19showperformancedataforPARATEC.Thecodecalculatesitsownfloating‐pointoperationratefortwokeycomponents,the“Projector”portionofthecode,whichisdominatedbyBLAS3routines,andtheFFTs,whichusetunedsystemlibrariesbutdependonglobalcommunications.Thecomparisonamongsystems,especiallythedual‐coreandquad‐coreCrayXT4systems,isquiteinteresting.Cores

IBMPower5Bassi

IBMBG/P

QCOpteronCrayXT4

Jaguar


GFLOPs Effic. GFLOPs Effic. GFLOPs Effic. GFLOPs Effic256 30 7% 517 665 31% 729 55%

1024 n/a 710 20% 2044 24% 2234 42%

Table18.MeasuredAggregatePerformanceandPercentofPeakforPARATEC

42

Machine FFTRate(MFLOPS/core)

ProjectorRate(MFLOPS/core)

OverallRate(Agg.GFLOPS)

Franklin 213 1045 729Jaguar 129 1581 665BG/P 377 561 517HLRB‐II 694 993 890Bassi 1028 1287 1213

Table19.PerformanceProfileforPARATEC(256cores)

SummaryTable20presentsasummaryofafewoftheimportantcharacteristicsforthecodes.InadditiontodoingagoodjobofrepresentingkeyscienceareaswithintheNERSCOfficeofScienceworkload,thecodesprovideagoodrangeofarchitectural“stresspoints”suchascomputationalintensity,andMPImessageintensity,type,andsize.Assuch,thecodesprovideaneffectivewayofdifferentiatingbetweensystemsandwillbeusefulforallthreegoalsrelatedtotheNERSC‐6project.Finally,thecodesprovideastrongsuiteofSCALABLEbenchmarks,capableofbeingrunatawiderangeofconcurrencies. Code

CAM GAMESS GTC IMPACT‐T MAESTRO MILC PARATEC

CI* 0.67 0.61 1.15 0.77 0.24 1.39 1.50

CrayXT4%peakpercore(largestcase)

13%

12%

24%

14%

5%

14%

44%

CrayXT4%MPImedium

29% 4% 9% 20% 120% 27%

CrayXT4%MPIlarge

35% 6% 40% 23% 64%

CrayXT4%MPIextralarge

n/a n/a n/a n/a n/a 30% n/a

CrayXT4avgmsgsizemed

113K n/a 1MB 35KB 2K 16KB 34KB

Table20ComparisonofComputationalCharacteristicsforNERSC‐6Benchmarks.*CIisthecomputationalintensity,theratioofthenumberoffloating‐pointoperationstothenumberofmemoryoperations.

43

Bibliography(partial)Wong,AdrianT.,LeonidOliker,WilliamT.C.Kramer,TeresaL.Kaltz,andDavidH.Bailey,“EvaluatingSystemEffectivenessinHighPerformanceComputingSystems,”ProceedingsofSC2000,November2000.Wong,AdrianT.,LeonidOliker,WilliamT.C.Kramer,TeresaL.Kaltz,andDavidH.Bailey,“SystemUtilizationBenchmarkontheCrayT3EandIBMSP,”5thWorkshoponJobSchedulingStrategiesforParallelProcessing,May2000,CancunMexico.Kramer,WilliamandClintRyan,"PerformanceVariabilityonHighlyParallelArchitectures",theInternationalConferenceonComputationalScience2003,MelbourneAustraliaandSt.PetersburgRussia,June2‐4,2003. IngridY.Bucher,“TheComputationalSpeedofSupercomputers,”JointInternationalConferenceonMeasurementandModelingofComputerSystemsarchiveProceedingsofthe1983ACMSIGMETRICSconferenceonMeasurementandmodelingofcomputersystems.I.Y.BucherandJ.L.Martin,“MethodologyforCharacterizingaScientificWorkload,”InCPEUG'82,pages121‐‐126,1982.Horst D. Simon, William T. C. Kramer, David H. Bailey, Michael J. Banda, E. Wes Bethel, Jonathon T. Carter, James M. Craw, William J. Fortney, John A. Hules, Nancy L. Meyer, Juan C. Meza, Esmond G. Ng, Lynn E. Rippe, William C. Saphir, Francesca Verdier, Howard A. Walter, Katherine A. Yelick, “Science-Driven Computing: NERSC’s Plan for 2006–2010,” Lawrence Berkeley National Laboratory Report LBNL-57582, May, 2005. Horst Simon, William Kramer, William Saphir, John Shalf, David Bailey, Leonid Oliker, Michael Banda, C. William McCurdy, John Hules, Andrew Canning, Marc Day, Philip Colella, David Serafini, Michael Wehner and Peter Nugent, “Science-Driven System Architecture: A New Process for Leadership Class Computing,” J. Earth Sim., Vol. 2, January 2005. Leonid Oliker, Andrew Canning, Jonathan Carter, Costin Iancu, Michael Lijewski, Shoaib Kamil, John Shalf, Hongzhang Shan, Erich Strohmaier, Stephane Ethier, Tom Goodale, “Scientific Application Performance on Candidate PetaScale Platforms,” International Parallel & Distributed Processing Symposium (IPDPS), March 24-30, 2007, Long Beach, CA.

44

Leonid Oliker, Andrew Canning, Jonathan Carter, John Shalf, David Skinner, Stephane Ethier, Rupak Biswas, Jahed Djomehri, and Rob Van der Wijngaart, “Performance Evaluation of the SX6 Vector Architecture for Scientific Computations,” http://crd.lbl.gov/~oliker/drafts/SX6_eval.pdf A. S. Almgren, J. B. Bell, M. Zingale, “MAESTRO: A Low Mach Number Stellar Hydrodynamics Code,” Journal of Physics: Conference Series 78 (2007) 012085. J. B. Bell, A. J. Aspden, M. S. Day, M. J. Lijewski, “Numerical simulation of low Mach number reacting flows,” Journal of Physics: Conference Series 78 (2007) 012004. Y. Ding, Z. Huang, C. Limborg-Deprey, J. Qiang, “LCLS Beam Dynamics Studies with the 3-D Parallel Impact-T Code,” Stanford Linear Accelerator Center Publication SLAC-PUB-12673, Presentedatthe22ndParticleAcceleratorConference,Albuquerque,NewMexico,U.S.A,June25‐29,2007.JiQiang,RobertD.Ryne,SalmanHabib,andViktorDecyk,“AnObject‐OrientedParallelParticle‐in‐CellCodeforBeamDynamicsSimulationinLinearAccelerators,”JournalofComputationalPhysics163,434–451(2000).S.C.Jardin,Editor,“DOEGreenbook:NeedsandDirectionsinHighPerformanceComputingfortheOfficeofScience,”PrincetonPlasmaPhysicsLaboratoryReportPPPL‐4090,June,2005.Availablefromhttp://www.nersc.gov/news/greenbook/.WilliamKramer,JohnShalf,andErichStrohmaier,“TheNERSCSustainedSystemPerformance(SSP)Metric,”LawrenceBerkeleyNationalLaboratoryReportLBNL‐58868,http://repositories.cdlib.org/lbnl/LBNL‐58868,2005.L.Oliker,J.Shalf,J.Carter,A.Canning,S.Kamil,M.Lijewski,S.Ethier(2007)."PerformanceCharacteristicsofPotentialPetascaleScientificApplications."PetascaleComputing:AlgorithmsandApplications(DavidBader:editor),Volume.Chapman&Hall/CRCPress:2007.Lin‐WangWang(2005),“ASurveyofCodesandAlgorithmsusedinNERSCMaterialsScienceAllocations,”LawrenceBerkeleyNationalLaboratoryReportLBNL‐61051,2005.AMirinandPWorley,“ExtendingScalabilityoftheCommunityAtmosphereModel,”JournalofPhysics:ConferenceSeries78(2007)012082.ArnoldKritz,DavidKeyes,etal.,“FusionSimulationProjectWorkshopReport,”availablefromhttp://www.ofes.fusion.doe.gov/programdocuments.shtml.K.Yelick,“ProgramminginUPC(UnifiedParallelC),CS267Lecture,Spring2006,”lecturenotesavailablefromhttp://upc.lbl.gov/publications/.

45

Formoreinformationonthebenchmarksseehttp://www.nersc.gov/projects/SDSA/software/?benchmark=NERSC5

Documents

NERSC‐6 Workload Analysis and Benchmark …NERSC’s workload analysis and benchmark selection process. Finally, we present an analysis of the performance of the selected benchmark