Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
1
LBNL‐1014E
NERSC‐6WorkloadAnalysisandBenchmarkSelectionProcess
KatieAntypas,JohnShalf,andHarveyWasserman
National Energy Research Scientific Computing Center Division Ernest Orlando Lawrence Berkeley National Laboratory
Berkeley, CA 94720
August 13, 2008
This work was supported by the U.S. Department of Energy’s Office of Science, Office of Advanced Scientific Computing Research, Mathematical, Information and Computational
Sciences Division under Contract No. DE-AC02-05CH11231.
2
Abstract
Thisreportdescribeseffortscarriedoutduringearly2008todeterminesomeofthescience drivers for the “NERSC‐6” next‐generation high‐performance computingsystemacquisition.AlthoughthestartingpointwasexistingGreenbooksfromDOEandtheNERSCUserGroup,themaincontributionofthisworkisananalysisofthecurrentNERSC computationalworkload combinedwith requirements informationelicitedfromkeyusersandotherscientistsaboutexpectedneedsinthe2009‐2011timeframe.TheNERSCworkload isdescribed in termsof science areas, computercodes supporting research within those areas, and description of key algorithmsthat comprise the codes. Thisworkwas carriedout in largepart tohelp select asmall set of benchmark programs that accurately capture the science andalgorithmic characteristics of the workload. The report concludes with adescriptionofthecodesselectedandsomepreliminaryperformancedataforthemonseveralimportantsystems.
3
NERSC‐6WorkloadAnalysisandBenchmarkSelectionProcess
IntroductionNERSC’sscience‐drivenstrategyforincreasingresearchproductivityfocusesonthebalancedandtimelyintroductionofthebestnewtechnologiestobenefitthebroadestsubsetoftheNERSCworkload.Thegoalinprocuringnewsystemsistoenablenewsciencediscoveriesviacomputationsofascalethatgreatlyexceedwhatispossibleoncurrentsystems.Thekeytounderstandingtherequirementsforanewsystemistotranslatescientificneedsexpressedbysimulationscientistsintoasetofcomputationaldemandsandfromtheseintoasetofhardwareandsoftwareattributesrequiredtosupportthem.Thiscombinationofrequirementsanddemandsisabstractedintoasetofrepresentativeapplicationbenchmarksthatbynecessityalsocapturecharacteristicstylesofcodingandthusassureaworkload‐drivenevaluation.Theimportanceofworkload‐basedperformanceanalysiswasbestexpressedbyIngridBucherandJoanneMartinina1982LANLtechnicalreportwheretheystated,“Benchmarksareonlyusefulinsofarastheymodeltheintendedcomputationalworkload.”AtNERSCthesecarefullychosenbenchmarksgobeyondrepresentingasetofindividualscientificandcomputationalneeds.ThebenchmarksalsocomprisetheSustainedSystemPerformance(SSP)metric,anaggregatemeasureofthereal,deliveredperformanceofacomputingsystem.TheSSPisevaluatedatdiscretepointsintimeandalsoasanintegratedvaluetogivethesystem’spotency,meaninganestimateofhowwellthesystemwillperformtheexpectedworkoversometimeperiod.Takentogether,theNERSCSSPbenchmarks,theirderivedaggregatemeasures,andtheentireNERSCworkload‐drivenevaluationmethodologycreateastrongconnectionbetweensciencerequirements,howthemachinesareused,andthetestsweuse.ThisdocumentbeginswithanexaminationofthekeysciencedriversforDOEhigh‐endcomputationthatwereidentifiedinthe2005Greenbookreport.Weidentifyanychangesinthesciencedriversandcommensuraterequirementsthatmightaffecttheselectionofnext‐generationsystems.WethenprovideanintroductiontoNERSC’sworkloadanalysisandbenchmarkselectionprocess.Finally,wepresentananalysisoftheperformanceoftheselectedbenchmarkapplications.AbriefsnapshotofbenchmarkperformanceasofMay2008ispresented.Thesearepreliminarydata,measured,ingeneral,withnoattemptatoptimization;inotherwords,representingwhatausermightobserveuponinitialportingofthecodes.Insomecasesthedatacontainestimates,eitherbecausethefullrunistoolongorbecauseitrequiresmorecoresthanwehaveaccesstoonthegivenmachine.
4
OverviewoftheNERSCWorkloadNERSCservesadiverseusercommunityofover3000usersand400distinctprojects.Theworkloadiscomprisedofsome600codestoservethediversescienceneedsoftheDOEOfficeofScienceresearchcommunity.Despitethediversityofcodes,themedianjobsizeatNERSCisasubstantialfractionoftheoverallsizeofNERSC’sflagshipXT4computingsystem.NERSC’sworkloadcoverstheneedsof15scienceareasacrosssixDOEOfficeofSciencedivisions,inadditiontousersfromotherresearchandacademicinstitutionsoutsideoftheDOE.ThepiechartsinFigure1showthepercentagesofcyclesatNERSCawardedtothevariousDOEofficesandscienceareas.
Figure1:AwardsbyDOEoffice(left)andawardsbysciencearea(right).
InthisstudywefocusedonthelargestcomponentsoftheNERSCworkload,whichincludeFusion,Chemistry,MaterialsScience,ClimateResearch,LatticeGaugeTheory,AcceleratorPhysics,Astrophysics,LifeSciences,andNuclearPhysics,asshowninFigure2.
5
Figure2:TheNERSCworkloadstudyfocusedondominantcontributorstotheNERSCworkload.
ClimateScienceClimatesciencecomputingatNERSCincludesavarietyofpredictionandsimulationactivities,includingcriticalsupportoftheU.S.submissiontotheIntergovernmentalPanelonClimateChange(IPCC).Figure3showsthatclimatescienceallocationsaredominatedbyCAM,theCommunityClimateSystemModel(CCSM3),andweathermodelingcodes.ThereareINCITEallocationsthatarealsolargeplayers,butbecauseINCITEisnotasteady‐stateallocationwedonotconsidertheseprojectsfurther.WiththeINCITEallocationsremoved,thedominanceofCAMandCCSM3isclear(Table1).Furthermore,CAMisthedominantcomputationalcomponentofthefullycoupledCCSM3climatemodel.Therefore,CAMisagoodproxyforthemostdemandingcomponentsoftheclimateworkload.
6
Figure3:DistributionofclimateallocationwithandwithoutthecontributionoftheINCITEallocations.TheINCITEawardsfor"forecast"and"filter"codesaremuchlargerthantheyareinthesteady‐stateworkload
Table1:BreakdownoftheNERSCworkloadinClimateSciences.CCSMandCAMaccountfor>74%oftheworkload.SinceCAMisthedominantcomponentofCCSM,CAMisclearlyagoodproxyfortheclimaterequirements.
Inordertobeusefulforclimateresearch,CAMmustachieveatargetperformanceof1000xfasterthanreal‐time.Thisplacesastringentrequirementfortheachievedperformanceoftheclimateapplicationonthetargetsystemarchitecture.Thecurrentmeshresolutionultimatelylimitstheexposableparallelismoftheapplication.However,iftheresolutionisincreased,thesizeofthesimulationtimestepsmustbereducedtosatisfytheCourantstabilitycondition.Increasingtheresolutionby2xrequires4xmoretimeduetotheshortertimesteps.Therefore,theclimateapplicationpushestowardshigherper‐processorsustainedperformance,whichrunscountertocurrentmicroprocessorarchitecturaltrends.
7
AlthoughspectralCAMdominatesthecurrentworkload,theIPCCisrapidlymigratingtothefinite‐volumeformulationofCAM(fvCAM).FiniteVolumeCAMismorescalablethanspectralCAMandisbetterabletoexploitlooplevelparallelismusingahybridOpenMP+MPIprogrammingmodeandtherefore,takeadvantageofSMPnodes.Thehybridmodelmaymitigatetheneedforimprovedper‐processorserialperformancebutwouldrequirewiderSMPnodes.Inanticipationofthelargermeshresolutiontargets,weselectedtheD‐mesh(~0.5degree),whichislikelytobetheproductionresolutionforsometime.
MaterialsScienceSimulationsinmaterialscienceplayanindispensableroleinnanoscienceresearch,anareaidentifiedasoneofthesevenhighestprioritiesfortheOfficeofScience.OfparticularinterestrecentlyarestudiesofQuantumDots,duetotheiremergingimportanceinminiaturizationofelectronicandoptoelectronicdevices,andcomputationalneedsseemvirtuallyunlimited.Forexample,twospecificapplicationsthatscientistsinthisareaforeseeincludethefollowing:• Electronicstructuresandmagneticmaterials.Thecurrentstateoftheartisa
500‐atomstructure,whichrequires1.0Tflop/ssustainedperformance(typicallyforseveralhours),and100GBmainmemory.Futurerequirements(aharddiskdrivesimulation,forinstance)arefora5000‐atomstructure,whichwillrequireroughly30Tflop/sand4TBmainmemory.
• Moleculardynamicscalculations.Thecurrentstate‐of‐the‐artisfora109‐atom
structure,whichrequires1Tflop/ssustainedand50GBmemory.Futurerequirements(analloymicrostructureanalysis)willrequire20Tflop/ssustainedand5TBmainmemory.
Thematerialsworkloadpresentsadizzyingarrayofcodes,asshowninFigure4.Thereare65accountsundertheBES/materialsciencecategorythataccountfor20%ofthe324totalNERSCaccounts.ThetotalHPCallocationfortheseaccountswas4.1MhoursbeforeBassiwasonlineand8.8MhoursaftertheBassiwasonline.Althoughthisallocationasafunctionofthetotalissmallerthanitwasafewyearsago,itstillrepresents13%ofthetotalNERSCallocation(66.7MhoursafterBassiwasonline).ThematerialssciencepercentagehasdecreasedpartlybecauseofthecreationsofspecialprogramslikeSciDACandINCITE.Inthisscienceareathereisanaverageofaboutonecodeperaccount(orgroupofusers).However,sincethesamecodecanbeusedbydifferentaccounts,onecodeisusedby2.15usergroupsonaverage.ExceptfortheVASPcode,whichisusedby23groups,themajorityofthecodesareusedbylessthan5groups,andmanyofthecodesareusedonlybyonegroup.Thisisaverydiversecommunity,withmanygroupsusingtheirowncodes.
8
Figure4:Thematerialsscienceareapresentsadizzyingarrayofcodes,noneofwhichdominatetheworkload.
Thedifferentcodescanbeclassifiedintosixcategoriesbasedontheirphysicsandthecorrespondingalgorithms.Thesecategories,andtheircorrespondingallocations,are:densityfunctionaltheory(DFT,74%);beyondDFT(GW+BSE,6.9%);quantumMonteCarlo(QMC,6.7%);classicalmoleculardynamics(CMD,6.4%);(classicalMonteCarlo(CMC,3.1%);andotherpartialdifferentialequations(PDE,2.9%).ThereisanoverwhelmingpreferencefortheDFTmethod,owingtothecurrentsuccessofthatmethodinabinitiomaterialssciencesimulations.However,theDFTmethoditselfhasdifferentnumericalapproachessuchasplanewaveDFT,Green’sfunctionDFT,localizedbasisandorbitalDFT,muffintinspheretypeDFT,andrealspacegridDFT.Themostpopular(bothintermsofnumberofcodesandHPChours)isplanewaveDFT,forwhichthereare12codesaccountingfor1.6Mhours.
DensityFunctionalTheory(DFT)IntheDFTmethod,oneneedstosolvethesingle‐particleSchrödingerequation(asecondorderpartialdifferentialequation).Typically,5–10%ofthelowesteigenvectorsareneededfromthissolution,andthenumberofeigenvectorsisproportionaltothenumberofelectronsinthesystem.Forexample,forathousand‐atomsystem,afewthousandeigenvectorsareneeded.Thereisakeydifferencebetweenthecomplexityofthisapproachandthatofmostengineeringproblems(e.g,fluiddynamics,climatesimulation,combustion,electromagnetics)whereonlyasmall,fixednumberoftimeevolvingoreigenvectorfieldsaresolved.InDFT,theSchrödingerequationitselfdependsontheeigenvectorsthroughthedensityfunction(whichgivesrisetothenamedensityfunctionaltheory).Thus,itisanonlinearproblem,butonethatcanbesolvedbyself‐consistentiterationofthelinearizedeigenstaterepresentedbySchrödinger’sequation,orbydirectnonlinearminimization.Currently,mostlarge‐scalecalculationsaredoneusingself‐consistent
9
iterations.Numerically,whatdistinguishthedifferentDFTmethodsandcodesarethedifferentbasissetsusedtodescribethewavefunctions(theeigenvectors).
PlanewaveDFTPlanewaveDFTusesplanewavestodescribethewavefunctions,whilerealspaceDFTusesaregularreal‐spacegrid.Duetothesharppeakofthepotentialnearatomicnuclei,specialcareisneededtochoosedifferentbasissets.Besidestheplanewaveandreal‐spacegrid,otherconventionalbasissetsinclude:atomicorbitalbasisset,wheretheeigenvectorsoftheatomicSchrödingerequationareusedtodescribethewavefunctionsinasolidormolecule;Gaussianbasissets,whicharemoreoftenusedinquantumchemistryduetotheiranalyticalproperties;muffin‐tinbasis,whereasphericalholeiscastoutneareachnucleusandsphericalharmonicsandBesselfunctionsareusedtodescribethewavefunctioninsidethehole;augmentedplanewaves,wheresphericalBesselfunctionsnearthenucleiareconnectedwiththeplanewavesintheinterstitialregionsandusedasthebasisset;andthewaveletbasissets.Intermsofthemethodstosolvetheeigenstateproblem,bothiterativeschemeanddirecteigensolvershavebeenusedindifferentcodes.IntheplanewaveDFT,aniterativemethod(e.g.,conjugategradient)isoftenused;whileintheatomicorbital,Gaussian,andaugmentedplane‐wave(FLAPW)methods,directdensesolvers(ScaLAPACK)areoftenused;andinreal‐spacegridmethods,sparsematrixsolversareused.Therearethreekeycomponentsoftheiterativeplanewavemethod:3DglobalFFTforconvertingbetweenrealandreciprocalspace,orthogonalizationofbasisfunctionsusingdenselinearalgebra,andpseudopotentialcalculationsusingdenselinearalgebra.Themosttimeconsumingstepsarethematrixvectormultiplicationandvector‐vectormultiplicationfororthogonalization.Forhighlyparallelcomputation,theFFTistheprimarybottleneck.However,iftheproblemsizeisincreased,thecomputationalrequirementsofthedenselinearalgebragrowO(N3)whereastheFFTrequirementsgrowO(N2).Therefore,weakscalingaDFTproblemcanproducedramaticimprovementsinthedeliveredflopratesgiventhelocalityofthelinear‐algebracalculation;butthebenefitsofweak‐scalingarehighlyproblemdependent.Forexample,time‐domainexperimentsdonotbenefitatallfromthiseffect.
GW+BSEGW+BSEisoneapproachtocalculatingexcitedstatesandopticalspectra.Itoperatesprimarilyonlarge,densematrices.Themosttimeconsumingpartsaregeneratinganddiagonalizingthesematrices.Thediagonalizationisoftendoneusingadenseeigensolver(withlibrariessuchasScaLAPACK).Thedimensionofthematrixisproportionaltothesquareofthenumberofelectronsinthesystem.
QuantumMonteCarlo(QMC)QuantumMonteCarlo(QMC)usesstochasticrandomwalkmethodstocarryoutthemultidimensionalintegrationofthemany‐bodywavefunctions.Sinceitneedsan
10
assemblysumofdifferentindependentwalkers,itisembarrassinglyparallel.QMCisaveryaccuratemethod,butitsuffersfromstatisticalnoise;thus,itisdifficulttouseforatomicdynamics(wheretheforcesontheatomsareneeded).
ClassicalMolecularDynamicsandClassicalMonteCarloInclassicalmoleculardynamics(MD),theforcecalculationstepisdoneinparallel.Sincetheclassicalforcefieldformalismislocalinnature(excepttheelectrostaticforce),efficientparallelizationispossibleasintheNAMDcode.ClassicalMonteCarloissometimesusedinsteadofclassicalMD,thusitismoreinterestedinthetimeevolvingprocess,insteadofanassemblesum(asinquantumMC).Asaresult,parallelizationisnotsotrivial.TherearemanyrecentdevelopmentsinparallelschemesforclassicalMC(besidesthepossibleapproachforparallelevaluationofthetotalenergyasinclassicalMD).OtherPDEsincludeMaxwellequations(e.g.,inphotonicstudy)andtimeevolvingdifferentialequationsforgrainboundaryanddefectdynamics,etc.Oncethematerialscienceworkloadisreorganizedusingthistaxonomyfortheircomputationalmethods,thedistributionofalgorithmsbecomesclearer(Figure5).Thedensityfunctionaltheory(DFT)codesclearlydominateallocationswith72%oftheoverallworkload.Ofthatsubcategory,planewaveDFTcodes(thatincludeVASP,PARATEC,QBox,andPETot)arethedominantcomponent.TheDFTcodes(particularlyplanewaveDFT)haveverysimilarcomputationalrequirementsaspreviouslydiscussed—withacommonsetofdominantcomponentsintheirexecutionprofileincluding:
• 3DFFT,whichrequiresglobalcommunicationoperationsthatdominateathighconcurrenciesandlowatomcounts
• denselinearalgebrafororthogonalizationofwavefunctions(whichdominatesforlargenumbersofelectronicstatesinlargeatomicsystemswithmanyelectronsbuthasverylocalizedcommunication)
• denselinearalgebraforcalculatingthepseudopotential(thatalsodominatesforlargenumbersofatoms).
SinceVASPwasnotavailableforpublicdistributionandQBoxisexport‐controlled,wechosePARATECasthebest‐availableproxytotherequirementsoftheMaterialsScienceworkload.WealsogettoshareeffortwiththeNSFTrack‐1andTrack‐2procurementbenchmarksthathavealsoadoptedPARATECasaproxytotheMaterialsSciencecomponentoftheirworkloads.
11
Figure5:Materialssciencecodescategorizedbyalgorithm.
ChemistryDespitebeingoneofthemostmaturesimulationfields,theroleofcomputationasafundamentaltooltoaidintheunderstandingofchemicalprocessescontinuestogrowsteadily,andoneestimateplacesthenumberofelectronicstructurecalculationspublishedyearlyaboutaround8,000!ThedominantsinglecodeintheChemistryworkloadisS3DduetoitssubstantialINCITEallocation.However,ifwesetasideINCITEapplicationsbecausetheyarenotsteady‐statecomponentsoftheworkload,weseethatthechemistryworkloadisdominatedbyVASP,whichisalsodominantinthematerialsscienceworkload.Asaresult,wechosetofocusonothercomponentsoftheChemistryworkloadtoemphasizeadifferentsetofcriteriainthealgorithmspace.S3DbearsmoresimilaritytothecomputationalrequirementsofcodesintheastrophysicsworkloadwheretherearemanyCFDsimulations.WedecidedtoevaluateS3Dasacomponentofthatworkloadsimplyduetothesimilaritiesinthecomputationalmethods.
Table2:Top50%ofthechemistryworkloadshowsadiversityofcodes.
12
Table2showsthatthetop50%ofthechemistryworkloadhasadiversityofcodes.Theleader,ZORI,isaQMCcode,andassuch,emphasizesper‐processorperformance,butisnotverydemandingoftheinterconnect.TheremainingcodesareeitherDFT(verysimilartomaterialsscience)orquantumchemistrycodes.ForquantumchemistrythemostscalablecodesareNWChemandGAMESS,andsincetheseareverydemandingoftheinterconnecttheycanbeusefulindifferentiatingmachinesbasedonsupportforefficientone‐sidedmessaging.GAMESSwasselectedforNERSCbenchmarking.ItisalsoemployedbytheDOD‐MODTI‐0xprocurements,whichallowsustocollaboratewithDODandallowsvendorstoleveragetheirbenchmarkingeffortsformultiplesystembids.
FusionFusionhaslongbeenthedominantcomponentoftheNERSCworkload,accountingformorethan27%oftheoverallworkloadin2007.Theworkloadiscurrentlydominatedbyfivecodes,whichaccountformorethan50%oftheoverallfusionworkload(Table3).Thosecodes,OSIRIS,GEM,NIROD,M3D,andGTC,fallintotwobasiccategories.Threeofthese,OSIRIS,GTC,GEM,areparticle‐in‐cell(PIC)codes,whereasM3DandNIMRODarebothcontinuumcodes(PDEsolversonsemi‐structuredgrids).Thepercentageofper‐processorpeakachievedbyparticle‐in‐cellcodesisgenerallylowbecausethegather/scatteroperationsrequiredtodepositthechargeofeachparticleonthegridthatwillbeusedtocalculatethefieldinvolvealargenumberofrandomaccessestomemory,makingthecodesensitivetomemoryaccesslatency.GTCiscertainlyrepresentativeinthisregard.However,unlikeotherPICcodes,GTCdoesnotrequireaglobal3‐DFFT,soperformanceandscalabilityissuesassociatedwiththisaspectofPICcodesmustbecoveredbybenchmarksfromothercomponentsoftheNERSCworkload.
13
Table3:Thetop50%ofthefusiontimeusageisdominatedby5codes.
GTCnowutilizestwoparalleldecompositions,aone‐dimensionaldomaindecompositioninthetoroidaldirectionandaparticledistributionwithineachdomain.Processorsassignedtothesamedomainhaveacopyofthelocalgridbutonlyafractionoftheparticlesandarelinkedwitheachotherbyadomain‐specificcommunicatorusedwhenupdatinggridquantities.AnAllreduceoperationisrequiredwithineachdomaintosumupthecontributionofeachprocessor,whichcanleadtolowerperformanceincertaincases.Althoughit’snotclearthatGTCadequatelycapturestheparallelscalingcharacteristicsofallotherPICcodes,itemergesasaclearwinningcandidateforbenchmarkingbyvirtueofitsexcellentscalability,easeofporting,lackofintellectualpropertyencumbrances,andgeneralsimplicity.Ontheotherhand,wefoundthefusioncontinuumcodestobeexceedinglycomplexasbenchmarks,oneexampleofwhichistherequireduseofexternallibrariesnotguaranteedtobepresentontypicalvendorin‐housebenchmarkingmachines.However,wealsofoundthatthesemi‐implicitsolversemployedbythecontinuumcodes,presumedtobethemosttime‐consumingportionofarun,hadsimilarresourcerequirementstoastrophysicscodesthatmodelstellarprocesses(MHDcodesusedformodelingcorecollapsesupernovaeandastrophysicaljets).Therefore,wechosetofocusonthePICcodesasproxiesforthefusionworkloadrequirements.
AcceleratorModelingSimulationisplayinganincreasinglyprominentandvaluableroleinthetheory,designanddevelopmentofparticleaccelerators,detectors,andX‐rayfreeelectronlasers,whichareamongthelargest,mostcomplexscientificinstrumentsintheworld.TheacceleratorworkloadisdominatedbyPICcodes,butalsohasstrongrepresentationbysparsematrixcodeswithOmega3P(seeTable4).
Table4.TopCodesinAcceleratorModeling.ThePICcodesVORPALandOSIRISdominatetheacceleratormodelingworkload.Evenaswelookfurtherdownthelistto70%oftheworkload,theallocationsaredominatedbyPICcodes.TheexceptionisOmega3p,whichisbottleneckedbyitseigensolver,whichdependsonasparse‐matrixdirectsolver.
Thesparsematrixdirectsolvershavewellknownscalabilityissuesthatwill
14
ultimatelyforcesuchcodestoadoptnewalgorithmicapproaches(likelysparseiterativesolvers).Therewasconcernthatselectionofcurrentsparsematrixcodeexampleswouldnotbereflectiveofthestate‐of‐the‐artcomputationalmethodsthatwillemergefortheseproblemsinthe2009timeframe.Therefore,weconcentratedourattentiononthePICcomponentoftheworkload,whichisclearlydominantforthissciencearea.ThePICcodesinacceleratormodelinghaveuniqueload‐balancingproblemsandpotentialinter‐processorcommunicationissuesthatdistinguishthemfromplasmasimulationPICcodesrepresentedbyGTC.Whereasplasmamicroturbulencecodesaregenerallywellloadbalancedduetothelargelyuniformdistributionoftheplasmathroughoutthecomputationaldomain,manyacceleratorproblemsdependonbunchinguptheparticlesbecauseofthenarrowfocusoftheacceleratorbeam.Thiscanleadtoseriousissuesforloadbalancing.Additionally,someoftheacceleratorcodes,notablythosecomprisingtheIMPACTsuitefromLBNL,useanFFTalgorithmtosolvePoisson’sequation,withitsassociatedglobal,all‐to‐allcommunication.IMPACT‐Twaschosenasthebenchmarkrepresentingbeamdynamicssimulation.Partofasuiteofcodesusedforthispurpose,IMPACT‐Tisusedspecificallyforphotoinjectorsandlinearaccelerators.Itusesa2‐Ddomaindecompositionofthe3‐DCartesiangridandisoneofthefewcodesusedinthephotoinjectorcommunitythathasaparallelimplementation.Ithasanobject‐orientedFortran90codingstylethatiscriticalforprovidingsomeofitskeyfeatures.Forexample,ithasaflexibledatastructurethatallowsparticlestobestoredin“containers”withcommoncharacteristicssuchasparticlespeciesorphotoinjectorslices;thisallowsbinninginthespace‐chargecalculationforbeamswithlargeenergyspreads.
15
AstrophysicsInthepast,NERSCresourceshavebeenbroughttobearonsomeofthemostsignificantproblemsincomputationalastrophysics,includingsupernovaexplosions,thermonuclearsupernovae,andanalysisofcosmicmicrowavebackgrounddata.WorkdonebyvariousresearchersatLBNLhasshownthatI/ObenchmarkingismosteffectivelydoneusingI/OkernelbenchmarkssuchasIORandstripped‐downapplicationssuchasMADCAPratherthanusingfullapplicationcodes.ThisisinpartbecauseoftheflexibilityIORoffersinmeasuringarangeoffileandblocksizes,I/Omethods,andfilesharingapproaches.SinceI/Obenchmarkinghasessentiallybeenfactoredoutfromthemainworkload,therefore,weconcentrateontheothersolversintheastrophysicsworkloadthatincludeavarietyofradiationhydrodynamics,MHD,andotherastrophysicalfluidscalculations.PDEsolversonblock‐structuredgridsareadominantcomponentofthisworkload.Althoughmanyofthesecodescurrentlyemployanexplicittime‐steppingscheme,manyofthescientistsinthefieldindicatetheyaremovingasquicklyaspossibletowardsimplicitschemes(suchasNewtonKrylovschemes)thathavemoredemandingcommunicationrequirementsbutofferamuchbettertime‐to‐solution.Inparticular,manyimplicitschemesarelimitedbyfastglobalreductions.VirtuallyallmodernKrylovsubspacealgorithms(i.e.,anythingofthefamilyofconjugategradient,biconjugategradient,biconjugategradient‐stabilized,generalizedminimumresidual,etc.methods)forsparselinearsystemsrequireinnerproductsofvectorsinordertofunction.InFortran,fortwovectorsXandY,thiswouldlooklike inner_prod=0.0 doi=1,n inner_prod=inner_prod+X(i)*Y(i) enddoWhenthevectorsaredistributedacrossaparallelarchitecture,the"Do"loopsumisonlyovertheelementsofthevectorthatexistoneachprocessor.Theglobalsummationisthendoneovertheentireprocesstopologywithalocalsumoneachprocessorfollowedbyaglobalreductionoperationtogetthetotalsumovertheprocesstopology.Thusitwouldlookroughlylike: local_sum=0.0 doi=1,nlocal local_sum=local_sum+X(i)*Y(i) enddo callmpi_all_reduce(local_sum,inner_product,MPI_SUM)
16
VirtuallyallKrylovsubspacealgorithmsrequirethesekindsofinnerproductsummations.MostclassesofimplicitalgorithmsinturnrelyheavilyontheseKrylovsubspacealgorithms:sparselinearsystems,sparsenon‐linearsystems,optimization,sensitivityanalysis,etc.Therefore,wefeltitnecessarytoselectabenchmarkthatemphasizesthedemandsofimplicittimesteppingschemestothatwedon’tselectamachinethatisonlygoodforexplicitfluiddynamicsalgorithmsbutwillignorethemanynewapplicationsthatutilizeimplicitapproachesthatarejuststartingtobecomeviable.Unfortunately,atthistimeveryfewcodeshavebeendevelopedthatfullyimplementimplicittimesteppingschemes.Therewerenoclearlydominantcodesthatusedimplicittimesteppingschemes.Suchcodesarestillindevelopmentandarenotveryportable.WeselectedtheMAESTROcodeinpartbecauseofcloseinteractionswiththedevelopers.MAESTROisalowmachnumberastrophysicscodeusedtosimulatethelong‐timeevolutionleadinguptoatype1asupernovaexplosion.TheNERSC‐6MAESTROsimulationexaminesconvectioninawhitedwarfandincludesbothexplicitandimplicitsolvers.MAESTROiscapableofdecomposingauniformgridofconstantresolutionintoaspecifiednumberofnon‐overlappinggridpatchesbutintheNERSC‐6benchmarkdoesnotrefinethegrid.
AMRAlthoughadaptivemeshrefinement(AMR)isapowerfultechniquewiththepotentialtoreduceresourcesnecessarytosolveotherwiseintractableproblemsinavarietyofcomputationalsciencefields,itbringswithitavarietyofadditionalperformancechallengesthatareworthyoftestinginaworkload‐drivenevaluationenvironment.Inparticular,theyhavethepotentialtoaddnon‐uniformmemoryaccessandirregularinter‐processorcommunicationtowhatmighthaveotherwisebeenrelativelyregularanduniformaccessandpatterns.WerepresentAMRintheNERSC‐6benchmarksuitewithastripped‐downapplicationthatexercisesanellipticsolverintheChomboAMRframeworkdevelopedatLBNL.ChomboprovidesadistributedinfrastructureforimplementingfinitedifferencemethodsforPDEsonblock‐structuredadaptivelyrefinedrectangulargrids.Thebenchmarkproblemusesasolverwithafixednumberofiterationsandweakscalestheproblemfrom256to4096cores.Formoreinformationseehttp://www.nersc.gov/projects/SDSA/software/?benchmark=AMR.
NuclearPhysicsItisnosurprisethatMILCandothervariantsoflatticeQCDcodesdominatetheworkload.
17
MILCwasalsoadoptedbytheNSFprocurementsasabenchmark,soweareabletoleveragetheinvestmentinpackagingthisbenchmarkfortheprocurement.
BiologyandLifeSciencesCodesinthisgroupseemtofallintothreebasiccategories.Oneconsistsofcodesforsequencesearching,comparison,identification,and/orprediction.Basicallyinformaticstypescodes,thesedifferfromtypicalHPCworkloadsandarenotrepresentedtoanyextentinNERSCbenchmarksuites.Thesecodeshadamoderateallocationin2007,butweconcludedthatexplicitlyrepresentingtheminthebenchmarksuitewasunnecessarybecausethecodesarebasicallyconstrainedbyclockspeed(integerprocessingrates,actually)andthusdonotpresentasufficientlychallengingoruniqueproblemtothearchitecture.Thesecondcategoryconsistsofavarietyofmoleculardynamicscodes,themostimportantofwhichareAMBER,NAMD,andCHARMM.Thecapabilitythesecodesofferismergedwithrequirementsforproteinfoldinganalysistocreatewhatisreferredtoasmoleculardynameomics—oneofthemostchallengingareasofcomputationalscienceingeneral.Thecharacteristicsofthesecodesare,inamacrosense,similarto,ifnotthesameas,themoleculardynamicscodesusedinmaterialscience;however,individuallife‐sciencerelatedmoleculardynamicssimulationsarelikelytobelessscalableandmoreinter‐processorcommunicationlimitedthanthoseofmaterialscience,sinceinter‐atomicpotentialsarelikelytobemoresubtleandlongerrange.Thethirdcategoryconsistsofcodesthatalsoareorlooklikematerialscienceorchemistrycodes,eitherforquantumMonteCarlo,classicalquantumchemistry,orDFTmethods.Thecomputationalcharacteristicsofthesehavealreadybeendiscussedandlikelydonotdifferforlifescienceapplications.ThetotallifescienceNERSCallocationinrecentyearshasremainedsmallandsonocodehasbeenchosentouniquelyrepresentthisarea.
BenchmarkSelectionBenchmarkselectionisanextraordinarilydifficultprocess.AsperIngridBucher’s1983observation,“benchmarksareonlyusefulinsofarastheymodeltheworkloadthatwillrunonagivensystem.”Tothatend,weuseinformationfromourworkloadanalysistoaidintheselectionofbenchmarksthatareaproxytotheanticipatedNERSCsystemrequirements.Ideally,thebenchmarksmustcovertherequirementsofthetargetHPCworkload,butthereareanumberofpracticalconsiderationsthatmustbetakenintoaccountduringtheselectionprocess.Coverage:WeassesshowwellthecodescoverscienceareasthatdominatetheNERSCworkloadaswellasthecoverageofthealgorithmspace.Thiscreatesatwo‐dimensionalmatrixofcomputationalmethodsandscienceareasthatmustbe
18
covered.Portability:Theselectedcodesmustbeabletorunonawidevarietyofsystemstoallowvalidcomparisonsacrossarchitectures.Theimplementationofthecodesmustnotbetooarchitecture‐specificandthe”build”systemsforthesecodesmustberobustsothattheycanbeusedacrossmanysystemarchitectures.Scalability:ItisimportanttoemphasizeapplicationsthatjustifyscalableHPCresources.Overtime,codesthatareunabletoscalewiththeresourcestendtohaveadiminishingroleintheHPCworkload.Inselectingscalablecodes,wearenotselectingthosethatscaleeasily—ratherthosethatpresentsomeofthemostdemandingrequirements,butareotherwiseengineeredtoexposesufficientparallelismtorunonhighlyparallelsystems.OpenSource:Sincethebenchmarksmustbedistributedwithoutrestrictiontovendorsforbenchmarkingpurposes,noproprietaryorexport‐controlledcodecanbeused.Onlyopen‐sourceapplicationsthatallowustosharesnapshotswithvendorscanbeconsidered.Availability:Packagingappropriatebenchmarkproblemsrequiresclosecollaborationwiththedeveloperforassistanceandsupport.Withouttheassistanceoftheusercommunity,wecannotselectappropriateproblemsforevaluatingsystems,andhencewillbeunabletogaugethevalueofagivensystem.
Table5:Matrixofcodesforfirstcutoftheselectionprocess.Thecodeswererankedbasedontheirscoresascandidatebenchmarks.Codesthatcouldnotbewidelydistributedorweredifficulttoportwereslowlyeliminateduntilthefinal6benchmarks.Withtheseconsiderationsinmind,thebenchmarkteamselectedproblemsbasedontheirabilitytofulfilltheseattributesandrankthembasedontheirscoreineachoftheseareasascandidatebenchmarkcodes.Aftergainingexperiencewiththecodesbybuildingthemorthroughdetaileddiscussionswiththedevelopers,weslowlyreducedthelistdowntosixtargetbenchmarkcodes.Thefinalproblemsizesandbenchmarkconfigurationsweredeterminedafterthefinalcodeswereselected.
19
Thesebenchmarksareusedindividuallytounderstandandcomparesystemperformance.Althoughouranalysisoftheindividualapplicationperformancehasfoundspecificbottlenecksoneachsystem,distillingdowntoasetofspecifichardwarefeaturesisunlikelytocapturethecomplexinteractionsbetweenfullapplicationsandcorrespondinglycomplexsystemarchitectures.Thereforetheperformanceoffullapplicationswiththeirassociatedcomplexbehaviorisusedtoassessthevalueofsystems.Therefore,wefeelusingacarefullyselectedsetoffullapplications,asaproxytofullworkloadperformanceisbetterabletoencodetheDOEOfficeofScienceworkloadrequirementsforthepurposeofcomparingHPCsystems.
Table6:FinalBenchmarkSelectionMatrixtogetherwithproblemsizesandconcurrencies.
Benchmark ScienceArea AlgorithmSpace BaseCaseConcurrency
ProblemDescription
Lang Libraries
CAM Climate(BER) NavierStokesCFD 56,240Strongscaling
DGrid,(~0.5degresolution);240timesteps
F90 netCDF
GAMESS QuantumChem(BES)
Denselinearalgebra,DFT
256,1024(SameasTI‐09)
rmsgradient,MP2gradient
F77 DDI,BLAS
GTC Fusion(FES) PIC,finitedifference
512,2048Weakscaling
100particlespercell
F90
IMPACT‐T AcceleratorPhysics(HEP)
LargelyPIC,FFTcomponent
256,1024Strongscaling
50particlespercell(currently)
F90
MAESTRO Astrophysics(HEP)
LowMachHydro;blockstructured‐gridmultiphysics
512,2048Weakscaling
64x64x128gridptsperproc;10timesteps
F90 Boxlib
MILC LatticeGaugePhysics(NP)
Conjugategradient,sparsematrix;FFT
256,1024,8192Weakscaling
8x8x8x9LocalGrid,~70,000iters
C,assem.
PARATEC MaterialScience(BES)
DFT;FFT,BLAS3 256,1024Strongscaling
686Atoms,1372bands,20iters
F90 Scalapack
20
ScienceAreasDenseLinearAlgebra
SparseLinearAlgebra
SpectralMethods(FFT)s
ParticleMethods
StructuredGrids
UnstructuredorAMRGrids
AcceleratorScience
XX
IMPACTX
IMPACTX
IMPACT
Astrophysics XX
MAESTROX
XMAESTRO
XMAESTRO
ChemistryX
GAMESSX
Climate X
CAM
XCAM
Combustion X
MAESTROX
AMRElliptic
Fusion X
GTCX
GTC
LatticeGauge X
MILCX
MILCX
MILCX
MILC
MaterialScience
XPARATEC
X
PARATEC
XPARATEC
Table7:AlgorithmcoveragebytheselectedSSPalgorithms.
EvolutionfromNERSC‐5ApplicationBenchmarksWepointoutthattheselectedapplicationbenchmarkshavechangedfromthoseusedfortheNERSC‐5procurement.Thefollowingconsiderationsledtothesechanges:
WorkloadEvolution:TheworkloadhaschangedmodestlysincetheNERSC‐5procurement.Twoapplicationsfromthelastbenchmarksuite(NAMDandMadBENCH)wereremovedbecauseoftheirreducedcontributiontotheworkloadandwerereplacedbyMAESTROandIMPACT‐Trespectively.Multicore:PowerhasendedDennardscaling,sofutureperformanceimprovementsofHPCsystemshingeondoublingthenumberofprocessorcoresevery18monthsratherthanthehistoricalclock‐frequencyscaling.Weprojectthenext‐generationNERSCsystemwillbe3–5xmorecapablethanNERSC‐5.Consequently,theconcurrencyofthebenchmarkshasincreasedby4xovertheNERSC‐5suite.StrongScaling:Overthepast15years,parallelcomputinghasthrivedonweak‐scalingofproblems.However,withthestallinCPUclockfrequencies,thereisincreasedpressuretoimprovestrong‐scalingperformanceofalgorithms.Thisisbecausefortime‐domainmethods,increasesinproblemresolutionmustoftenbeaccompaniedbycorrespondingdecreasesintime‐
21
stepsizetosatisfythecourantstabilitycondition.Whereasthetime‐stepdecreasescouldbeaccommodatedbyperformanceimprovementsofindividualcores,nextgenerationalgorithmswillneedtoimprovetheirstepratesusingincreasedparallelism(strongscaling).Consequently,ourprobleminputdecksfortheNERSC‐6applicationbenchmarksnowemphasizestrong‐scalingrequirements.ImplicitTimestepping:ManyPDEsolversonblock‐structuredgridsemployexplicittimesteppingmethods,whichperformwellwhenweak‐scaledonmassivelyconcurrentsystems.However,itisverydifficulttoimprovestrong‐scalingperformanceinlightofthestallinCPUclockfrequencies.Thiswillprovideincreasedpressuretomovetowardsimplicittimesteppingschemes,whicharefarmorechallengingtoscaletohighconcurrencies.Evidenceofthistrendhasbeendocumentedinthe2005GreenbookandsubsequentreportsfromtheFusionSimulationProject(FSP)andotherDOEreports.WehavecodedtherequirementintoourbenchmarksuiteusingMAESTRO,whichimplementsbothimplicitandexplicittimesteppingschemes.AMR/MultiscalePhysics:Althoughitisnotamajorcomponentofourworkloadyet,thepresenceofscalableAMRcodeshasincreasedsubstantiallyinourworkloads.Giventheincreasedemphasisonmulti‐physicsmultiscaleproblemsinSciDACandincreasedneedforstrong‐scaling,weexpectthesemethodstobecomeincreasinglyimportantcomponentoffutureworkloads.SupportforLightweight/OneSidedMessaging:Assystemsscaletoevenlargerconcurrencies,theperformanceoftheinterconnectforverysmallmessagesbecomesincreasinglyimportant.Already,ourcurrentworkloadincludesmanychemistryapplicationsthatdependonGlobalArrays(GA)andotherlibrariesthatemulateaglobalsharedaddressspace.Thesecodebenefitgreatlyfromefficientsupportoflight‐weightmessaging,includingone‐sidedmessageconstructs.TheserequirementsarerepresentedbythechoiceofGAMESSfortheapplicationsuite.
NERSCCompositeBenchmarks(SustainedSystemPerformance)Thehighest‐concurrencyrunsfortheseapplicationbenchmarksarealsousedtogetherintheSustainedSystemPerformance(SSP)compositebenchmark,whichisdiscussedinacompaniondocumentdescribingboththeSSPandESPtests.ThegeometricmeanoftheperformanceofthesebenchmarksisusedtoestimatethelikelydeliveredperformanceoftheevaluatedsystemsforthetargetNERSCworkload.Therefore,thisapproachfulfillstheobservationbyIngridBucherthat“benchmarksareonlyusefulinsofarastheymodeltheintendedworkload.”Thesebenchmarksencodeourworkloadrequirements,andconsequentlytheSSPmetricprovidesagoodmetricforestimatingthesustainedperformanceontheanticipatedworkload.
22
AnalysisofNERSC‐6BenchmarkCodesThissectionprovidesmoredetaileddescriptionsoftheNERSC‐6applicationbenchmarkcodesandtheirproblemsets,alongwithsomeobservedcharacteristicsofthecodesandsomepreliminaryperformancedatafromvarioussystems.Thebenchmarksservethreepurposesintheprocurementproject.First,eachNERSC‐6benchmarkanditsrespectivetestproblemhasbeencarefullychosenanddevelopedtorepresentaparticularsubsetand/orspecificcharacteristicoftheexpectedDOEOfficeofScienceworkloadontheNERSC‐6system.Asithasbeenstated,“Forbetterorforworse,benchmarksshapeafield1,”andthus,akeycomponentoftheNERSCworkload‐drivenevaluationstrategyismakingavailablerealapplicationcodesthatcapturetheperformance‐criticalaspectsoftheworkload.Second,thebenchmarksprovidevendorstheopportunitytoprovideNERSCwithconcretedata(inanRFPresponse)associatedwiththeperformanceandscalabilityoftheproposedsystemsonprogrammaticallyimportantapplications.WeexpectthatvendorproposalresponseswillincludetheresultsofrunningNERSC‐6benchmarksonexistinghardwareandaswellasprojectionstofuture,currentlyunavailablesystems.Inthisrole,thebenchmarksplayanessentialroleintheproposalevaluationprocess.TheNERSCstaffarewellqualifiedtoevaluateanyprojectionssuppliedbyvendorsasaresultofhavingcarriedoutextensivetechnologyassessmentsofprospectivevendorsystemsforthetimeframeoftheNERSC‐6project.Third, the benchmarks will be used as an integral part of the system acceptance test and part of a continuous measurement of hardware and software performance throughout the operational lifetime of the NERSC-6 system. The NERSC evaluation strategy uses benchmark programs of varying complexity, ranging from kernels and microbenchmarks through stripped-down applications and full applications. Each has its place: we use kernels and microbenchmarks to gain understanding, but applications to evaluate effectiveness and performance. The remainder of this document will concentrate on the NERSC-6 full application codes. For each of the codes a few key characteristics are presented, most of which have been obtained using the NERSC Integrated Performance Modeling (IPM) tool - an application profiling layer that uses a unique hashing approach to allow a fixed memory footprint and minimal CPU usage. Key characteristics measured include the type and frequency of application-issued MPI calls, buffer sizes utilized for point-to-point and collective communications, the communication topology of each application, and in some cases, a time profile of the interprocessor communications. Another key parameter of a completely different type is the computational intensity, the ratio of the number of floating-point operations executed to the number of memory operations.
1 Patterson, D., CS252 Lecture Notes, University of California Berkeley, Spring, 1998.
23
CAM:CCSMCommunityAtmosphereModelTheCommunityAtmosphereModel(CAM)istheatmosphericcomponentoftheCommunityClimateSystemModel(CCSM)developedatNCARandelsewherefortheweatherandclimateresearchcommunities.AlthoughgenerallyusedinproductionaspartofacoupledsysteminCCSM,CAMcanberunasastandalone(uncoupled)modelasitisatNERSC.TheNERSC‐6benchmarkrunsCAMversion3.1atDresolution(about0.5degree)usingafinitevolume(FV)dynamicalcore.Twoproblemsarerunusingstrongscalingat56and240cores.Atmosphericmodelsconsistoftwoprincipalcomponents,thedynamicsandthephysics.Thedynamics,forwhichthesolverisreferredtoasthe“dynamicalcore,”arethelarge‐scalepartofamodel,theatmosphericequationsofmotionaffectingwind,pressureandtemperaturethatareresolvedontheunderlyinggrid.Thephysicsarecharacterizedbysubgrid‐scaleprocessessuchasradiation,moistureconvection,friction,andboundarylayerinteractionsthataretakenintoconsiderationimplicitly(viaparameterizations).InCAM3asitisrunatNERSC,thedynamicsaresolvedusinganexplicittimeintegrationandfinite‐volumediscretizationthatislocalandentirelyinphysicalspace.HydrostaticequilibriumisassumedandaLagrangianverticalcoordinateisused,whichtogethereffectivelyreducethedimensionalityfromthreetotwo.
RelationshiptoNERSCWorkloadCAMisusedforbothshort‐termweatherpredictionandaspartofalargeclimatemodelingefforttoaccuratelydetectandattributeclimatechange,predictfutureclimate,andengineermitigationstrategies.Anexampleofa“breakthrough”computationinvolvingCAMmightbeitsuseinafullycoupledocean/land/atmosphere/iceclimatesimulationwith0.125‐degree(approximately12kilometer)resolutioninvolvinganensembleofeighttotenindependentruns.DoublinghorizontalresolutioninCAMincreasescomputationalcosteightfold.ThecomputationalcostofCAMintheCCSM,holdingresolutionconstant,hasincreased4xsince1996.Morecomputationalcomplexityiscomingintheformof,forexample,super‐parameterizationsofmoistprocesses,andultimatelyeliminationofparameterizationanduseofgrid‐basedmethods.
ParallelizationThesolutionprocedurefortheFVdynamicalcoreinvolvestwoparts:(1)dynamicsandtracertransportwithineachcontrolvolumeand(2)remappingofprognosticdatafromtheLagrangianframebacktotheoriginalverticalcoordinate.Theremappingoccursonatimescaleseveraltimeslongerthanthatofthetransport.TheversionofCAMusedbyNERSCusesaformalismeffectivelycontainingtwodifferent,two‐dimensionaldomaindecompositions.Inthefirst,thedynamicsisdecomposedoverlatitudeandverticallevel.Becausethisdecompositionisinappropriatefortheremapphase,andfortheverticalintegrationstepfor
24
calculationofpressure,forwhichthereexistverticaldependences,alongitude–latitudedecompositionisusedforthesephasesofthecomputation.Optimizedtransposesmovedatafromtheprogramstructuresrequiredbyonephasetotheother.Onekeytothisyztoxytransposeschemeishavingthenumberofprocessorsinthelatitudinaldirectionbethesameforbothofthetwo‐dimensionaldecompositions.Figure6presentsthetopologicalconnectivityofcommunicationforfv‐CAM—eachpointinthegraphindicatesprocessorsexchangingmessages,andthecolorindicatesthevolumeofdataexchanged.Figures7and8giveanindicationoftheextenttowhichdifferentcommunicationprimitivesareusedinthecodeandtheirrespectivetimesononesystem.Figures9and10giveanindicationofthesizesoftheMPImessagesinanactualrunontheNERSCCrayXT4systemFranklin.
Figure6.CommunicationtopologyandintensityforfvCAM.
Figure7.MPIcallcountsforfvCAM.
Figure8.PercentoftimespentinvariousMPIroutinesforfvCAMon240processorsontheNERSCBassisystem.
25
Figure9.MPIbuffersizeforpoint‐to‐pointmessagesinfvCAM.
Figure10.MPIbuffersizeforcollectivemessagesinfvCAM.
ImportanceforNERSC‐6ForNERSC‐6benchmarking,CAMoffersthefollowingcharacteristics:complexity(relativelyflatperformanceprofileinhibitssimpleoptimization);relativelylowcomputationalintensitystresseson‐node/processordatamovement;relativelylongMPImessagesstressinterconnectpoint‐to‐pointbandwidth;strongscalinganddecompositionmethodlimitscalability;moderatelysensitivetoTLBperformanceandVMpagesize.
PerformanceTable8presentsmeasuredperformanceforCAMonseveralsystems.The“GFLOPs”columnrepresentstheaggregatecomputingrate,basedonthereferencefloating‐pointoperationcountfromFranklinandthetimeonthespecificmachine.Theefficiencycolumn(“Effic.”)representsthepercentageofpeakperformanceforthenumberofcoresusedtorunthebenchmark.
26
Cores
IBMPower5Bassi
IBMBG/P
QCOpteronCrayXT4Jaguar
DCOpteronCrayXT4Franklin
GFLOPs Effic. GFLOPs Effic. GFLOPs Effic. GFLOPs Effic56 30 7% 6 3% 32 7% 33.5 12%
240 100 6% 49 3% 143 7% 130.6 11%
Table8.MeasuredAggregatePerformanceandPercentofPeakforCAM
GAMESS:GeneralAtomicandMolecularElectronicStructureSystemTheGAMESS(GeneralAtomicandMolecularElectronicStructureSystem)codefromtheGordonresearchgroupattheDepartmentofEnergyAmesLabatIowaStateUniversityisoneofthemostimportanttoolsavailableforvariousabinitioquantumchemistrysimulations.SeveralkindsofSCFwavefunctionscanbecomputedwithconfigurationinteraction,secondorderperturbationtheory,andcoupled‐clusterapproaches,aswellasthedensityfunctionaltheoryapproximation.Avarietyofmolecularproperties,rangingfromsimpledipolemomentstofrequency‐dependenthyper‐polarizabilitiesmaybecomputed.Manybasissetsarestoredinternally,togetherwitheffectivecorepotentials,soallelementsuptoradonmaybeincludedinmolecules.ThetwobenchmarksusedheretestDFTenergyandgradient,RHFenergy,MP2energy,andMP2gradient.
RelationshiptoNERSCWorkloadGAMESSisavailableandisusedonallNERSCsystems.ItismostcommonlyusedbyallocationsfromtheBESarea.Asabenchmark,GAMESSrepresentsavarietyofdifferentconventionalquantumchemistrycodesthatuseGaussianbasissetstosolveHartree‐Fock,MP2,coupledclusterequations.Thesecodesalsonowincludethecapabilitytoperformdensityfunctionalcomputations.CodessuchasGAMESScanbeusedtostudyrelativelylargemolecules,inwhichcaselargeprocessorconfigurationscanbeused,butfrequentlytheyarealsousedtocalculatemany‐atomicconfigurations(e.g.,tomapoutthepotentialmanifoldinachemicalreaction)forsmallmolecules.
ParallelizationGAMESSusesanSPMDapproachbutincludesitsownunderlyingcommunicationlibrary,calledtheDistributedDataInterface(DDI),topresenttheabstractionofaglobalsharedmemorywithone‐sidedatatransfersevenonsystemswithphysicallydistributedmemory.Inpart,theDDIphilosophyistoaddmoreprocessorsnotjustfortheircomputeperformancebutalsotoaggregatemoretotalmemory.Infact,onsystemswherethehardwaredoesnotcontainsupportforefficientone‐sidedmessages,anMPIimplementationofDDIisusedinwhichonlyone‐halfoftheprocessorsallocatedcomputewhiletheotherhalfareessentiallydatamovers.
BenchmarkProblemsRatherthanusingGAMESSforscalingstudies,twoseparaterunsaredonetestingdifferentpartsoftheprogram.TheNERSC“medium”case,runon64coresfor
27
NERSC‐5andexpectedtorunon256coresforNERSC‐6,isaB3LYP(5)/6‐311G(d,p)calculationwithaRHFSCFgradient.The“large”case,384coresforNERSC‐5andexpectedtorunon1024coresforNERSC‐6,isa6‐311++G(d,p)MP2computationonafive‐atomsystemwithTsymmetry.
ImportanceforNERSC‐6ForNERSC‐6benchmarking,GAMESSoffersthefollowingcharacteristics:complexity(relativelyflatperformanceprofileinhibitssimpleoptimizationbutbuilt‐incommunicationlayeraffordsopportunityforkeysystem‐dependentoptimization);considerablestride‐1memoryaccess;built‐incommunicationlayeryieldsperformancecharacteristicsnotvisiblefromMPI(typicalofGlobalArraycodes,forexample).
PerformanceSincetheproblemsetsforNERSC‐6GAMESShavenotbeenfinalized,thedatapresentedinTable9arefortheNERSC‐5problems.Performanceisnotexpectedtodiffersignificantly.
Cores
IBMPower5Bassi
IBMBG/P
QCOpteronCrayXT4Jaguar
DCOpteronCrayXT4Franklin
GFLOPs Effic. GFLOPs Effic. GFLOPs Effic. GFLOPs Effic64 26 5% 18 3% 23 7%
384 121 4% 39 2% 177 6% 214 11%
Table9.MeasuredAggregatePerformanceandPercentofPeakforGAMESS
GTC:3DGyrokineticToroidalCodeGTCisathree‐dimensionalcodeusedtostudymicroturbulenceinmagneticallyconfinedtoroidalfusionplasmasviatheparticle‐in‐cell(PIC)method.Itsolvesthegyro‐averagedVlasovequationinrealspaceusingglobalgyrokinetictechniquesandanelectrostaticapproximation.TheVlasovequationdescribestheevolutionofasystemofparticlesundertheeffectsofself‐consistentelectromagneticfields.Theunknownistheflux,f(t,x,v),whichisafunctionoftimet,positionx,andvelocityv,andrepresentsthedistributionfunctionofparticles(electrons,ions,etc.)inphasespace.Thismodelassumesacollisionlessplasma,i.e.,theparticlesinteractwithoneanotheronlythroughaself‐consistentfieldandthusthealgorithmscalesasNinsteadofN2,whereNisthenumberofparticles.
A“classical”PICalgorithminvolvesusingthenotionofdiscrete“particles”tosampleachargedistributionfunction.Theparticlesinteractviaagrid,onwhichthepotentialiscalculatedfromcharges“deposited”atthegridpoints.Fourstepsareinvolvediteratively:
• “SCATTER,”ordeposit,chargesonthegrid.Aparticletypicallycontributestomanygridpointsandmanyparticlescontributetoanyonegridpoint.
28
• SolvethePoissonequation.• “GATHER”forcesoneachparticlefromtheresultantpotentialfunction.• “PUSH”(move)theparticlestonewlocationsonthegrid.InGTC,point‐chargeparticlesarereplacedbychargedringsusingafour‐pointgyroaveragingscheme,asshowninFigure11.
RelationtoNERSCWorkloadGTCisusedforfusionenergyresearchviatheDOESciDACprogramandfortheinternationalfusioncollaboration.SupportforitcomesfromtheDOEOfficeofFusionEnergyScience.Itisusedforstudyingneoclassicalandturbulenttransportintokamaksandstellaratorsaswellasforinvestigatinghot‐particlephysics,toroidalAlfvenmodes,andneoclassicaltearingmodes.
Figure11.RepresentationofthechargeaccumulationtothegridinGTC.
ParallelizationInolderversionsofGTC,theprimaryparallelprogrammingmodelwasa1‐DdomaindecompositionwitheachMPIprocessassignedatoroidalsectionandMPI_SENDRECVusedtoshiftparticlesmovingbetweendomains.NewerversionsofGTC,includingthecurrentNERSC‐6benchmark,useafixed,1‐Ddomaindecompositionwith64domainsplusaparticle‐baseddecomposition.Atgreaterthan64processors,eachdomaininthe1‐Ddomaindecompositionhasmorethanoneprocessorassociatedwithit,andeachprocessorholdsafractionofthetotalnumberofparticlesinthatdomain.Thisrequiresanall_reduceoperationtocollectallofthechargesfromallparticlesinagivendomain.ThecommunicationtopologyoftheNERSC‐6versionofGTC,usingthecombinationtoroidaldomain/particledecompositionmethod,isshowninFigure12.Figures13,14,and15showothercommunicationcharacteristics.
29
Figure12.MPImessagetopologyforNERSC‐6versionofGTC.
Figure13.MPIcallcountsfortheNERSC‐6versionofGTC.
Figure14.MPImessagebuffersizedistributionbasedoncallsfortheNERSC‐6versionofGTC.
Figure15.TimespentinMPIfunctionsintheNERSC‐6versionofGTConFranklin.
ImportanceforNERSC‐6ForNERSC‐6benchmarking,GTCoffersthefollowingcharacteristics:chargedepositioninPICcodesutilizesindirectaddressesandthereforestressesrandomaccesstomemory;particle‐decompositionstressescollectivecommunicationswithsmallmessages;point‐to‐pointcommunicationsarestronglybandwidthbound;runsadequatelyonrelativelysimpletopologycommunicationnetworks;importantloopconstructswithlargenumberofarraysleadstohighlysensitiveTLBbehavior.
PerformanceTable10,whichlistssomemeasuredperformancedata,suggeststhatGTCiscapableofachievingrelativelyhighpercentagesofpeakperformanceonsomesystems,whichisprimarilyduetoarelativelyhighcomputationalintensityandalowpercentageofcommunicationcontribution.ItalsoseemsthatOpteronprocessorshaveaperformanceadvantageonthiscode.
30
Cores
IBMPower5Bassi
IBMBG/P
QCOpteronCrayXT4
Jaguar
DCOpteronCrayXT4Franklin
GFLOPs Effic. GFLOPs Effic. GFLOPs Effic. GFLOPs Effic512 428 11% 98 6% 511 12% 565 21%
2048 n/a 391 6% 2023 12% 2438 23%
Table10.MeasuredAggregatePerformanceandPercentofPeakforGTC.
IMPACT‐T:Parallel3‐DBeamDynamicsIMPACT‐T(IntegratedMapandParticleAcceleratorTracking—Time)isonememberofasuiteofcomputationaltoolsforthepredictionandperformanceenhancementofaccelerators.Thesetoolsareintendedtopredictthedynamicsofbeamsinthepresenceofopticalelementsandspacechargeforces,thecalculationofelectromagneticmodesandwakefieldsofcavities,thecoolinginducedbyco‐movingbeams,andtheaccelerationofbeamsbyintensefieldsinplasmasgeneratedbybeamsorlasers.TheImpact‐Tcodeusesaparallel,relativisticPICmethodforsimulating3‐Dbeamdynamicsusingtimeastheindependentvariable.(ThedesignofRFacceleratorsisnormallyperformedusingposition,eitherthearclengthorz‐direction,ratherthanthetime,astheindependentvariable.)IMPACT‐Tincludesthearbitraryoverlapoffieldsfromacomprehensivesetofbeamlineelements,canmodelbothstandingandtravelingwavestructure,andcanincludeawidevarietyofexternalmagneticfocusingelementssuchassolenoids,dipoles,quadrupoles,etc.IMPACT‐Tisuniqueintheacceleratormodelingfieldinseveralways.Itusesspace‐chargesolversbasedonanintegratedGreenfunctiontoefficientlyandaccuratelytreatbeamswithlargeaspectratios,andashiftedGreenfunctiontoefficientlytreatimagechargeeffectsofacathode.Itusesenergybinninginthespace‐chargecalculationtomodelbeamswithlargeenergyspread.Italsohasaflexibledatastructure(implementedinobject‐orientedFortran)thatallowsparticlestobestoredincontainerswithcommoncharacteristics;forphotoinjectorsimulationsthecontainersrepresentmultipleslices,butinotherapplicationstheycouldcorrespond,e.g.,toparticlesofdifferentspecies.WhilethecorePICmethodusedinIMPACT‐Tisessentiallythesameasthatusedforplasmasimulation(seeGTC,above),manydetailsoftheimplementationdiffer;inparticular,thespectral/Green’sfunctionpotentialsolver(vs.thefinitedifferencemethodusedinGTC)resultsinasignificantglobalinter‐processorcommunicationpattern.
RelationshiptoNERSCWorkloadPastapplicationsoftheIMPACTcodesuiteincludetheSNSlinacmodeling,theJ‐PARClinaccommissioning,theRIAdriverlinacdesign,theCERNsuperconductinglinacdesign,theLEDAhaloexperiment,theBNLphotoinjector,andtheFNALNICADDphotoinjector.RecentcodeenhancementshaveexpandedIMPACT’sapplicability,anditisnowbeingused(alongwithothercodes)ontheLCLSprojectandtheFERMI@Elettraproject.ArecentDOEINCITEgrantallowedIMPACTtobe
31
usedinthedesignoffreeelectronlasers.
ParallelizationAtwo‐dimensionaldomaindecompositioninthey‐zdirectionsisusedalongwithadynamicloadbalancingschemebasedondomain.TheimplementationusesMPI‐1messagepassing.Forsomeproblems,mostofthecommunicationduetoparticlescrossingprocessorboundariesislocal;however,inthesimulationofcollidingbeams,particlescanmovealongdistanceandmorethanonepairofexchangesmightberequiredforasingleparticleupdate.Impact‐Tdoesnotuseaparticle‐fieldparalleldecompositionusedbyotherPICcodes.Hockney’sFFTalgorithmisusedtosolvePoisson’sequationwithopenboundaryconditions.Inthisalgorithm,thenumberofgridpointsisdoubledineachdimension,withthechargedensityontheoriginalgridkeptthesame,andthechargedensityelsewhereissettozero.SomecommunicationcharacteristicsforIMPACT‐TarepresentedinTables11and12andFigures16–18.MPIEvent BufferSize
(Bytes)CumulativePercentofTotalWallClockTime
MPI_Alltoallv 131K–164K 15%
MPI_Send 8K 3%
MPI_Allreduce 4–8 4%
Table11.ImportantMPIMessageBufferSizesforIMPACT‐T.RoutineName PercentofTotalTime Purpose
beambunchclass_kick2t_beambunch 32% Interpolatethespace‐chargefieldsEandBinthelabframetoindividualparticle+externalfields.
beamlineelemclass_getfldt_beamlineelem 29% Getexternalfield
ptclmgerclass_ptsmv2_ptclmger 10% Moveparticlesfromoneprocessortofourneighboringprocessors
Table12.ProfileofTopSubroutinesinIMPACT‐TonCrayXT4
ImportanceforNERSC‐6ForNERSC‐6benchmarking,IMPACT‐Toffersthefollowingcharacteristics:object‐orientedFortran90codingstylepresentscompileroptimizationchallenges;FFTPoissonsolverstressescollectivecommunicationswithmoderatemessagesizes;greatercomplexitythanotherPICcodesduetoexternalfields,openboundaryconditions;relativelymoderatecomputationalintensity;fixedglobalproblemsizecausessmallermessagesizesandincreasingimportanceofMPIlatencyathigherconcurrencies.
32
Figure16.MPIProfileon1024coresofFranklin.
Figure17.MPIcallcounts.
Figure18.MessagetopologyforIMPACT‐T.
PerformanceAlthoughthey’rebothPICcodes,performanceforIMPACT‐T(Table13)isquitedifferentfromthatofGTC,withIMPACT‐Tgenerallyachievingalowerrate,duetoincreasedimportanceofglobalcommunicationsandaconsiderablylowercomputationalintensity.Cores
IBMPower5Bassi
IBMBG/P
QCOpteronCrayXT4
Jaguar
DCOpteronCrayXT4Franklin
GFLOPs Effic. GFLOPs Effic. GFLOPs Effic. GFLOPs Effic256 143 7% 34 4% 111 5% 130 10%
1024 n/a 174 5% 513 6% 638 12%
Table13.MeasuredAggregatePerformanceandPercentofPeakforIMPACT‐T
33
MAESTRO:LowMachNumberStellarHydrodynamicsCodeMAESTROisusedforsimulatingastrophysicalflowssuchasthoseleadinguptoignitioninTypeIasupernovae.ItcontainsanewalgorithmspecificallydesignedtoneglectacousticwavesinthemodelingoflowMachnumberconvectivefluidsbutretaincompressibilityeffectsduetonuclearreactionsandbackgroundstratificationthattakesplaceinthecenturies‐longperiodpriortoexplosion.MAESTROallowsforlargetimesteps,finite‐amplitudedensityandtemperatureperturbations,andthehydrostaticevolutionofthebasestateofthestar.ThebasicdiscretizationinMAESTROcombinesasymmetricoperator‐splittreatmentofchemistryandtransportwithadensity‐weightedapproximateprojectionmethodtoimposetheevolutionconstraint.Theresultingintegrationproceedsonthetimescaleoftherelativelyslowadvectivetransport.Fasterdiffusionandchemistryprocessesaretreatedimplicitlyintime.Thisintegrationschemeisembeddedinanadaptivemeshrefinementalgorithmbasedonahierarchicalsystemofrectangularnon‐overlappinggridpatchesatmultiplelevelswithdifferentresolution;however,intheNERSC‐6benchmarkthegriddoesnotadapt.Amultigridsolverisused.
ParallelizationParallelizationinMAESTROisviaa3‐Ddomaindecompositioninwhichdataandworkareapportionedusingacoarse‐graineddistributionstrategytobalancetheloadandminimizecommunicationcosts.Optionsfordatadistributionincludeaknapsackalgorithm,withanadditionalsteptobalancecommunications,andMorton‐orderingspace‐fillingcurves.Studieshaveshownthatthetime‐to‐solutionforthelowMachnumberalgorithmisroughlytwoordersofmagnitudefasterthanamoretraditionalcompressiblereactingflowsolver.MAESTROusesBoxLib,afoundationlibraryofFortran90modulesfromLBNLthatfacilitatesdevelopmentofblock‐structuredfinitedifferencealgorithmswithrichdatastructuresfordescribingoperationsthattakeplaceondatadefinedinregionsofindexspacethatareunionsofnon‐intersectingrectangles.Itisparticularlyusefulinunsteadyproblemswheretheregionsofinterestmaychangeinresponsetoanevolvingsolution.TheMAESTROcommunicationtopologypattern(Figure19)isquiteunusual.OthercommunicationcharacteristicsareshowninFigures20–22.
RelationshiptoNERSCWorkloadMAESTROandthealgorithmsitrepresentsareusedinatleasttwoareasatNERSC:supernovaeignitionandturbulentflamecombustionstudieshavingbothSciDAC,INCITEandbaseallocations.
34
Figure19.CommunicationtopologyforMAESTROfromIPM.
Figure20.MPIcallcountsforMaestro
Figure21.DistributionofMPItimeson512coresofFranklinfromIPM.
Figure22.DistributionofMPIbuffersizesbytimefromIPMonFranklin.
ImportanceforNERSC‐6ForNERSC‐6benchmarking,MAESTROoffersthefollowingcharacteristics:unusualcommunicationtopologyshouldstresssimpletopologyinterconnectsandrepresentcharacteristicsassociatedwithirregularorrefinedgrids;verylowcomputationalintensitystressesmemoryperformance,especiallylatency;implicitsolvertechnologystressesglobalcommunications;widerangeofmessagesizesfromshorttorelativelymoderate.
PerformanceMaestroachievesarelativelylowoverallrateonthesystemsshowninTable14.Thecomputationalintensityisverylow,mostlikelyasaresultofnon‐floating‐pointoperationoverheadrequiredfortheadaptivemeshframework.Thedataalso
35
suggestthatMPIcommunicationsperformancemaybeabottleneck,sincetheefficiencydecreasesforthelargerproblemofthisweak‐scalingset.Cores
IBMPower5Bassi
IBMBG/P
QCOpteronCrayXT4
Jaguar
DCOpteronCrayXT4Franklin
GFLOPs Effic. GFLOPs Effic. GFLOPs Effic. GFLOPs Effic512 178 5% 52* 3% 230 5% 245 9%
2048 n/a 406 2% 437 4%
Table14.MeasuredAggregatePerformanceandPercentofPeakforMAESTRO
MILC:MIMDLatticeComputationThebenchmarkcodeMILCrepresentspartofasetofcodeswrittenbytheMIMDLatticeComputation(MILC)collaborationtostudyquantumchromodynamics(QCD),thetheoryofthestronginteractionsofsubatomicphysics.ItperformssimulationsoffourdimensionalSU(3)latticegaugetheoryonMIMDparallelmachines.Stronginteractionsareresponsibleforbindingquarksintoprotonsandneutronsandholdingthemalltogetherintheatomicnucleus.
TheMILCcollaborationhasproducedapplicationcodestostudyseveraldifferentQCDresearchareas,onlyoneofwhich,ks_dynamicalsimulationswithconventionaldynamicalKogut‐Susskindquarks,isusedhere.
QCDdiscretizesspaceandevaluatesfieldvariablesonsitesandlinksofaregularhypercubelatticeinfour‐dimensionalspacetime.Eachlinkbetweennearestneighborsinthislatticeisassociatedwithathree‐dimensionalSU(3)complexmatrixforagivenfield.
QCDinvolvesintegratinganequationofmotionforhundredsorthousandsoftimestepsthatrequiresinvertingalarge,sparsematrixateachstepoftheintegration.Thesparsematrixproblemissolvedusingaconjugategradientmethod,butbecausethelinearsystemisnearlysingular,manyCGiterationsarerequiredforconvergence.Withinaprocessor,thefour‐dimensionalnatureoftheproblemrequiresgathersfromwidelyseparatedlocationsinmemory.Thematrixinthelinearsystembeingsolvedcontainssetsofcomplexthree‐dimensional“link”matrices,oneper4‐Dlatticelink,butonlylinksbetweenoddsitesandevensitesarenon‐zero.TheinversionbyCGrequiresrepeatedthree‐dimensionalcomplexmatrix‐vectormultiplications,whichreducestoadotproductofthreepairsofthree‐dimensionalcomplexvectors.Thecodeseparatestherealandimaginaryparts,producingsixdotproductpairsofsix‐dimensionalrealvectors.Eachsuchdotproductconsistsoffivemultiply‐addoperationsandonemultiply.
RelationshiptoNERSCWorkloadMILChaswidespreadphysicscommunityuseandalargeallocationofresourcesonNERSCsystems.FundedthroughHighEnergyPhysicsTheory,itsupportsresearch
36
thataddressesfundamentalquestionsinhighenergyandnuclearphysicsandisdirectlyrelatedtomajorexperimentalprogramsinthesefields.
ParallelizationTheprimaryparallelprogrammingmodelforMILCisa4‐DdomaindecompositionwitheachMPIprocessassignedanequalnumberofsub‐latticesofcontiguoussites.Inafour‐dimensionalproblem,eachsitehaseightnearestneighbors.Thecodeisorganizedsothatmessagepassingroutinesarecompartmentalizedfromwhatisbuiltasalibraryofsingle‐processorlinearalgebraroutines.ManyrecentproductionMILCrunshaveusedtheRHMCalgorithm,whichisanimprovementinthemoleculardynamicsevolutionprocessforQCDthatreducescomputationalcostdramatically.AllthreeNERSC‐6MILCtestsaresetuptoactuallydotworunseach.ThisistomoreaccuratelyrepresenttheworkthattheCGsolvemustdoinactualQCDsimulations.TheCGsolverwilltakeinsufficientiterationstoconvergeifonestartswithanorderedsystem,sowefirstdoashortrunwithafewsteps,withalargerstepsize,andwithalooseconvergencecriterion.Thisletsthelatticeevolvefromtotallyordered.IntheNERSC‐6versionofthecode,thisportionoftherunistimedas“INIT_TIME”andittakesabouttwominutesontheNERSCCrayXT4.Then,startingwiththis“primed”lattice,weincreasetheaccuracyfortheCGsolve,andtheiterationcountpersolvegoesuptoamorerepresentativevalue.Thisistheportionofthecodethatwetime.Between65,000and75,000CGiterationsaredone(dependingonproblemsize).ThethreeNERSC‐6problemsuseweakscaling,andtheonlydifferencebetweenthethreeisthesizeofthelattice;i.e.,therearenodifferencesinstepspertrajectoryornumberoftrajectories.InTable15,thelocallatticesizeisbasedonthetargetconcurrency.Notethatthesesizeswerechosenasacompromisebetweenperceivedscienceneedsandavarietyofpracticalconsiderations.ScalingstudieswithlargerconcurrenciescaneasilybedoneusingMILC.EstimatesofQCDrunsinthetimeframeoftheNERSC‐6systemincludesomeofthelatticesintheNERSC‐6benchmarkbutalsouptoabout963 x 192. Communication characteristics are presented in Figures 23–26. TargetConcurrency LatticeSize(Global) LatticeSize(Local)
256cores 32x32x32x36 8x8x8x9
1024cores 64x64x32x72 8x8x8x9
8192cores 64x64x64x144 8x8x8x9
Table15.NERSC‐6MILClatticesizes.
37
Figure23.MPICallcountsforMILC.
Figure24.CommunicationtopologyforMILCfromIPMonFranklin.
Figure25.DistributionofCommunicationtimesforMILCfromIPMon1024coresofFranklin.
Figure26.MPImessagebuffersizedistributionbasedontimeforMILConFranklinfromIPM.
38
ImportanceforNERSC‐6ForNERSC‐6benchmarking,MILCoffersthefollowingcharacteristics:CGsolverwithsmallsubgridsizesusedstressesinterconnectwithsmallmessagesforbothpoint‐to‐pointandcollectiveoperations;extremelydependentonmemorybandwidthandprefetching;highcomputationalintensity.
PerformanceTable16showsthatMILC,too,achievesarelativelyhighpercentageofpeakonsomesystems.Thedual‐coreOpteronresultsincludesomecodetunedforthatsystem,butquad‐coreOpteroncodeisnotyetavailable.TheincreasingeffectofcommunicationsonlargerconcurrenciesisobviousfromtheFranklinandJaguarresultsonthisweakscaledproblem.Cores
IBMPower5Bassi
IBMBG/P
QCOpteronCrayXT4
Jaguar
DCOpteronCrayXT4Franklin
GFLOPs Effic. GFLOPs Effic. GFLOPs Effic. GFLOPs Effic256 488 25% 113 13% 203 9% 291 22%
1024 n/a 456 13% 513 6% 1101 21%8192 n/a n/a 3179 5% 5783 14%
Table16.MeasuredAggregatePerformanceandPercentofPeakforMILC
PARATEC:ParallelTotalEnergyCodeThebenchmarkcodePARATEC(PARAllelTotalEnergyCode)performsabinitioquantum‐mechanicaltotalenergycalculationsusingpseudopotentialsandaplanewavebasisset.Totalenergyminimizationofelectronsisdonewithaself‐consistentfield(SCF)method.Forcecalculationsarealsodonetorelaxtheatomsintoequilibriumpositions.PARATECusesanall‐band(unconstrained)conjugategradient(CG)approachtosolvetheKohn‐Shamequationsofdensityfunctionaltheory(DFT)andobtaintheground‐stateelectronwavefunctions.InsolvingtheKohn‐Shamequationsusingaplanewavebasis,partofthecalculationiscarriedoutinFourierspace,wherethewavefunctionforeachelectronisrepresentedbyasphereofpoints,andtheremainderisinrealspace.Specializedparallelthree‐dimensionalFFTsareusedtotransformthewavefunctionsbetweenrealandFourierspace;andakeyoptimizationinPARATECistotransformonlythenon‐zerogridelements.ThecodecanuseoptimizedlibrariesforbothbasiclinearalgebraandfastFouriertranforms;butduetoitsglobalcommunicationrequirements,architectureswithapoorbalancebetweenbisectionbandwidthandcomputationalratewillsufferperformancedegradationathigherconcurrenciesonPARATEC.Nevertheless,duetofavorablescaling,highcomputationalintensityandotheroptimizations,thecodegenerallyachievesahighpercentageofpeakperformanceonbothsuperscalarandvectorsystems.
39
Inthebenchmarkproblemssuppliedhere,20conjugategradientiterationsareperformed;however,arealrunmighttypicallydoupto60.
RelationshiptoNERSCWorkloadArecentsurveyofNERSCERCAPrequestsformaterialsscienceapplicationsshowedthatDFTcodessimilartoPARATECaccountedfornearly80%ofallHPCcyclesdeliveredtotheMaterialsSciencecommunity.SupportedbyDOEBES,PARATECisanexcellentproxytotheapplicationrequirementsoftheentireMaterialsSciencecommunity.PARATECsimulationscanalsobeusedtopredictnuclearmagneticresonanceshifts.Theoverallgoalistosimulatethesynthesisandpredictthepropertiesofmulti‐componentnanosystems.
ParallelizationInaplane‐wavesimulation,agridofpointsrepresentseachelectron’swavefunction.Paralleldecompositioncanbeovern(g),thenumberofgridpointsperelectron(typicallyO(100,000)perelectron),n(i),thenumberofelectrons(typicallyO(800)persystemsimulated),orn(k),anumberofsamplingpoints(O(1‐10)).PARATECusesMPIandparallelizesovergridpoints,therebyachievingafine‐grainlevelofparallelism.InFourierspace,eachelectron’swavefunctiongridformsasphere.Figure27depictsavisualizationoftheparalleldatalayoutonthreeprocessors.Eachprocessorholdsseveralcolumns,whicharelinesalongthez‐axisoftheFFTgrid.Loadbalancingisimportantbecausemuchofthecompute‐intensivepartofthecalculationiscarriedoutinFourierspace.Togetgoodloadbalancing,thecolumnsarefirstassignedtoprocessorsindescendinglengthorderandthentoprocessorscontainingthefewestpoints.Thereal‐spacedatalayoutofthewavefunctionsisonastandardCartesiangrid,whereeachprocessorholdsacontiguouspartofthespacearrangedincolumns,showninFigure27b.Customthree‐dimensionalFFTstransformbetweenthesetwodatalayouts.Dataarearrangedincolumnsasthethree‐dimensionalFFTisperformed,bytakingone‐dimensionalFFTsalongthez,y,andxdirectionswithparalleldatatransposesbetweeneachsetofone‐dimensionalFFTs.Thesetransposesrequireglobalinter‐processorcommunicationandrepresentthemostimportantimpedimenttohighscalability.ThecommunicationtopologyforPARATECisshownbelow.TheFFTportionofthecodescalesapproximatelyasn2log(n),andthedenselinearalgebraportionscalesapproximatelyasn³;therefore,theoverallcomputation‐to‐communicationratioscalesasn,wherenisthenumberofatomsinthesimulation.Ingeneral,thespeedatwhichtheFFTproceedsdominatestheruntime.Table19,below,comparesthespeedoftheFFTandBLAS‐dominatedProjectorportionsofthecodeforvariousmachines.ThedatashowthatPARATECisanextremelyuseful
40
benchmarkfordiscriminatingbetweenmachinesbasedonbothcomputationandcommunicationspeed.CommunicationcharacteristicsareshowninFigures28–30andinTable17.
Figure27.Athree‐processorexampleofPARATEC’sparalleldatalayoutforthewavefunctionsofeachelectronin(a)Fourierspaceand(b)realspace.
Figure28.MPIcallcountsforPARATEC.
Figure29.DistributionofMPItimesforPARATECon256coresofFranklinfromIPM.
Figure30.CommunicationTopologyforPARATECfromIPM.
41
MessageBufferSize Counton256Cores Counton1024Cores
TotalMessageCount 428,318 1,940,665
16≤MsgSz<256 114,432
256≤MsgSz<4KB 20,337 1,799,211
4KB≤MsgSz<64KB 403,917 4,611
64KB≤MsgSz<1MB 1,256 22,412
1MB≤MsgSz<16MB 2,808
Table17.ImportantMPIMessageBufferSizesforPARATEC(fromIPMon256cores,Franklin)
ImportanceforNERSC‐6ForNERSC‐6benchmarkingPARATECoffersthefollowingcharacteristics:hand‐codedspectralsolverwithall‐to‐alldatatranspositionsstressesglobalcommunicationsbandwidth;mostlypoint‐to‐pointmessages,andsinceinasingle3‐DFFTthesizeofthedatapacketsscalesastheinverseoftheinverseofthesquareofthenumberofprocessors,relativelyshortmessages;canbeverymemory‐capacitylimited;tuningparameteroffersopportunitytotradememoryusageforinterprocessorcommunicationintensity;requiresqualityimplementationofsystemlibraries(FFTW/SCALAPACK/BLACS/BLAS);useofBLAS3and1‐DFFTlibrariesresultsinhighcachereuseandhighpercentageofper‐processorpeakperformance;fixedglobalproblemsizecausessmallermessagesizesandincreasingimportanceofMPIlatencyathigherconcurrencies;moderatelysensitivetoTLBperformance.
PerformanceTables18and19showperformancedataforPARATEC.Thecodecalculatesitsownfloating‐pointoperationratefortwokeycomponents,the“Projector”portionofthecode,whichisdominatedbyBLAS3routines,andtheFFTs,whichusetunedsystemlibrariesbutdependonglobalcommunications.Thecomparisonamongsystems,especiallythedual‐coreandquad‐coreCrayXT4systems,isquiteinteresting.Cores
IBMPower5Bassi
IBMBG/P
QCOpteronCrayXT4
Jaguar
DCOpteronCrayXT4Franklin
GFLOPs Effic. GFLOPs Effic. GFLOPs Effic. GFLOPs Effic256 30 7% 517 665 31% 729 55%
1024 n/a 710 20% 2044 24% 2234 42%
Table18.MeasuredAggregatePerformanceandPercentofPeakforPARATEC
42
Machine FFTRate(MFLOPS/core)
ProjectorRate(MFLOPS/core)
OverallRate(Agg.GFLOPS)
Franklin 213 1045 729Jaguar 129 1581 665BG/P 377 561 517HLRB‐II 694 993 890Bassi 1028 1287 1213
Table19.PerformanceProfileforPARATEC(256cores)
SummaryTable20presentsasummaryofafewoftheimportantcharacteristicsforthecodes.InadditiontodoingagoodjobofrepresentingkeyscienceareaswithintheNERSCOfficeofScienceworkload,thecodesprovideagoodrangeofarchitectural“stresspoints”suchascomputationalintensity,andMPImessageintensity,type,andsize.Assuch,thecodesprovideaneffectivewayofdifferentiatingbetweensystemsandwillbeusefulforallthreegoalsrelatedtotheNERSC‐6project.Finally,thecodesprovideastrongsuiteofSCALABLEbenchmarks,capableofbeingrunatawiderangeofconcurrencies. Code
CAM GAMESS GTC IMPACT‐T MAESTRO MILC PARATEC
CI* 0.67 0.61 1.15 0.77 0.24 1.39 1.50
CrayXT4%peakpercore(largestcase)
13%
12%
24%
14%
5%
14%
44%
CrayXT4%MPImedium
29% 4% 9% 20% 120% 27%
CrayXT4%MPIlarge
35% 6% 40% 23% 64%
CrayXT4%MPIextralarge
n/a n/a n/a n/a n/a 30% n/a
CrayXT4avgmsgsizemed
113K n/a 1MB 35KB 2K 16KB 34KB
Table20ComparisonofComputationalCharacteristicsforNERSC‐6Benchmarks.*CIisthecomputationalintensity,theratioofthenumberoffloating‐pointoperationstothenumberofmemoryoperations.
43
Bibliography(partial)Wong,AdrianT.,LeonidOliker,WilliamT.C.Kramer,TeresaL.Kaltz,andDavidH.Bailey,“EvaluatingSystemEffectivenessinHighPerformanceComputingSystems,”ProceedingsofSC2000,November2000.Wong,AdrianT.,LeonidOliker,WilliamT.C.Kramer,TeresaL.Kaltz,andDavidH.Bailey,“SystemUtilizationBenchmarkontheCrayT3EandIBMSP,”5thWorkshoponJobSchedulingStrategiesforParallelProcessing,May2000,CancunMexico.Kramer,WilliamandClintRyan,"PerformanceVariabilityonHighlyParallelArchitectures",theInternationalConferenceonComputationalScience2003,MelbourneAustraliaandSt.PetersburgRussia,June2‐4,2003. IngridY.Bucher,“TheComputationalSpeedofSupercomputers,”JointInternationalConferenceonMeasurementandModelingofComputerSystemsarchiveProceedingsofthe1983ACMSIGMETRICSconferenceonMeasurementandmodelingofcomputersystems.I.Y.BucherandJ.L.Martin,“MethodologyforCharacterizingaScientificWorkload,”InCPEUG'82,pages121‐‐126,1982.Horst D. Simon, William T. C. Kramer, David H. Bailey, Michael J. Banda, E. Wes Bethel, Jonathon T. Carter, James M. Craw, William J. Fortney, John A. Hules, Nancy L. Meyer, Juan C. Meza, Esmond G. Ng, Lynn E. Rippe, William C. Saphir, Francesca Verdier, Howard A. Walter, Katherine A. Yelick, “Science-Driven Computing: NERSC’s Plan for 2006–2010,” Lawrence Berkeley National Laboratory Report LBNL-57582, May, 2005. Horst Simon, William Kramer, William Saphir, John Shalf, David Bailey, Leonid Oliker, Michael Banda, C. William McCurdy, John Hules, Andrew Canning, Marc Day, Philip Colella, David Serafini, Michael Wehner and Peter Nugent, “Science-Driven System Architecture: A New Process for Leadership Class Computing,” J. Earth Sim., Vol. 2, January 2005. Leonid Oliker, Andrew Canning, Jonathan Carter, Costin Iancu, Michael Lijewski, Shoaib Kamil, John Shalf, Hongzhang Shan, Erich Strohmaier, Stephane Ethier, Tom Goodale, “Scientific Application Performance on Candidate PetaScale Platforms,” International Parallel & Distributed Processing Symposium (IPDPS), March 24-30, 2007, Long Beach, CA.
44
Leonid Oliker, Andrew Canning, Jonathan Carter, John Shalf, David Skinner, Stephane Ethier, Rupak Biswas, Jahed Djomehri, and Rob Van der Wijngaart, “Performance Evaluation of the SX6 Vector Architecture for Scientific Computations,” http://crd.lbl.gov/~oliker/drafts/SX6_eval.pdf A. S. Almgren, J. B. Bell, M. Zingale, “MAESTRO: A Low Mach Number Stellar Hydrodynamics Code,” Journal of Physics: Conference Series 78 (2007) 012085. J. B. Bell, A. J. Aspden, M. S. Day, M. J. Lijewski, “Numerical simulation of low Mach number reacting flows,” Journal of Physics: Conference Series 78 (2007) 012004. Y. Ding, Z. Huang, C. Limborg-Deprey, J. Qiang, “LCLS Beam Dynamics Studies with the 3-D Parallel Impact-T Code,” Stanford Linear Accelerator Center Publication SLAC-PUB-12673, Presentedatthe22ndParticleAcceleratorConference,Albuquerque,NewMexico,U.S.A,June25‐29,2007.JiQiang,RobertD.Ryne,SalmanHabib,andViktorDecyk,“AnObject‐OrientedParallelParticle‐in‐CellCodeforBeamDynamicsSimulationinLinearAccelerators,”JournalofComputationalPhysics163,434–451(2000).S.C.Jardin,Editor,“DOEGreenbook:NeedsandDirectionsinHighPerformanceComputingfortheOfficeofScience,”PrincetonPlasmaPhysicsLaboratoryReportPPPL‐4090,June,2005.Availablefromhttp://www.nersc.gov/news/greenbook/.WilliamKramer,JohnShalf,andErichStrohmaier,“TheNERSCSustainedSystemPerformance(SSP)Metric,”LawrenceBerkeleyNationalLaboratoryReportLBNL‐58868,http://repositories.cdlib.org/lbnl/LBNL‐58868,2005.L.Oliker,J.Shalf,J.Carter,A.Canning,S.Kamil,M.Lijewski,S.Ethier(2007)."PerformanceCharacteristicsofPotentialPetascaleScientificApplications."PetascaleComputing:AlgorithmsandApplications(DavidBader:editor),Volume.Chapman&Hall/CRCPress:2007.Lin‐WangWang(2005),“ASurveyofCodesandAlgorithmsusedinNERSCMaterialsScienceAllocations,”LawrenceBerkeleyNationalLaboratoryReportLBNL‐61051,2005.AMirinandPWorley,“ExtendingScalabilityoftheCommunityAtmosphereModel,”JournalofPhysics:ConferenceSeries78(2007)012082.ArnoldKritz,DavidKeyes,etal.,“FusionSimulationProjectWorkshopReport,”availablefromhttp://www.ofes.fusion.doe.gov/programdocuments.shtml.K.Yelick,“ProgramminginUPC(UnifiedParallelC),CS267Lecture,Spring2006,”lecturenotesavailablefromhttp://upc.lbl.gov/publications/.
45
Formoreinformationonthebenchmarksseehttp://www.nersc.gov/projects/SDSA/software/?benchmark=NERSC5