19
Salinas: A Scalable Software for High-Performance Structural and Solid Mechanics Simulations Manoj Bhardwaj, Kendall Pierson, Garth Reese, Tim Walsh, David Day, Ken Alvin and James Peery Charbel Farhat and Michel Lesoinne Sandia National Laboratories, Albuquerque, New Mexico 87185, U.S.A Department of Aerospace Engineering Sciences and Center for Aerospace Structures, University of Colorado, Boulder, Colorado 80309-0429, U.S.A Abstract We present Salinas, a scalable implicit software application for the finite element static and dynamic analysis of complex structural real-world systems. This relatively complete engineering software with more than 100,000 lines of code and a long list of users sustains 292.5 Gflop/s on 2,940 ASCI Red processors, and 1.16 Tflop/s on 3,375 ASCI White processors. 1 Introduction Part of the Accelerated Strategic Computing Initiative (ASCI) of the US Depart- ment of Energy is the development at Sandia of Salinas, a massively parallel im- plicit structural mechanics/dynamics software aimed at providing a scalable com- putational workhorse for extremely complex finite element (FE) stress, vibration, and transient dynamics models with tens or hundreds of millions of degrees of freedom (dofs). Such large-scale mathematical models require significant compu- tational effort, but provide important information, including vibrational loads for components within larger systems (Fig. 1), design optimization (Fig. 2), frequency response information for guidance and space systems, and modal data necessary for active vibration control (Fig. 3). As in the case of other ASCI projects, the success of Salinas hinges on its ability to deliver scalable performance results. However, unlike many other ASCI soft- 0-7695-1524-X/02 $17.00 (c) 2002 IEEE Preprint submitted to the 2002 Gordon Bell Award 15 August 2002

Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

Salinas: A Scalable Software for High-PerformanceStructural and Solid Mechanics Simulations

ManojBhardwaj, KendallPierson,GarthReese,�

Tim Walsh,David Day, KenAlvin andJamesPeery�

CharbelFarhatandMichel Lesoinne�

�SandiaNationalLaboratories,Albuquerque, New Mexico87185,U.S.A�

Departmentof AerospaceEngineeringSciencesandCenterfor AerospaceStructures,University of Colorado,Boulder, Colorado80309-0429,U.S.A

Abstract

We presentSalinas,a scalableimplicit softwareapplicationfor thefinite elementstaticanddynamicanalysisof complex structuralreal-world systems.This relatively completeengineeringsoftwarewith morethan100,000lines of � ��� codeanda long list of userssustains292.5Gflop/son 2,940ASCI Redprocessors,and1.16 Tflop/s on 3,375ASCIWhiteprocessors.

1 Introduction

Part of the AcceleratedStrategic ComputingInitiative (ASCI) of the US Depart-mentof Energy is the developmentat Sandiaof Salinas,a massively parallelim-plicit structuralmechanics/dynamicssoftwareaimedat providing a scalablecom-putationalworkhorsefor extremelycomplex finite element(FE) stress,vibration,and transientdynamicsmodelswith tensor hundredsof millions of degreesoffreedom(dofs).Suchlarge-scalemathematicalmodelsrequiresignificantcompu-tationaleffort, but provide importantinformation,including vibrational loadsforcomponentswithin largersystems(Fig. 1), designoptimization(Fig. 2), frequencyresponseinformationfor guidanceandspacesystems,andmodaldatanecessaryfor activevibrationcontrol(Fig. 3).

As in thecaseof otherASCI projects,thesuccessof Salinashingeson its abilityto deliver scalableperformanceresults.However, unlike many other ASCI soft-�

0-7695-1524-X/02$17.00(c) 2002IEEE

Preprintsubmittedto the2002GordonBell Award 15August2002

Page 2: Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

Fig. 1. Vibrationalloadanalysisof amulti-componentsystem:integrationof circuit boardsin anelectronicpackage(EP),andintegrationof thiselectronicpackagein aweaponsystem(WS). FE modelsizes:8,500,000dofs (EP),12,000,000dofs (WS+EP).Machine:ASCIRed.Processorsused:3,000processors.

Fig.2.Structuraloptimizationof theelectronicspackageof are-entryvehicleusingSalinasandtheDAKOTA [1] optimizationtoolkit. Objective function:maximizationof thesafetyfactor. Left: original design.Right: optimizeddesign.FE modelsize:1,000,000dofs.Ma-chine:ASCI Red.Processorsused:3,000processors.

2

Page 3: Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

Fig. 3. Modalanalysisof a lithographicsystemin supportof thedesignof anactive vibra-tion controllerneededfor enablingimageprecisionto afew nanometers.Shown is the76thvibrationmode,an internalcomponentresponse.FE modelsize:602,070dofs.Machine:ASCI Red.Processorsused:50processors.

ware,Salinasis an implicit codeandthereforeprimarily requiresa scalableequa-tion solver in order to meetits objectives.Becauseall ASCI machinesaremas-sively parallelcomputers,thedefinitionof scalabilityadoptedhereis theability tosolve an � -timeslargerproblemusingan � -timeslargernumberof processors�in a nearlyconstantCPUtime.Achieving this definitionof scalabilityrequiresanequationsolverwhich is (a)numericallyscalable— thatis, whosearithmeticcom-plexity growsalmostlinearly with theproblemsize,and(b) amenableto ascalableparallel implementation— that is, which canexploit as large an � aspossiblewhile incurringrelatively small interprocessorcommunicationcosts.Sucha strin-gentdefinitionof scalabilityrulesoutsparsedirectsolversbecausetheirarithmeticcomplexity is a nonlinearfunction of the problemsize.On the otherhand,sev-eralmultilevel [2] iterative schemessuchasmultigrid algorithms[3, 4, 5] anddo-maindecomposition(DD) methodswith coarseauxiliaryproblems[6, 7, 8] canbecharacterizedby a nearlylineararithmeticcomplexity, or aniterationcountwhichgrows only weaklywith thesizeof theproblemto besolved.SalinasselectedtheDD-basedFETI-DPiterativesolver[9, 10] becauseof its underlyingstructuralme-chanicsconcepts,its robustnessandversatility, its provablenumericalscalability[11, 12], andits establishedscalableperformanceonmassively parallelprocessors.

Our submissionfocuseson anengineeringsoftwarewith morethan100,000linesof codeanda long list of users.Salinascontainsseveral computationalmodulesincludingthosefor formingandassemblingtheelementstiffness,mass,anddamp-ing matrices,recoveringthestrainandstressfields,performingsensitivity analysis,

3

Page 4: Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

solvinggeneralizedeigenvalueproblems,andtime-integratingby implicit schemessemi-discreteequationsof motion.Furthermore,oursubmissionaddressestotalso-lution time,scalability, andoverallCPUefficiency in additionto floating-pointper-formance.

Salinaswasdevelopedwith codeportability in mind.By 1999,it haddemonstratedscalability[13] on 1,000processorsof ASCI Red[36]. Currently, it runsroutinelyon ASCI RedandASCI White [37], performsequallywell on theCPLANT NewMexico [38] clusterof 1,500CompaqXP1000processors,andis supportedon avarietyof sequentialandparallelworkstations.On 2,940processorsof ASCI Red,Salinassustains292.5Gflop/s— thatis,99.5Mflop/sperASCI Redprocessor, andan overall CPU efficiency of 30%. On 3,375processorsof ASCI White, Salinassustains1.16Tflop/s — that is, 343.7Mflop/s per processor, andanoverall CPUefficiency of 23%.Theseperformancenumbersarewell above what is commonlyconsideredachievableby unstructuredFE-basedsoftware.

To thebestof ourknowledge,Salinasis todaytheonly FEsoftwarecapableof com-putinga dozeneigenmodesfor a million-dof FE structuralmodelin lessthan10minutes.Giventhepressingneedfor suchcomputations,it is rapidlybecomingthemodelfor parallelFE analysissoftwarein bothacademicandindustrialstructuralmechanics/dynamicscommunities[14].

2 The Salinas software

Salinasis mostlywritten in � ��� anduses64-bit arithmetic.It combinesa modernobject-orientedFE softwarearchitecture,scalablecomputationalalgorithms,andexisting high-performancenumericallibraries.Extensiveuseof high-performancenumericalbuildingblockalgorithmssuchastheBasicLinearAlgebraSubprograms(BLAS), theLinearAlgebraPackage(LAPACK), andtheMessagePassingInter-face(MPI) hasresultedin asoftwareapplicationwhichfeaturesnotonly scalabilityon thousandsof processors,but alsoasolidper-processorperformanceandthede-siredportability.

2.1 Distributedarchitectureanddatastructures

The architectureanddatastructuresof Salinasarebasedon the conceptof DD.They rely on theavailability of meshpartitioningsoftwaresuchasCHACO [15],TOP/DOMDEC[16], METIS [17], andJOSTLE[18]. Thedistributeddatastruc-turesof Salinasareorganizedin two levels.Thefirst-level supportsall local com-putationsaswell asall interprocessorcommunicationbetweenneighboringsubdo-mains.Thesecondlevel interfaceswith thosedatastructuresof aniterative solver

4

Page 5: Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

which supportthe global operationsassociatedwith the solutionof coarseprob-lems.

2.2 Analysiscapabilities

Salinasincludesa full library of structuralfinite elements.It supportslinearmul-tiple point constraints(LMPCs) to provide modelingflexibility , geometricalnon-linearitiesto addresslargedisplacementsandrotations,andlimited structuralnon-linearitiesto incorporatethe effect of joints. It offers five different analysisca-pabilities:(a) static,(b) modalvibration, (c) implicit transientdynamics,(d) fre-quency response,and(e)sensitivity. Themodalvibrationcapabilityis built aroundARPACK’s Arnoldi solver [19], andthe transientdynamicscapabilityis basedontheimplicit “generalized ” time-integrator[20]. All analysiscapabilitiesinterfacewith thesameFETI-DPmodulefor solvingthesystemsof equationsthey generate.

3 The FETI-DP solver

3.1 Background

Structuralmechanicsproblemscan be subdivided into second-orderand fourth-orderproblems.Second-orderproblemsaretypicallymodeledbybar, planestress/strain,andsolid elements,andfourth-orderproblemsby beam,plate,andshellelements.Theconditionnumberof a generalizedsymmetricstiffnessmatrix � arisingfromtheFEdiscretizationof second-orderproblemsgrowsasymptoticallywith themeshsize � as

���������������

(1)

andthatof ageneralizedsymmetricstiffnessmatrixarisingfrom theFEdiscretiza-tion of a fourth-orderproblemgrowsasymptoticallywith � as

���������������

(2)

Theaboveconditioningestimatesexplainwhy it wasnotuntil powerful androbustpreconditionersbecamerecentlyavailablethatthemethodof ConjugateGradients(CG) of Hestenesand Steifel [21] madeits debut in productioncommercialFEstructuralmechanicssoftware.Materialanddiscretizationheterogeneitiesaswellasbadelementaspectratios,all of which arecommonin real-world FE models,worsenfurthertheaboveconditionnumbers.

5

Page 6: Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

H

h

Fig. 4. Meshsize andsubdomainsize ! .

3.2 A scalableiterativesolver

FETI (Finite ElementTearingandInterconnection)is thegenericnamefor a suiteof DD basediterative solverswith Lagrangemultipliersdesignedwith thecondi-tion numberestimates(1) and(2) in mind.Thefirst andsimplestFETI algorithm,known astheone-level FETI method,wasdevelopedaround1989[23, 24]. It canbe describedas a two-stepPreconditionedConjugateGradient(PCG)algorithmwheresubdomainproblemswith Dirichlet boundaryconditionsaresolved in thepreconditioningstep,and relatedsubdomainproblemswith Neumannboundaryconditionsaresolvedin a secondstep.Theone-level FETI methodincorporatesarelatively small-sizeauxiliary problemwhich is basedon therigid bodymodesofthe floating subdomains.This coarseproblemacceleratesconvergenceby propa-gatingtheerrorgloballyduringthePCGiterations.

For second-orderelasticityproblems,the conditionnumberof the interfaceprob-lemassociatedwith theone-level FETI methodequippedwith theDirichletprecon-ditioner [25] grows at mostpolylogarithmicallywith the numberof elementspersubdomain

�"����� �$#&%('*) � �,+ ������

(3)

Here, + denotesthesubdomainsize(Fig. 4) andtherefore+�- � is thenumberofelementsalongeachsideof a uniformsubdomain.Theconditionnumberestimate(3) constitutesa majorimprovementover theconditioningresult(1). More impor-tantly, it establishesthe numericalscalabilityof the FETI methodwith respecttotheproblemsize,thenumberof subdomains,andthenumberof elementspersub-domain.

For fourth-orderplateandshell problems,preservingthe quasi-optimalconditionnumberestimate(3) requiresenrichingthe coarseproblemof the one-level FETImethodby thesubdomaincornermodes[27, 28]. This enrichmenttransformstheoriginal one-level FETI methodinto a genuinetwo-level algorithmknown asthetwo-level FETI method[27, 28]. Both the one-level andtwo-level FETI methodshavebeenextendedto transientdynamicsproblemsasdescribedin [26].

6

Page 7: Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

Unfortunately, enrichingthecoarseproblemof theone-level FETI methodby thesubdomaincornermodesincreasesits computationalcomplexity to a point wheretheoverall scalabilityof FETI is diminishedonavery largenumberof processors,say .0/ � �21*1*1 . For this reason,the basicprinciplesgoverning the designofthe two-level FETI methodwere recentlyrevisited to constructa more efficientdual-primalFETI method[9, 10] known astheFETI-DPmethod.ThismostrecentFETI methodfeaturesthe samequasi-optimalconditionnumberestimate(3) forboth second-andfourth-orderproblems,but employs a moreeconomicalcoarseproblemthan the two-level FETI method.Mainly for this reason,FETI-DP waschosento powerSalinas.

3.3 A versatileiterativesolver

ProductioncodessuchasSalinasrequiretheir solver to besufficiently versatiletoaddress,amongothers,problemswith successiveright-handsidesand/orLMPCs.

Systemswith successive right-handsidesarisein many structuralapplicationsin-cluding static analysisfor multiple loads,sensitivity, modal vibration, and im-plicit transientdynamicsanalyses.Krylov-basediterative solversareill-suited fortheseproblems,unlessthey incorporatespecialtechniquesfor avoiding restartingthe iterationsfrom scratchfor eachdifferentright-handside[29, 30, 31, 32]. Toaddressthis issue,Salinas’FETI-DP solver is equippedwith the projection/re-orthogonalizationproceduredescribedin [30, 31]. Thisprocedureacceleratescon-vergenceasillustratedbelow.

Fig. 5 reportsthe performanceresultsobtainedfor FETI-DP equippedwith theprojection/re-orthogonalizationtechniqueproposedin [30, 31] andappliedto thesolutionof therepeatedsystemsarisingfrom thecomputationof thefirst 50 eigenmodesof theopticalshuttershown in Fig.5(a).Theseresultsarefor aFEstructuralmodelwith 16 million dofs, anda simulationperformedon 1,071processorsofASCIWhite.Asshownin Fig.5(b),thenumberof FETI-DPiterationsisequalto75for thefirst right-handside,anddropsto 26 for the100th.Consequently, theCPUtimefor FETI-DPdropsfrom 57.0secondsfor thefirst problem— 33.4secondsofwhich correspondto a one-timepreprocessingcomputation— to 10.7secondsforthelastone.ThetotalFETI-DPCPUtime for all 100successiveproblemsis equalto 1,223seconds.Without the projection/re-orthogonalization-basedaccelerationprocedure,thetotalCPUconsumptionby FETI-DPfor all 100successiveproblemsis equalto 2,445seconds.Hence,for this modalanalysis,our accelerationschemereducesthe CPU time of FETI-DP by a factor equalto 2, independentlyof theparallelismof thecomputation.

LMPCs arefrequentin structuralanalysisbecausethey easethe FE modelingofcomplex structures.They canbe written in matrix form as 3�4 �65 , where 3 is

7

Page 8: Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

(a)Left: optical shutterto be embeddedin a MEMS device. Right:vonMisesstressesassociatedwith aneigenmode.

(b) IterationcountandCPUtimefor eachsuccessive linearsolve.

Fig. 5. Modal analysisof an optical shutterembeddedin a MEMS device. Performanceresultsof FETI-DP for the solutionof the 100 linear solvesarisingduring the computa-tion of thefirst 50 eigenmodes.FE modelsize:16,000,000dofs.Machine:ASCI White.Processorsused:1,071processors.

a rectangularmatrix, 5 is a vector, and 4 is thesolutionvectorof generalizeddis-placements.Solving systemsof equationswith LMPCs by a DD methodis notan easytaskbecauseLMPCs canarbitrarily coupletwo or moresubdomains.Toaddressthis importantissue,a methodologywaspresentedin [33] for generaliz-ing numericallyscalableDD-basediterative solversto thesolutionof constrainedFE systemsof equations,without interferingwith their local andglobalprecondi-

8

Page 9: Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

tioners.FETI-DPincorporatesthekey elementsof thismethodology, andthereforenaturallyhandlesLMPCs.

4 Optimization of solution time and scalability

The performanceand scalabilityof Salinasare essentiallythoseof its FETI-DPmoduleappliedto thesolutionof aproblemof theform

��4 ��7 � (4)

where � arisesfrom any of Salinas’analysiscapabilities.Givena meshpartition,FETI-DPtransformstheabove globalprobleminto aninterfaceproblemwith La-grangemultiplier unknowns, and solves this problemby a PCG algorithm.Thepurposeof theLagrangemultipliersis to enforcethecontinuityof thegeneralizeddisplacementfield 4 on theinterior of thesubdomaininterfaces.EachiterationofthePCGalgorithmincursthesolutionof a setof sparsesubdomainproblemswithNeumannboundaryconditions,a relatedsetof sparsesubdomainproblemswithDirichlet boundaryconditions,andasparsecoarseproblemof theform

�98:;:=< : ��> : � (5)

In Salinas,theindependentsubdomainproblemsaresolvedconcurrentlyby these-quentialblocksparseCholesky algorithmdescribedin [34]. ThissolverusesBLASlevel 3 operationsfor factoringamatrix,andBLAS level 2 operationsfor perform-ing forwardandbackwardsubstitutions.

For solving the coarseproblem(5), two approachesareavailable.In the first ap-proach[13], therelatively smallsparsematrix � 8:;: is duplicatedin eachprocessorwhich factorsit. In a preprocessingstep,the inverseof this matrix, � 8@?BA�C:;: , is com-putedby embarrassinglyparallel forward andbackward substitutionsandstoredacrossall � processorsin the column-wisedistributed format DFEB� 8G?BA�C:;:IH KJ ML*NPOML � .Giventhatthesolutionof thecoarseproblem(5) canbewrittenas

< : � � 8 ?BA�C:;: > : � ML*NPO,L � EQ� 8 ?RASC:;:IH TE > : H � (6)

it is computedin parallelusing local matrix-vectorproductsanda singleglobalrangecommunication.In thesecondapproach,� 8:;: is storedin a scatteredcolumndatastructureacrossanoptimalnumberof processors�U V . , andthecoarseproblem(5) is solvedby a parallelsparsedirectmethodwhich resortsto selective

9

Page 10: Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

inversionsfor improving theparallelperformanceof forwardandbackwardsubsti-tutions[39].

When� 8:;: canbestoredin eachprocessor, thefirstapproachmaximizesthefloating-point performance.For example,whenSalinas’FETI-DP usesthe first approachfor solving the coarseproblems,it sustains150 Mflop/s per ASCI Redprocessorratherthan the 99.5 Mflop/s announcedin the introduction.However, when thesizeof � 8:W: is greateror equalto the averagesizeof a local subdomainproblem,thesecondapproachscalesbetterthanthefirst oneandminimizestheCPUtime.For thesereasons,Salinasselectsthefirst methodwhenthesizeof thecoarseprob-lem is smallerthantheaveragesizeof a local subdomainproblem,andthesecondmethodotherwise.

5 Performance and scalability studies

5.1 Focusproblems

To illustratethescalabilityproperties,parallelperformance,andCPUefficiency ofSalinasandFETI-DP, we considerthe staticanalysisof two differentstructures.We remindthereaderthatbecauseSalinasis animplicit code,its performanceforstaticanalysisis indicativeof its performancefor its otheranalysiscapabilities.

The first structure,a cube,definesa modelproblemwhich hasthe virtue of sim-plifying the tedioustasksof meshgenerationand partitioningduring scalabilitystudies.We partitionthis cubeinto �YXZ\[]�_^Z`[]�_aZ subdomains,anddiscretizeeachsubdomainby �_b�[c�Ybd[c�Yb 8-nodedhexahedralelements.

The secondstructureis the optical shutterintroducedin Section3.3. This real-world structureis far morecomplex thansuggestedby Fig. 5(a), as its intricatemicroscopicgeometricalfeaturesarehardto visualize.It is composedof threedisklayersthatareseparatedby gaps,but connectedby asetof flexiblebeamsextendingin thedirectionnormalto all threedisks(seeFig.7).Weconstructtwo detailed(seeFig. 8) FEmodelsof this real-world structure:modelM1 with 16million dofs,andmodelM2 with 110million dofs.WeuseCHACO[15] to decomposetheFEmodelM1 into 288,535,835,and1,071subdomains,andthe FE modelM2 into 3,783subdomains.

We recognizethatfor thecubeproblem,loadbalanceis ensuredby theuniformityof eachmeshpartition,andnotethat for theopticalshutterproblem,loadbalancedependson thegeneratedmeshpartition.In all cases,we maponeprocessorontoonesubdomain.

10

Page 11: Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

Fig.6. Thecubeproblem:e = 30.0e+6fhgji , kmlonKprq — s XZut s ^Z t s aZ partitionanduniforms b t s b t s b subdomaindiscretization.

Fig. 7. Opticalshutterwith a500microndiameteranda three-layerconstruction.

5.2 Measuringperformance

All CPU timings reportedin the remainderof this paperarewall clock timingsobtainedusingtheC/C++ functionsavailablein v time.h / .

On ASCI Red, we use the performancelibrary of this machine,perfmon, tomeasurethefloating-pointperformanceof Salinas.Thefloating-pointperformanceshown onASCI REDin thesubsequentfiguresis for thesolver(FETI-DP)stageofSalinasonly. However, theSalinasexecutiontimesmeasuredonASCI Redincludeall thestagesfrom beginningtheendexcepttheinput/outputstage.

On ASCI White, we use the hpmcount utility[35] provided by IBM to mea-surefloating-pointperformanceof Salinas.None of Salinaswas excludedfromthe floating-pointmeasurementon ASCI White. The SalinasexecutiontimesonASCI White includetheinput/outputstage.

11

Page 12: Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

Fig. 8. Meshingcomplexity of theopticalshutter.

WemeasuretheoverallCPUefficiency asfollows

CPUEfficiency �xw`y{z,|T} 7 };y�~F-*� 4 ��� � 5 . ~*�hy��,�K����y��h�.�[�� �K|T��������

(7)

where� �K|T������� is thepeakprocessorfloating-pointrateandisequalto333Mflop/sfor anASCI Redprocessor, and1,500Mflop/s for anASCI Whiteprocessor.

5.3 Scalabilityandoverall CPUefficiencyonASCIRed

For thecubeproblem,wefix �_b � �M� andvary � XZ , � ^Z and� aZ to increasethesizeoftheglobalproblemfrom lessthana million to morethan36 million dofs.We notethat the parametersof this problemaresuchthat in all cases,Salinaschoosestheparallelsparsedirectmethoddescribedin [39] for solvingthecoarseproblems.

12

Page 13: Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

Fig. 9. Staticanalysisof thecubeon ASCI Red:scalabilityof FETI-DPandSalinasfor afixed-sizesubdomainandanincreasingnumberof subdomainsandprocessors.

Fig. 10. Staticanalysisof the cubeon ASCI Red:overall CPU efficiency andaggregatefloating-pointrateachievedby Salinas.

Fig.9 highlightsthescalabilityof FETI-DPandSalinasonASCI Red.Morespecif-ically, it shows that whenthe numberof subdomains,andthereforethe problemsize,aswell as . areincreased,the numberof FETI-DP iterationsremainsrel-atively constantand the total CPU time consumedby FETI-DP increasesonlyslightly. Fig. 10 shows that the floating-pointrateachieved by Salinasincreaseslinearly with � andreaches292.5Gflop/son 2,940processors.This correspondsto a performanceof 99.5Mflop/s perprocessor. Consequently, Salinasdeliversanoverall CPUefficiency rangingbetween34%on 64 processorsand30%on 2,940processors.

13

Page 14: Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

Fig. 11.Staticanalysisof thecubeonASCI White:scalabilityof FETI-DPfor afixed-sizesubdomainandanincreasingnumberof subdomainsandprocessors.

5.4 Scalabilityandoverall CPUefficiencyonASCIWhite

5.4.1 Thecubeproblem

Here,we fix �Y�����*� becauseeachASCI White processorhasa larger memorythanan ASCI Redprocessor, andvary �Y�� , �Y�� and �Y�� to increasethe sizeof theglobal problemfrom 2 million to morethan100 million dofs. For �_�c���*� , thesizeof thecoarseproblemis smallerthanthesizeof thesubdomainproblemandthereforeSalinaschoosesthe first methoddescribedin Section4 for solving thecoarseproblems.

Fig. 11highlightsthescalabilityof FETI-DPandSalinasonASCI White.It showsthatwhen ��� is increasedwith thesizeof theglobalproblem,the iterationcountandCPUtime of FETI-DPremainrelatively constant.Furthermore,Fig. 12 showsthat the floating-pointrateachievedby Salinason ASCI White increaseslinearlywith ��� andreaches1.16Tflop/son3,375processors.Thiscorrespondsto anaver-ageperformanceof 343.7Mflop/sperprocessor. Consequently, Salinasdeliversanoverall CPUefficiency rangingbetween28%on 64 processorsand24%on 3,375processors.

5.4.2 Thereal-worldopticalshutterproblem

To illustratetheparallelscalabilityof Salinasfor a fixed-sizeglobalproblem,wereportin Fig.13thespeed-upobtainedwhenusingtheFEmodelM1 andincreasingthenumberof ASCI White processorsfrom 288to 1,071.Thereadercanobservea linearspeed-upfor this rangeof processorswhich is commensuratewith thesizeof theFEmodelM1.

14

Page 15: Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

Fig. 12.Staticanalysisof thecubeon ASCI White: overall CPUefficiency andaggregatefloating-pointrateachievedby Salinas.

Fig. 13. Staticanalysisof theoptical shutteron ASCI White: speed-upof Salinasfor theFEmodelM1 with 16,000,000dofs.

Next, we summarizein Table1 the performanceresultsobtainedon ASCI Whitefor the larger FE modelM2. The floating-pointrate andoverall CPU efficiencyachieved in this casearelower thanthoseobtainedfor the cubeproblembecausethenontrivial meshpartitionsgeneratedfor theopticalshutterarenotwell balanced.Balancingameshpartitionfor aDD methodin whichthesubdomainproblemsaresolved by a sparsedirect methodrequiresbalancingthe sparsitypatternsof thesubdomainproblems,which is difficult to achieve without increasingthe costofthe partitioningprocessitself. Still, we point out the remarkablefact that Salinasperformsthe static analysisof a complex 110 million dofs FE model on 3,375ASCI Whiteprocessorsin lessthan7 minutesandsustains745Gflop/sduringthisprocess.

15

Page 16: Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

FE Size � Solution Performance OverallCPU

model Time Rate Efficiency

M2 110,000,000dofs 3,783 418seconds 745Gflop/s 13%

Table1Staticanalysisof theopticalshutteron3,783ASCI Whiteprocessors:performanceof Sali-nasfor theFEmodelM2.

6 Conclusions

ThissubmissionfocusesonSalinas,anengineeringsoftwarewith morethan100,000linesof � ��� codeanda long list of users.In the context of unstructuredapplica-tions,the292.5Gflop/ssustainedby thissoftwareon2,940ASCI Rednodesusing1 CPU/node(2,940processors)comparesfavorablywith themostrecentandim-pressiveperformances:156Gflop/son2,048nodesof ASCIRedusing1 CPU/node(2,048processors)demonstratedby a winningentryin 1999[40], and319Gflop/son 2,048nodesof ASCI Redusing2 CPUs/node(4,096processors)demonstratedby anotherwinningentryalsoin 1999[41]. Yet, thehighlightof thissubmissionisthesustainedperformanceof 1.16Tflop/son 3,375processorsof ASCI White bya codecapableof variousstructuralanalysesof real-world FE modelswith morethan100million dofs,is supportedona widevarietyof platforms,andis inspiringthedevelopmentof parallelFE softwarethroughoutbothacademicandindustrialstructuralmechanicscommunities.

Acknowledgments

Salinascodedevelopmenthasbeenongoingfor severalyearsandwould not havebeenpossibleif not for the help from the following people:Dan Segalman,JohnRed-Horse,Clay Fulcher, JamesFreymiller, Todd Simmermacher, Greg Tipton,BrianDriessen,CarlosFelippa,David Martinez,MichaelMcGlaun,ThomasBickel,PadmaRaghavan, and EsmondNg. The authorswould also like to thanksthesupportstaff of ASCI RedandASCI White for their indispensablehelp. Sandiais a multiprogramlaboratoryoperatedby SandiaCorporation,a LockheedMar-tin Company, for the U.S.Departmentof Energy (DOE) underContractNo. DE-AC04-94AL85000.CharbelFarhatacknowledgespartialsupportby SandiaunderContractNo. BD-2435,andpartialsupportby DOEunderAwardNo. B347880/W-740-ENG-48.Michel Lesoinneacknowledgespartialsupportby DOEunderAwardNo. B347880/W-740-ENG-48.

16

Page 17: Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

References

[1] M. S.Eldred,A. A. Giunta,B. G. vanBloemenWaanders,S. F. Wojtkiewicz,W. E. Hart andM. P. Alleva. DAKOTA, a multilevel parallelobject-orientedframework for designoptimization,parameterestimation,uncertaintyquantifi-cation,andsensitivity analysis.Version3.0referencemanual.SandiaTechnicalReportSAND2001-3515,2002.

[2] S. F. McCormick. Multilevel adaptive methodsfor partial differentialequa-tions. Frontiers in AppliedMathematics,SIAM, 1989.

[3] S. F. McCormick,ed. Multigrid methods.Frontiers in AppliedMathematics,SIAM, 1987.

[4] W. Briggs. A multigrid tutorial. SIAM, 1987.[5] P. Vanek,J. Mandel and M. Brezina. Algebraic multigrid on unstructured

meshes.Computing56,179-196(1996).[6] P. LeTallec. Domain-decompositionmethodsin computationalmechanics.

ComputationalMechanicsAdvances1, 121-220(1994).[7] C. FarhatandF. X. Roux. Implicit parallelprocessingin structuralmechanics.

ComputationalMechanicsAdvances2, 1-124(1994).[8] B. Smith,P. BjorstadandW. Gropp.Domaindecomposition,parallelmultilevel

methodsfor elliptic partialdifferentialequations.CambridgeUniversityPress,1996.

[9] C. Farhat,M. LesoinneandK. Pierson.A scalabledual-primaldomaindecom-positionmethod.Numer. Lin. Alg. Appl.7, 687-714(2000).

[10] C. Farhat,M. Lesoinne,P. LeTallec,K. PiersonandD. Rixen. FETI-DP: adual-primalunified FETI method- part I: a fasteralternative to the two-levelFETI method.Internat.J. Numer. Meths.Engrg. 50,1523-1544(2001).

[11] J.MandelandR. Tezaur. Ontheconvergenceof adual-primalsubstructuringmethod.Numer. Math.88,543-558(2001).

[12] A. Klawonn andO. B. Widlund. FETI-DP methodsfor three-dimensionalelliptic problemswith heterogeneouscoefficients. Technicalreport, CourantInstituteof MathematicalSciences,2000.

[13] M. Bhardwaj, D. Day, C. Farhat,M. Lesoinne,K. Piersonand D. Rixen.Applicationof the FETI methodto ASCI problems:scalabilityresultson one-thousandprocessorsanddiscussionof highly heterogeneousproblems.Internat.J. Numer. Methds.Engrg. 47,513-536(2000).

[14] J.J.McGowan,G. E. WarrenandR. A. Shaw. Wholeshipmodels.TheONRProgramReview, Arlington, April 15-18,2002.

[15] B. HendricksonandR. Leland.TheChacoUser’sGuide:Version2.0. SandiaTech. ReportSAND94-2692,1994.

[16] C. Farhat,S. Lant́eri andH. D. Simon. TOP/DOMDEC,a softwaretool formeshpartitioningandparallelprocessing.Comput.Sys.Engrg. 6, 13-26(1995).

[17] G. Karypis and V. Kumar. Parallel multilevel k-way partition schemeforirregulargraphs.SIAMReview 41,278-300(1999).

[18] C. Walshaw andM. Cross. Parallel optimizationalgorithmsfor multilevelmeshpartitioning.Parallel Comput.26,1635-1660(2000).

17

Page 18: Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

[19] R.Lehoucq,D. C.Sorensen,C.Yang.ArpackUser’sGuide:Solutionof large-scaleeigenvalueproblemswith implicitly restartedArnoldi methods. SIAM,1998.

[20] J. ChungandG. M. Hulbert. A time integrationalgorithmfor structuraldy-namicswith improvednumericaldissipation:thegeneralized- method.J. Appl.Mech. 60,371(1993).

[21] M. R. HestenesandE. Steifel. Methodof conjugategradientsfor solvinglinearsystems.J. Res.Nat.Bur. Standards49,409-436(1952).

[22] G. H. GolubandC. F. VanLoan. Matrix computations.TheJohnsHopkinsUniversityPress,1990.

[23] C. Farhat. A Lagrangemultiplier baseddivide andconquerfinite elementalgorithm.J. Comput.Sys.Engrg. 2, 149-156(1991).

[24] C. FarhatandF. X. Roux. A methodof finite elementtearingandintercon-nectingandits parallelsolutionalgorithm.Internat.J. Numer. Meths.Engrg. 32,1205-1227(1991).

[25] C. Farhat.,J.MandelandF. X. Roux. Optimalconvergencepropertiesof theFETI domaindecompositionmethod.Comput.Meths.Appl.Mech. Engrg. 115,367-388(1994).

[26] C. Farhat,P. S. ChenandJ. Mandel. A scalableLagrangemultiplier baseddomaindecompositionmethodfor implicit time-dependentproblems.Internat.J. Numer. Meths.Engrg. 38,3831-3858(1995).

[27] C. FarhatandJ.Mandel. Thetwo-level FETI methodfor staticanddynamicplateproblems- PartI: anoptimaliterativesolverfor biharmonicsystems.Com-put.Meths.Appl.Mech. Engrg. 155,129-152(1998).

[28] C. Farhat,P. S.Chen,J.MandelandF. X. Roux. Thetwo-level FETI method- Part II: extensionto shellproblems,parallelimplementationandperformanceresults.Comput.Meths.Appl.Mech. Engrg. 155,153-180(1998).

[29] Y. Saad.On theLanczosmethodfor solvingsymmetriclinearsystemswithseveralright-handsides.Math.Comp.48,651-662(1987).

[30] C. Farhatand P. S. Chen. Tailoring Domain DecompositionMethodsforEfficient ParallelCoarseGrid Solutionandfor Systemswith Many Right HandSides.Contemporary Mathematics180,401-406(1994).

[31] C. Farhat,L. Crivelli andF. X. Roux. Extendingsubstructurebasediterativesolvers to multiple load and repeatedanalyses.Comput.Meths.Appl. Mech.Engrg. 117,195-209(1994).

[32] P. Fischer. Projectiontechniquesfor iterative solutionof Ax=b with succes-sive right-handsides.Comp.Meths.Appl.Mech. Engrg. 163,193-204(1998).

[33] C. Farhat,C. LacourandD. Rixen. Incorporationof linear multipoint con-straintsin substructurebasediterative solvers - Part I: a numericallyscalablealgorithm. Internat.J. Numer. Meths.Engrg. 43,997-1016(1998).

[34] E. G. Ng andB. W. Peyton. Block sparsecholesky algorithms.SIAMJ. Sci.Stat.Comput.14,1034-1056(1993).

[35] Information on hpmcount, a performance monitor utilitythat reads hardware counters for IBM SP RS/6000 computers.http://www.alphaworks.ibm.com/tech/hpmtoolkit.

18

Page 19: Salinas: A Scalable Software for High-Performance ...supercomputing.org/sc2002/paperpdfs/pap.pap216.pdfSalinas: A Scalable Software for High-Performance Structural and Solid Mechanics

[36] ASCI RedHomePage.http://www.sandia.gov/ASCI/Red.[37] ASCI WhiteHomePage.http://www.llnl.gov/asci/platforms/white.[38] ASCI CPLANT HomePage.http://www.cs.sandia.gov/cplant.[39] P. Raghavan. Efficient parallel triangularsolutionwith selective inversion.

Parallel ProcessingLetters 8, 29-40(1998).[40] K. Anderson,W. Gropp,D. Kaushik,D. KeyesandB. Smith.Achieving high

sustainedperformancein anunstructuredmeshCFD application. Proceedingsof SC99, Portland,OR,November1999.

[41] H. M. Tufo andP. F. Fischer. Terascalespectralelementalgorithmsandim-plementations.Proceedingsof SC99, Portland,OR,November1999.

19