Salinas: A Scalable Software for High-Performance...

Preview:

Citation preview

Salinas: A Scalable Software for High-PerformanceStructural and Solid Mechanics Simulations

ManojBhardwaj, KendallPierson,GarthReese,�

Tim Walsh,David Day, KenAlvin andJamesPeery�

CharbelFarhatandMichel Lesoinne�

�SandiaNationalLaboratories,Albuquerque, New Mexico87185,U.S.A�

Departmentof AerospaceEngineeringSciencesandCenterfor AerospaceStructures,University of Colorado,Boulder, Colorado80309-0429,U.S.A

Abstract

We presentSalinas,a scalableimplicit softwareapplicationfor thefinite elementstaticanddynamicanalysisof complex structuralreal-world systems.This relatively completeengineeringsoftwarewith morethan100,000lines of � ��� codeanda long list of userssustains292.5Gflop/son 2,940ASCI Redprocessors,and1.16 Tflop/s on 3,375ASCIWhiteprocessors.

1 Introduction

Part of the AcceleratedStrategic ComputingInitiative (ASCI) of the US Depart-mentof Energy is the developmentat Sandiaof Salinas,a massively parallelim-plicit structuralmechanics/dynamicssoftwareaimedat providing a scalablecom-putationalworkhorsefor extremelycomplex finite element(FE) stress,vibration,and transientdynamicsmodelswith tensor hundredsof millions of degreesoffreedom(dofs).Suchlarge-scalemathematicalmodelsrequiresignificantcompu-tationaleffort, but provide importantinformation,including vibrational loadsforcomponentswithin largersystems(Fig. 1), designoptimization(Fig. 2), frequencyresponseinformationfor guidanceandspacesystems,andmodaldatanecessaryfor activevibrationcontrol(Fig. 3).

As in thecaseof otherASCI projects,thesuccessof Salinashingeson its abilityto deliver scalableperformanceresults.However, unlike many other ASCI soft-�

0-7695-1524-X/02$17.00(c) 2002IEEE

Preprintsubmittedto the2002GordonBell Award 15August2002

Fig. 1. Vibrationalloadanalysisof amulti-componentsystem:integrationof circuit boardsin anelectronicpackage(EP),andintegrationof thiselectronicpackagein aweaponsystem(WS). FE modelsizes:8,500,000dofs (EP),12,000,000dofs (WS+EP).Machine:ASCIRed.Processorsused:3,000processors.

Fig.2.Structuraloptimizationof theelectronicspackageof are-entryvehicleusingSalinasandtheDAKOTA [1] optimizationtoolkit. Objective function:maximizationof thesafetyfactor. Left: original design.Right: optimizeddesign.FE modelsize:1,000,000dofs.Ma-chine:ASCI Red.Processorsused:3,000processors.

2

Fig. 3. Modalanalysisof a lithographicsystemin supportof thedesignof anactive vibra-tion controllerneededfor enablingimageprecisionto afew nanometers.Shown is the76thvibrationmode,an internalcomponentresponse.FE modelsize:602,070dofs.Machine:ASCI Red.Processorsused:50processors.

ware,Salinasis an implicit codeandthereforeprimarily requiresa scalableequa-tion solver in order to meetits objectives.Becauseall ASCI machinesaremas-sively parallelcomputers,thedefinitionof scalabilityadoptedhereis theability tosolve an � -timeslargerproblemusingan � -timeslargernumberof processors�in a nearlyconstantCPUtime.Achieving this definitionof scalabilityrequiresanequationsolverwhich is (a)numericallyscalable— thatis, whosearithmeticcom-plexity growsalmostlinearly with theproblemsize,and(b) amenableto ascalableparallel implementation— that is, which canexploit as large an � aspossiblewhile incurringrelatively small interprocessorcommunicationcosts.Sucha strin-gentdefinitionof scalabilityrulesoutsparsedirectsolversbecausetheirarithmeticcomplexity is a nonlinearfunction of the problemsize.On the otherhand,sev-eralmultilevel [2] iterative schemessuchasmultigrid algorithms[3, 4, 5] anddo-maindecomposition(DD) methodswith coarseauxiliaryproblems[6, 7, 8] canbecharacterizedby a nearlylineararithmeticcomplexity, or aniterationcountwhichgrows only weaklywith thesizeof theproblemto besolved.SalinasselectedtheDD-basedFETI-DPiterativesolver[9, 10] becauseof its underlyingstructuralme-chanicsconcepts,its robustnessandversatility, its provablenumericalscalability[11, 12], andits establishedscalableperformanceonmassively parallelprocessors.

Our submissionfocuseson anengineeringsoftwarewith morethan100,000linesof codeanda long list of users.Salinascontainsseveral computationalmodulesincludingthosefor formingandassemblingtheelementstiffness,mass,anddamp-ing matrices,recoveringthestrainandstressfields,performingsensitivity analysis,

3

solvinggeneralizedeigenvalueproblems,andtime-integratingby implicit schemessemi-discreteequationsof motion.Furthermore,oursubmissionaddressestotalso-lution time,scalability, andoverallCPUefficiency in additionto floating-pointper-formance.

Salinaswasdevelopedwith codeportability in mind.By 1999,it haddemonstratedscalability[13] on 1,000processorsof ASCI Red[36]. Currently, it runsroutinelyon ASCI RedandASCI White [37], performsequallywell on theCPLANT NewMexico [38] clusterof 1,500CompaqXP1000processors,andis supportedon avarietyof sequentialandparallelworkstations.On 2,940processorsof ASCI Red,Salinassustains292.5Gflop/s— thatis,99.5Mflop/sperASCI Redprocessor, andan overall CPU efficiency of 30%. On 3,375processorsof ASCI White, Salinassustains1.16Tflop/s — that is, 343.7Mflop/s per processor, andanoverall CPUefficiency of 23%.Theseperformancenumbersarewell above what is commonlyconsideredachievableby unstructuredFE-basedsoftware.

To thebestof ourknowledge,Salinasis todaytheonly FEsoftwarecapableof com-putinga dozeneigenmodesfor a million-dof FE structuralmodelin lessthan10minutes.Giventhepressingneedfor suchcomputations,it is rapidlybecomingthemodelfor parallelFE analysissoftwarein bothacademicandindustrialstructuralmechanics/dynamicscommunities[14].

2 The Salinas software

Salinasis mostlywritten in � ��� anduses64-bit arithmetic.It combinesa modernobject-orientedFE softwarearchitecture,scalablecomputationalalgorithms,andexisting high-performancenumericallibraries.Extensiveuseof high-performancenumericalbuildingblockalgorithmssuchastheBasicLinearAlgebraSubprograms(BLAS), theLinearAlgebraPackage(LAPACK), andtheMessagePassingInter-face(MPI) hasresultedin asoftwareapplicationwhichfeaturesnotonly scalabilityon thousandsof processors,but alsoasolidper-processorperformanceandthede-siredportability.

2.1 Distributedarchitectureanddatastructures

The architectureanddatastructuresof Salinasarebasedon the conceptof DD.They rely on theavailability of meshpartitioningsoftwaresuchasCHACO [15],TOP/DOMDEC[16], METIS [17], andJOSTLE[18]. Thedistributeddatastruc-turesof Salinasareorganizedin two levels.Thefirst-level supportsall local com-putationsaswell asall interprocessorcommunicationbetweenneighboringsubdo-mains.Thesecondlevel interfaceswith thosedatastructuresof aniterative solver

4

which supportthe global operationsassociatedwith the solutionof coarseprob-lems.

2.2 Analysiscapabilities

Salinasincludesa full library of structuralfinite elements.It supportslinearmul-tiple point constraints(LMPCs) to provide modelingflexibility , geometricalnon-linearitiesto addresslargedisplacementsandrotations,andlimited structuralnon-linearitiesto incorporatethe effect of joints. It offers five different analysisca-pabilities:(a) static,(b) modalvibration, (c) implicit transientdynamics,(d) fre-quency response,and(e)sensitivity. Themodalvibrationcapabilityis built aroundARPACK’s Arnoldi solver [19], andthe transientdynamicscapabilityis basedontheimplicit “generalized ” time-integrator[20]. All analysiscapabilitiesinterfacewith thesameFETI-DPmodulefor solvingthesystemsof equationsthey generate.

3 The FETI-DP solver

3.1 Background

Structuralmechanicsproblemscan be subdivided into second-orderand fourth-orderproblems.Second-orderproblemsaretypicallymodeledbybar, planestress/strain,andsolid elements,andfourth-orderproblemsby beam,plate,andshellelements.Theconditionnumberof a generalizedsymmetricstiffnessmatrix � arisingfromtheFEdiscretizationof second-orderproblemsgrowsasymptoticallywith themeshsize � as

���������������

(1)

andthatof ageneralizedsymmetricstiffnessmatrixarisingfrom theFEdiscretiza-tion of a fourth-orderproblemgrowsasymptoticallywith � as

���������������

(2)

Theaboveconditioningestimatesexplainwhy it wasnotuntil powerful androbustpreconditionersbecamerecentlyavailablethatthemethodof ConjugateGradients(CG) of Hestenesand Steifel [21] madeits debut in productioncommercialFEstructuralmechanicssoftware.Materialanddiscretizationheterogeneitiesaswellasbadelementaspectratios,all of which arecommonin real-world FE models,worsenfurthertheaboveconditionnumbers.

5

H

h

Fig. 4. Meshsize andsubdomainsize ! .

3.2 A scalableiterativesolver

FETI (Finite ElementTearingandInterconnection)is thegenericnamefor a suiteof DD basediterative solverswith Lagrangemultipliersdesignedwith thecondi-tion numberestimates(1) and(2) in mind.Thefirst andsimplestFETI algorithm,known astheone-level FETI method,wasdevelopedaround1989[23, 24]. It canbe describedas a two-stepPreconditionedConjugateGradient(PCG)algorithmwheresubdomainproblemswith Dirichlet boundaryconditionsaresolved in thepreconditioningstep,and relatedsubdomainproblemswith Neumannboundaryconditionsaresolvedin a secondstep.Theone-level FETI methodincorporatesarelatively small-sizeauxiliary problemwhich is basedon therigid bodymodesofthe floating subdomains.This coarseproblemacceleratesconvergenceby propa-gatingtheerrorgloballyduringthePCGiterations.

For second-orderelasticityproblems,the conditionnumberof the interfaceprob-lemassociatedwith theone-level FETI methodequippedwith theDirichletprecon-ditioner [25] grows at mostpolylogarithmicallywith the numberof elementspersubdomain

�"����� �$#&%('*) � �,+ ������

(3)

Here, + denotesthesubdomainsize(Fig. 4) andtherefore+�- � is thenumberofelementsalongeachsideof a uniformsubdomain.Theconditionnumberestimate(3) constitutesa majorimprovementover theconditioningresult(1). More impor-tantly, it establishesthe numericalscalabilityof the FETI methodwith respecttotheproblemsize,thenumberof subdomains,andthenumberof elementspersub-domain.

For fourth-orderplateandshell problems,preservingthe quasi-optimalconditionnumberestimate(3) requiresenrichingthe coarseproblemof the one-level FETImethodby thesubdomaincornermodes[27, 28]. This enrichmenttransformstheoriginal one-level FETI methodinto a genuinetwo-level algorithmknown asthetwo-level FETI method[27, 28]. Both the one-level andtwo-level FETI methodshavebeenextendedto transientdynamicsproblemsasdescribedin [26].

6

Unfortunately, enrichingthecoarseproblemof theone-level FETI methodby thesubdomaincornermodesincreasesits computationalcomplexity to a point wheretheoverall scalabilityof FETI is diminishedonavery largenumberof processors,say .0/ � �21*1*1 . For this reason,the basicprinciplesgoverning the designofthe two-level FETI methodwere recentlyrevisited to constructa more efficientdual-primalFETI method[9, 10] known astheFETI-DPmethod.ThismostrecentFETI methodfeaturesthe samequasi-optimalconditionnumberestimate(3) forboth second-andfourth-orderproblems,but employs a moreeconomicalcoarseproblemthan the two-level FETI method.Mainly for this reason,FETI-DP waschosento powerSalinas.

3.3 A versatileiterativesolver

ProductioncodessuchasSalinasrequiretheir solver to besufficiently versatiletoaddress,amongothers,problemswith successiveright-handsidesand/orLMPCs.

Systemswith successive right-handsidesarisein many structuralapplicationsin-cluding static analysisfor multiple loads,sensitivity, modal vibration, and im-plicit transientdynamicsanalyses.Krylov-basediterative solversareill-suited fortheseproblems,unlessthey incorporatespecialtechniquesfor avoiding restartingthe iterationsfrom scratchfor eachdifferentright-handside[29, 30, 31, 32]. Toaddressthis issue,Salinas’FETI-DP solver is equippedwith the projection/re-orthogonalizationproceduredescribedin [30, 31]. Thisprocedureacceleratescon-vergenceasillustratedbelow.

Fig. 5 reportsthe performanceresultsobtainedfor FETI-DP equippedwith theprojection/re-orthogonalizationtechniqueproposedin [30, 31] andappliedto thesolutionof therepeatedsystemsarisingfrom thecomputationof thefirst 50 eigenmodesof theopticalshuttershown in Fig.5(a).Theseresultsarefor aFEstructuralmodelwith 16 million dofs, anda simulationperformedon 1,071processorsofASCIWhite.Asshownin Fig.5(b),thenumberof FETI-DPiterationsisequalto75for thefirst right-handside,anddropsto 26 for the100th.Consequently, theCPUtimefor FETI-DPdropsfrom 57.0secondsfor thefirst problem— 33.4secondsofwhich correspondto a one-timepreprocessingcomputation— to 10.7secondsforthelastone.ThetotalFETI-DPCPUtime for all 100successiveproblemsis equalto 1,223seconds.Without the projection/re-orthogonalization-basedaccelerationprocedure,thetotalCPUconsumptionby FETI-DPfor all 100successiveproblemsis equalto 2,445seconds.Hence,for this modalanalysis,our accelerationschemereducesthe CPU time of FETI-DP by a factor equalto 2, independentlyof theparallelismof thecomputation.

LMPCs arefrequentin structuralanalysisbecausethey easethe FE modelingofcomplex structures.They canbe written in matrix form as 3�4 �65 , where 3 is

7

(a)Left: optical shutterto be embeddedin a MEMS device. Right:vonMisesstressesassociatedwith aneigenmode.

(b) IterationcountandCPUtimefor eachsuccessive linearsolve.

Fig. 5. Modal analysisof an optical shutterembeddedin a MEMS device. Performanceresultsof FETI-DP for the solutionof the 100 linear solvesarisingduring the computa-tion of thefirst 50 eigenmodes.FE modelsize:16,000,000dofs.Machine:ASCI White.Processorsused:1,071processors.

a rectangularmatrix, 5 is a vector, and 4 is thesolutionvectorof generalizeddis-placements.Solving systemsof equationswith LMPCs by a DD methodis notan easytaskbecauseLMPCs canarbitrarily coupletwo or moresubdomains.Toaddressthis importantissue,a methodologywaspresentedin [33] for generaliz-ing numericallyscalableDD-basediterative solversto thesolutionof constrainedFE systemsof equations,without interferingwith their local andglobalprecondi-

8

tioners.FETI-DPincorporatesthekey elementsof thismethodology, andthereforenaturallyhandlesLMPCs.

4 Optimization of solution time and scalability

The performanceand scalabilityof Salinasare essentiallythoseof its FETI-DPmoduleappliedto thesolutionof aproblemof theform

��4 ��7 � (4)

where � arisesfrom any of Salinas’analysiscapabilities.Givena meshpartition,FETI-DPtransformstheabove globalprobleminto aninterfaceproblemwith La-grangemultiplier unknowns, and solves this problemby a PCG algorithm.Thepurposeof theLagrangemultipliersis to enforcethecontinuityof thegeneralizeddisplacementfield 4 on theinterior of thesubdomaininterfaces.EachiterationofthePCGalgorithmincursthesolutionof a setof sparsesubdomainproblemswithNeumannboundaryconditions,a relatedsetof sparsesubdomainproblemswithDirichlet boundaryconditions,andasparsecoarseproblemof theform

�98:;:=< : ��> : � (5)

In Salinas,theindependentsubdomainproblemsaresolvedconcurrentlyby these-quentialblocksparseCholesky algorithmdescribedin [34]. ThissolverusesBLASlevel 3 operationsfor factoringamatrix,andBLAS level 2 operationsfor perform-ing forwardandbackwardsubstitutions.

For solving the coarseproblem(5), two approachesareavailable.In the first ap-proach[13], therelatively smallsparsematrix � 8:;: is duplicatedin eachprocessorwhich factorsit. In a preprocessingstep,the inverseof this matrix, � 8@?BA�C:;: , is com-putedby embarrassinglyparallel forward andbackward substitutionsandstoredacrossall � processorsin the column-wisedistributed format DFEB� 8G?BA�C:;:IH KJ ML*NPOML � .Giventhatthesolutionof thecoarseproblem(5) canbewrittenas

< : � � 8 ?BA�C:;: > : � ML*NPO,L � EQ� 8 ?RASC:;:IH TE > : H � (6)

it is computedin parallelusing local matrix-vectorproductsanda singleglobalrangecommunication.In thesecondapproach,� 8:;: is storedin a scatteredcolumndatastructureacrossanoptimalnumberof processors�U V . , andthecoarseproblem(5) is solvedby a parallelsparsedirectmethodwhich resortsto selective

9

inversionsfor improving theparallelperformanceof forwardandbackwardsubsti-tutions[39].

When� 8:;: canbestoredin eachprocessor, thefirstapproachmaximizesthefloating-point performance.For example,whenSalinas’FETI-DP usesthe first approachfor solving the coarseproblems,it sustains150 Mflop/s per ASCI Redprocessorratherthan the 99.5 Mflop/s announcedin the introduction.However, when thesizeof � 8:W: is greateror equalto the averagesizeof a local subdomainproblem,thesecondapproachscalesbetterthanthefirst oneandminimizestheCPUtime.For thesereasons,Salinasselectsthefirst methodwhenthesizeof thecoarseprob-lem is smallerthantheaveragesizeof a local subdomainproblem,andthesecondmethodotherwise.

5 Performance and scalability studies

5.1 Focusproblems

To illustratethescalabilityproperties,parallelperformance,andCPUefficiency ofSalinasandFETI-DP, we considerthe staticanalysisof two differentstructures.We remindthereaderthatbecauseSalinasis animplicit code,its performanceforstaticanalysisis indicativeof its performancefor its otheranalysiscapabilities.

The first structure,a cube,definesa modelproblemwhich hasthe virtue of sim-plifying the tedioustasksof meshgenerationand partitioningduring scalabilitystudies.We partitionthis cubeinto �YXZ\[]�_^Z`[]�_aZ subdomains,anddiscretizeeachsubdomainby �_b�[c�Ybd[c�Yb 8-nodedhexahedralelements.

The secondstructureis the optical shutterintroducedin Section3.3. This real-world structureis far morecomplex thansuggestedby Fig. 5(a), as its intricatemicroscopicgeometricalfeaturesarehardto visualize.It is composedof threedisklayersthatareseparatedby gaps,but connectedby asetof flexiblebeamsextendingin thedirectionnormalto all threedisks(seeFig.7).Weconstructtwo detailed(seeFig. 8) FEmodelsof this real-world structure:modelM1 with 16million dofs,andmodelM2 with 110million dofs.WeuseCHACO[15] to decomposetheFEmodelM1 into 288,535,835,and1,071subdomains,andthe FE modelM2 into 3,783subdomains.

We recognizethatfor thecubeproblem,loadbalanceis ensuredby theuniformityof eachmeshpartition,andnotethat for theopticalshutterproblem,loadbalancedependson thegeneratedmeshpartition.In all cases,we maponeprocessorontoonesubdomain.

10

Fig.6. Thecubeproblem:e = 30.0e+6fhgji , kmlonKprq — s XZut s ^Z t s aZ partitionanduniforms b t s b t s b subdomaindiscretization.

Fig. 7. Opticalshutterwith a500microndiameteranda three-layerconstruction.

5.2 Measuringperformance

All CPU timings reportedin the remainderof this paperarewall clock timingsobtainedusingtheC/C++ functionsavailablein v time.h / .

On ASCI Red, we use the performancelibrary of this machine,perfmon, tomeasurethefloating-pointperformanceof Salinas.Thefloating-pointperformanceshown onASCI REDin thesubsequentfiguresis for thesolver(FETI-DP)stageofSalinasonly. However, theSalinasexecutiontimesmeasuredonASCI Redincludeall thestagesfrom beginningtheendexcepttheinput/outputstage.

On ASCI White, we use the hpmcount utility[35] provided by IBM to mea-surefloating-pointperformanceof Salinas.None of Salinaswas excludedfromthe floating-pointmeasurementon ASCI White. The SalinasexecutiontimesonASCI White includetheinput/outputstage.

11

Fig. 8. Meshingcomplexity of theopticalshutter.

WemeasuretheoverallCPUefficiency asfollows

CPUEfficiency �xw`y{z,|T} 7 };y�~F-*� 4 ��� � 5 . ~*�hy��,�K����y��h�.�[�� �K|T��������

(7)

where� �K|T������� is thepeakprocessorfloating-pointrateandisequalto333Mflop/sfor anASCI Redprocessor, and1,500Mflop/s for anASCI Whiteprocessor.

5.3 Scalabilityandoverall CPUefficiencyonASCIRed

For thecubeproblem,wefix �_b � �M� andvary � XZ , � ^Z and� aZ to increasethesizeoftheglobalproblemfrom lessthana million to morethan36 million dofs.We notethat the parametersof this problemaresuchthat in all cases,Salinaschoosestheparallelsparsedirectmethoddescribedin [39] for solvingthecoarseproblems.

12

Fig. 9. Staticanalysisof thecubeon ASCI Red:scalabilityof FETI-DPandSalinasfor afixed-sizesubdomainandanincreasingnumberof subdomainsandprocessors.

Fig. 10. Staticanalysisof the cubeon ASCI Red:overall CPU efficiency andaggregatefloating-pointrateachievedby Salinas.

Fig.9 highlightsthescalabilityof FETI-DPandSalinasonASCI Red.Morespecif-ically, it shows that whenthe numberof subdomains,andthereforethe problemsize,aswell as . areincreased,the numberof FETI-DP iterationsremainsrel-atively constantand the total CPU time consumedby FETI-DP increasesonlyslightly. Fig. 10 shows that the floating-pointrateachieved by Salinasincreaseslinearly with � andreaches292.5Gflop/son 2,940processors.This correspondsto a performanceof 99.5Mflop/s perprocessor. Consequently, Salinasdeliversanoverall CPUefficiency rangingbetween34%on 64 processorsand30%on 2,940processors.

13

Fig. 11.Staticanalysisof thecubeonASCI White:scalabilityof FETI-DPfor afixed-sizesubdomainandanincreasingnumberof subdomainsandprocessors.

5.4 Scalabilityandoverall CPUefficiencyonASCIWhite

5.4.1 Thecubeproblem

Here,we fix �Y�����*� becauseeachASCI White processorhasa larger memorythanan ASCI Redprocessor, andvary �Y�� , �Y�� and �Y�� to increasethe sizeof theglobal problemfrom 2 million to morethan100 million dofs. For �_�c���*� , thesizeof thecoarseproblemis smallerthanthesizeof thesubdomainproblemandthereforeSalinaschoosesthe first methoddescribedin Section4 for solving thecoarseproblems.

Fig. 11highlightsthescalabilityof FETI-DPandSalinasonASCI White.It showsthatwhen ��� is increasedwith thesizeof theglobalproblem,the iterationcountandCPUtime of FETI-DPremainrelatively constant.Furthermore,Fig. 12 showsthat the floating-pointrateachievedby Salinason ASCI White increaseslinearlywith ��� andreaches1.16Tflop/son3,375processors.Thiscorrespondsto anaver-ageperformanceof 343.7Mflop/sperprocessor. Consequently, Salinasdeliversanoverall CPUefficiency rangingbetween28%on 64 processorsand24%on 3,375processors.

5.4.2 Thereal-worldopticalshutterproblem

To illustratetheparallelscalabilityof Salinasfor a fixed-sizeglobalproblem,wereportin Fig.13thespeed-upobtainedwhenusingtheFEmodelM1 andincreasingthenumberof ASCI White processorsfrom 288to 1,071.Thereadercanobservea linearspeed-upfor this rangeof processorswhich is commensuratewith thesizeof theFEmodelM1.

14

Fig. 12.Staticanalysisof thecubeon ASCI White: overall CPUefficiency andaggregatefloating-pointrateachievedby Salinas.

Fig. 13. Staticanalysisof theoptical shutteron ASCI White: speed-upof Salinasfor theFEmodelM1 with 16,000,000dofs.

Next, we summarizein Table1 the performanceresultsobtainedon ASCI Whitefor the larger FE modelM2. The floating-pointrate andoverall CPU efficiencyachieved in this casearelower thanthoseobtainedfor the cubeproblembecausethenontrivial meshpartitionsgeneratedfor theopticalshutterarenotwell balanced.Balancingameshpartitionfor aDD methodin whichthesubdomainproblemsaresolved by a sparsedirect methodrequiresbalancingthe sparsitypatternsof thesubdomainproblems,which is difficult to achieve without increasingthe costofthe partitioningprocessitself. Still, we point out the remarkablefact that Salinasperformsthe static analysisof a complex 110 million dofs FE model on 3,375ASCI Whiteprocessorsin lessthan7 minutesandsustains745Gflop/sduringthisprocess.

15

FE Size � Solution Performance OverallCPU

model Time Rate Efficiency

M2 110,000,000dofs 3,783 418seconds 745Gflop/s 13%

Table1Staticanalysisof theopticalshutteron3,783ASCI Whiteprocessors:performanceof Sali-nasfor theFEmodelM2.

6 Conclusions

ThissubmissionfocusesonSalinas,anengineeringsoftwarewith morethan100,000linesof � ��� codeanda long list of users.In the context of unstructuredapplica-tions,the292.5Gflop/ssustainedby thissoftwareon2,940ASCI Rednodesusing1 CPU/node(2,940processors)comparesfavorablywith themostrecentandim-pressiveperformances:156Gflop/son2,048nodesof ASCIRedusing1 CPU/node(2,048processors)demonstratedby a winningentryin 1999[40], and319Gflop/son 2,048nodesof ASCI Redusing2 CPUs/node(4,096processors)demonstratedby anotherwinningentryalsoin 1999[41]. Yet, thehighlightof thissubmissionisthesustainedperformanceof 1.16Tflop/son 3,375processorsof ASCI White bya codecapableof variousstructuralanalysesof real-world FE modelswith morethan100million dofs,is supportedona widevarietyof platforms,andis inspiringthedevelopmentof parallelFE softwarethroughoutbothacademicandindustrialstructuralmechanicscommunities.

Acknowledgments

Salinascodedevelopmenthasbeenongoingfor severalyearsandwould not havebeenpossibleif not for the help from the following people:Dan Segalman,JohnRed-Horse,Clay Fulcher, JamesFreymiller, Todd Simmermacher, Greg Tipton,BrianDriessen,CarlosFelippa,David Martinez,MichaelMcGlaun,ThomasBickel,PadmaRaghavan, and EsmondNg. The authorswould also like to thanksthesupportstaff of ASCI RedandASCI White for their indispensablehelp. Sandiais a multiprogramlaboratoryoperatedby SandiaCorporation,a LockheedMar-tin Company, for the U.S.Departmentof Energy (DOE) underContractNo. DE-AC04-94AL85000.CharbelFarhatacknowledgespartialsupportby SandiaunderContractNo. BD-2435,andpartialsupportby DOEunderAwardNo. B347880/W-740-ENG-48.Michel Lesoinneacknowledgespartialsupportby DOEunderAwardNo. B347880/W-740-ENG-48.

16

References

[1] M. S.Eldred,A. A. Giunta,B. G. vanBloemenWaanders,S. F. Wojtkiewicz,W. E. Hart andM. P. Alleva. DAKOTA, a multilevel parallelobject-orientedframework for designoptimization,parameterestimation,uncertaintyquantifi-cation,andsensitivity analysis.Version3.0referencemanual.SandiaTechnicalReportSAND2001-3515,2002.

[2] S. F. McCormick. Multilevel adaptive methodsfor partial differentialequa-tions. Frontiers in AppliedMathematics,SIAM, 1989.

[3] S. F. McCormick,ed. Multigrid methods.Frontiers in AppliedMathematics,SIAM, 1987.

[4] W. Briggs. A multigrid tutorial. SIAM, 1987.[5] P. Vanek,J. Mandel and M. Brezina. Algebraic multigrid on unstructured

meshes.Computing56,179-196(1996).[6] P. LeTallec. Domain-decompositionmethodsin computationalmechanics.

ComputationalMechanicsAdvances1, 121-220(1994).[7] C. FarhatandF. X. Roux. Implicit parallelprocessingin structuralmechanics.

ComputationalMechanicsAdvances2, 1-124(1994).[8] B. Smith,P. BjorstadandW. Gropp.Domaindecomposition,parallelmultilevel

methodsfor elliptic partialdifferentialequations.CambridgeUniversityPress,1996.

[9] C. Farhat,M. LesoinneandK. Pierson.A scalabledual-primaldomaindecom-positionmethod.Numer. Lin. Alg. Appl.7, 687-714(2000).

[10] C. Farhat,M. Lesoinne,P. LeTallec,K. PiersonandD. Rixen. FETI-DP: adual-primalunified FETI method- part I: a fasteralternative to the two-levelFETI method.Internat.J. Numer. Meths.Engrg. 50,1523-1544(2001).

[11] J.MandelandR. Tezaur. Ontheconvergenceof adual-primalsubstructuringmethod.Numer. Math.88,543-558(2001).

[12] A. Klawonn andO. B. Widlund. FETI-DP methodsfor three-dimensionalelliptic problemswith heterogeneouscoefficients. Technicalreport, CourantInstituteof MathematicalSciences,2000.

[13] M. Bhardwaj, D. Day, C. Farhat,M. Lesoinne,K. Piersonand D. Rixen.Applicationof the FETI methodto ASCI problems:scalabilityresultson one-thousandprocessorsanddiscussionof highly heterogeneousproblems.Internat.J. Numer. Methds.Engrg. 47,513-536(2000).

[14] J.J.McGowan,G. E. WarrenandR. A. Shaw. Wholeshipmodels.TheONRProgramReview, Arlington, April 15-18,2002.

[15] B. HendricksonandR. Leland.TheChacoUser’sGuide:Version2.0. SandiaTech. ReportSAND94-2692,1994.

[16] C. Farhat,S. Lant́eri andH. D. Simon. TOP/DOMDEC,a softwaretool formeshpartitioningandparallelprocessing.Comput.Sys.Engrg. 6, 13-26(1995).

[17] G. Karypis and V. Kumar. Parallel multilevel k-way partition schemeforirregulargraphs.SIAMReview 41,278-300(1999).

[18] C. Walshaw andM. Cross. Parallel optimizationalgorithmsfor multilevelmeshpartitioning.Parallel Comput.26,1635-1660(2000).

17

[19] R.Lehoucq,D. C.Sorensen,C.Yang.ArpackUser’sGuide:Solutionof large-scaleeigenvalueproblemswith implicitly restartedArnoldi methods. SIAM,1998.

[20] J. ChungandG. M. Hulbert. A time integrationalgorithmfor structuraldy-namicswith improvednumericaldissipation:thegeneralized- method.J. Appl.Mech. 60,371(1993).

[21] M. R. HestenesandE. Steifel. Methodof conjugategradientsfor solvinglinearsystems.J. Res.Nat.Bur. Standards49,409-436(1952).

[22] G. H. GolubandC. F. VanLoan. Matrix computations.TheJohnsHopkinsUniversityPress,1990.

[23] C. Farhat. A Lagrangemultiplier baseddivide andconquerfinite elementalgorithm.J. Comput.Sys.Engrg. 2, 149-156(1991).

[24] C. FarhatandF. X. Roux. A methodof finite elementtearingandintercon-nectingandits parallelsolutionalgorithm.Internat.J. Numer. Meths.Engrg. 32,1205-1227(1991).

[25] C. Farhat.,J.MandelandF. X. Roux. Optimalconvergencepropertiesof theFETI domaindecompositionmethod.Comput.Meths.Appl.Mech. Engrg. 115,367-388(1994).

[26] C. Farhat,P. S. ChenandJ. Mandel. A scalableLagrangemultiplier baseddomaindecompositionmethodfor implicit time-dependentproblems.Internat.J. Numer. Meths.Engrg. 38,3831-3858(1995).

[27] C. FarhatandJ.Mandel. Thetwo-level FETI methodfor staticanddynamicplateproblems- PartI: anoptimaliterativesolverfor biharmonicsystems.Com-put.Meths.Appl.Mech. Engrg. 155,129-152(1998).

[28] C. Farhat,P. S.Chen,J.MandelandF. X. Roux. Thetwo-level FETI method- Part II: extensionto shellproblems,parallelimplementationandperformanceresults.Comput.Meths.Appl.Mech. Engrg. 155,153-180(1998).

[29] Y. Saad.On theLanczosmethodfor solvingsymmetriclinearsystemswithseveralright-handsides.Math.Comp.48,651-662(1987).

[30] C. Farhatand P. S. Chen. Tailoring Domain DecompositionMethodsforEfficient ParallelCoarseGrid Solutionandfor Systemswith Many Right HandSides.Contemporary Mathematics180,401-406(1994).

[31] C. Farhat,L. Crivelli andF. X. Roux. Extendingsubstructurebasediterativesolvers to multiple load and repeatedanalyses.Comput.Meths.Appl. Mech.Engrg. 117,195-209(1994).

[32] P. Fischer. Projectiontechniquesfor iterative solutionof Ax=b with succes-sive right-handsides.Comp.Meths.Appl.Mech. Engrg. 163,193-204(1998).

[33] C. Farhat,C. LacourandD. Rixen. Incorporationof linear multipoint con-straintsin substructurebasediterative solvers - Part I: a numericallyscalablealgorithm. Internat.J. Numer. Meths.Engrg. 43,997-1016(1998).

[34] E. G. Ng andB. W. Peyton. Block sparsecholesky algorithms.SIAMJ. Sci.Stat.Comput.14,1034-1056(1993).

[35] Information on hpmcount, a performance monitor utilitythat reads hardware counters for IBM SP RS/6000 computers.http://www.alphaworks.ibm.com/tech/hpmtoolkit.

18

[36] ASCI RedHomePage.http://www.sandia.gov/ASCI/Red.[37] ASCI WhiteHomePage.http://www.llnl.gov/asci/platforms/white.[38] ASCI CPLANT HomePage.http://www.cs.sandia.gov/cplant.[39] P. Raghavan. Efficient parallel triangularsolutionwith selective inversion.

Parallel ProcessingLetters 8, 29-40(1998).[40] K. Anderson,W. Gropp,D. Kaushik,D. KeyesandB. Smith.Achieving high

sustainedperformancein anunstructuredmeshCFD application. Proceedingsof SC99, Portland,OR,November1999.

[41] H. M. Tufo andP. F. Fischer. Terascalespectralelementalgorithmsandim-plementations.Proceedingsof SC99, Portland,OR,November1999.

19

Recommended