Automatic Dynamic Load Balancing for a Crack Propagation

Automatic Dynamic Load Balancing for a CrackPropagation Application

Gengbin Zheng†, Michael S. Breitenfeld‡, Hari Govind†, Philippe Geubelle‡, Laxmikant V. Kale†∗†Department of Compute Science, University of Illinois at Urbana-Champaign

‡Department of Aerospace Engineering, University of Illinois at Urbana-Champaign∗Corresponding Author: [email protected]

Abstract— Automatic, adaptive load balancing is essential forhandling load imbalance that may occur during parallel finiteelement simulations involving mesh adaptivity, nonlinear materialbehavior and other localized effects. This paper demonstratesthe successful application of a measurement-based dynamic loadbalancing concept to the finite element analysis of elasto-plasticwave propagation and dynamic fracture events. The simulationsare performed with the aid of a parallel framework for un-structured meshes called ParFUM, which is based on Charm++and Adaptive MPI (AMPI) and involves migratable user-levelthreads. The performance was analyzed using Projections, aperformance analysis and post factum visualization tool. Thebottlenecks to scalability are identified and eliminated using avariety of strategies resulting in performance gains ranging frommoderate to highly significant.

I. I NTRODUCTION

Researchers in the field of structural mechanics have oftenturned to parallel finite element modeling to model physicalphenomena with more detail, sophistication, and accuracy.While parallel computing can provide large amounts of com-putational power, developing parallel software requires sub-stantial efforts to leverage parallel computers efficiently.

Among the challenges associated with the parallelizationof finite element codes, achieving load balance is the key toscaling a dynamic application to a large number of processors.This is especially true for dynamic structural mechanics codeswhere simulations involve rapidly evolving geometry andphysics, often resulting in a load imbalance between proces-sors. As a result of this load imbalance, the application has torun at the speed of the slowest processor with deterioratedperformance. Solving load imbalance has triggered variousresearch activities in load balancing techniques [1], [2], [3],[4]. Dynamic load balancing attempts to solve the load balanceproblem at run-time according to the most up-to-date loadsituation.

Dynamic load balancing is a challenging software designissue and generally creates a burden for the application devel-opers. For example, a computational analyst working on com-putational fracture mechanics must include the mechanism toinform the decision-making module concerning load balancethe estimated CPU load and the communication structure. Inaddition, once load imbalance is detected and data migrationis requested, a developer has to write complicated code formoving data across processors. The ideal load balancingframework should hide the details of load balancing so that the

application developer can concentrate on modeling the physicsof the problem.

In this paper, we present an automatic load balancingmethod and its application in wave propagation and dynamiccrack propagation applications. The parallelization model usedin this application is the processor virtualization supportedby migratable MPI threads. The application runs on a largenumber of MPI threads (that exceeds the actual physicalnumber of processors), allowing to perform run-time loadbalancing by migrating MPI threads. The MPI run-time systemautomatically collects load information from the execution ofthe application. Based on this instrumented load data, the run-time module makes decisions on migrating MPI threads fromheavily loaded processors to underloaded ones. This approachthus requires minimal efforts from the application developer.

The remainder of the paper is organized as follows: SectionII presents the finite element formulation used in this paper,with emphasis on the viscoplastic model and cohesive finiteelement scheme adopted here to model the dynamic propaga-tion of a crack in a ductile medium. Section III describes theparallelization method and programming environment used toimplement the structural mechanics application, while SectionIV describes ParFUM, the high-level domain specific libraryintroduced to help developer with the parallelization aspectof the application. ParFUM is based on the CHARM++ loadbalancing framework summarized in Section V. Sections VIand VII respectively describe the performance of paralleladaptive finite element simulations of an elasto-plastic wavepropagation problem and of a dynamic fracture event. Sec-tion VIII discusses related work in load balancing research.Finally, Section IX concludes with some future plans.

II. COHESIVE FINITE ELEMENT MODEL OF FRACTURE

To simulate the spontaneous initiation and propagation of acrack in a discretized domain, we use an explicit cohesive-volumetric finite element (CVFE) scheme [5], [6], [7]. Asits name indicates, the scheme relies on a combination ofvolumetric elements used to capture the constitutive responseof the continuum medium, and of cohesive interfacial elementsused to model the failure process taking place in the vicinityof the advancing crack front. The CVFE concept is illustratedin Figure 1, which presents two 4-node tetrahedral volumetricelements tied together by a 6-node cohesive element shown in

its deformed configuration, as the adjacent nodes are initiallysuperposed and the cohesive element has no volume.

Fig. 1. Two 4-node tetrahedral volumetric elements linked by a 6-nodecohesive element.

In the present study, the mechanical response of the cohesiveelements is described by the bilinear traction-separation lawillustrated in Figure 2 for the case of tensile (Mode I) failure.After an initial stiffening (rising) phase, the cohesive tractionTn reaches a maximum corresponding to the failure strengthσmax of the material, followed by a downward phase thatrepresents the progressive failure of the material. Once thecritical value ∆nc of the displacement jump is reached, nomore traction is exerted across the cohesive interface and atraction-free surface (i.e., a crack) is created in the discretizeddomain. The emphasis of the dynamic fracture study summa-rized hereafter is on the simulation of purely mode I failure,although cohesive models have also been proposed for thesimulation of mixed-mode fracture events. Also illustrated inFigure 2 is an unloading and reloading path followed by thecohesive traction during an unloading event taking place whilethe material fails.

Fig. 2. Bilinear traction-separation law for mode I failure modeling.

The finite element formulation of the CVFE scheme isderived from the following form of the principle of virtualwork:

∫V

(ρui δui + Sij δEij) dV =∫

ST

T exi δui dST +

∫Sc

Ti δ∆i dSc,

(1)where the left-hand-side corresponds to the virtual work doneby the inertial forces (ρui) and the internal stresses (Sij), andthe right-hand side denotes the virtual work associated withthe externally applied traction (T ex

i ) and cohesive traction (Ti)acting along their respective surfaces of applicationST andSc.In Equation (1),ρ denotes the material density,ui and Eij

are the displacement and strain fields, respectively, and∆i

denotes the displacement jump across the cohesive surfaces.The implementation relies on an explicit time stepping schemebased on the central difference formulation [5]. A nonlinearkinematics description is used to capture the large deformationand rotation associated with the propagation of the crack. Thestrain measure used here is the Lagrangian strain tensorE.

To complete the CVFE scheme, we need to model theconstitutive response of the material, i.e., to describe theresponse of the volumetric elements. In the present study,we use an explicit elasto-visco-plastic update scheme, whichis compatible with the nonlinear kinematic description andrelies on the multiplicative decomposition of the deformationgradientF into elastic and plastic parts as

F = FeFp. (2)

The update of the plastic componentFp of the deformationgradient at the(n + 1)th time step is obtained by

Fpn+1 = exp

[ ∑A

∆γ√2σ

(σA − Iσ

1

3

)NA ⊗NA

]• Fp

n, (3)

whereNA (A=1, 2, 3) denote the Lagrangian axes defined inthe initial configuration,∆γ is the discretized plastic strainincrement,Iσ

1 is the first Cauchy stress invariant, andσ =√(σ′ : σ′)/2 is the effective stress, withσ′ denoting the

Cauchy stress deviator whose spectral decomposition is

σ′ =∑A

(σA − Iσ

1

3

)NA ⊗NA. (4)

The plastic strain increment is given by∆γ = ∆t γ, wherethe plastic strain rate is described in this study by the classicalPersyna two-parameter model

γ = η

(f(σ)σY

)n

, (5)

in which n and η are material constants,σY is the currentyield stress, andf(σ) = (σ − σY ) is the overstress. Strainhardening is captured by introducing a tangent modulusEt

relating the increment of the yield stress,∆σY , to the plasticstrain increment∆γ. Finally, the linear relation

S = LE (6)

between the second Piola-Kirchhoff stressesS and the La-grangian strainsE is used to describe the elastic response.Assuming material isotropy, the stiffness tensorL is definedby the Young’s modulusE and Poisson’s ratioν.

The main source of load imbalance comes from the verydifferent computational costs associated with the elastic andvisco-plastic constitutive updates. As long as the effectivestress remains below a given level (chosen in this paperas 80% of the yield stress), only the elastic relation (6) iscomputed. Once this threshold is reached for the first time, thevisco-plastic update is performed, which typically representsa doubling in the computational cost. As the crack propagatesthrough the discretized domain, the load associated with eachprocessor can be substantially heterogeneous, suggesting theneed for the robust dynamic load balancing scheme describedin Section V.

III. PARALLELIZATION WITH AMPI

The parallel program for simulation of fracture dynamics iswritten and parallelized using Adaptive MPI.

Adaptive MPI (AMPI) [8], [9] is an MPI implementa-tion and extension based on CHARM++ [10] programmingmodel. CHARM++ is a parallel C++ programming languagethat embodies the concept ofprocessor virtualization[11].This idea of processor virtualization is that the programmerdecomposes the computation, without regard to the physicalnumber of processors available, into a large number of logicalwork units and data units, which are encapsulated invirtualprocessors(VPs) [11]. The programmer leaves the assignmentof VPs to physical processors to the run-time system, whichincorporates intelligent optimization strategies and automaticruntime adaptation. These virtual processors themselves can beprogrammed using any programming paradigm: e.g. they canbe organized as indexed collections of C++ objects that interactvia asynchronous method invocations, as in CHARM++ [12].Alternatively, they can be MPI virtual processors implementedas user-level, extremely lightweight threads (NOT to be con-fused with system level threads or Pthreads), that interact witheach other via messages, as in AMPI (illustrated in Figure 3).

Fig. 3. Implementation of AMPI virtual processors

This idea of processor virtualization brings significant ben-efits to both parallel programming productivity and parallel

performance [13]. It empowers the run-time system to incorpo-rate intelligent optimization strategies and automatic runtimeadaptation. The following is a list of the benefits we havedemonstrated in many projects.

Automatic load balancing: AMPI threads (the virtualprocessors) are decoupled from real processors, thus theyare location independent and can migrate from processorsto processors. Thread migration provides basic mechanismfor load balancing — if some of the physical processorsbecome overloaded, the run-time system can migrate a fewof their AMPI threads to underloaded physical processors.The AMPI run-time system provides transparent support ofmessage forwarding after thread migration.

Adaptive overlapping of communication and computa-tion: If one of the AMPI threads is blocked on a receive,another AMPI thread on the same physical processor canrun. This largely eliminates the need for the programmerto manually specify some static computation/communicationoverlapping, as is often required in MPI.

Optimized communication library support: Besidesthe communication optimization inherited from CHARM++,AMPI supports asynchronous, or non-blocking interfaces tocollective communication operations. This allows the overlap-ping between time-consuming collective operations and otheruseful computation.

Better cache performance: A virtual processor handlesa smaller set of data than a physical processor, so a virtualprocessor will have better memory locality. This blockingeffect is the same method manyserial cache optimizationsemploy, and AMPI programs get this benefit automatically.

Flexibility to run on an arbitrary number of processors:Since more than one VPs can be executed on one physicalprocessor, AMPI is capable of running MPI programs onany arbitrary number of processors. This feature proves to beuseful in application development and debugging phases.

In many applications, we have demonstrated that the pro-cessor virtualization does not incur much cost in parallelperformance [13], due to low scheduling overheads of user-level threads. In fact, it often improves cache performancesignificantly because of its blocking effect.

CHARM++ and AMPI have been used as mature paralleliza-tion tools and run-time systems for a variety of real worldapplications for scalability [14], [15], [16], [17]. To furtherenhance programmer productivity, we have developed domain-specific frameworks on top of CHARM++ and AMPI toautomate the parallelization process, which produces reusablelibraries for parallel algorithms. A parallel framework forunstructured meshing called ParFUM that this work is basedon is described in the next section.

IV. PARALLEL FRAMEWORK FORUNSTRUCTURED

MESHES

A wide variety of applications involve explicit computationson unstructured grids. As described earlier, one of the keyobjectives of this work is to create a flexible framework to

perform this type of simulations on parallel computing plat-forms. Parallel programming introduces several complications:

• Simply expressing a computation in parallel requires theuse of either a specialized language such as HPF [18] oran additional library such as MPI.

• Parallel execution makes race conditions and nondeter-ministic execution possible. Some languages, such asHPF, have a simple lockstep control structure and arethus relatively immune to this problem; while in others,such as pthreads, they are more common.

• Computation and communication must be overlapped toachieve optimal performance. However, few languagesprovide good support for this overlap, and even simplestatic schemes can be painfully difficult to implement.

• Load imbalance can severely restrict performance, espe-cially for dynamic applications. Automatic or application-independent load-balancing capabilities are rare (Sec-tion VIII).

Our approach to managing the complexity of parallel pro-gramming is based on a simple division of labor. In thisapproach, parallel programming specialists in computer sci-ence provide a simple but efficientparallel frameworkfor thecomputation; while application specialists provide the numer-ics and physics. The parallel framework described hereafterabstracts away the details of its parallel implementation.

Since the parallel framework is application independent, itcan be reused across multiple projects. This reuse amortizesthe effort and time spent developing the framework and makesit feasible to invest in sophisticated capabilities such as adap-tive computation and communication overlap and automaticmeasurement-based load balancing. Overall, this approach hasproven quite effective, leveraging skills in both computerscience and engineering to solve problems neither could solveindependently.

A. ParFUM Framework

This section describes our parallel framework (called Par-FUM) for performing explicit computations on unstructuredgrids. The framework has been used for finite-element com-putations, solving partial differential equations, computationalfluid dynamics, and other problems.

The basic abstraction provided is very simple — the com-putational domain consists of an irregular mesh of nodes andelements. The elements are divided into partitions or chunks,normally using the graph partitioning library Metis [19], orParMetis [20]. These chunks reside in AMPI migratable virtualprocessors, thereby taking advantage of run-time optimizationsincluding dynamic load balancing. The chunks of meshesand AMPI virtual processors are then distributed across theprocessors of the parallel machine. There is normally at leastone chunk per processor; and often even more. Nodes can beeither private, adjacent to the elements of a single partition;or shared, adjacent to the elements of different partitions.

ParFUM application has two main subroutines: theinit andthe driver. The init subroutine executes only on processor 0and is used to read the input mesh and physical data and

register it with the framework. The framework then partitionsthe mesh into as many regions as requested, each partitionbeing a virtual processor. It then executes thedriver routineon each virtual processor. This routine computes the solutionover the local partition of the mesh.

The solution loop for most applications involves a cal-culation in which each node or element requires data fromits neighboring entities. Thus entities on the boundary of apartition need data from entities on other partitions. ParFUMprovides a flexible and scalable approach to meet an ap-plication’s communication requirements. ParFUM adds localread-only copies of remote entities to the partition boundary.These read-only copies are referred to asghosts. A singlecollective call to ParFUM allows the user to update all ghostentities with data from the original copies on neighboringpartitions. This lets application code have effortless access todata from neighboring entities on other partitions. Since thedefinition of “neighboring” can vary from one application toanother, ParFUM provides a flexible mechanism for generatingghost layers. For example, an application might considertwo tetrahedra that share a face as neighbors. In anotherapplication, tetrahedra that share edges might be consideredneighbors. ParFUM users can specify the type of ghost layerrequired by defining the “neighboring” relationship in theinit routine and adding multiple layers of ghosts accordingto the neighboring relationship for applications that requirethem. In addition, the definition of “neighboring” can vary fordifferent layers. User-specified ghost layers are automaticallyadded after partitioning the input mesh provided during theinitroutine. ParFUM also updates the connectivity and adjacencyinformation of a partition’s entities to reflect the additionallayers of ghosts. Thus ParFUM satisfies the communicationneeds of a wide range of applications by allowing the user toadd arbitrary ghost layers. After the communication for ghostlayers, each local partition is nearly self contained; a serialnumerics routine can be run on the partition with only a minormodification to the boundary conditions.

With the above design, ParFUM framework enables straight-forward conversion of serial codes into parallel applications.For example, in an explicit structural dynamics computation,each iteration of the time loop has the following structure:

1) Compute element strains based on nodal displacements.2) Compute element stresses based on element strains.3) Compute nodal forces based on element stresses.4) Apply external boundary conditions.5) Compute new nodal displacements based on Newtonian

physics.In a serial code, these operations apply over the entire mesh.

However, since each operation is local, depending only on anode or element’s immediate neighbors, we can partition themesh and run the same code on each partition.

The only problem is ensuring that the boundary conditionsof the different partitions match. The solution we choose isto duplicate the nodes along the boundary and then sum upthe nodal forces during step 3, which amounts to this simplechange:

1) Compute element strains based on nodal displacements.2) Compute element stresses based on element strains.3) Compute nodal forces based on element stresses.4) Apply externaland internalboundary conditions.5) Compute new nodal displacements based on Newtonian

physics.

For existing codes that have already parallelized with MPI,the conversion to ParFUM is even faster, thereby takingadvantage of features including dynamic load balancing.

V. L OAD BALANCING FRAMEWORK

Many ParFUM applications involve simulations with dy-namic geometry, and use adaptive techniques to solve highlyirregular problems. In these applications, load balancing isrequired to achieve the desired high performance on largeparallel machines. It is especially essential for applicationswhere the amount of computation on a mesh partition canincrease significantly as the number of elements comprisingthe partition increases with refinement and/or cohesive elementinsertion. It is also useful in applications where the computa-tional load for subsets of elements varies over the duration ofthe simulation.

ParFUM directly utilizes the load balancing framework inCHARM++ and AMPI load balancing framework [3], [21]. Theload balancing involves four distinct steps: (1) load evaluation;(2) load balancing initiation which determines when to start anew load balancing; (3) load balancing decision making and(4) task and data migration. These steps are automatic andrequire minimal effort from the developers.

CHARM++ load balancing framework adopts a uniquemeasurement-based strategy for load evaluation. This schemeis based on the run-time instrumentation, which is feasibledue to theprinciple of persistencethat can be found in mostphysical simulations: the communication patterns betweenobjects as well as the computational load of each of them tendto persist over time, even in the case of dynamic applications.This implies that the recent past behavior of a system canbe used as a good predictor of the near future. The loadinstrumentation is fully automatic at runtime. During theexecution of a ParFUM application, the run-time times thecomputation load for each object and records communicationpattern into a load “database” on each processor.

The runtime then assesses the load database periodicallyand determines if load imbalance occurs. The load imbalancecan be computed as:

σ =Lmax

Lavg− 1, (7)

whereLmax is the maximum load across all processors, andLavg is the average load of all the processors. Note that evenwhen load imbalance occurs (σ > 0), it may not be profitableto start a new load balancing step due to the overhead of loadbalancing itself. In practice, a load imbalance threshold can bechosen based on a heuristic that the gain of the load balancing(Lmax−Lavg) after the load balancing is at least greater than

the estimated cost of load balancing (Clb). That is:

σ >Clb

Lavg. (8)

When load balancing is triggered, the load balancing de-cision module uses the load database to compute a newassignment of virtual processors to physical processors andinforms the run-time to execute the migration decision.

A. Run-time Support for Thread Migration

In ParFUM applications, load balancing is achieved bymigrating AMPI threads that host mesh partitions from over-loaded processors to underloaded ones. When an AMPIthread migrates between processors, it must move all theassociated data, including its stack and heap-allocated data.The CHARM++ runtime supports both fully automated threadmigration and flexible user-controlled migration of data byadditional helper functions.

In fully automatic mode, the AMPI run-time system auto-matically transfers a thread’s stack and heap data which areallocated by special memory allocator calledisomalloc [9]in a manner similar to that ofPM2 [2]. It is portable onmost platforms except for those where themmap systemcall is unavailable. Isomalloc allocates data with a globallyunique virtual address, reserving the same virtual space onall processors. With this mechanism, isomalloced data can bemoved to a new processor without changing the address. Thisprovides a clean way to move a thread’s stack and heap datato a new machine automatically. In this case, migration istransparent to the user code.

Alternatively, users can write their own helper functions topack and unpack heap data for migrating an AMPI thread.This is useful when application developers wish to have morecontrol in reducing the data volume by using application spe-cific knowledge and/or by packing only variables that are liveat the time of migration. The PUP (Pack/UnPack) library [22]was written to simplify this process and reduce the amount ofcode the developers have to write. The developers only needto write a single PUP routine to traverse the data structure andthis routine is used for both packing and unpacking.

B. Load Balancing Strategies

In the step that makes the load balancing decision, theCHARM++ run-time assigns AMPI threads on physical pro-cessors, so as to minimize the maximum load (makespan)on the processors. This is known as the Makespan mini-mization problem, which has been shown as anNP -hardoptimization problem [23]. However, many combinatiorialalgorithms have been developed that find a reasonably goodapproximate solution. CHARM++ load balancing frameworkprovides a spectrum of simple to sophisticated heuristic-basedload balancing algorithms, some of which are described inmore details below:

• Greedy Strategy: This simple strategy organizes all theobjects in decreasing order of their computation times.The algorithm repeatedly selects the heaviest un-assigned

object, and assigns it to the least loaded processor. Thisalgorithm may lead to a large number of migrations.However, it works effectively in most cases.

• Refinement Strategy: The refinement strategy is an algo-rithm which improves the load balance by incrementallyadjusting the existing object distribution, especially onhighly loaded processors. The computational cost of thisalgorithm is low because only a subset of processors areexamined. Further, this algorithm results in only a fewobjects being migrated, which makes it suitable for fine-tuning the load balance.

• Metis-based Strategy: This strategy uses the METISgraph partitioning library [24] to partition the object-communication graph. The objective of this strategy isto find a reasonable load balance, while minimizing thecommunication among processors.

CHARM++ load balancing framework also allows a devel-oper to implement his own load balancing strategies basedon heuristics specific to the target application (such as inNAMD [14] molecular simulation code).

Load balancing can be done in either centralized or dis-tributed approach depending on how the load balancing de-cisions are made. In the centralized approach, one centralprocessor makes the decisions globally. The load databases ofall processors are collected to the central processor, which mayincur high communication overhead and memory usage for thecentral processor. In the distributed approach, load balancedecisions are made in distributed fashion. The load databasesare only exchanged among neighboring processors. Due to thelack of the global information and aging of the load data,distributed load balancing inherently is difficult to achievegood load balance as quickly as the centralized approach. Inthis paper, we use a centralized load balancing strategy.

C. Agile Load Balancing

Applications with fast changing load requires frequent loadbalancing, which demands fast load balancing with minimaloverhead. Normal load balancing strategies in CHARM++

occur in synchronousmode, as shown in Figure 4. At loadbalancing time, the application on each processor stops afterit finishes its designated iterations and hands over the controlto the load balancing framework to make load balancingdecisions. The application can only resume when the loadbalancing step finishes and all AMPI threads migrate tothe destination processors. In practice, this “stop and go”load balancing scheme is simple to implement, and has oneimportant advantage — the AMPI thread migration happensunder user control, so that a user can choose a convenient timefor the thread migration to possibly minimize the migrationdata size. However, this scheme is not efficient due to theeffect of the global barrier. It suffers from high overhead dueto the fact that load balancing process on the central processorhas to wait for the slowest processor to join load balancing, andthus wasting CPU cycles on other processors. This motivatedus to develop an agile load balancing strategy that performs

asynchronousload balancing which allows overlapping of loadbalancing time and normal computation.

��

��

��

Processor 0

Processor 1

Processor 2

2 3 4

6

1 2

3 4

5 6

i

i i i

i i

i+1 i+1

i+1 i+1

i+1 i+1

5

computing object of i_th iteration1

i

load balancing strategy

1

Fig. 4. Traditional synchronous load balancing

The asynchronous load balancing scheme fully takes ad-vantage of CHARM++’s intelligent run-time support for con-current compositionality [13] that allows dynamic overlappingof the execution of different composed modules in time andspace. In the asynchronous scheme, load balancing process oc-curs concurrently or in the background of normal computation.When it is time for load balancing, each processor sends itsload database to the central processor and continues its normalcomputation without waiting for load balancing to start. Whena migration decision is calculated at the background on thecentral processor, the AMPI threads are instructed to migrateto their new processors in the middle of their computation.

There are a few advantages of asynchronous load balancingover the synchronous scheme. First, eliminating the globalbarrier helps reducing the idle time on faster processors whichotherwise would have to wait for the slower processors to jointhe load balancing step. Second, it allows the overlapping ofload balancing decision making time and computation in anapplication, which potentially could help improve the overallperformance. Finally, each thread can have more flexiblecontrol on when to migrate to the designated processor. Forexample, a thread can choose to migrate when it is about tobe idle, which potentially allows overlapping of the threadmigration and computation of other threads.

Asynchronous load balancing however imposes a significantchallenge to the thread migration in the AMPI run-timesystem. AMPI threads may migrateat any time, wheneverthey receive the migration notification. In practice, it is nottrivial for an AMPI thread to migrate at any time due to thecomplex run-time state involved, for example when a threadis suspended in the middle of pending receives. In order tosupport any-time migration of AMPI threads, we extendedthe AMPI run-time to be able to transfer a complete runtimestate associated with the AMPI threads including the pendingreceive requests and buffered messages for future receives.With the help of isomalloc stack and heap, AMPI threadscan be migrated to a new processor transparently at any timewithout worrying about the scenario that the stack becomesinvalid when the threads are resumed on a different processor.For AMPI threads with pending receives, incoming messagesare redirected automatically to the destination processors by

the run-time system.In the next two sections, we present two case studies

of simulations to demonstrate the effectiveness of our loadbalancing strategies.

VI. CASE STUDY 1: ELASTO-PLASTIC WAVE PROPAGATION

The first application is the quasi-one-dimensional elasto-plastic wave propagation problem depicted in Figure 5. Itconsists of a rectangular bar of lengthL = 10 m and cross-sectionA = 1m2. The bar is initially at rest and stress free. Itis fixed at one end and subjected at the other end to an appliedvelocity V ramped linearly from 0 to 20m/s over .16ms andheld constant thereafter. The material properties are chosen asfollows: yield stressσY = 480 MPa, stiffnessE = 73 GPaand Et = 7.3 GPa, exponentn = 0.5, fluidity η = 10−6/s,Poisson’s ratioν = .33 and densityρ = 2800 kg/m3.

The applied velocity generates a one-dimensional stresswave that propagates through the bar and reflects from thefixed end. At every wave reflection, the stress level in thebar increases as the end of the bar is continuously pulled ata velocity V . During the initial stage of the dynamic event,the material response is elastic as the first stress wave travelsthrough the bar at the dilatational wave speedCd = 6215 m/swith an amplitude

σ = ρCdV = 348 MPa < σY . (9)

After one reflection of the wave from the fixed end, thestress level in the bar exceeds the yield stress of the materialand the material becomes plastic. A snapshot of the locationof the elasto-plastic stress wave is shown in Figure 5. Thecomputational overload associated with the plastic updateroutine (approximately a factor of two increase compared tothe elastic case) leads to a significant dynamic load imbalancewhile the bar transforms from elastic to plastic. In thesesimulations, the plastic check and update subroutine is calledupon when the equivalent stress level exceeds 80% of the yieldstress.

The unstructured 400,000-element tetrahedral mesh thatspans the bar is initially partitioned into chunks using METISand these chunks are then mapped to the processors. Duringthe simulation, the processors advance in lockstep with fre-quent synchronizing communications required by exchangingof boundary conditions, which may lead to bad performancewhen load imbalance occurs.

The simulation was run on Tungsten Xeon Linux cluster atthe National Center for Supercomputing Applications (NCSA).This cluster is based on Dell PowerEdge 1750 servers, eachwith two Intel Xeon 3.2 GHz processors, running Red HatLinux and Myrinet interconnect network. The test ran on32 processors with 160 AMPI virtual processors. Figure 7shows the results without load balancing in aProjectionsCPU utilization graph over a certain time interval. The figurewas generated byProjections [25], which is a performancevisualization and analysis tool associated with CHARM++ thatsupplies application-level visual and analytical performance

Fig. 5. Location of the traveling elasto-plastic wave at timeCdt/L = 1.3.

feedback. This utilization graph shows how the overall uti-lization changes as the wave propagates through the bar. Thetotal run time was 177 seconds for this run.

0 200 400 600 800 1000Time Step Number

0

2e+05

4e+05

6e+05

8e+05

1e+06N

umbe

r of

Pla

stic

Ele

men

ts

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Tim

este

p’s

Ave

rage

Exe

cutio

n T

ime

(sec

onds

)

Fig. 6. Evolution of the number of plastic elements.

A separate interest, although not investigated further in thispaper, is the period of initial load imbalance (as shown inFigure 7) caused by the quiet generation of subnormal numbers(floating-point numbers that are very close to zero) duringthe initial propagation of the elastic wave along the initiallyquiescent bar. This phenomenon is discussed by Lawloretal. [26], who propose an approach to mitigate such perfor-mance effects caused by the inherent processor design. Thispaper is only concerned with the load imbalance associatedwith the transformation of the bar from elastic to plastic.

As indicated earlier, the load imbalance in this problem ishighly transient, as elements at the wave front change from anelastic to a plastic state. In Figure 6, the effects of the plasticitycalculations are clearly noticeable in terms of execution timewhich linearly ramps from the condition of fully elastic tofully plastic resulting in a doubling of the execution time.This leads to a load imbalance between the processor whenthe simulation is within this linearly ramping region. The load

CPU

Utiliz

atio

n (%

)

0

10

20

30

40

50

60

70

80

90

100

Time (seconds)0 50 100 150

Initial load imbalance

Transient load imbalance

Fig. 7. CPU utilization graph without load balancing (Tungsten Xeon).

CPU

Utiliz

atio

n (%

)

0

10

20

30

40

50

60

70

80

90

100

Time (seconds)0 50 100 150

Fig. 8. CPU utilization graph with synchronous load balancing (TungstenXeon).

balancer here has to migrate these objects aggressively.Even though we used a variety of methods and time

frames, the problem was not considerably sped up by loadbalancing. The transition time was too fast for load balancerto significantly speed up the simulation. Also the period ofimbalance is a very small portion compared to the total runtime. Therefore, a performance improvement here necessitatesthat the overhead and delays associated with the invocation ofthe load balancer be minimal. Nevertheless, we managed tospeed up the simulation by 7 seconds as shown in Figure 8.The time required for completion reduces to 170 seconds,which yields a 4 percent of overall improvement by the loadbalancing.

We repeated the same test on 32 processors of the SGIAltix (IA64) at NCSA with the same 160 AMPI virtualprocessors. Figure 9 shows the result without load balancing inthe Projections utilization graph. The total execution time was207 seconds and a more severe effect of subnormal numberson this machine was observed in the first hundred seconds ofexecution time.

In the second run, we ran the same test with the greedyload balancing scheme described in the previous section. Theresult is shown in Figure 10 in the same utilization graph.

CPU

Utiliz

atio

n (%

)

0

10

20

30

40

50

60

70

80

90

100

Time (seconds)50 100 150 200

Fig. 9. CPU utilization graph without load balancing (SGI Altix).

The load balancing is invoked around time interval 130 inthe figure. After the load balancing, the CPU utilization isslightly improved and the total execution time is now around198 seconds.

CPU

Utiliz

atio

n (%

)

0

10

20

30

40

50

60

70

80

90

100

Time (seconds)50 100 150 200

Fig. 10. CPU utilization graph with synchronous load balancing (SGI Altix).

Finally, we ran the same test with the same greedy algorithmin an asynchronous load balancing scheme (Section V-C).Asynchronous load balancing scheme avoids the stall of anapplication for load balancing and overlaps the computationwith the load balancing and migration. The result is shown inFigure 11 in a utilization graph. It can be seen that, after loadbalancing, the overall CPU utilization was further improvedand the total execution time is 187 seconds, which is a 20second improvement.

VII. C ASE STUDY 2: DYNAMIC FRACTURE

The second application involves a single edge notchedfracture specimen of widthW = 5 m, height H = 5 m,thicknessT = 1 m and initial crack lengtha0 = 1 m, havinga weakened plane starting at the crack tip and extending alongthe crack plane to the opposite edge of the specimen. The ma-terial properties used in this simulation areσY = 900 MPa,E = 210 GPa, Et = 2.4 GPa, n = 0.5 , η = 10−6/s,ν = .3, and ρ = 7850 kg/m3. A linearly ramped velocityof 0 to 1 m/s over 2.0ms and held constant thereafter is

CPU

Utiliz

atio

n (%

)

0

10

20

30

40

50

60

70

80

90

100

Time (seconds)50 100 150 200

Fig. 11. CPU utilization graph with asynchronous load balancing (SGI Altix).

applied along the top and bottom surfaces of the specimen. Asingle layer of six-node cohesive elements are placed along theweakened interface, with the failure properties described by acritical crack opening displacement value∆nc = .8 mm and acohesive failure strengthσmax = 95 MPa. The mesh consistsof 91,292 cohesive elements along the interface plane and4,198,134 linear strain tetrahedral elements. As the stress waveemanating from the top and bottom edges of the specimenreach the fracture plane, a region of high stress concentrationis created around the initial crack tip. In that region, theequivalent stress exceeds the yield stress of the materialleading to the creation of a plastic zone. As the stress levelcontinues to build up in the vicinity of the crack front, thecohesive tractions along the fracture plane start to exceed thecohesive failure strength of the weakened plane and a crackstarts to propagate rapidly along the fracture plane, surroundedby an plastic zone and leaving behind a plastic wake, asillustrated in Figure 12.

Fig. 12. Snapshot of the plastic zone surrounding the propagating planarcrack at timeCdt/a0 =???. The iso-surfaces denote the extent of the regionwhere the elements have exceeded the yield stress of the material.

This simulation was run on the Turing cluster at the Uni-versity of Illinois of Urbana-Champaign. The cluster consistsof 640 dual Apple G5 nodes connected with Myrinet network.

The simulation without load balancing took about 24 hours on100 processors. The average processor utilization is shown inthe bottom curve of Figure 13. The processor utilization is onthe Y axis and the time on the X axis. It can be seen thataround time interval 120, the CPU utilization dropped fromaround 85% to only about 42%. This is due to the advent ofthe elastic elements transistioning into plastic elements aroundthe crack tip, leading to the beginning of load imbalance. InFigure 14 the number of plastic elements starts to increasedramatically as the crack starts to propagate along the inter-face. As more elastic elements turn plastic, the CPU utilizationslowly increases. The load imbalance can also be easily seenin the CPU utilization graph over processors in Figure 15.While some of the processors have the CPU utilization ashigh as about 90%, some processors only have about 50% ofthe CPU utilization during the whole execution.

CPU

Utiliz

atio

n (%

)

0

10

20

30

40

50

60

70

80

90

100

Time (seconds)0 10000 20000 30000 40000 50000 60000 70000 80000

Utilization with LBUtilization without LB

Fig. 13. CPU utilization graph over time without vs. with load balancingfor the fracture problem shown in Figure 12 (Turing Apple Cluster).

0

20000

40000

60000

80000

1e+05

Num

ber

of B

roke

n C

ohes

ive

Ele

men

ts

25000 50000 75000 1e+05 1.25e+05Time Step Number

0

1e+06

2e+06

3e+06

4e+06

Num

ber

of P

last

ic E

lem

ents

Fig. 14. Evolution of the number of plastic and broken cohesive elements.

CPU

Utiliz

atio

n (%

)

0

10

20

30

40

50

60

70

80

90

100

Processor0 10 20 30 40 50 60 70 80 90 100

Fig. 15. CPU utilization across processors without load balancing (TuringApple Cluster).

CPU

Utiliz

atio

n (%

)

0

10

20

30

40

50

60

70

80

90

100

Processor0 10 20 30 40 50 60 70 80 90 100

Fig. 16. CPU utilization across processors with load balancing (Turing AppleCluster).

With the greedy load balancing strategy invoked every 500time-steps, the simulation finished in 18 hours, a saving ofnearly 6 hours or 25% over the same simulation with no loadbalancing. This increase is caused by the overall increasedprocessor utilization, which can be seen in the upper curveof Figure 13. The peaks correspond to the times when theload balancer is activated during the simulation. There isan immediate improvement in the utilization when the loadbalancer is invoked. Then the performance slowly deterioratesas more elements become plastic. The next invocation tries tobalance the load, all over again. Figure 16 further illustratesthat load balance has been improved from Figure 15 in theview of the CPU utilization across processors. It can be seenthat a CPU utilization of around 85% is achieved on allprocessors with negligible load variance.

VIII. R ELATED WORK

The goal of our work is a generic load balancing frameworkthat optimizes the load balance of the irregular and highlydynamic applications with an application independent inter-face, therefore we will focus our discussion in this section tothose dynamic load balancing systems for parallel applications.In particular, we wish to distinguish our research using thefollowing criteria:

• Supporting data migration. Migrating data has advantagesover migrating “heavy-weight” processes which addscomplexity to the runtime system.

• Generality. Load balancing methods are designed to beapplication independent. They can be used for a widevariety of applications.

• Automatic load estimation. The load balancing frame-work does not rely on application developer to provideapplication load information.

• Communication-aware load balancing. The frameworktakes communication into account explicitly rather thanimplicitly for example using domain specific knowledge.Communication pattern including multicast and commu-nication volume are directly recorded into load balancingdatabase for load balancing algorithms.

• Adaptive to execution environment. Take backgroundload and non-migratable load into account.

Table I shows the comparison of CHARM++ load balancingframework to several other software systems that supportdynamic load balancing. DRAMA [27] is designed specificallyto support finite element applications. This specializationenables DRAMA to provide an application “independent” loadbalancing using its built-in cost functions for the categoryof applications. Zoltan [28], [29] does not make assumptionsabout applications’ data, and is designed to be general pur-pose load balancing library. However, it relies on applicationdevelopers to provide cost function and communication graph.A recent system PREMA [30], [31] supports very similaridea of migratable objects, however, its load balancing methodprimarily focuses on task scheduling problem as in non-iterative applications. The Chombo [32] package has beendeveloped by Lawrence Berkeley National Lab. It providesa set of tools including load balancing for implementingfinite difference methods for the solution of partial differentialequations on block-structured adaptively refined rectangulargrids. It requires users to provide input for computationalworkload in real number for each box (box is a partition ofmesh). Charm++ provides the most comprehensive featuresfor load balancing. It is applicable to most scientific andengineering applications that present persistent computation(even dynamic). Charm++ load balancing is also capable ofadapting to the change of background load [33].

IX. CONCLUSION

Dynamic and adaptive parallel load balancing is indispens-able for handling load imbalance that may arise during a par-allel simulation due to mesh adaptation, material nonlinearity,etc. This paper demonstrates the successful application of ameasurement-based dynamic load balancing concept to thecrack propagation problem, that uses cohesive elements. Thereare myriad of other problems where the same principle applies.

In the future we plan to enhance the performance of morecomplex ParFUM applications with the load balancing frame-work, which were previously considered unsolvable in rea-sonable amount of time, using ParFUM. Example applicationsinclude adaptive insertion and activation of cohesive elements

System Name Data Migration Generality Explicit Comm. Automatic Cost Est. AdaptiveDRAMA Yes No No Yes NoZoltan Yes Yes No No No

PREMA Yes No No No NoChombo Yes No No No No

CHARM++ Yes Yes Yes Yes Yes

TABLE I

SOFTWARE SYSTEMS THAT SUPPORT DYNAMIC LOAD BALANCING

for dynamic fracture simulations, adaptive mesh adaptation.We also will explore using load balancing framework on verylarge parallel machines such as 64K processor Blue Gene/L.

ACKNOWLEDGEMENTS

This work was supported by the Center for the Simulationof Advanced Rockets under contract number B341494 by theU.S Department of Energy.

REFERENCES

[1] Robert K. Brunner and Laxmikant V. Kale. Handling application-induced load imbalance using parallel objects. InParallel and Dis-tributed Computing for Symbolic and Irregular Applications, pages 167–181. World Scientific Publishing, 2000.

[2] Gabriel Antoniu, Luc Bouge, and Raymond Namyst. An efficient andtransparent thread migration scheme in thePM2 runtime system. InProc. 3rd Workshop on Runtime Systems for Parallel Programming(RTSPP) San Juan, Puerto Rico. Lecture Notes in Computer Science1586, pages 496–510. Springer-Verlag, April 1999.

[3] Gengbin Zheng.Achieving High Performance on Extremely Large Paral-lel Machines: Performance Prediction and Load Balancing. PhD thesis,Department of Computer Science, University of Illinois at Urbana-Champaign, 2005.

[4] Amnon Barak, Shai Guday, and Richard G. Wheeler. The mosixdistributed operating system. InLNCS 672. Springer, 1993.

[5] X.-P. Xu and A. Needleman. Numerical simulation of fast crack growthin brittle solids. Journal of the Mechanics and Physics of Solids,42:1397–1434, 1994.

[6] G. T. Camacho and M. Ortiz. Computational modeling of impact damagein brittle materials.Int. J. Solids Struct., 33:2899–2938, 1996.

[7] P. H. Geubelle and J. Baylor. Impact-induced delamination of compos-ites: a 2d simulation.Composites B, 29:589–602, 1998.

[8] Chao Huang, Gengbin Zheng, Sameer Kumar, and Laxmikant V. Kale.Performance evaluation of adaptive MPI. InProceedings of ACM SIG-PLAN Symposium on Principles and Practice of Parallel Programming2006, March 2006.

[9] Chao Huang, Orion Lawlor, and L. V. Kale. Adaptive MPI. InProceedings of the 16th International Workshop on Languages andCompilers for Parallel Computing (LCPC 2003), LNCS 2958, pages306–322, College Station, Texas, October 2003.

[10] L. V. Kale and Sanjeev Krishnan. Charm++: Parallel Programming withMessage-Driven Objects. In Gregory V. Wilson and Paul Lu, editors,Parallel Programming using C++, pages 175–213. MIT Press, 1996.

[11] Laxmikant V. Kale. The virtualization model of parallel programming :Runtime optimizations and the state of art. InLACSI 2002, Albuquerque,October 2002.

[12] Orion Sky Lawlor and L. V. Kale. Supporting dynamic parallelobject arrays.Concurrency and Computation: Practice and Experience,15:371–393, 2003.

[13] Laxmikant V. Kale. Performance and productivity in parallel program-ming via processor virtualization. InProc. of the First Intl. Workshop onProductivity and Performance in High-End Computing (at HPCA 10),Madrid, Spain, February 2004.

[14] James C. Phillips, Gengbin Zheng, Sameer Kumar, and Laxmikant V.Kale. NAMD: Biomolecular simulation on thousands of processors.In Proceedings of the 2002 ACM/IEEE conference on Supercomputing,pages 1–18, Baltimore, MD, September 2002.

[15] Sameer Kumar, Chao Huang, Gheorghe Almasi, and Laxmikant V. Kale.Achieving strong scaling with NAMD on Blue Gene/L. InProceedingsof IEEE International Parallel and Distributed Processing Symposium2006, April 2006.

[16] Ramkumar V. Vadali, Yan Shi, Sameer Kumar, L. V. Kale, Mark E.Tuckerman, and Glenn J. Martyna. Scalable fine-grained parallelizationof plane-wave-based ab initio molecular dynamics for large supercom-puters. Journal of Comptational Chemistry, 25(16):2006–2022, Oct.2004.

[17] Filippo Gioachin, Amit Sharma, Sayantan Chackravorty, Celso Mendes,Laxmikant V. Kale, and Thomas R. Quinn. Scalable cosmologysimulations on parallel machines. In7th International Meeting onHigh Performance Computing for Computational Science (VECPAR),July 2006.

[18] C.H. Koelbel, D.B. Loveman, R.S. Schreiber, G.L. Steele Jr., and M.E.Zosel. The High Performance Fortran Handbook. MIT Press, 1994.

[19] George Karypis and Vipin Kumar. Multilevel k-way partitioning schemefor irregular graphs.Journal of Parallel and Distributed Computing,48:96 – 129, 1998.

[20] George Karypis and Vipin Kumar. A fast and high quality multi-level scheme for partitioning irregular graphs.SIAM J. Sci. Comput.,20(1):359–392, 1998.

[21] Milind Bhandarkar, L. V. Kale, Eric de Sturler, and Jay Hoeflinger.Object-Based Adaptive Load Balancing for MPI Programs. InProceed-ings of the International Conference on Computational Science, SanFrancisco, CA, LNCS 2074, pages 108–117, May 2001.

[22] Rashmi Jyothi, Orion Sky Lawlor, and L. V. Kale. Debugging supportfor Charm++. InPADTAD Workshop for IPDPS 2004, page 294. IEEEPress, 2004.

[23] J. K. Lenstra, D. B. Shmoys, and E. Tardos. Approximation algorithmsfor scheduling unrelated parallel machines.Math. Program., 46(3):259–271, 1990.

[24] George Karypis and Vipin Kumar. Parallel multilevel k-way partitioningscheme for irregular graphs. InSupercomputing ’96: Proceedings of the1996 ACM/IEEE conference on Supercomputing (CDROM), page 35,1996.

[25] Laxmikant V. Kale, Gengbin Zheng, Chee Wai Lee, and Sameer Kumar.Scaling applications to massively parallel machines using projectionsperformance analysis tool. InFuture Generation Computer Systems Spe-cial Issue on: Large-Scale System Performance Modeling and Analysis,volume 22, pages 347–358, February 2006.

[26] Orion Lawlor, Hari Govind, Isaac Dooley, Michael Breitenfeld, andLaxmikant Kale. Performance degradation in the presence of subnormalfloating-point values. InProceedings of the International Workshopon Operating System Interference in High Performance Applications,September 2005.

[27] A. Basermann, J. Clinckemaillie, T. Coupez, J. Fingberg, H. Digonnet,R. Ducloux, J.-M. Gratien, U. Hartmann, G. Lonsdale, B. Maerten,D. Roose, and C. Walshaw. Dynamic load balancing of finite elementapplications with the DRAMA Library. InApplied Math. Modeling,volume 25, pages 83–98, 2000.

[28] K. Devine, B. Hendrickson, E. Boman, M. St. John, and C. Vaughan.Design of Dynamic Load-Balancing Tools for Parallel Applications. InProc. Intl. Conf. Supercomputing, May 2000.

[29] Karen D. Devine, Erik G. Boman, Robert T. Heaphy, Bruce A. Hen-drickson, James D. Teresco, Jamal Faik, Joseph E. Flaherty, and Luis G.Gervasio. New challenges in dynamic load balancing.Appl. Numer.Math., 52(2–3):133–152, 2005.

[30] Kevin Barker, Andrey Chernikov, Nikos Chrisochoides, and KeshavPingali. A Load Balancing Framework for Adaptive and Asynchronous

Applications. InIEEE Transactions on Parallel and Distributed Systems,volume 15, pages 183–192, 2003.

[31] Kevin J. Barker and Nikos P. Chrisochoides. An Evaluation of aFramework for the Dynamic Load Balancing of Highly Adaptive andIrregular Parallel Applications. InProceedings of SC 2003, Phoenix,AZ, 2003.

[32] Chombo Software Package for AMR Applications.http://seesar.lbl.gov/anag/chombo/.

[33] Robert K. Brunner and Laxmikant V. Kale. Adapting to load onworkstation clusters. InThe Seventh Symposium on the Frontiersof Massively Parallel Computation, pages 106–112. IEEE ComputerSociety Press, February 1999.

Documents

Automatic Dynamic Load Balancing for a Crack Propagation