13
Hybrid Runtime Management of Space-Time Heterogeneity for Parallel Structured Adaptive Applications Xiaolin Li, Member, IEEE, and Manish Parashar, Senior Member, IEEE Abstract—Structured adaptive mesh refinement (SAMR) techniques provide an effective means for dynamically concentrating computational effort and resources to appropriate regions in the application domain. However, due to their dynamism and space-time heterogeneity, scalable parallel implementation of SAMR applications remains a challenge. This paper investigates hybrid runtime management strategies and presents an adaptive hierarchical multipartitioner (AHMP) framework. AHMP dynamically applies multiple partitioners to different regions of the domain, in a hierarchical manner, to match the local requirements of the regions. Key components of the AHMP framework include a segmentation-based clustering algorithm (SBC) that can efficiently identify regions in the domain with relatively homogeneous partitioning requirements, mechanisms for characterizing the partitioning requirements of these regions, and a runtime system for selecting, configuring, and applying the most appropriate partitioner to each region. Further, to address dynamic resource situations for long-running applications, AHMP provides a hybrid partitioning strategy (HPS) that involves application-level pipelining, trading space for time when resources are sufficiently large and underutilized, and an application-level out-of-core strategy (ALOC), trading time for space when resources are scarce in order to enhance the survivability of applications. The AHMP framework has been implemented and experimentally evaluated on up to 1,280 processors of the IBM SP4 cluster at the San Diego Supercomputer Center. Index Terms—Parallel computing, structured adaptive mesh refinement, dynamic load balancing, hierarchical multipartitioner, high-performance computing. Ç 1 INTRODUCTION S IMULATIONS of complex physical phenomena, modeled by systems of partial differential equations (PDE), are playing an increasingly important role in science and engineering. Furthermore, dynamically adaptive tech- niques, such as the dynamic structured adaptive mesh refinement (SAMR) technique [1], [2], are emerging as attractive formulations of these simulations. Compared to numerical techniques based on static uniform discretiza- tion, SAMR can yield highly advantageous ratios for cost/ accuracy by concentrating computational effort and re- sources to appropriate regions adaptively at runtime. SAMR is based on block-structured refinements overlaid on a structured coarse grid and provides an alternative to the general, unstructured AMR approach [3], [4]. SAMR techniques have been used to solve complex systems of PDEs that exhibit localized features in varied domains, including computational fluid dynamics, numerical relativ- ity, combustion simulations, subsurface modeling, and oil reservoir simulation [5], [6], [7]. Large-scale parallel implementations of SAMR-based applications have the potential to accurately model complex physical phenomena and provide dramatic insights. How- ever, while there have been some large-scale implementa- tions [8], [9], [10], [11], [12], [13], these implementations are typically based on application-specific customizations and general scalable implementations of SAMR applications continue to present significant challenges. This is primarily due to the dynamism and space-time heterogeneity exhibited by these applications. SAMR-based applications are inher- ently dynamic because the physical phenomena being modeled and the corresponding adaptive computational domain change as the simulation evolves. Further, adapta- tion naturally leads to a computational domain that is spatially heterogeneous, i.e., different regions in the compu- tational domain and different levels of refinements have different computational and communication requirements. Finally, the SAMR algorithm periodically regrids the com- putational domain, causing regions of refinement to be created/deleted/moved to match the physics being mod- eled, i.e., it exhibits temporal heterogeneity. The dynamism and heterogeneity of SAMR applications have been traditionally addressed using dynamic partition- ing and load-balancing algorithms, e.g., the mechanisms presented in [10] and [13], which partition and load-balance the SAMR domain when it changes. More recently, it was observed in [14], that, for parallel SAMR applications, the appropriate choice and configuration of the partitioning/ load-balancing algorithm depends on the application, its runtime state, and its execution context. This led to the 1202 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 9, SEPTEMBER 2007 . X. Li is with the Department of Computer Science, 219 MSCS, Oklahoma State University, Stillwater, OK 74078. E-mail: [email protected]. . M. Parashar is with the Department of Electrical and Computer Engineering, Rutgers University, 94 Brett Road, Piscataway, NJ 08854. E-mail: [email protected]. Manuscript received 9 Aug. 2005; revised 7 February 2006; accepted 28 July 2006; published online 9 Jan. 2007. Recommended for acceptance by T. Davis. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPDS-0363-0805. Digital Object Identifier no. 10.1109/TPDS.2007.1038. 1045-9219/07/$25.00 ß 2007 IEEE Published by the IEEE Computer Society Authorized licensed use limited to: Rutgers University. Downloaded on November 22, 2008 at 10:44 from IEEE Xplore. Restrictions apply.

1202 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …coe · 2011-11-11 · 2.1 Structured Adaptive Mesh Refinement Structured Adaptive Mesh Refinement (SAMR) formula-tions for adaptive

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1202 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …coe · 2011-11-11 · 2.1 Structured Adaptive Mesh Refinement Structured Adaptive Mesh Refinement (SAMR) formula-tions for adaptive

Hybrid Runtime Managementof Space-Time Heterogeneity for Parallel

Structured Adaptive ApplicationsXiaolin Li, Member, IEEE, and Manish Parashar, Senior Member, IEEE

Abstract—Structured adaptive mesh refinement (SAMR) techniques provide an effective means for dynamically concentrating

computational effort and resources to appropriate regions in the application domain. However, due to their dynamism and space-time

heterogeneity, scalable parallel implementation of SAMR applications remains a challenge. This paper investigates hybrid runtime

management strategies and presents an adaptive hierarchical multipartitioner (AHMP) framework. AHMP dynamically applies

multiple partitioners to different regions of the domain, in a hierarchical manner, to match the local requirements of the regions. Key

components of the AHMP framework include a segmentation-based clustering algorithm (SBC) that can efficiently identify regions in

the domain with relatively homogeneous partitioning requirements, mechanisms for characterizing the partitioning requirements of

these regions, and a runtime system for selecting, configuring, and applying the most appropriate partitioner to each region. Further,

to address dynamic resource situations for long-running applications, AHMP provides a hybrid partitioning strategy (HPS) that

involves application-level pipelining, trading space for time when resources are sufficiently large and underutilized, and an

application-level out-of-core strategy (ALOC), trading time for space when resources are scarce in order to enhance the survivability

of applications. The AHMP framework has been implemented and experimentally evaluated on up to 1,280 processors of the IBM

SP4 cluster at the San Diego Supercomputer Center.

Index Terms—Parallel computing, structured adaptive mesh refinement, dynamic load balancing, hierarchical multipartitioner,

high-performance computing.

Ç

1 INTRODUCTION

SIMULATIONS of complex physical phenomena, modeled bysystems of partial differential equations (PDE), are

playing an increasingly important role in science andengineering. Furthermore, dynamically adaptive tech-niques, such as the dynamic structured adaptive meshrefinement (SAMR) technique [1], [2], are emerging asattractive formulations of these simulations. Compared tonumerical techniques based on static uniform discretiza-tion, SAMR can yield highly advantageous ratios for cost/accuracy by concentrating computational effort and re-sources to appropriate regions adaptively at runtime.SAMR is based on block-structured refinements overlaidon a structured coarse grid and provides an alternative tothe general, unstructured AMR approach [3], [4]. SAMRtechniques have been used to solve complex systems ofPDEs that exhibit localized features in varied domains,including computational fluid dynamics, numerical relativ-ity, combustion simulations, subsurface modeling, and oilreservoir simulation [5], [6], [7].

Large-scale parallel implementations of SAMR-basedapplications have the potential to accurately model complexphysical phenomena and provide dramatic insights. How-ever, while there have been some large-scale implementa-tions [8], [9], [10], [11], [12], [13], these implementations aretypically based on application-specific customizations andgeneral scalable implementations of SAMR applicationscontinue to present significant challenges. This is primarilydue to the dynamism and space-time heterogeneity exhibitedby these applications. SAMR-based applications are inher-ently dynamic because the physical phenomena beingmodeled and the corresponding adaptive computationaldomain change as the simulation evolves. Further, adapta-tion naturally leads to a computational domain that isspatially heterogeneous, i.e., different regions in the compu-tational domain and different levels of refinements havedifferent computational and communication requirements.Finally, the SAMR algorithm periodically regrids the com-putational domain, causing regions of refinement to becreated/deleted/moved to match the physics being mod-eled, i.e., it exhibits temporal heterogeneity.

The dynamism and heterogeneity of SAMR applicationshave been traditionally addressed using dynamic partition-ing and load-balancing algorithms, e.g., the mechanismspresented in [10] and [13], which partition and load-balancethe SAMR domain when it changes. More recently, it wasobserved in [14], that, for parallel SAMR applications, theappropriate choice and configuration of the partitioning/load-balancing algorithm depends on the application, itsruntime state, and its execution context. This led to the

1202 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 9, SEPTEMBER 2007

. X. Li is with the Department of Computer Science, 219 MSCS, OklahomaState University, Stillwater, OK 74078. E-mail: [email protected].

. M. Parashar is with the Department of Electrical and ComputerEngineering, Rutgers University, 94 Brett Road, Piscataway, NJ 08854.E-mail: [email protected].

Manuscript received 9 Aug. 2005; revised 7 February 2006; accepted 28 July2006; published online 9 Jan. 2007.Recommended for acceptance by T. Davis.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPDS-0363-0805.Digital Object Identifier no. 10.1109/TPDS.2007.1038.

1045-9219/07/$25.00 � 2007 IEEE Published by the IEEE Computer Society

Authorized licensed use limited to: Rutgers University. Downloaded on November 22, 2008 at 10:44 from IEEE Xplore. Restrictions apply.

Page 2: 1202 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …coe · 2011-11-11 · 2.1 Structured Adaptive Mesh Refinement Structured Adaptive Mesh Refinement (SAMR) formula-tions for adaptive

development of metapartitioners [14], [15], which select andconfigure partitioners (from a pool of partitioners) atruntime to match the application’s current requirements.However, due to the spatial heterogeneity of the SAMRdomain, computation and communication requirementscan vary significantly across the domain, and, as a result,using a single partitioner for the entire domain can lead todecompositions that are locally inefficient. This is especiallytrue for large-scale simulations that run on over a thousandprocessors.

The objective of the research presented in this paper is toaddress this issue. Specifically, this paper builds on ourearlier research on metapartitioning [14], adaptive hier-archical partitioning [16], and adaptive clustering [17], andinvestigates hybrid runtime management strategies and anadaptive hierarchical multipartitioner (AHMP) framework.AHMP dynamically applies multiple partitioners to differ-ent regions of the domain, in a hierarchical manner, tomatch local requirements. This paper first presents asegmentation-based clustering algorithm (SBC) that canefficiently identify regions in the domain (called clusters) atruntime that have relatively homogeneous requirements.The partitioning requirements of these cluster regions aredetermined and the most appropriate partitioner from theset of available partitioners is selected, configured, andapplied to each cluster.

Further, this paper also presents two strategies forcoping with different resource situations in the case oflong-running applications: 1) the hybrid partitioning algo-rithm, which involves application-level pipelining, tradingspace for time when resources are sufficiently large andunderutilized, and 2) the application-level out-of-corestrategy (ALOC), which trades time for space whenresources are scarce in order to improve the performanceand enhance the survivability of applications. The AHMPframework and its components are implemented andexperimentally evaluated using the RM3D application onup to 1,280 processors of the IBM SP4 cluster at San DiegoSupercomputer Center.

The rest of the paper is organized as follows: Section 2presents an overview of the SAMR technique and describesthe computation and communication behaviors of itsparallel implementation. Section 3 reviews related work.Section 4 describes the AHMP framework and hybridruntime management strategies for parallel SAMR applica-tions. Section 5 presents an experimental evaluation for theframework using SAMR application kernels. Section 6presents a conclusion.

2 PROBLEM DESCRIPTION

2.1 Structured Adaptive Mesh Refinement

Structured Adaptive Mesh Refinement (SAMR) formula-tions for adaptive solutions to PDE systems track regions inthe computational domain with high solution errors thatrequire additional resolution and dynamically overlay finergrids over these regions. SAMR methods start with a basecoarse grid with minimum acceptable resolution that coversthe entire computational domain. As the solution progresses,regions in the domain requiring additional resolution are

tagged and finer grids are overlaid on these tagged regionsof the coarse grid. Refinement proceeds recursively so thatregions on the finer grid requiring more resolution aresimilarly tagged and even finer grids are overlaid on theseregions. The resulting SAMR grid structure is a dynamicadaptive grid hierarchy, as shown in Fig. 1.

2.2 Computation and CommunicationRequirements of Parallel SAMR

Parallel implementations of SAMR applications typicallypartition the adaptive grid hierarchy across availableprocessors, and each processor operates on its local portionsof this domain in parallel. The overall performance ofparallel SAMR applications is thus limited by the ability topartition the underlying grid hierarchies at runtime toexpose all inherent parallelism, minimize communicationand synchronization overheads, and balance load.

Communication overheads of parallel SAMR applica-tions primarily consist of four components:

1. interlevel communications, defined between compo-nent grids at different levels of the grid hierarchy,consist of prolongations (coarse to fine transfer andinterpolation) and restrictions (fine to coarse transferand interpolation),

2. intralevel communications, required to update thegrid-elements along the boundaries of local portionsof a distributed component grid, consisting of near-neighbor exchanges,

3. synchronization cost, which occurs when the load isnot balanced among processors, and

4. data migration cost, which occurs between two succes-sive regridding and remapping steps.

Partitioning schemes for SAMR grid hierarchies can beclassified as patch-based, domain-based, and hybrid [14]. Apatch, associated with one refinement level, is a rectangularregion that is created by clustering adjacent computationalgrids/cells at that level [18]. In patch-based schemes,partitioning decisions are made independently for eachpatch at each level. Domain-based schemes partition thephysical domain and result in partitions or subdomains thatcontain computational grids/cells at multiple refinementlevels. Hybrid schemes generally follow two steps. The firststep uses domain-based schemes to create partitions, which

LI AND PARASHAR: HYBRID RUNTIME MANAGEMENT OF SPACE-TIME HETEROGENEITY FOR PARALLEL STRUCTURED ADAPTIVE... 1203

Fig. 1. Adaptive grid hierarchy for 2D Berger-Oliger SAMR [1]. (a) showsa 2D physical domain with localized multiple refinement levels.(b) represents the refined domain as a grid hierarchy.

Authorized licensed use limited to: Rutgers University. Downloaded on November 22, 2008 at 10:44 from IEEE Xplore. Restrictions apply.

Page 3: 1202 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …coe · 2011-11-11 · 2.1 Structured Adaptive Mesh Refinement Structured Adaptive Mesh Refinement (SAMR) formula-tions for adaptive

are mapped to a group of processors. The second step usespatch-based or combined schemes to further distribute thepartition within its processor group. Domain-based parti-tioning schemes have been studied in [13], [14], [19]. Threehybrid schemes are presented in [13]. In general, purepatch-based schemes outperform domain-based schemeswhen balancing the workload is the only consideration.However, patch-based schemes incur considerable over-head for communications between refinement levels, e.g.,prolongation and restriction operations. Typically, bound-ary information is much smaller than the information in theentire patch. Further, if the patch-based method completelyignores locality, it might cause a severe communicationbottleneck between refinement levels. In contrast, domain-based schemes distribute subdomains that contain allrefinement levels to processors and, hence, eliminate theinterlevel communications. However, for applications withstrongly localized and deeply refined regions, domain-based schemes are inadequate to balance workloads well.More detailed description and comparisons of these

partitioning schemes can be found in [13], [14], [18]. Thispaper mainly focuses on domain-based schemes and alsopresents a new hybrid scheme.

The timing diagram (note that the figure is not to scale)in Fig. 2 illustrates the operation of the SAMR algorithmusing a three-level grid hierarchy. The process shown in thefigure shows a typical parallel SAMR application using adomain-based partitioning scheme. For simplicity, thecomputation and communication behaviors of only twoprocessors, P1 and P2, are shown. The communicationoverheads are illustrated in the magnified portion of thetime line. This figure illustrates the exact computation andcommunication patterns for a parallel SAMR implementa-tion. The timing diagram shows that there is one time stepon the coarsest level (level 0) of the grid hierarchy followedby two time steps on the first refinement level and four timesteps on the second level, before the second time step onlevel 0 can start. Further, the computation and communica-tion steps for each refinement level are interleaved. Thisbehavior makes partitioning the dynamic SAMR gridhierarchy to both balance load and minimize communica-tion overheads a significant challenge.

2.3 Spatial and Temporal Heterogeneity ofSAMR Applications

The space-time heterogeneity of SAMR applications isillustrated using the 3D compressible turbulence simulationkernel solving the Richtmyer-Meshkov (RM3D) instability[5] in Fig. 3. The figure shows a selection of snapshots of theRM3D adaptive grid hierarchy, as well as a plot of its loaddynamics at different regrid steps. Since the adaptive gridhierarchy remains unchanged between two regrid steps, theworkload dynamics and other features of SAMR applica-tions are hence measured with respect to regrid steps. Theworkload in this figure represents the computational/storage requirement, which is computed based on thenumber of grid points in the grid hierarchy. Application

1204 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 9, SEPTEMBER 2007

Fig. 2. Timing diagram for parallel SAMR.

Fig. 3. Spatial and temporal heterogeneity and load dynamics for a 3D Richtmyer-Meshkov simulation using SAMR.

Authorized licensed use limited to: Rutgers University. Downloaded on November 22, 2008 at 10:44 from IEEE Xplore. Restrictions apply.

Page 4: 1202 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …coe · 2011-11-11 · 2.1 Structured Adaptive Mesh Refinement Structured Adaptive Mesh Refinement (SAMR) formula-tions for adaptive

variables are typically defined at these grid points and areupdated at each iteration of the simulation, and, conse-quently, the computational/storage requirements are pro-portional to the number of grid points. The snapshots in thisfigure clearly demonstrate the dynamics and space-timeheterogeneity of SAMR applications—different subregionsin the computational domain have different computationaland communication requirements and regions of refine-ment are created, deleted, relocated, and grow/shrink atruntime.

3 RELATED WORK

Parallel SAMR implementations presented in [10], [12], [13]use dynamic partitioning and load-balancing algorithms.These approaches view the system as a flat pool ofprocessors. They are based on a global knowledge of thestate of the adaptive grid hierarchy and partition the gridhierarchy across the set of processors. Global synchroniza-tion and communication is required to maintain this globalknowledge and can lead to significant overheads on largesystems. Furthermore, these approaches do not exploit thehierarchical nature of the grid structure and the distributionof communications and synchronization in this structure.

Dynamic load balancing schemes proposed in [20] involvetwo phases: the moving-grid phase and the splitting-grid phase.The first phase is intended to move load from overloadedprocessors to underloaded processors. The second phase istriggered when the direct grid movement cannot balance theload among processors. These schemes improve the perfor-mance by focusing load balance and are suited for coarse-grained load balancing without considering the locality ofpatches. Dynamic load balancing schemes have been furtherextended to support distributed SAMR applications [21]using two phases: global load balancing and local loadbalancing. The load balancing in these schemes does notexplicitly address the spatial and temporal heterogeneityexhibited by SAMR applications. In the SAMRAI library [10],[18], after the construction of computational patches, patchesare assigned to processors using a greedy bin-packingprocedure. SAMRAI uses the Morton space-filling curvetechnique to maintain spatial locality for patch distribution.To further enhance scalability of SAMR applications usingthe SAMRAI framework, Wissink et al. proposed a RecursiveBinary Box Tree algorithm to improve communicationschedule construction [18]. A simplified point-clusteringalgorithm based on the Berger-Regoutsos algorithm [22] hasalso been presented. The reduction of runtime complexityusing these algorithms substantially improves the scalabilityof parallel SAMR applications on up to 1,024 processors. Asmentioned above, the SAMRAI framework uses patch-basedpartitioning schemes, which result in good load balance butmight cause considerable interlevel communication andsynchronization overheads. SAMR applications have beenscaled on up to 6,420 processors using the FLASH andPARAMESH packages [8], [12]. The scalability is achievedusing large domain dimensions ranging from 128� 128�2;560 to 1;024� 1;024� 20;480. Further, a Morton space-filling curve technique has been applied to maintain spatiallocality in PARAMESH. However, the space-time hetero-geneity issues are not addressed explicitly.

A recent paper [23] presents a hierarchical partitioning anddynamic load balancing scheme using the Zoltan toolkit [3].The proposed scheme first uses the multilevel graphpartitioner in ParMetis [11] for across-node partitioning inorder to minimize communication across nodes. It thenapplies the recursive inertial bisection (RIB) method withineach node. The approach was evaluated in small systems witheight processors in the paper. The characterization of SAMRapplications presented in [14] is based on the entire physicaldomain. The research in this paper goes a step further byconsidering the characteristics of individual subregions. Theconcept of natural regions is presented in [24]. Two kinds ofnatural regions are defined: unrefined/homogeneous andrefined/complex. The framework proposed then uses abilevel domain-based partitioning scheme to partition therefined subregions. This approach is one of the first attemptsto apply multiple partitioners concurrently to the SAMRdomain. However, this approach restricts itself to applyingonly two partitioning schemes, one to the refined region andthe other to the unrefined region.

4 HYBRID RUNTIME MANAGEMENT STRATEGIES

4.1 Adaptive Hierarchical MultipartitionerFramework

Fig. 4 shows the basic operation of the AHMP framework. Acritical task is to dynamically and efficiently identifyregions that have similar requirements, called clusters. Acluster is a region of connected component grids that hasrelatively homogeneous computation and communicationrequirements. The input of AHMP is the structure of thecurrent grid hierarchy (an example is illustrated in Fig. 1),represented as a list of regions, which defines the runtimestate of the SAMR application. AHMP operation consists ofthe following steps: First, a clustering algorithm is used toidentify cluster hierarchies. Second, each cluster is char-acterized and its partitioning requirements are identified.Available resources are also partitioned into corresponding

LI AND PARASHAR: HYBRID RUNTIME MANAGEMENT OF SPACE-TIME HETEROGENEITY FOR PARALLEL STRUCTURED ADAPTIVE... 1205

Fig. 4. A flowchart for the AHMP framework.

Authorized licensed use limited to: Rutgers University. Downloaded on November 22, 2008 at 10:44 from IEEE Xplore. Restrictions apply.

Page 5: 1202 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …coe · 2011-11-11 · 2.1 Structured Adaptive Mesh Refinement Structured Adaptive Mesh Refinement (SAMR) formula-tions for adaptive

resource groups based on the relative requirements of theclusters. A resource group is a set of possibly physicallyproximal processors that are assigned a coarse-grainedpartition of the computational workload. A resource groupis dynamically constructed based on the load assignmentand distribution. Third, these requirements are used toselect and configure an appropriate partitioner for eachcluster. The partitioner is selected from a partitionerrepository using selection policies. Finally, each cluster ispartitioned and mapped to processors in its correspondingresource group. The strategy is triggered locally when theapplication state changes. State changes are determinedusing the load-imbalance metric described below. Partition-ing proceeds hierarchically and incrementally. The identi-fication and isolation of clusters uses a segmentation-basedclustering (SBC) scheme. Partitioning schemes in thepartitioner repository include the Greedy PartitioningAlgorithm (GPA), Level-based Partitioning Algorithm(LPA), and others presented in Section 4.3. Partitionerselection policies consider cluster partitioning require-ments, communication/computation requirements, scat-tered adaptation, and activity dynamics [14]. This paperspecifically focuses on developing partitioning policiesbased on cluster requirements defined in terms of refine-ment homogeneity, which is defined in Section 5.1.

AHMP extends our previous work on the HierarchicalPartitioning Algorithm (HPA) [16], which hierarchicallyapplies a single partitioner, reducing global communicationoverheads and enabling incremental repartitioning andrescheduling. AHMP additionally addresses spatial hetero-geneity by applying the most appropriate partitioner toeach cluster based on its characteristics and requirements.As a result, multiple partitioners may concurrently operateon different subregions of the computational domain.

The load imbalance factor (LIF) metric is used as thecriterion for triggering repartitioning and reschedulingwithin a local resource group, and is defined as follows:

LIFA ¼maxAn

i¼1 Ti �minAn

i¼1 TiPAn

i¼1 Ti=An

; ð1Þ

where An is the total number of processors in resourcegroup A, and Ti is the estimated relative execution timebetween two consecutive regrid steps for the processor i,which is proportional to its load. In the numerator of theright-hand side of the above equation, we use the differencebetween maximum and minimum execution times to betterreflect the impact of synchronization overheads. The localload imbalance threshold is �A. When LIFA > �A, therepartitioning is triggered inside the local group. Note thatthe imbalance factor can be recursively calculated for largergroups as well.

4.2 Clustering Algorithms for Cluster RegionIdentification

The objective of clustering is to identify well-structuredsubregions in the SAMR grid hierarchy, called clusters. Asdefined above, a cluster is a region of connected componentgrids with relatively homogeneous partitioning require-ments. This section describes the segmentation-basedclustering (SBC) algorithm, which is based on space-filling

curves (SFC) [25]. The algorithm is motivated by thelocality-preserving property of SFCs and the localizednature of physical features in SAMR applications. Further,SFCs are widely used for domain-based partitioning forSAMR applications [13], [24], [25], [26]. Note that clustersare similar in concept to natural regions proposed in [24].However, unlike natural regions, clusters are not restrictedto strict geometric shapes, but are more flexible and takeadvantage of the locality-preserving property of SFCs.

Typical SAMR applications exhibit localized features,and thus result in localized refinements. Moveover, refine-ment levels and the resulting adaptive grid hierarchy reflectthe application runtime state. Therefore, clustering sub-regions with similar refinement levels is desired.

The segmentation-based clustering algorithm is based onideas in image segmentation [27]. The algorithm firstdefines load density factor ðLDF Þ as follows:

LDF ðrlevÞ ¼ ðassociated workload of patches with levels

>¼ rlev Þ on the subregionÞ=ðvolume of the subregion at

rlevÞ volume of the subregion at rlevÞ;

where rlev denotes the refinement level and the volume is forthe subregion of interest in a 3D domain (it will be area andlength in case of 2D and 1D domains, respectively).

The SBC algorithm is illustrated in Fig. 5. SBC aims tocluster domains with similar load density together to formcluster regions. The algorithm first smooths out subregionsthat are smaller than a predefined threshold, which is referredto as the template size. Template size is determined by thestencil size of the finite difference method and the granularityconstraint, maintaining appropriate computation/commu-nication ratios to maximize performance and minimizecommunication overheads. A subregion is defined by abounding box with lower-bound and upper-bound coordi-nates and the strides/steps along each dimension. Thesubregion list input to the SBC algorithm is created byapplying the SFC indexing mechanism on the entire domainthat consists of patches of different refinement levels. SBCfollows the SFC index to extract subregions (defined byrectangular bounding boxes) from the subregion list until the

1206 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 9, SEPTEMBER 2007

Fig. 5. Segmentation-based clustering algorithm.

Authorized licensed use limited to: Rutgers University. Downloaded on November 22, 2008 at 10:44 from IEEE Xplore. Restrictions apply.

Page 6: 1202 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …coe · 2011-11-11 · 2.1 Structured Adaptive Mesh Refinement Structured Adaptive Mesh Refinement (SAMR) formula-tions for adaptive

size of the accumulated subregion set is greater than thetemplate size. It then calculates the load density for this set ofsubregions and computes a histogram of its load density. SBCcontinues to scan through the entire subregion list andrepeats the above process, calculating the load density andcomputing histograms. Based on the histogram of the loaddensity obtained, it then finds a clustering threshold �. Asimplified intermeans thresholding algorithm by iterativeselection [27], [28] is used as shown below.

The goal of the thresholding algorithm is to partition theSFC-indexed subregion list into two classes: C0 and C1

(which may not necessarily be two clusters, as shown inFig. 7) using an “optimal” threshold T with respect to LDFso that the LDF of all subregions in C0 � T and the LDF ofall subregions in C1 > T . Let �0 and �1 be the mean LDF ofC0 and C1, respectively. Initially, a threshold T is selected,for example, the mean of the entire list as a starting point.Then, for the two classes generated based on T , �00 and �01are calculated, and a new threshold is computed asT 0 ¼ ð�00 þ �01Þ=2. This process is repeated until the valueof T converges.

Using the threshold obtained, subregions are furtherpartitioned into several cluster regions. As a result, ahierarchical structure of cluster regions is created byrecursively calling the SBC algorithm for finer refinementlevels. The maximum number of clusters created can beadjusted to the number of processors available. Note that thisalgorithm has similarities to the point clustering algorithms

proposed by Berger and Regoutsos in [22]. However, the SBCscheme differs from this scheme in two aspects. Unlike theBerger-Regoutsos scheme, which creates fine-grained clus-ters, the SBC scheme targets coarser granularity clusters. SBCalso takes advantage of the locality-preserving properties ofSFCs to potentially reduce data movement costs betweenconsecutive repartitioning phases.

The SBC algorithm is illustrated using a simple 2D ex-ample in Fig. 6. In this figure, SBC results in three clusters,which are shaded in the figure. Fig. 7 shows the loaddensity distribution and histogram for an SFC-indexedsubdomain list. For this example, the SBC algorithm createsthree clusters defined by the regions separated by thevertical lines in the figure on the left. The template size inthis example is two boxes on the base level. The right figureshows a histogram of the load density. For efficiency andsimplicity, this histogram is used to identify the appropriatethreshold. For this example, the threshold is identified inbetween 1 and 9, using the intermeans thresholdingalgorithm. While there are many more sophisticatedapproaches for identifying good thresholds for segmenta-tion and edge detection in image processing [27], [29], thisapproach is sufficient for our purpose. Note that apredefined minimum size for a cluster region is assumed.In this example, the subregion with index 14 in Fig. 6 doesnot form a cluster as its size is less than the template size. Itis instead clustered with another subregion in its proximity.

4.3 Partitioning Schemes and Partitioner Selection

For completeness, several partitioning algorithms withinthe GrACE package [30] are briefly described. The greedypartitioning algorithm (GPA) [13] is the default partitioningalgorithm in GrACE. First, GPA partitions the entiredomain into subdomains such that each subdomain keepsall refinement levels as a single composite grid unit. Thus,all interlevel communications are local to a subdomain andthe interlevel communication time is eliminated. Second,GPA rapidly scans this list only once, attempting to equallydistribute load among all processors. It helps in reducingpartitioning costs and works quite well for a relativelyhomogeneous computational domain.

For applications with localized features and deep gridhierarchies, GPA can result in load imbalances and, hence,lead to synchronization overheads at higher levels ofrefinement. To overcome this problem, the level-basedpartitioning algorithm (LPA) [16] attempts to simulta-neously balance load and minimize synchronization cost.LPA essentially aims to balance workload at each refine-ment level among all processors in addition to balancing

LI AND PARASHAR: HYBRID RUNTIME MANAGEMENT OF SPACE-TIME HETEROGENEITY FOR PARALLEL STRUCTURED ADAPTIVE... 1207

Fig. 6. Clustering results for the SBC scheme.

Fig. 7. Load density distribution and histogram for SBC.

Authorized licensed use limited to: Rutgers University. Downloaded on November 22, 2008 at 10:44 from IEEE Xplore. Restrictions apply.

Page 7: 1202 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …coe · 2011-11-11 · 2.1 Structured Adaptive Mesh Refinement Structured Adaptive Mesh Refinement (SAMR) formula-tions for adaptive

overall load. To further improve the runtime performance,the hierarchical partitioning algorithm (HPA) enables theload distribution to reflect the state of the adaptive gridhierarchy and exploit it to reduce synchronization require-ments, improve load-balance, and enable concurrent com-munications and incremental redistribution [31]. HPApartitions the computational domain into subdomains andassigns them to hierarchical processor groups. Otherpartitioners in the partitioner repository include the bin-packing partitioner (BPA), the geometric multilevel +sequence partitioner (G-MISP+SP), and the p-way binarydissection partitioner (pBD-ISP) [14], [16].

The characterization of clusters is based on theircomputation and communication requirements, runtimestates, and the refinement homogeneity defined in Sec-tion 5.1. An octant approach is proposed in [14] to classify theruntime states of a SAMR application with respect to 1) theadaptation pattern (scattered or localized), 2) whetherruntime is dominated by computations or communications,and 3) the activity dynamics in the solution. A metaparti-tioner is then proposed to enable the selection/mapping ofpartitioners according to the current state of an applicationin the octant. The mapping is based on an experimentalcharacterization of partitioning techniques and applicationstates using five partitioning schemes and seven applica-tions. The evaluation of partitioner quality is based on afive-component metric, including load imbalance, commu-nication requirements, data migration, partitioning-intro-duced overheads, and partitioning time. In addition to thesecharacterization and selection policies, we also considerrefinement homogeneity. The overall goal of these newpolicies is to obtain better load balance for less refinedclusters, and to reduce communication and synchronizationcosts for highly refined clusters. For example, the policydictates that the GPA and G-MISP+SP partitioning algo-rithms be used for clusters with refinement homogeneitygreater than some threshold and partitioning algorithmsLPA and pBD-ISP be used for clusters with refinementhomogeneity greater below the threshold.

4.4 Hybrid Partitioning Strategy (HPS)

The parallelism that can be exploited by domain-basedpartitioning schemes is typically limited by granularityconstraints. In cases where a very narrow region has deeprefinements, domain-based partitioning schemes will in-evitably result in significant load imbalances, especiallywhen using a large-scale system. For example, assume thatthe predefined minimum dimension of a 3D block/grid onthe base level is four grid points. In this case, the minimumworkload (minimum partition granularity �) of a compositegrid unit with three refinement levels is 43 þ 2� 83 þ 2�2 � 163 ¼ 17;472, i.e., the granularity constraint is � �17;472 units. Such a composite block can result in significantload imbalance if domain-based partitioning is usedexclusively. To expose more parallelism in these cases, apatch-based partitioning approach must be used. HPScombines domain-based and patch-based schemes. Toreduce communication overheads, the application of HPSis restricted to a cluster region that is allocated to a singleresource group. Further, HPS is only applied when certainconditions are met. These conditions include 1) resourcesare sufficient, 2) resources are underutilized, and 3) the gainby using HPS outweighs the extra communication costincurred. For simplicity, the communication and computa-tion process of HPS is illustrated using three processors in a

resource group in Fig. 8. The cluster has two refinementlevels in this example.

HPS has two variants: One is pure hybrid partitioningwith pipelining (pure HPS) and the other is hybridpartitioning with redundant computation and pipelining(HPS with redundancy). HPS splits the smallest domain-based partitions into patches at different refinement levels,partitions the finer patches into n partitions, and assignseach partition to a different processor in the group ofn processors. When this process extends to many refine-ment levels, it is analogous to the pipelining process, wherethe operation at each refinement level represents eachpipelining stage. Since the smallest load unit on the base-level (level 0) cannot be further partitioned, the pure HPSscheme maps the level 0 patch to a single processor, whilethe HPS with redundancy scheme redundantly computesthe level 0 patch at all processors in the group. Althoughpure HPS saves redundant computation, it needs interlevelcommunication from the level 0 patch to the other patches,which can be expensive. In contrast, HPS with redundancytrades computing resource for less interlevel communica-tion overhead. To avoid significant overhead, HPS schemesare applied only in a small resource group, e.g., an SMPnode with eight processors.

To specify the criteria for choosing HPS, the resourcesufficiency factor (RSF) is defined by

RSF ¼ Nrg

Lq=L�; ð2Þ

where Lq denotes the total load for a cluster region, Nrg

denotes the total number of processors in a resourcegroup, and L�, the granularity, denotes the load on thesmallest base-level subregion with the maximum refine-ment level. In the case of deep refinements, L� can bequite large. When RSF > �, where � is the threshold andresources are underutilized, HPS can be applied toexplore additional parallelism. The threshold is deter-mined statistically through sensitivities analysis for a setof applications.

The basic operations of HPS consist of pairing tworefinement levels, duplicating the computation on the coarserpatch, and partitioning the finer patch across the resourcegroup. The operation of HPS is as follows: 1) AHMP generatesa set of clusters and assigns each cluster to a resource group.2) Within each resource group, the algorithm checks whetheror not to trigger HPS and, if the criteria defined above are met,AHMP selects HPS for the cluster and the correspondingresource group. 3) HPS splits the cluster into patches atdifferent refinement levels, assigns the patches to individualprocessors within the resource group, and coordinates thecommunication and computation as illustrated in Fig. 8. Notethat HPS can be recursively applied to patches of deeperrefinement hierarchies.

4.5 Application-Level Out-of-Core (ALOC) Strategy

When available physical memory is not sufficient for theapplication, one option is to rely on the virtual memorymechanism of the operating system (OS). The OS handlespage faults and replaces less frequently used pages with therequired pages from disks. The OS, however, has littleknowledge of the characteristics of an application and itsmemory access pattern. Consequently, it will result in manyunnecessary swap-in and swap-out operations, which arevery expensive. Data rates from disks are approximatelytwo orders of magnitude lower than those from memory

1208 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 9, SEPTEMBER 2007

Authorized licensed use limited to: Rutgers University. Downloaded on November 22, 2008 at 10:44 from IEEE Xplore. Restrictions apply.

Page 8: 1202 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …coe · 2011-11-11 · 2.1 Structured Adaptive Mesh Refinement Structured Adaptive Mesh Refinement (SAMR) formula-tions for adaptive

[32]. In many systems, the OS sets a default maximum limiton physical and virtual memory allocation. When anapplication uses up this quota, it cannot proceed andcrashes. Experiments show that system performance de-grades during excessive memory allocation due to highpage fault rates causing memory thrashing [33], [34], [35].As a result, the amount of allocated memory and thememory usage pattern play a critical role in overall systemperformance.

To address these issues, an application-level out-of-corescheme (ALOC) is designed that exploits the applicationmemory access patterns and explicitly keeps the workingset of application patches while swapping out otherpatches.

The ALOC mechanism proactively manages application-level pages, i.e., the computational domain patches. Itattempts to not only improve performance, but alsoenhance survivability when available memory is insuffi-cient. For instance, as shown in Fig. 3, the RM3D applicationrequires four times more memory during the peak require-ments than the average requirements, while the peak timelasts for less than 10 percent of the total execution time.

As illustrated in Fig. 9, the ALOC scheme incrementallypartitions the local grid hierarchy into temporal virtualcomputational units ðT -V CUÞ according to refinementlevels and runtime iterations. In the figure, the notationT -V CUa

b;c denotes the temporal VCU, where a denotes thetime step at the base level, b denotes the current refinementlevel, and c is the time step at the current level. To avoidundesired page-swapping, ALOC releases the memoryallocated by lower-level patches and explicitly swaps themout to the disk. The ALOC mechanism is triggered whenthe ratio between the amount of memory allocated and thesize of the physical memory is above a predefined

threshold or the number of page faults increases above a

threshold. The appropriate thresholds are selected based

on experiments. Specifically, each process periodically

monitors its memory usage and the page faults incurred.

Memory usage information is obtained by tracking

memory allocations and deallocations, while page fault

information is obtained using the getrusage() system call.

5 EXPERIMENTAL EVALUATION

5.1 Evaluating the Effectiveness of the SBCClustering Algorithm

To aid the evaluation of the effectiveness of the SBC

clustering scheme, a clustering quality metric is defined.

The metric consists of two components: 1) the static quality

and 2) the dynamic quality of the cluster regions generated.

The static quality of a cluster is measured in terms of its

refinement homogeneity and the efficiency of the clustering

algorithm. The dynamic quality of a cluster is measured in

terms of its communication costs (intralevel, interlevel, and

data migration). These criteria are defined as follows:

1. Refinement Homogeneity. This measures the qual-

ity of the structure of a cluster. Let jRtotali ðlÞj denote

the total workload of a subregion or a cluster at

refinement level l, which is composed of jRrefi ðlÞj, the

workload of refined regions, and jRunrefi ðlÞj, the

workload of unrefined regions at refinement level l.

Refinement homogeneity is recursively defined

between two refinement levels as follows:

LI AND PARASHAR: HYBRID RUNTIME MANAGEMENT OF SPACE-TIME HETEROGENEITY FOR PARALLEL STRUCTURED ADAPTIVE... 1209

Fig. 8. Hybrid partitioning strategy. (a) Domain-based strategy when resource is excessive. (b) HPS. (c) HPS with redundancy.

Authorized licensed use limited to: Rutgers University. Downloaded on November 22, 2008 at 10:44 from IEEE Xplore. Restrictions apply.

Page 9: 1202 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …coe · 2011-11-11 · 2.1 Structured Adaptive Mesh Refinement Structured Adaptive Mesh Refinement (SAMR) formula-tions for adaptive

HiðlÞ ¼jRref

i ðlÞjjRtotal

i ðlÞj; ð3Þ

HallðlÞ ¼1

n

Xn

i¼1

HiðlÞ; if jRrefi ðlÞj 6¼ 0; ð4Þ

where n is the total number of subregions that have

refinement level lþ 1. A goal of AHMP is to maximize

the refinement homogeneity of a cluster as parti-

tioners work well on relatively homogeneous regions.2. Communication Cost. This measures the commu-

nication overheads of a cluster and includes inter-level communication, intralevel communication,synchronization cost, and data migration cost asdescribed in Section 2.2. A goal of AHMP is tominimize the communication overheads of a cluster.

3. Clustering Cost. This measures the cost of theclustering algorithm itself. As mentioned above,SAMR applications require regular repartitioningand rebalancing, and as a result clustering costbecomes important. A goal of AHMP is to minimizethe clustering cost.

Partitioning algorithms typically work well on highlyhomogeneous grid structures and can generate scalablepartitions with the desired load balance. Hence, it isimportant to have a quantitative measurement to specifythe homogeneity. Intuitively, the refinement homogeneitymetric attempts to isolate refined clusters that are poten-tially heterogeneous and are difficult to partition. Incontrast, unrefined or completely refined clusters arehomogeneous at that refinement level.

The effectiveness of SBC-based clustering is evaluatedusing the metrics defined above. The evaluation comparesthe refinement homogeneity of the six SAMR applicationkernels with and without clustering. These applicationkernels span multiple domains, including computationalfluid dynamics (compressible turbulence: RM2D andRM3D, supersonic flows: ENO2D), oil reservoir simulations(oil-water flow: BL2D and BL3D), and the transportequation (TP2D). The applications are summarized inTable 1.

Fig. 10 shows the refinement homogeneity for the TP2Dapplication with four refinement levels without any cluster-ing. The refinement homogeneity is smooth for level 0 andvery dynamic and irregular for levels 1, 2, and 3.

1210 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 9, SEPTEMBER 2007

Fig. 9. Application-level out-of-core strategy.

TABLE 1SAMR Application Kernels

Authorized licensed use limited to: Rutgers University. Downloaded on November 22, 2008 at 10:44 from IEEE Xplore. Restrictions apply.

Page 10: 1202 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …coe · 2011-11-11 · 2.1 Structured Adaptive Mesh Refinement Structured Adaptive Mesh Refinement (SAMR) formula-tions for adaptive

The average refinement homogeneity of the six SAMRapplications without clustering is presented in Table 2. Therefinement homogeneity is calculated for the entire domainand averaged among 100 regridding steps. The tableshows that the refinement homogeneity HðlÞ increases asthe refinement level l increases. This observation wellreflects the physical properties of SAMR applications, i.e.,refined regions tend to be further refined. Moreover, theseapplications typically exhibit intensive activities in narrowregions. Typical ranges of HðlÞ are: Hð0Þ 2 ½0:02; 0:22�,Hð1Þ 2 ½0:26; 0:68�, Hð2Þ 2 ½0:59; 0:83�, and Hð3Þ 2 ½0:66; 0:9�.Several outliers require some explanation. In the case ofthe BL2D application, the average Hð2Þ ¼ 0:4. However,the individual values of Hð2Þ are in the range ½0:6; 0:9�,with many scattered zeros. Since the refinement homo-geneity on level 3 and above is typically over 0.6 andrefined subregions at higher refinement levels tend to bemore scattered, the clustering scheme focuses efforts onclustering levels 0, 1, and 2. Furthermore, based on thesestatistics, we select the thresholds � for switching betweendifferent lower-level partitioners as follows: �0 ¼ 0:4,�1 ¼ 0:6, and �2 ¼ 0:8, where the subscripts denote therefinement level.

Fig. 11 and Table 3 demonstrate the improvements inrefinement homogeneity using the SBC algorithm. Fig. 11shows the effects of using SBC on level 0 for theTransport2D application. The original homogeneity Hð0Þis in the range ½0; 0:15�, while the improved homogeneityusing SBC is in the range ½0:5; 0:8�.

The effects of clustering using SBC for the six SAMRapplications are presented in Table 3. In the table, the gain isdefined as the ratio of the improved homogeneity over theoriginal homogeneity at each level. The gains for TP2D,ENO2D, BL3D, and BL2D on level 0 are quite large. Thegains for the RM3D and RM2D applications are smallerbecause these applications already exhibit high refinementhomogeneity starting from level 0, as shown in Table 2.

These results demonstrate the effectiveness of the clusteringscheme. Moreover, clustering increases the effectiveness ofpartitioners and improves overall performances as shownin the next section.

Communication Costs. The evaluation of communica-tion cost uses a trace-driven simulation. The simulations areconducted as follows: First, a trace of the refinementbehavior of the application at each regrid step was obtainedby running the application on a single processor. Second,this trace is fed into the partitioners to partition andproduce a new partitioned trace for multiple processors.Third, the partitioned trace is then fed into the SAMRsimulator, which was developed at Rutgers University [37],to obtain the runtime performance measurements onmultiple processors. Fig. 12 shows the total communicationcost for the RM3D application on 64 processors for GPA andAHMP (using SBC) schemes. The figure shows that theoverall communication cost is lower for SBC+AHMP.However, in the interval between regrid steps 60 and 100,SBC+AHMP exhibits higher communication costs. This isbecause the application is highly dynamic with scatteredrefinements during this period. The snapshot at the regrid

LI AND PARASHAR: HYBRID RUNTIME MANAGEMENT OF SPACE-TIME HETEROGENEITY FOR PARALLEL STRUCTURED ADAPTIVE... 1211

Fig. 10. Refinement homogeity for the Transport2D application kernel

(four levels of refinement).

TABLE 2Average Refinement Homogeneity HðlÞ

for Six SAMR Applications

Fig. 11. Homogeneity improvements using SBC for TP2D.

TABLE 3Homogeneity Improvements Using SBC

for Six SAMR Applications

Fig. 12. Maximum total communication for RM3D on 64 processors.

Authorized licensed use limited to: Rutgers University. Downloaded on November 22, 2008 at 10:44 from IEEE Xplore. Restrictions apply.

Page 11: 1202 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …coe · 2011-11-11 · 2.1 Structured Adaptive Mesh Refinement Structured Adaptive Mesh Refinement (SAMR) formula-tions for adaptive

step 96 in Fig. 3 illustrates the scattered refinements. This inturn causes significant cluster movement during recluster-ing. Note that the exiting simulator does not measuresynchronization costs and thus does not reflect fullperformance benefits that can be achieved using AHMP.The actual performance gains due to AHMP are seen inFig. 15 based on actual runs.

Clustering Costs. The cost of the SBC clustering algorithmis experimentally evaluated using the six different SAMRapplication kernels on a Beowulf cluster (Frea) at RutgersUniversity. The cluster consists of 64 processors and eachprocessor has a1.7GHzPentium IVCPU.Thecosts areplottedin Fig. 13. As seen in this figure, the overall clustering time onaverage is less than 0.01 seconds. Note that the computationaltime between successive repartitioning/rescheduling phasesis typically on the order of 10s of seconds and, as a result, theclustering costs are not significant.

5.2 Performance Evaluation

This section presents the experimental evaluation ofAHMP. The overall performance benefit is evaluated onDataStar, the IBM SP4 cluster at the San Diego Super-computer Center. DataStar has 176 (8-way) P655+ nodes(SP4). Each node has eight (1.5 GHz) processors, 16 GBmemory, and CPU peak performance is 6.0 GFlops. Theevaluation uses the RM3D application kernel with a basegrid of size 256 X 64 X 64, up to three refinement levels, and1,000 base level time steps. The number of processors usedwas between 64 and 1,280.

Impact of Load Imbalance Threshold and ResourceGroup Size. As mentioned in Section 3, the loadimbalance threshold � is used to trigger repartitioningand redistribution within a resource group, wheremin group size is the minimum number of processorsallowed in a resource group when the resource ispartitioned hierarchically. This threshold plays an im-portant role because it affects the frequency of redistribu-tion and, hence, the overall performance. The impact ofthis threshold for different sizes of resource groups forthe RM3D application on 128 processors is plotted inFig. 14. When � increases from 5 percent to around20 percent to 30 percent, the execution time decreases. Onthe other hand, when � increases from 30 percent to60 percent, the execution time increases significantly.Smaller values of � result in more frequent repartitioningwithin a resource group, while larger thresholds may leadto increased load imbalance. The best performance isobtained for � ¼ 20% and min group size ¼ 4. Due to theincreased load imbalance, larger group sizes do not enhance

performance. The overall performance evaluation belowuses � ¼ 20% and min group size ¼ 4.

Overall Performance. The overall execution time isplotted in Fig. 15 . The figure plots execution times forGPA, LPA, and AHMP. The plot shows that SBC+AHMPdelivers the best performance. Compared to GPA, theperformance improvement is between 30 percent to42 percent. These improvements can be attributed to thefollowing factors: 1) AHMP takes advantage of the strengthof different partitioning schemes matching them to therequirements of each cluster, 2) the SBC scheme createswell-structured clusters that reduce the communicationtraffic between clusters, and (3) AHMP enables incrementalrepartitioning/redistribution and concurrent communica-tion between resource groups, which extends the advan-tages of HPA [16].

Impact of Hybrid Partitioning. To show the impact ofthe hybrid partitioning strategy, we conduct the experimentusing RM3D with a smaller domain, 128� 32� 32. All theother parameters are the same as in the previous experi-ment. As shown in Fig. 16, due to the smaller computa-tional domain, without HPS, the overall performancedegrades when the application is deployed on a clusterwith more than 256 processors. The main reason is that,without HPS, the granularity constraint and the increasingcommunication overheads overshadow the increased com-puting resources. In contrast, with HPS, AHMP can furtherscale up to 512 processors with performance gains up to40 percent compared to the scheme without HPS. Note thatthe maximum performance gain (40 percent) is achievedwhen using 512 processors, where the scheme without HPSresults in the degraded performance.

1212 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 9, SEPTEMBER 2007

Fig. 13. Clustering costs for the six SAMR application kernels.Fig. 14. Impact of load imbalance threshold for RM3D on

128 processors.

Fig. 15. Overall performance for RM3D.

Authorized licensed use limited to: Rutgers University. Downloaded on November 22, 2008 at 10:44 from IEEE Xplore. Restrictions apply.

Page 12: 1202 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …coe · 2011-11-11 · 2.1 Structured Adaptive Mesh Refinement Structured Adaptive Mesh Refinement (SAMR) formula-tions for adaptive

Impact of Out-of-Core. The ALOC scheme has beenimplemented using the HDF5 library [38], which is widelyused to store scientific data. The effect of the out-of-corescheme is evaluated using RM3D on the Frea Beowulfcluster. The configuration of RM3D consists of a base grid ofsize 128� 32� 32, four refinement levels, and 100 base-level time steps (with a total of 99 regridding steps). Thenumber of processors used is 64. Without ALOC, it tookabout 13,507 seconds to complete 63 regridding steps, atwhich point the application crashed. With ALOC, theapplication successfully completed the execution of 99 re-gridding steps. The execution time for the same 63 regrid-ding steps was 9,573 seconds, which includes 938 secondsfor explicit out-of-core I/O operations. Fig. 17 shows thepage faults distribution and the execution time for experi-ments using Non-ALOC and ALOC schemes. As seen in thefigure, without ALOC, the application incurs significantpage faults. With ALOC, the number of page faults isreduced. As a result, the ALOC scheme improves theperformance and enhances the survivability.

6 CONCLUSION

This paper presented hybrid runtime management strategiesand the adaptive hierarchical multipartitioner (AHMP)framework to address space-time heterogeneity in dynamicSAMR applications. A segmentation-based clustering algo-rithm (SBC) was used to identify cluster regions withrelatively homogeneous partitioning requirements in theadaptive computational domain. The partitioning require-ments of each cluster were identified and used to select themost appropriate partitioning algorithm for the cluster.Further, this paper presented hybrid partitioning strategy(HPS), which involves pipelining process and improves thesystem scalability by trading space for time, and the

application-level out-of-core scheme (ALOC), which ad-dresses insufficient memory resources and improves theoverall performance and survivability of SAMR applications.Overall, the AHMP framework and its components have beenimplemented and experimentally evaluated on up to1,280 processors. The experimental evaluation demonstratedboth, the effectiveness of the clustering as well as theperformance improvements using AHMP. Future work willconsider adaptive tuning of control parameters and willextend the proposed strategies to support workload hetero-geneity in multiphysics applications.

APPENDIX

GLOSSARY

. AHMP: Adaptive Hierarchical MultipartitionerStrategy

. ALOC: Application-Level Out-of-Core Strategy

. BPA: Bin-packing Partitioning Algorithm

. G-MISP+SP: Geometric Multilevel Inverse SFC Par-titioning + Sequence Partitioning

. GPA: Greedy Partitioning Algorithm

. LDF: Load Density Factor

. HPA: Hierarchical Partitioning Algorithm

. HPS: Hybrid Partitioning Strategy

. LIF: Load Imbalance Factor

. LPA: Level-based Partitioning Algorithm

. pBD-ISP: p-way Binary Dissection-Inverse SFCPartitioning

. RSF: Resource Sufficiency Factor

. SAMR: Structured Adaptive Mesh RefinementTechnique

. SBC: Segmentation-Based Clustering Algorithm

. SFC: Space-Filling Curve Technique

ACKNOWLEDGMENTS

The authors would like to thank Sumir Chandra and JohanSteensland for many insightful research discussions. Theauthors would also like to thank the editors and referees fortheir suggestions, which have helped improve the quality andpresentation of this paper. The research presented in thispaper is supported in part by the US National ScienceFoundation via grants numbers ACI 9984357, EIA 0103674,EIA 0120934, ANI 0335244, CNS 0305495, CNS 0426354, andIIS 0430826, and by the US Department of Energy via the grantnumber DE-FG02-06ER54857.

LI AND PARASHAR: HYBRID RUNTIME MANAGEMENT OF SPACE-TIME HETEROGENEITY FOR PARALLEL STRUCTURED ADAPTIVE... 1213

Fig. 17. Number of page faults: Non-ALOC versus ALOC.

Fig. 16. Experimental results: AHMP with and without HPS.

Authorized licensed use limited to: Rutgers University. Downloaded on November 22, 2008 at 10:44 from IEEE Xplore. Restrictions apply.

Page 13: 1202 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …coe · 2011-11-11 · 2.1 Structured Adaptive Mesh Refinement Structured Adaptive Mesh Refinement (SAMR) formula-tions for adaptive

REFERENCES

[1] M. Berger and J. Oliger, “Adaptive Mesh Refinement forHyperbolic Partial Differential Equations,” J. ComputationalPhysics, vol. 53, pp. 484-512, 1984.

[2] M. Berger and P. Colella, “Local Adaptive Mesh Refinement forShock Hydrodynamics,” J. Computational Physics, vol. 82, pp. 64-84, 1989.

[3] K. Devine, E. Boman, R. Heaphy, B. Hendrickson, and C.Vaughan, “Zoltan Data Management Services for Parallel Dy-namic Applications,” Computing in Science and Eng., vol. 4, no. 2,pp. 90-97, Mar.-Apr. 2002.

[4] S. Das, D. Harvey, and R. Biswas, “Parallel Processing of AdaptiveMeshes with Load Balancing,” IEEE Tran. Parallel and DistributedSystems, vol. 12, no. 12, pp. 1269-1280, Dec. 2001.

[5] J. Cummings, M. Aivazis, R. Samtaney, R. Radovitzky, S. Mauch,and D. Meiron, “A Virtual Test Facility for the Simulation ofDynamic Response in Materials,” J. Supercomputing, vol. 23,pp. 39-50, 2002.

[6] S. Hawley and M. Choptuik, “Boson Stars Driven to the Brink ofBlack Hole Formation,” Physical Rev. D, vol. 62, no. 10, p. 104024,2000.

[7] J. Ray, H.N. Najm, R.B. Milne, K.D. Devine, and S. Kempka,“Triple Flame Structure and Dynamics at the Stabilization Point ofan Unsteady Lifted Jet Diffusion Flame,” Proc. Combustion Inst.,vol. 25, no. 1, pp. 219-226, 2000.

[8] A. Calder, H. Tufo, J. Turan, M. Zingale, G. Henry, B. Curtis, L.Dursi, B. Fryxell, P. MacNeice, K. Olson, P. Ricker, R. Rosner, andF. Timmes, “High Performance Reactive Fluid Flow SimulationsUsing Adaptive Mesh Refinement on Thousands of Processors,”Proc. ACM/IEEE Conf. Supercomputing, 2000.

[9] L.V. Kale, “Charm,” http://charm.cs.uiuc.edu/research/charm/,2007.

[10] R.D. Hornung and S.R. Kohn, “Managing Application Complexityin the SAMRAI Object-Oriented Framework,” Concurrency andComputation—Practice & Experience, vol. 14, no. 5, pp. 347-368, 2002.

[11] G. Karypis and V. Kumar, “Parmetis,” http://www.users.cs.umn.edu/~karypis/metis/parmetis/index.html, 2003.

[12] P. MacNeice, “Paramesh,” http://esdcd.gsfc.nasa.gov/ESS/macneice/paramesh/paramesh.html, 2007

[13] M. Parashar and J. Browne, “On Partitioning Dynamic AdaptiveGrid Hierarchies,” Proc. 29th Ann. Hawaii Int’l Conf. SystemSciences, pp. 604-613, 1996.

[14] J. Steensland, S. Chandra, and M. Parashar, “An Application-Centric Characterization of Domain-Based SFC Partitioners forParallel SAMR,” IEEE Trans. Parallel and Distributed Systems,vol. 13, no. 12, pp. 1275-1289, Dec. 2002.

[15] P.E. Crandall and M.J. Quinn, “A Partitioning Advisory Systemfor Networked Data-Parallel Programming,” Concurrency: Practiceand Experience, vol. 7, no. 5, pp. 479-495, 1995.

[16] X. Li and M. Parashar, “Dynamic Load Partitioning Strategies forManaging Data of Space and Time Heterogeneity in ParallelSAMR Applications,” Proc. Ninth Int’l Euro-Par Conf. (Euro-Par’03), vol. 2790, pp. 181-188, 2003.

[17] X. Li and M. Parashar, “Using Clustering to Address the Hetero-geneity and Dynamism in Parallel SAMR Application,” Proc. 12thAnn. IEEE Int’l Conf. High Performance Computing (HiPC ’05), 2005.

[18] A. Wissink, D. Hysom, and R. Hornung, “Enhancing Scalability ofParallel Structured AMR Calculations,” Proc. 17th ACM Int’l Conf.Supercomputing (ICS ’03), pp. 336-347, 2003.

[19] M. Thune, “Partitioning Strategies for Composite Grids,” ParallelAlgorithms and Applications, vol. 11, pp. 325-348, 1997.

[20] Z. Lan, V. Taylor, and G. Bryan, “Dynamic Load Balancing forAdaptive Mesh Refinement Applications: Improvements andSensitivity Analysis,” Proc. 13th IASTED Int’l Conf. Parallel andDistributed Computing Systems (PDCS ’01), 2001.

[21] Z. Lan, V. Taylor, and G. Bryan, “Dynamic Load Balancing ofSAMR Applications on Distributed Systems,” J. Scientific Program-ming, vol. 10, no. 4, pp. 319-328, 2002.

[22] M. Berger and I. Regoutsos, “An Algorithm for Point Clusteringand Grid Generation,” IEEE Trans. Systems, Man, and Cybernetics,vol. 21, no. 5, pp. 1278-1286, 1991.

[23] J.D. Teresco, J. Faik, and J.E. Flaherty, “Hierarchical Partitioningand Dynamic Load Balancing for Scientific Computation,”Technical Report CS-04-04, Dept. of Computer Science, WilliamsCollege, 2004, also in Proc. Workshop Applied Parallel Computing(PARA ’04), 2004.

[24] J. Steensland, “Efficient Partitioning of Structured Dynamic GridHierarchies,” PhD dissertation, Uppsala Univ., 2002.

[25] H. Sagan, Space Filling Curves. Springer-Verlag, 1994.[26] J. Pilkington and S. Baden, “Dynamic Partitioning of Non-

Uniform Structured Workloads with Spacefilling Curves,” IEEETrans. Parallel and Distributed Systems, vol. 7, no. 3, Mar. 1996.

[27] R.C. Gonzalez and R.E. Woods, Digital Image Processing, seconded. Prentice Hall, 2002.

[28] T. Ridler and S. Calvard, “Picture Thresholding Using an IterativeSelection Method,” IEEE Trans. Systems, Man, and Cybernetics,vol. 8, no. 630-632, 1978.

[29] N. Otsu, “A Threshold Selection Method from Gray-LevelHistogram,” IEEE Trans. Systems, Man, and Cybernetics, vol. 6,no. 1, pp. 62-66, 1979.

[30] M. Parashar, “Grace,” http://www.caip.rutgers.edu/~parashar/TASSL/, 2007.

[31] X. Li and M. Parashar, “Hierarchical Partitioning Techniques forStructured Adaptive Mesh Refinement Applications,” J. Super-computing, vol. 28, no. 3, pp. 265-278, 2004.

[32] J.L. Hennessy, D.A. Patterson, and D. Goldberg, ComputerArchitecture: A Quantitative Approach. Morgan Kaufmann, 2002.

[33] M.S. Potnuru, “Automatic Out-of-Core Execution Support forCHARM++,” master’s thesis, Univ. of Illinois at Urbana-Cham-paign, 2003.

[34] N. Saboo and L.V. Kale, “Improving Paging Performance withObject Prefetching,” Proc. Int’l Conf. High Performance Computing(HiPC01), 2001.

[35] J. Tang, B. Fang, M. Hu, and H. Zhang, “Developing a User-LevelMiddleware for Out-of-Core Computation on Grids,” Proc. IEEEInt’l Symp. Cluster Computing and the Grid, pp. 686-690, 2004.

[36] “IPARS,” http://www.cpge.utexas.edu/new_generation/, 2007.[37] S. Chandra and M. Parashar, “A Simulation Framework for

Evaluating the Runtime Characteristics of Structured AdaptiveMesh Refinement Applications,” Technical Report TR-275, Centerfor Advanced Information Processing, Rutgers Univ., Sept. 2004.

[38] “Hdf5,” http://hdf.ncsa.uiuc.edu/HDF5/, 2007.

Xiaolin Li received the BE degree from QingdaoUniversity, China, the ME degree from ZhejiangUniversity, China, and the PhD degrees from theNational University of Singapore and RutgersUniversity. He is an assistant professor ofcomputer science at Oklahoma State University.His research interests include distributed sys-tems, sensor networks, and software engineer-ing. He directs the Scalable Software SystemsLaboratory (http://www.cs.okstate.edu/~xiaolin/

s3lab). He is a member of the IEEE.

Manish Parashar received the BE degree inelectronics and telecommunications from Bom-bay University, India, and the MS and PhDdegrees in computer engineering from SyracuseUniversity. He is a professor of electrical andcomputer engineering at Rutgers University,where he is also the director of the AppliedSoftware Systems Laboratory. He has receivedthe Rutgers Board of Trustees Award forExcellence in Research (2004-2005), the NSF

CAREER Award (1999), and the Enrico Fermi Scholarship fromArgonne National Laboratory (1996). His research interests includeautonomic computing, parallel and distributed computing (includingpeer-to-peer and Grid computing), scientific computing, and softwareengineering. He is a senior member of the IEEE, a member of theexecutive committee of the IEEE Computer Society TechnicalCommittee on Parallel Processing (TCPP), part of the IEEE ComputerSociety Distinguished Visitor Program (2004-2006), and a member ofthe ACM. He is a cofounder of the IEEE International Conference onAutonomic Computing (ICAC) and serves on the editorial boards ofseveral journals and on the steering and program committees ofseveral international workshops and conferences. For more informa-tion, please visit http://www.caip.rutgers.edu/~parashar/.

1214 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 18, NO. 9, SEPTEMBER 2007

Authorized licensed use limited to: Rutgers University. Downloaded on November 22, 2008 at 10:44 from IEEE Xplore. Restrictions apply.