38
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2008; 20:903–940 Published online 1 October 2007 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe.1214 Methodology for modelling SPMD hybrid parallel computation L. M. Liebrock 1, , and S. P. Goudy 2 1 New Mexico Institute of Mining and Technology, Computer Science Department, Socorro, NM 87801, U.S.A. 2 Sandia National Laboratories, Albuquerque, NM 87185, U.S.A. SUMMARY This research defines and analyzes a methodology for deriving a performance model for SPMD hybrid parallel applications. Hybrid parallelism combines shared memory and message passing computing models. This work extends the current practice of application performance modelling by development of a methodology for hybrid applications with these procedures. Creation of a model based on complexity analysis of an application code and its data structures. Enhancement of a static complexity model by dynamic factors to capture execution time phenomena, such as memory hierarchy effects. Quantitative analysis of model characteristics and the effects of perturbations in measured para- meters. These research results are presented in the context of a hybrid parallel implementation of a sparse linear algebra kernel. A model for this kernel is derived and analyzed using the methodology. Application of the model on two large parallel computing platforms provides case studies for the methodology. Operating system issues, machine balance factor, and memory hierarchy effects on model accuracy are examined. Copyright © 2007 John Wiley & Sons, Ltd. Received 30 May 2006; Revised 4 November 2006; Accepted 11 March 2007 KEY WORDS: hybrid parallelism; performance analysis; performance modelling Correspondence to: L. M. Liebrock, New Mexico Institute of Mining and Technology, Computer Science Department, 801 Leroy Place, Socorro, NM 87801, U.S.A. E-mail: [email protected] Contract/grant sponsor: Publishing Arts Research Council; contract/grant number: 98-1846389 Contract/grant sponsor: Sandia National Laboratories; contract/grant number: DE-AC04-94-AL85000 Copyright 2007 John Wiley & Sons, Ltd.

Methodology for modelling SPMD hybrid parallel computation

Embed Size (px)

Citation preview

Page 1: Methodology for modelling SPMD hybrid parallel computation

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCEConcurrency Computat.: Pract. Exper. 2008; 20:903–940Published online 1 October 2007 inWiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe.1214

Methodology for modellingSPMD hybrid parallelcomputation

L. M. Liebrock1,∗,† and S. P. Goudy2

1New Mexico Institute of Mining and Technology, Computer Science Department,Socorro, NM 87801, U.S.A.2Sandia National Laboratories, Albuquerque, NM 87185, U.S.A.

SUMMARY

This research defines and analyzes a methodology for deriving a performance model for SPMD hybridparallel applications. Hybrid parallelism combines shared memory andmessage passing computing models.

This work extends the current practice of application performance modelling by development of amethodology for hybrid applications with these procedures.

• Creation of a model based on complexity analysis of an application code and its data structures.• Enhancement of a static complexity model by dynamic factors to capture execution time phenomena,such as memory hierarchy effects.

• Quantitative analysis of model characteristics and the effects of perturbations in measured para-meters.

These research results are presented in the context of a hybrid parallel implementation of a sparse linearalgebra kernel. A model for this kernel is derived and analyzed using the methodology. Application of themodel on two large parallel computing platforms provides case studies for the methodology. Operatingsystem issues, machine balance factor, and memory hierarchy effects on model accuracy are examined.Copyright © 2007 John Wiley & Sons, Ltd.

Received 30 May 2006; Revised 4 November 2006; Accepted 11 March 2007

KEY WORDS: hybrid parallelism; performance analysis; performance modelling

∗Correspondence to: L. M. Liebrock, New Mexico Institute of Mining and Technology, Computer Science Department, 801Leroy Place, Socorro, NM 87801, U.S.A.

†E-mail: [email protected]

Contract/grant sponsor: Publishing Arts Research Council; contract/grant number: 98-1846389Contract/grant sponsor: Sandia National Laboratories; contract/grant number: DE-AC04-94-AL85000

Copyright q 2007 John Wiley & Sons, Ltd.

Page 2: Methodology for modelling SPMD hybrid parallel computation

904 L. M. LIEBROCK AND S. P. GOUDY

1. INTRODUCTION

Within the high-performance computing community, much attention is directed at performancemodelling and evaluation of future computer architectures‡. One goal of performance modellingis prediction of the way a particular computing system will behave in the presence of a typicalworkload. Here we address single-program multiple-data (SPMD) performance for hybrid archi-tectures. In most cases, a workload can be decomposed into its constituent applications and modelsof runtime performance can be produced for each component.In keeping with the evaluation of future platforms, analysis of parallel application performance

‘at scale’ is an important research area at Sandia National Laboratories. A thorough understandingof the features that make an application run efficiently on many processors will help applicationdevelopers as they design algorithms. Further, a determination of the ways that platform propertiesenhance or degrade application scalability may provide useful information for the computer designprocess.This work focuses on a method for development of performance models for hybrid parallel

applications, which combine message-passing and shared memory communication. The Message-Passing Interface (MPI) library and OpenMP threads directives provide the infrastructure for suchapplications [1,2].Clusters with multiprocessor nodes exist in many forms, from massively parallel supercomputers

such as the Intel Teraflops (familiarly known as Tflops) to workstations linked on a network.In between are moderately parallel clusters, having 64–256 compute nodes, such as the Vplantvisualization cluster at Sandia National Laboratories. Performance modelling of algorithms forthese computing platforms is a subject of current research at Sandia, where evaluation of alternativearchitectures is important.Parallel computers can be classified on the basis of the tightness with which the processors are

coupled in their ability to communicate with each other. In a symmetric multiprocessor system,intercommunication occurs via an internal memory subsystem. In a cluster, intercommunicationoccurs across a network. This can be a high-speed interconnect as in Intel Teraflops or a relativelyslow connection such as Ethernet. Somemassively parallel computers, for example, the ‘QMachine’at Los Alamos National Laboratory (LANL), employ a hybrid architecture in which the machineis a cluster of symmetric multiprocessors.Within a symmetric multiprocessor, the most frequently used software paradigm for interprocess

communication is shared memory. For a system of computers with distributed memory§ , all datasharing takes place via message passing. In a clustered SMP system, a hybrid or mixed-modeparallel programming paradigm combines message passing with shared memory for data exchange.Typically, hybrid parallel applications use shared memory on a node or SMP, but use messagepassing between nodes or SMPs.

‡For a description of one such program at Lawrence Berkeley National Laboratory, see HPCwire #108895 LBNL REVAMPSHPC PERFORMANCE MEASUREMENTS, 3 December 2004.§Distributed memory machines can be programmed according to a global address space model; however, that paradigm isnot considered.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 3: Methodology for modelling SPMD hybrid parallel computation

METHODOLOGY FOR MODELLING SPMD HYBRID PARALLEL COMPUTATION 905

Some researchers have shown improved timing results with hybrid techniques compared with puremessage-passing programming [3,4]. Many others report little or no benefit from hybrid parallelversions of their codes [5–14]. A version of semicoarsening multigrid (SMG) that was developed forthis project falls into the latter category, as does a different version of SMG developed at LawrenceLivermore National Laboratory [15].Hybrid parallelism typically pays off when there are many things to do that must share the same

context and there are multiple contexts in which such computations must be performed. In thiscase, having processors share memory on a node increases the memory available to support largerproblems and allows multiple processes (or threads) to work on different pieces of the problemindependently using the context provided by the shared memory. For problems where this largeshared context is not necessary, a purely distributed approach is often more effective. For problemswhere all the processes must share the same complete context, a purely shared memory approachis usually more effective. The difference between paradigms becomes even more clear when theextra programmer effort needed for hybrid parallelization is taken into account.In evaluating the performance of an SPMD application, the question arose: could application

behavior be predicted by an analytic model of a hybrid parallel algorithm? In trying to answerthat question for a specific SPMD application, after several attempts to produce a model thatcaptured measured application behavior on different platforms, two more questions emerged. Howdoes one build a ‘good’ model? How does one rate the ‘goodness’ of a model? The focus of thiswork was to answer the first of these questions and to provide a foundation for answering thesecond.Parameterized analytic models combine application and architecture characteristics in a formula

that can be evaluated quickly. Depending on the size and complexity of application code, an analyticmodel can be developed in a short time. Such models have been shown to have a predictive capability[16]. Possibly more important than performance prediction is elucidation of the interactions, forgood or ill, between application and architecture.Performance prediction methodology includes a number of methods and procedures. In the

context of a single-scientific application running on a cluster of symmetric multiprocessors, this workdescribes a collection of procedures for creating an analytic model of the application. Moreover,the semi-empirical methodology presented herein evaluates the scalability and predictive capabilityof the model. This research expands the current practice for application performance modelling onpresent and future computer platforms in the following ways.

• Methodology for creation of parameterized analytic model: Such a model starts with com-plexity analysis. There are many reasons to use an analytic model for performance prediction.A complexity model can be evaluated more quickly than systems that arise from a Petrinet or queuing model. An analytic model is more easily developed and validated than asimulated computing platform. An analytic model may be used even when no hardware isavailable.

• Enhancement of complexity model with dynamic factors: Analysis of complexity is a statictechnique. Analytic models with statically determined parameters will be inadequate forprediction of some application behavior. Statically determined parameters come fromvendor numbers or microbenchmarks. It is possible to introduce a dynamic component ina model by adding measured values. Microbenchmarks can add dynamic factors if they show

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 4: Methodology for modelling SPMD hybrid parallel computation

906 L. M. LIEBROCK AND S. P. GOUDY

the impact on performance based on changes in data (via size or value). This enables themodel to capture, without explicit terms, memory hierarchy and other resource contentioneffects.

• Quantitative analysis of model: In order to have confidence in the predictions of a model, onecan apply well-known system evaluation techniques to analyze a model [17]. Mathematicalsensitivity analyses and empirical sensitivity calculations are used to address the followingissues: model response to perturbations in parameters, impact on the model of using aver-ages for parameter values, effectiveness of the model when scaling the problem domain sizeor number of processors. This is especially important when no platform exists for modelvalidation. Computational and statistical methods for evaluating the fidelity of a model aresuggested.

In this work, the methodology is used with applications that are regular in the following senses.The applications follow the SPMD programming model. Looping constructs have iteration countsthat depend on domain size or some other measure of work. The global domain is decomposed asevenly as possible across the processors. Either there are well-defined phases of communicationand computation or attempts to overlap these phases are apparent. Even with these constraints,the methodology can be used with a large class of scientific and engineering applications. Themethodology has been applied to an explicit physics simulation code; the detailed analysis ofthat application will not be discussed here. However, examples are drawn from two- and three-dimensional iterative solvers for discretized partial differential equations. For the implicit solvers aswell as the explicit code, the data structures and domain decompositions introduce tacit assumptionsabout logical nearest neighbors into the models. While these assumptions do not necessarily limitthe scope of the methodology, they do affect performance predictions.

2. RELATED WORK

The techniques for generation of performance predictions fall into broad categories. There areparameterized analytic models, which can be generated automatically from source or object code.Some analytic models are derived from complexity analysis and system characterization. Thereare procedures based on measurement of run times. Execution tracing tools¶ provide input forsimulation with hardware system parameters as a second form of input. Finally, there are solutionmethodologies for complex systems, such as Petri nets or task graphs in combination with queuingmodels.

2.1. A similar model development technique

This research is most closely aligned with that of the performance modelling team at LANL.Although the two methods were developed independently, there is considerable similarity between

¶Dimemas is one such commercial tool, available from European Center for Parallelism of Barcelona. Seehttp://www.cepba.upc.edu/dimemas for details.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 5: Methodology for modelling SPMD hybrid parallel computation

METHODOLOGY FOR MODELLING SPMD HYBRID PARALLEL COMPUTATION 907

them. This section discusses the relationship between the LANL approach and this semi-empiricalmethodology.Kerbyson et al. [18] have described their method of producing a performance model to encapsu-

late an application’s crucial performance and scaling characteristics. Kerbyson and his colleaguesseparate application and mapping parameters from system parameters in the model; some para-meters are measured and some are specified. The LANL group formulates a complexity modelfrom analysis of code structure, operation counts, and key data structures.Their method defines computation time from measurement of single-processor execution time

on a problem size commensurate with parallel subdomains of the weakly scaled problem. Analternative is to measure time for a kernel and use that time with a count of kernel executions toestimate computation time [19]. They model communication with piecewise linear representationin different message size ranges. It is important for their technique that communication bandwidthbe measured in a setting that closely mimics the actual application. Frequency and type of message-passing operations are obtained by execution traces or profiles. Network and memory contentionfor a multiprocessor node are measured by using different problem configurations within a SMPnode. This consists of running an application on more than one processor per node and comparingthat performance to the application using the same data set on only one processor per node.Up to this point the description of the Kerbyson method applies equally well to this semi-

empirical methodology. However, the approach presented here handles hybrid parallelism explicitlyand accounts for MPI task mapping, whether block or cyclic, to multiprocessor nodes. In somealgorithms, the accuracy of the model is closely tied to differences in latency and bandwidth thatcan be observed for different MPI task mappings. This work also goes beyond the LANL work toinclude quantitative analysis of model characteristics, such as scalability and sensitivity to variationin parameters.

2.2. Models

Adhianto and Chapman [20] present a model based on combining static analysis and runtimebenchmark feedback for measurement of communication and multithreading. Static applicationsignatures are obtained using the OpenUH compiler. System profiles for communication latencyand overhead are obtained using Sphinx and Perfsuite. This approach eliminates the execution of theapplication to speed up modelling and interacts with users to define unknown variables, e.g. numberof processes and threads. This is useful when you have a machine, but do not have the applicationimplemented on the machine. This approach and our concurrently developed methodology are quitesimilar. However, it does not take into account the dynamics related to computation that we handle.Brehm et al. [21] developed a tool for analytic model evaluation with a focus on assisted

parallelization. Therefore, they emphasize determination of the appropriate level of detail forthe application and for the computer system. They compare model predictions with runtimemeasurements using percentage of absolute error as the metric for ‘goodness’. The relation be-tween complexity and accuracy of models is explored by creation of algebraic representations ofthe application and the computing system. Validation against execution times serves to rate theaccuracy of the different abstractions.Boyd et al. [22] obtain a hierarchy of performance bounds starting at the hardware. Their sys-

tem is called MACS: M(achine) is the peak floating-point performance; A(pplication) accounts

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 6: Methodology for modelling SPMD hybrid parallel computation

908 L. M. LIEBROCK AND S. P. GOUDY

for workload essential operations not covered by floating-point operations; C(ompiler) generatedworkload captures performance degradation due to additional operations needed to support the cal-culation; S(chedule) measures degradation due to scheduling constraints for resources both internaland external to the processor.Srekartaswany et al. develop an analytic model for system performance in which tree-structured

programming techniques are used [23]. Here tree structured means that a processor receives aparallel task and elects to work on it or to pass it on, possibly after splitting it. The computationclasses they model are divide-and-conquer and processor farms.Brehm is strongly motivated to keep his modelling process simple so that application developers

will use it. His consequent focus on the level of detail that will give an adequate characterizationof application and architecture is a point of strong similarity with this semi-empirical work. Heraises very interesting questions and while his answers are not crisp, he gives a clear account of hisexperimental technique.TheMACS bounds on performance are essential for understanding the capabilities of a given pro-

cessor. Boyd’s characterization of performance occurs at a much lower level than that of Kerbyson orBrehm. Sreekantasswamy’s characterization of system performance is derived from general featuresof application types.

2.3. Measurement

Once an analytic model has been formulated, judicious measurement can be used to obtain modelparameters or validate choices of model parameters. One can also measure system response inorder to rate the sensitivity of the model to parameter variations. While an analytic model capturesprogrammer intent, measurement captures system response to that intent.Characterization of platform behavior can use measurement at the application level. In the refer-

ences cited below, timing data are inspected to determine model sensitivity to parameters. Hoisieet al. and Kerbyson et al. examine sensitivity of application performance to machine parameters[16,19]. Worley et al. relate sensitivity of application performance to communication protocol[24,25]. Crovella et al. measure and account for all sources of parallel overhead [26]. The authorsmodel overhead by fitting trial functions to timing data. Their model includes goodness-of-fit val-ues for the trial functions. Grama et al. also account for parallel overhead [27]. They propose anisoefficiency metric that allows an analyst to measure performance on small numbers of processorsand predict performance on large numbers of processors. Isoefficiency is determined by the amountthat data size per process must increase as the number of processes increases to maintain the samevalue for parallel efficiency.Marin et al. describe their toolkit for semiautomatic measurement of static and dynamic char-

acteristics of applications [28]. They produce architecture-neutral models for sequential code. Thiswork seems to have some points of similarity with [22,29], which examine architecture-neutralcharacteristics of shared-memory multiprocessors and parallel computers. Dongarra et al. focus onmeasurement for performance enhancement [30]. They are developing tools for instrumentation ofparallel codes. Numerous researchers take measurements for system characterization [31–34].In the analyses of sensitivity to system parameters cited above, the authors determine sensitivity

by data inspection of model predictions. By contrast, this semi-empirical approach uses quantitativeanalysis of sensitivity.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 7: Methodology for modelling SPMD hybrid parallel computation

METHODOLOGY FOR MODELLING SPMD HYBRID PARALLEL COMPUTATION 909

2.4. Methodology

Research described here has little in common with our semi-empirical methodology. Histori-cally, ‘methodology’ has been used to denote solution of a queuing model or evaluation of aPetri net or a task graph. Various simulation techniques are also referred to as models and/ormethodologies. They are considered here as each claims to predict or improve performance ofapplications.The PMaC framework has been developed at San Diego Supercomputer Center for application

performance simulation [35]. A model of serial performance, between communication events, isobtained by convolution of the application memory trace with machine memory hierarchy profile.Parallel performance estimation uses the serial performance model and communication traces from acommercial package in a post-processing tool. A key feature of the PMaC convolver is the emphasison memory behavior as the primary indicator of performance. Instruction count and complexityanalysis frequently play only a minor role in a PMaC prediction.Developments in parallel skeletons and parallel design pattern libraries have been made to reduce

the development time and improve the performance of new parallel applications. For example,MacDonald et al. provide performance tuning opportunities at lower layers in the CO2P3S parallelprogramming system [36]. No references have been found for any such systems that support hybridparallel computing and performance analysis or tuning.Grove and Coddington built a simulator, Performance Evaluating Virtual Parallel Machine,

that evaluates performance directives placed in the application source code [37]. These direc-tives are customizable, based on the particular hardware to be evaluated. van Gemuud createda symbolic language, PAMELA, for simulating application performance on shared memory anddistributed memory machines [38]. In his prediction methodology, one describes the applicationas an imperative program using the symbolic modelling language. The computing system is de-scribed by a machine specification formalism. PAMELA models are compiled for efficiency ofevaluation.Many researchers have combined task graph methods or Petri nets with queuing models [39–43].

Recently, there has been a revival of interest in symbolic simulation [44] and deterministictask graphs [45]. It should be noted, however, that some authors believe that such techniquesare impractical to apply to large-scale applications running on thousands of processors [25,46].

2.5. Remarks on predictive model research

The most important concepts from the research on application performance prediction are simplein theory. First, a model must include dynamic information to supplement static analysis. Second,it is necessary to account for all overhead. In practice, these concepts are not simple to implementeffectively. The question of how much dynamic information to include is a point of differentiationbetween modellers. Overhead includes more than communication cost. Although the primary focusof this work is hybrid parallel computation, we must also consider sequential overheads, as theysignificantly impact performance. An interesting issue is how much detail suffices to describeapplication behavior.The term semi-empirical will be used to distinguish the approach taken in this research from that

of other authors. One important distinction is the use of quantitative analysis of the effects of system

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 8: Methodology for modelling SPMD hybrid parallel computation

910 L. M. LIEBROCK AND S. P. GOUDY

parameter choices on the estimates of performance produced by a model. Other novel aspects ofthis research are its setting in hybrid parallelism and its investigation of mapping strategies forclusters of symmetric multiprocessors.

3. METHODOLOGY FOR MODELLING HYBRID PARALLEL PERFORMANCE

In this semi-empirical methodology, creation of an analytic model for a hybrid parallel applicationbegins with complexity analysis. Operation counts are combined with parameters that describe thecomputing platform and the way the application domain is mapped onto the processors. Somesystem parameters may be specified by reference to vendor documentation, while others may needto be measured.The process for derivation of a model entails the following steps.

1. Count computation, communication, and threaded operations to obtain complexityexpressions.

2. Define application and mapping parameters that affect execution time.3. Define system parameters that must be quantified.4. Write the analytic model in terms of the application and system parameters.5. Enhance the static model with dynamic information, where necessary.6. Perform quantitative analysis of the parameterized model.

Modelling hybrid parallelism differs from modelling pure message passing on distributedmemory computers in that there are additional parameters needed to quantify thread creation,scheduling, and synchronization overheads. Memory subsystem effects include contention for ac-cess at all levels of the hierarchy and possible cache thrashing due to ‘false sharing’. Memory accessbehaviour can strongly affect floating-point performance as well as interaction with the communi-cation network. Quantification of thread parameters is one point of distinction between this workand other analytic modelling efforts [18,21].An analytic model for an application kernel can be produced from examination of algorith-

mic complexity and data structures in a code with regular structure. That is, one can write ananalytic expression for the code behavior if the code has well-defined regions of computation andcommunication or if overlaps of these regions are readily discernible. Parameters for applicationand system attributes give this expression a predictive capability. Here, a parametrized model ispresented for an application with loops that have well-defined iteration limits.

3.1. Steps in complexity analysis of a hybrid parallel application

The process for derivation of a static model with parameters is composed of the following steps.

• Count computation and communication operations to obtain complexity expressions. For com-plex codes, a profiler can be used to determine the counts.

• Define application andmapping parameters that affect execution time. Thesemodel parametersinclude such items as simulation domain size and decomposition, as well as problem-specificindicators that specify execution paths.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 9: Methodology for modelling SPMD hybrid parallel computation

METHODOLOGY FOR MODELLING SPMD HYBRID PARALLEL COMPUTATION 911

• Define system parameters that must be quantified. These parameters‖ include floating-pointperformance, latency and bandwidth of the communication network, and thread creation andscheduling overhead∗∗.

• Develop the analytic model in terms of the application and system parameters. Characterizethe system parameters, whether specified or measured. If measurements are necessary, obtainor create benchmarks to measure system parameters.

3.1.1. Count computation and communication operations

This semi-empirical methodology derives a cost function for an application. First, count the numberof each type of computation operations in each code block. Many analytic models of scientificapplications use floating-point arithmetic counts as ameasure of single-processor activity [18,21,47].Some other operation types that influence performance are integer arithmetic, memory accessoperations, and intrinsic functions. Also, count the number of times the block is executed in termsof the data size parameters. This implies that loops have iteration limits and that branches havedeterminable conditions. Profiling tools, such as SpeedShop from SGI, can be used to determinesuch counts.To deal with hybrid applications, we must account for both processes/communication and

threads/memory sharing.Therefore, next, count the number of each type of communication event. Point-to-point commu-

nication and collectives have different complexity expressions in terms of number of processorsand thus must be accounted for separately. Further, collective complexity depends both on thecommunication network topology [27] and on the communication library implementation. For ex-ample, some early implementations of the MPI library did not take advantage of the most efficienttree algorithms for collective operations. Determine the message size for each communication,either by code examination or from a communication trace. Discover, if possible, the degree ofcomputation/communication overlap. Even though the code may attempt to use asynchronous com-munications, the hardware may not be capable of simultaneous communication and computation[48]. If there is no overlap, then the complexity expressions for communication and computationcan be summed. Otherwise, expressions for these phases require maximums as well as sums or insome cases more complex combinations.Lastly, account for the thread overheads. These are typically quantifiable by inspection of the

source code. Upon entry to an OpenMP parallel region, a pool of threads must be created oractivated; upon exit from the region these threads must be destroyed or deactivated. Count thenumber of parallel regions and the number of thread scheduling events within the regions. The type ofscheduling requested can affect the thread overhead; for example, there is more overhead associatedwith fine-grained dynamic OpenMP parallelism than with coarse-grained static parallelism. Implicitand explicit barriers must also be accounted for in the thread overhead terms.

‖Response of the memory subsystem can affect all these parameters and must be incorporated into the model. This isaddressed in Section 3.4.4.

∗∗Models for multithreaded architectures are beyond the scope of this work. Thread creation and scheduling overhead aredependent on operating system features. Other types of thread overhead have a basis in hardware characteristics.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 10: Methodology for modelling SPMD hybrid parallel computation

912 L. M. LIEBROCK AND S. P. GOUDY

3.1.2. Define application and mapping parameters

The runtime of an application is clearly related to the size of key data structures. The shape of thesestructures and the pattern of access are also relevant.While the number of compute nodes available is a platform parameter, the number that are actually

used in a calculation is a data mapping parameter. The number of MPI tasks and the number ofOpenMP threads within a task, when combined with the global data size and shape, yield a datadecomposition and mapping onto the processors. This data layout or mapping is reflected in themodel by the local data sizes and shapes for a process or thread.The determination of data decomposition and mapping is an application parameter that is typically

determined based on the surface to volume ratio of the subdomain for the application. To generalizethis idea, the modeller or application developer needs to consider the computation to communicationratio. The domain decomposition selected should balance the optimization of computational speed(e.g. using long vectors for performance) with the requirements for communication.

3.1.3. Define system parameters

A model must contain system-dependent parameters in order to capture performance on differentplatforms. Clock speed is a fixed quantity for each processor in a parallel system. The number ofcommunication links per compute node is a fixed quantity on each system. Some parameters havepeak values that are specified by the hardware vendor: bandwidth and latency of the communicationfabric, time for standard arithmetic operations, and access times for the memory hierarchy levels.Measurements of the values attained by an application can be quite different. Which value touse in a model is a matter of judgment in the context of the reasons for the development of themodel.Vendor numbers really provide an upper bound on the performance for a machine and should

therefore only be used when there is no way to get any realistic numbers. In this case, the modelis typically being developed to assess the capabilities of proposed architecture designs. As soon asan emulator or real hardware is available, measurements using microbenchmarks are a necessityfor the case where the application has not been ported. In this case, the model is typically beingdeveloped to assess the tradeoffs of purchasing competing architectures. If the application has beenported, then there is no substitute for using measured quantities from the application of interest.Here the model is often used to assess possible improvements for the application implementation.Other quantities that must, in general, be measured are timing data for intrinsic or system li-

brary functions and the overheads for threads (creation, scheduling, and barriers). This list is notexhaustive; however, it forms a basic set of necessary quantities to be included in an analyticmodel.

3.1.4. Develop analytic model with parameters

Having assembled the complexity expressions and the list of application, mapping, and system pa-rameters, next write the parameterized analytic model. The combination of the formulas describingthe complexity of the computation and communication activities, together with the parameters thatspecify system performance characteristics, make up the model.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 11: Methodology for modelling SPMD hybrid parallel computation

METHODOLOGY FOR MODELLING SPMD HYBRID PARALLEL COMPUTATION 913

Consider a simple generic hybrid performance model of the form

T = E(N ,�)

P�+ �C + a�S + b(� + �n) + c� log P (1)

where N is a measure of the problem size, P is the number of processes, � is the number of threadsper process, � represents system parameters such as floating point performance, �C is the threadcreation time, �S is the thread scheduling/synchronization overhead, a is the number of timesthreads must be synchronized or scheduled, � is the startup overhead for communication, � is theoverhead per byte of communication, n is the number of bytes of data to transfer, b is the numberof communications that must take place, � log P is the cost of a collective communication††, and cis the number of collective communications to perform.Given these values, E represents the computational requirements/performance for the application

with the specified data set size on a processor with characteristics corresponding to � and T isthe expected run time. The term, E(N , �)/P�, represents the impossible to reach goal of perfectutilization of processes and threads with no overhead. The additional terms bring the model backto reality by adding first thread overheads, �C + a�S, then direct communication overheads,b(� + �n)‡‡, and finally collective communication, c� log P . Although this generic model doesnot directly represent issues such as overlapping communication and computation, it provides atemplate that needs to be specialized for each application and machine pair.To adapt this general model to a particular system, it is necessary to set specific values for the

platform and the application. The modeller must do the following.

• Characterize system parameters as having specified or measured values.• Consider whether effective rates for arithmetic operations and routines are needed.• Decide whether communication parameters depend on the logical connectivity§§ of processes.• Determine how to handle resource contention effects.• Factor in overlapping computation and communication, as well as the impact of the domaindecomposition.

• Assign platform parameters.• Predict application performance by assigning values to application, system, and mappingparameters.

3.1.5. Measurements using tools

Some tools exist that can assist with data collection, analysis, and understanding of performance.For data collection consider the use of profilers, trace collectors, and trace analyzers, if they are

††The term for collective communication must reflect the type of implementation supported on the system—here we use abinary tree-based implementation, which is typically a best case.

‡‡The communication overhead would really be a sum of terms, since the model needs to take into account the differentcombinations of b and n.

§§ Consider point-to-point communication between nearest neighbors in a logical process grid. Message latency and inversebandwidth may differ depending on whether these neighbors share a node or not.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 12: Methodology for modelling SPMD hybrid parallel computation

914 L. M. LIEBROCK AND S. P. GOUDY

available for the machine of interest. Measurement of hardware statistics can be achieved usingPAPI¶¶ or Vtune for single-cpu performance‖‖.Performance visualization tools such as ParaDyn∗∗∗ and ParaVer††† can provide a basis for the

analysis of performance. Vampir, designed for cluster performance analysis, has evolved into IntelTrace Collector and Analyser‡‡‡.The Tuning and Analysis Utilities [49] (TAU) parallel performance framework provides support

for performance observation and analysis. Roth and Miller [50] have developed tools to effectivelysearch for performance bottlenecks on systems with up to 1024 processes. Their work also providesvisualization to consolidate information in larger problems (up to 30 000 nodes). Adhianto andChapman [20] present a tool framework that uses benchmarks to derive a system profile and acompiler to determine an application profile.Although such tools help to understand performance in some cases, they are often not available

or do not provide a sufficiently detailed understanding of application behaviour. This limitationleads to development of benchmarks that provide more specific and/or detailed information for aparticular application.

3.1.6. Measurements using benchmarks

Not all system parameters that appear in the model have well-defined values; the modeller mustquantify those that do not have specific values by using benchmarks. There are many benchmarksavailable for download from the Internet. At the web site for Oak Ridge National Laboratory, onecan get the Low Level Architectural Characterization Benchmark Package§§§ for parallel com-puters. This package has components for measuring floating-point performance, cache behavior,and communication network parameters. STREAM, from the University of Virginia¶¶¶ , is a well-known benchmark for measuring memory performance. Functionality and performance tests forMPI library functions can be obtained from the MCS division at Argonne National Laboratory‖‖‖.Despite efforts of benchmark creators to make their products portable and easy to use, coverage

of all platforms is difficult to achieve. Low-level machine details can be hard to determine; config-uration of benchmarks for a particular machine often depends on these details. Publicly availablebenchmarks may not match the memory access patterns or the communication patterns of the appli-cation closely enough. For example, ‘ping pong’ tests that measure communication latency betweentwo processors are ubiquitous. Using the results of this benchmark in a model for an applicationthat uses some other communication pattern∗∗∗∗ can be misleading.

¶¶ http://icl.cs.utk.edu/papi.‖‖http://intel.com.∗∗∗http://pardyn.org.†††http://cepba.upc.es.‡‡‡http://intel.com.§§§http://icl.cs.utk.edu/projects/llcbench/.¶¶¶ http://www.cs.virginia.edu/stream/.‖‖‖http://www-unix.mcs.anl.gov/mpi/.∗∗∗∗Simultaneous exchange of data on processor boundaries is a pattern that introduces contention for network resources

that is not present in the simple two-processor ping pong test.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 13: Methodology for modelling SPMD hybrid parallel computation

METHODOLOGY FOR MODELLING SPMD HYBRID PARALLEL COMPUTATION 915

If it is not possible to use existing benchmark tools, here are some general guidelines for creationof customized benchmarks. Mimic the application setting as closely as possible. The memoryaccess patterns that support arithmetic operations induce an effective operation rate that can bevery different from the processor speed [22]. For example, if a sequence of multiplications occursin the context of a multidimensional array, then a sequence of repeated multiplications of the sametwo numbers will not be an accurate benchmark. Extract small frequently executed kernels fromthe application for timing to characterize a system parameters.When measuring operations that take very little time, clock resolution is an issue. In some cases,

the event of interest is shorter than the period of the available timer. The most common approachto timing very short duration events is to repeat the event many times between timer calls. Caremust be taken as caching issues apply here as they did above. The techniques for measuring threadoverheads, discussed in [31], are generally applicable, as are the benchmarks in [22].

3.2. Steps for addition of measured values to complexity analysis

The last step for creation of a static model is to write the analytic model in terms of the applicationand system parameters. Characterization of system parameters can be done using benchmarks;however, values of many system parameters are sensitive to the context in which the operationsappear. Thus, it may be necessary to define parameters based on information that is available onlyat run time.Enhance the static model with dynamic information: It is generally acknowledged that memory

hierarchy effects can strongly alter application performance [29,35]. Incorporation of measurementsfor single-node run times at different problem domain sizes can capture system-specific memorybehaviour. A similar technique can be used to model communication variations that can arise fromprocesses that communicate with each other being either local to a node or remote. Computationvariations can be captured by varying the data used in the computation rather than the size.The starting point for determining which parameters must be modelled using dynamic information

for each architecture is a comprehensive set of microbenchmarks that are used with varying datasizes and values. These experiments determine which parameters have significant variation over therange of sizes and values used. Such parameters should be dynamically modelled.Starting with a static complexity model, general parameters that affect execution time are defined.

Some of these parameters are sizes and layout of key data structures. The number of MPI processesand the number of OpenMP threads per process, together with domain sizes, can be used to generatea data-to-processor mapping in the model. Examples of system parameters that can be sensitive tocontext are effective rates for floating-point arithmetic, amount of actual parallelism, and threadscheduling overhead. To keep the model from becoming overly complex, timing data for kernelsor library functions can be used directly in the model.Develop a dynamic timing estimation for a computation phase: Begin by obtaining timing results

for single-processor runs†††† at various data sizes. Fit a curve to this data. Next, use a profilingtool to measure relative timing information for routine and function calls at the ‘top level’ of the

††††These single CPU runs must be set up to execute quickly, while giving a sufficient measure of the relevant computation,since many runs will likely be required.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 14: Methodology for modelling SPMD hybrid parallel computation

916 L. M. LIEBROCK AND S. P. GOUDY

computation phase under study. Use this to refine the single-processor model based on profile countsof calls.Develop an estimate for each communication phase: Use a profiling tool or a communication

trace to time and count communication events for various data sizes and layouts. Fit a surface tothis data.Finally, write an analytic expression for parallel performance that includes the fitted parameters:

If the application is composed of alternating computation and communication phases, determineif any of these phases overlap. If not, then the total time is the sum of the times for each phase.Otherwise, the total time requires a more complicated expression involving sums and maximums.

3.3. Case study (RELAX2)

As a ‘testbed’ for evaluation of the model methodology, a good choice is SMG, a numerical tech-nique for solution of elliptic partial differential equations in two and three dimensions. The algorithmcontains both coarse-grained and fine-grained parallel execution paths. These attributes are mostevident in the relaxation kernel of SMG, making this kernel a good basis for experimentation in aclustered SMP environment. All experiments described here use block Gauss–Seidel relaxation asimplemented in two-dimensional SMG. Gauss–Seidel relaxation can be easily separated into twophases, one phase requires the solution of a block of tridiagonal systems‡‡‡‡. For this case study,measured execution times are compared with model predictions as one data point in validating themodel methodology.

3.3.1. Basis of the RELAX2 model

Block Gauss–Seidel relaxation proceeds in two phases: creation of a system of tridiagonal equationsfrom the plane equation defined for SMG and the subsequent line solves. The plane equation is rep-resented in stencil format. A stencil specifies how the solution value at a domain grid point dependson values of nearby points. The blocks of the two-dimensional grid are the even numbered lines,called ‘red’ lines, and the odd numbered lines are ‘black’ lines. The plane equation is decomposedinto systems of tridiagonal equations (or line equations). Due to the compactness of the stencil,solution of the lines can proceed in parallel. Single-line equations or blocks of line equations canbe passed to the tridiagonal solver.A complexity model for relaxation has two parts: the setup and the solution of the tridiagonal

systems that are derived from the plane equation. The model is refined to specify the computationand communication times for each of these parts. Models are developed for Wang’s partition methodand a block multithreaded variant of Wang’s method.Implementation of relaxation follows the SPMD model. A logical PI × PJ processor mesh is

assumed to contain the data for the decomposed two-dimensional domain. Each processor in themesh executes the same set of instructions on its portion of the data. The relaxation pseudocode inFigure 1 does not contain explicit communication calls. Where data are indexed outside loop limits,

‡‡‡‡For this task, the SMG implementation uses Wang’s partition method.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 15: Methodology for modelling SPMD hybrid parallel computation

METHODOLOGY FOR MODELLING SPMD HYBRID PARALLEL COMPUTATION 917

Figure 1. Pseudocode for 2D relaxation.

the actual implementation must provide data from a remote processor by using MPI communicationfunctions.

3.3.2. Setup of line equations

In order to simplify the computational model for relaxation, assume a square discretized domain ofN × N points, where N = 2n . Assume that PI and PJ are also powers of two and that PI and PJneed not be equal. The finest grid is decomposed so that each processor receives a subdomain ofsize NI × NJ , where NI = N/PI and NJ = N/PJ . The multigrid levels are processed sequentially;parallelization occurs within each level. Dependence on multigrid level manifests only in the valueof N lev

J = 21−levNJ that is passed to RELAX2. To simplify the rest of the discussion, an abbreviatedform N̂J will be used to denote N lev

J on an arbitrary level.Referring to the pseudocode in Figure 1, the time for an invocation of RELAX2 can be separated

into non-overlapping phases, so that TRELAX2 = Tsetup + Tsolve. Calculations on any black line areindependent of calculations on any other black line; this is due to the compact stencil. Red linecomputations are similarly independent. Therefore, within a block of either color, computation canproceed in parallel. The time for setting up the equations for the blocks of red and black lines can beseparated into computation time, T calc

setup, and communication time, T commsetup . Communication events are

present implicitly in the pseudocode as references outside the array bounds of the solution variable u.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 16: Methodology for modelling SPMD hybrid parallel computation

918 L. M. LIEBROCK AND S. P. GOUDY

The time for an invocation of the tridiagonal solver, Tsolve, can also be split into computation andcommunication times. This will be done in Section 3.3.3.Assume � is the average time for a floating-point operation, then

T calcsetup = 2(12�NI (N̂J/2)) (2)

Assume � is average latency of communication between hosts and � is the average time for transferof one data word, then

T commsetup = 2(TEW + TNS) = 2(2(� + �N̂J ) + 2(� + �NI )) (3)

TEW is the time required for exchange of u-values on east and west boundaries of the subdomaincontained within a processor. TNS is the time for exchange of u-values on north and south boundaries.These boundary exchanges occur during equation setup for red and black lines.Combining equations, the time for line equation setup with an arbitrary data size NI × N̂J is

modelled as

Tsetup = (12�NI N̂J ) + 4(� + �N̂J ) + 4(� + �NI ) (4)

In practice, domain decomposition and stencil data structures engender directional dependencefor � and �. Derivation of the model for tridiagonal solves appears in the next section. Additionof parameters for thread overhead will complete the parameterization of the relaxation model, inSection 3.3.4.

3.3.3. Line solvers

In this section, time complexity of Wang’s partition method is derived. Operation counts are pre-sented first for solution of a single-line equation. Then the model for a block-structured version ofWang’s algorithm is developed. Finally, the models for threaded versions of the setup and solvephases are combined into the full model for the relaxation kernel.Assume that P processing elements (or PE’s), logically numbered 0 . . . P − 1, are available

to solve a tridiagonal system of order N P . Before invocation of Wang’s algorithm, distribute Nequations to each of the P processors. The partition method [51] then proceeds by these steps:

1. Upper-triangularize diagonal blocks.2. Eliminate superdiagonals of diagonal blocks.3. Eliminate non-zero elements of superdiagonal blocks.4. Eliminate columns below the main diagonal.5. Eliminate columns above the main diagonal.6. Solve the diagonal system.

Figure 2 shows pseudocode for the forward elimination phase of the Wang algorithm. Messagetransfers are present implicitly as references outside the array bounds. Step (4a) requires that logicalPE i p receive values of cc(N ), aa(N ), and u(N ) from logical PE i p − 1 in order to calculate newvalues for cc(N ) and u(N ) in the local block for PE i p. This data dependency causes the serializationof messages across the logical processor array. Parallel processing is resumed in step (4b). A similarserialization of message traffic occurs for the backward elimination phase. These communicationpatterns limit the scalability of Wang’s method.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 17: Methodology for modelling SPMD hybrid parallel computation

METHODOLOGY FOR MODELLING SPMD HYBRID PARALLEL COMPUTATION 919

Figure 2. Pseudocode for a portion of Wang’s partition algorithm.

The Wang partition algorithm is readily parallelized for distributed memory computers. FromFigure 2, solution of a tridiagonal system by Wang’s method, for PI > 1, is modelled as

TW = 21�NI + (� + 4�) + (PI − 1)(8� + 2(� + 4�))≈ 21�NI + PI (8� + 2(� + 4�)) (5)

where � represents the average time of a floating-point operation, and �, � represent communicationlatency and bandwidth parameters. The presence of the term PI indicates that two communicationsweeps and a small amount of computation must be serialized.There are 2 · N̂J/2 line solves in each invocation of RELAX2, giving the total time for solution

of the tridiagonal line equations,

Tsolve = N̂J TW = N̂J (21�NI + (PI − 1)(8� + 2(� + 4�))) (6)

This expression for Tsolve suggests a potential optimization. Observe that communication for Wang’ssingle-line method occurs 2(PI − 1)N̂J times in an invocation of RELAX2. If a block structurewith delayed communication is used, the number of communication events can be dramaticallyreduced. The factor N̂J can be ‘pulled inside’ the Wang method, so that the time for the Wangblock method is

TWB = 21�NI N̂J + (PI − 1)(8�N̂J + 2�(4N̂J )) where �(n) denotes � + �n (7)

Now there are fewer, possibly more costly, serialized communication events. The amount of parallelcomputation stays the same, but the computation that occurs in the serialized portion of the algorithmincreases from 8� to 8�N̂J .

3.3.4. Hybrid model for relaxation

The block-structured Wang method, as well as line equation setup, are suitable for further paral-lelization using OpenMP. The strategy is to assign subblocks of line solves to the threads. Eachtime RELAX2 is invoked, an OpenMP parallel region is entered and threads are created. Thesethreads are destroyed on exit from the parallel region. Communication inside a parallel region isdone by only one of the threads. Logical PI × PJ process meshes are assumed. Each process canuse up to � threads.The model for block hybrid Wang is expanded to include terms for thread scheduling overhead,

�S, and the number of threads, �,

TWBT = 5�S + (21�NI )N̂J

�+ (PI − 1)(8�N̂J + 2�(4N̂J )) (8)

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 18: Methodology for modelling SPMD hybrid parallel computation

920 L. M. LIEBROCK AND S. P. GOUDY

Addition of the solver term, TWBT, to Equation (4) and application of threads to relaxation setupyield the hybrid model for relaxation, where �C is the thread creation time:

TRELAX2 =�C + 4�S + (12�NI )N̂J

�+ 4�(N̂J ) + 4�(NI ) + TWBT (9)

Complete treatment of scalability requires a model for RELAX2 in its multilevel setting. The full-relaxation model is related to this single-level model in the following ways. Total calculation on allcoarse grids is approximately equal to the calculation on the finest grid. Parallel overhead, includingthread scheduling, on the coarse grids is roughly n = log2(N ) times the parallel overhead on thefinest grid.Given its focus on model methodology, this research found interesting points in comparison of

the multilevel model with measurements. Consequently, analysis of the model is given for variousdata sizes NI × NJ .

3.4. Experiments and results

The same model problem is used for all experiments for this paper. The domain consists of the unitsquare � =[0, 1] × [0, 1], with a square subdomain [0.5, 1] × [0.5, 1]. Different partial differentialequations are defined in each region depicted in Figure 3, yielding a system with discontinuouscoefficients. In Regions 1 and 2, respectively, the equations are

�2u�x2

+ �2u�y2

= 0 and�2u�x2

+ 100.0�2u�y2

= 0 (10)

Dirichlet boundary conditions u = 0 are defined.The square domain of the model problem can be mapped onto the processors of a cluster of

symmetric multiprocessors in a number of ways.

Region 2

Region 1

(0.5,0.5)

(0,0)

(1,1)

Figure 3. Model problem domain showing regions with different governing equations.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 19: Methodology for modelling SPMD hybrid parallel computation

METHODOLOGY FOR MODELLING SPMD HYBRID PARALLEL COMPUTATION 921

3.4.1. Decomposition strategies

The timing studies use two-dimensional Cartesian domains with evenly spaced discretization grids.These fine grids are divided as evenly as possible into rectangular subdomains that are allocatedto MPI tasks on a cluster of SMPs. The assignment of coarse grid points to MPI tasks is madeon the basis of the owner of the fine-grid points from which the coarse subgrids are defined. Thisdecomposition leads to load imbalance, in the sense that some processors§§§§ have no work to doon some coarse levels.Consider mapping a square problem domain to MPI tasks. Assume that the parallel platform is

a cluster of four nodes, with each node having two processors. Across the top of Figure 4, threepossible decompositions of a square domain are shown. Below each decomposition is a possibledistribution to MPI tasks in the cluster. Data are labelled by the MPI rank of the process that ownsthe data. Arrows indicate the transfer of information at processor boundaries. The areas enclosedby dashed lines represent data from the domain decomposition that is assigned to each MPI task ina parallel job. The boxes that enclose MPI task data represent the compute node that is executingthe individual tasks comprising the parallel job. The decomposition shown in Figure 4(a) can beoperated on by four MPI tasks running on four processors or by four MPI tasks, each using twothreads. In the former case, one CPU in each of the four nodes is idle; in the latter, all processorsare busy.

(b)

7

3

10

2

4 5

6

(c)

76

3210

4 5

(a)

3

10

2

Figure 4. Possible mappings of the simulation domain to MPI tasks for a 4-node, 2-way SMP cluster.

§§§§Several techniques for dealing with ‘sleeping processors’ have been investigated during the history of multigrid. Ignoringthe sleepers is an easy and acceptable solution [47].

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 20: Methodology for modelling SPMD hybrid parallel computation

922 L. M. LIEBROCK AND S. P. GOUDY

For fixed numbers of grid points, the simulation domain is evenly divided among the computenodes. For scaled speedup studies, the problem domain is sized so that each compute node has thesame number of fine-grid points. Within a node, these points are divided evenly among the MPItasks or the OpenMP threads. Performance of domain-to-processor mappings is compared withpredictions from the models. Results for pure message passing and hybrid paradigms are presented.

3.4.2. Computer systems

The platforms chosen for this case study, Intel Teraflops and Vplant visualization cluster at SandiaNational Laboratories, are superficially similar. Both have dual processor nodes connected by highbandwidth, low-latency communication networks.The Intel Teraflops is a massively parallel supercomputer composed of Intel Pentium processors

connected by a proprietary communication backplane. The operating system on the compute nodesis a microkernel (Cougar). Parallel jobs are loaded into memory on a portion of the compute meshthat is allocated only for that job. Nodes can be requested by a batch queuing system or by directlaunch in the interactive partition. Tests on the Intel Teraflops were run in the interactive partitionas the only interactive user on the system.The Vplant system is a cluster of Xeon processors connected by a Myrinet network. The compute

nodes run the Linux operating system. Nodes are available only through the portable batch system(PBS) and are scheduled using the Maui Scheduler. There are two processor partitions available forgeneral computation; one of these has slower processors than the other. For this set of studies, het-erogeneous collections of Vplant nodes are not used. For the sake of brevity, only the timings fromthe slower nodes are reported, since these are more numerous. Tests were run on Vplant during aperiod of relatively low usage; frequently a sequence of timing studies was the only active job. It wasexpected that differences between the platforms would manifest in performance of RELAX2.Micro-kernel (Cougar) as opposed to full-operating system (Linux) should give less variability in run times.

3.4.3. Measurement of model parameters

This section describes measurement of three groups of model parameters and explains their usein two versions of the model for block Gauss–Seidel relaxation. The first model that was devel-oped for this research utilizes average values for floating-point cost and one set of average valuesfor communication parameters. We present motivation for changing from this static definition ofparameters to a more dynamic approach. The second model includes runtime information, in theform of a function that accounts for variations in floating-point cost that are due to data size. Thisversion also uses three sets of communication parameter values that are based on point-to-pointtransfer patterns in the setup and solve phases of RELAX2. Parameter values for thread creationand scheduling are common to both model versions.Latency and bandwidth were measured, using a ping pong test, on the systems where the timing

studies were done. The values so obtained were not appropriate for all communication patternsin RELAX2, so halo exchange tests were also run. Communication times for a few representativemessage sizes are shown in Table I. The data are a composite of all communication data thatwere obtained from microbenchmarks. The first model, referred to as static, used this derivedaverage parameter. Closer examination of the microbenchmark output indicated the importance

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 21: Methodology for modelling SPMD hybrid parallel computation

METHODOLOGY FOR MODELLING SPMD HYBRID PARALLEL COMPUTATION 923

Table I. Transfer times for a few MPI message sizes (�s).

� Intel Tflops (29.1) Vplant (13.7)

M �(M) �M �(M) �M

128 43.1 14.0 20.1 6.4256 57.2 28.1 26.5 12.8512 85.3 56.2 39.3 25.6

1024 142 113 64.9 51.2

�(M)= � + �M .

Table II. Thread creation and scheduling overhead (�s).

�C �S

Intel Tflops 0.003 8.003Vplant 0.007 6.035

of the directional differences in the communication parameters. The second model, denoted bydynamic, incorporates these variations, as explained below.The term ‘directional’ requires explanation. The platforms used for these experiments have

different conventions for assignment of MPI tasks; one uses a cyclic distribution and the otheruses block distribution. The effects of the different conventions are closely tied to data layout.For example, in the model for Vplant, the communication time for north/south transfers shouldbe larger due to the contention in the single-communication link as it is accessed by two MPItasks. Also, about half the east/west transfers in the serialized message sections of the Wang solvershould be at memory-to-memory speed instead of network speed. In Section 3.4.5, modificationsfor communication costs, where messages do not leave a node and for those that contend for asingle-communication link, are introduced into the model. The expectation that the more accuratereflection of communication events makes the model a better predictor of performance is examinedin Section 4.1.3.Measurement of overheads associated with threads was done with a small set of benchmarks

based on the methods described in [31]. Both of the testing platforms have two processor nodes.The thread scheduling times for two threads are shown in Table II. The extremely low valuesfor thread creation suggested that a thread pool is created only once, at process startup time. OnTeraflops, this is the case: a second process is used as a ‘thread’ that is activated when parallelregions are entered. Linux ‘threads’ are implemented as lightweight processes and scheduled bythe process scheduler.The model parameters �S

avg and �Tavg denote average time for floating-point operations in the

setup and solve phases of relaxation, respectively. Table III presents these parameters for IntelTeraflops and the Vplant cluster. The values in the table show the effects of memory subsystemcontention when both processors on a node are used.Measurements of the basic operations (addition,multiplication, division) were made in the context of two-dimensional arrays. These measuredperformance values were weighted by operation counts to produce the averages.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 22: Methodology for modelling SPMD hybrid parallel computation

924 L. M. LIEBROCK AND S. P. GOUDY

Table III. Average times in ns for floating-point operations using oneprocess per node (1 ppn) and two processes per node (2 ppn).

�add �mult �div �Savg �T

avg

Tflops, 1 ppn 33.6 35.2 116 34.4 50.3Tflops, 2 ppn 54.0 56.6 174 55.3 81.5Vplant, 1 ppn 10.2 10.2 23.3 10.2 12.1Vplant, 2 ppn 19.1 18.5 24.3 18.8 20.4

Table IV. Effective floating-point operation times (ns), tabulated as a functionof data size N × N , for Tflops. Separate values for combinations of operations

in setup and solve phases of RELAX2 are given.

1 ppn 2 ppn

N �Seff �T

eff �Seff �T

eff

16 26.22 46.48 84.73 112.8932 19.84 45.50 46.36 72.2948 17.94 45.21 36.89 76.9264 17.99 46.37 38.39 81.6696 21.38 60.05 39.58 80.70128 24.06 65.68 39.72 80.38192 24.45 63.94 39.04 79.88256 25.39 67.35 39.15 84.64384 30.01 75.69 41.85 87.48512 30.72 81.22 42.18 93.69768 31.53 88.01 42.30 96.39

1024 32.15 91.55 43.39 99.18

The dynamic model uses functions for effective floating-point performance, �Seff for tridiagonal

system setup and �Teff for the line solves. These functions are constructed by measuring single-

node performance of RELAX2 in the context of N × N data sets, where N ranges from 16 to 1024.Extraction of timings for the setup and solve phases of relaxation are gathered from instrumentationinserted in the source code. The measurements are made for one process per node (1 ppn) and fortwo processes or two threads per node (2 ppn). Tables of these values, as functions of data size,are input to the model. For an example of effective floating-point measurements, see Table IV.This method gives better agreement between model and measured execution times for the differentdata sizes than the averaging of fundamental floating-point operations. Evidence for this claim ispresented in the next section.

3.4.4. Experimental results

Briefly, dynamic model experiments were conducted as follows. Logical processor grids, of sizePI × PJ in powers of two up to 128, were used for these experiments. The processor grids

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 23: Methodology for modelling SPMD hybrid parallel computation

METHODOLOGY FOR MODELLING SPMD HYBRID PARALLEL COMPUTATION 925

correspond to square arrays of two-way nodes, from 1 to 64. For strong scaling studies, a fixedglobal domain, of size N × N , was divided evenly among the compute nodes. The fixed domainscaling decompositions followed the schemes laid out in Figure 4. For weak scaling studies, theproblem domain was discretized so that each node received a square subdomain of size N × N .Within a node, these points were divided evenly among the MPI tasks or the OpenMP threads.The results of the aforementioned investigation were presented at a conference on Linux clus-

ters [52]. The timing measurements were made for RELAX2 in the context of a complete SMGV-cycle. Likewise, the times predicted by the model were for multiple level execution of relaxation.In general, the multilevel relaxation model¶¶¶¶ is a better predictor of performance for Intel Tflopsthan for Vplant.Examination of the N = 256 data showed that the agreement between model and measurement

for this data set is not quite as good as for other experiments on Tflops. Accordingly, this dataset, which was part of the weak scaling sequence, was chosen for more detailed analysis. Thedata size on the finest grid level was selected for comparison of the predictions of the static anddynamicmodels with the measured performance of relaxation on this grid level. In the context of aninvestigation of hybrid parallelism, where the goal was to determine the best way to use a cluster ofdual processor nodes, the weak scaling study assigned a constant amount of work to each computenode. In particular, for this data set, the global data size was adjusted for different processor countsso that each node received a grid of size 256× 256.Within a node, the processors were allocated to the work in four ways that are designated by

single characters as follows.

s: One processor was left idle, while the other executed one MPI task that operated on the256× 256 grid.

t: One MPI task was allocated, with a thread invoked to assist with work in the loops. While thethread was active, each processor operated on subdomains of size 256× 128. During phasesof thread inactivity, the MPI task worked on the local 256× 256 grid.

v: Each processor executed one MPI task that operated on 256× 128 grid points.h: Each processor executed one MPI task that received a subdomain of 128× 256 points.

Table V compares the output of the static and dynamic models with execution times for differentnumbers of Tflops compute nodes. It is clear that the dynamic model version produces predictionsthat are closer to measured run times. Despite the significant improvement when effective valuesare used instead of averages for floating-point parameters in the model, the model is still not totallysuccessful.Graphical presentation of the comparison of the dynamic and static models with measured ex-

ecution time for RELAX2 may be helpful. Define a metric for relative success of a model, for aparticular data size and node configuration, as

Smodel = Tmodel − Tmexec

Tmexec(11)

¶¶¶¶Recall that the model stated in Equation (4) is for relaxation in an arbitrary single level of a V-cycle.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 24: Methodology for modelling SPMD hybrid parallel computation

926 L. M. LIEBROCK AND S. P. GOUDY

Table V. Comparison of models with measured run times:execution times for the four nodal configurations on Tflops,compared with predictions of two models for 1, 4, 16, and64 compute nodes. Local data size is 256× 256 grid points.

Configuration Tmexec Tdynamic Tstatic

s 1 78.18 77.35 58.284 111.8 113.5 83.5916 115.8 114.1 84.0364 118.0 115.2 84.90

t 1 48.42 51.59 29.224 66.6 74.66 42.5916 71.33 75.29 43.0264 73.84 76.53 43.90

v 1 43.53 38.94 29.754 56.05 57.00 42.2716 58.13 57.37 42.5464 59.24 58.10 43.10

h 1 60.72 55.23 41.904 68.84 55.97 42.7816 72.19 57.06 43.6564 77.43 59.23 45.39

(a) (b)

Figure 5. Relative success of predictions for RELAX2, static vs dynamic models, local data size 256× 256on Tflops: (a) static and (b) dynamic.

where Tmodel is the execution time predicted by the model and Tmexec is the measured executiontime. Figure 5 illustrates the relative success of the two model versions for data size 256× 256.This figure represents computational experiments run on 64 nodes, for a total of 128 processors.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 25: Methodology for modelling SPMD hybrid parallel computation

METHODOLOGY FOR MODELLING SPMD HYBRID PARALLEL COMPUTATION 927

Timings on larger configurations were not collected due to constraints on system availability andutilization.Close examination of the timing results revealed that the solve phase of relaxation showed

large variation in per processor run times. The assumption that this was due to aberrations incommunication time was unfounded. In reality, there was a load imbalance induced by data values.While each processor had exactly the same number of floating-point operations in the stages of theWang partition method, some processors had zeros as operands for these instructions. The processorson the boundary of the simulation domain, where zero values arose due to boundary conditions, tookless time in their computation phases than the other processors. While the load imbalance caused byzeros in calculations may not be significant, calculations that involve denormalized numbers maywell be noticeable. For this reason, modellers should examine cases of unexpected load imbalancefor computation with specially handled data values.The differences in relative model success for the two testing platforms are partially explained

by the difference in MPI task allocation. Suppose that a set of eight MPI tasks is assigned to fourcompute nodes. In block allocation, on Vplant, contiguously numbered tasks are assigned to a node,so that tasks 0 and 1 are together, as are 2 and 3, 4 and 5, 6 and 7. Cyclic distribution of tasks,on Tflops, on the other hand, collocates 0 and 4, 1 and 5, 2 and 6, 3 and 7. The allocation of MPItasks to nodes can have a marked effect on communication performance.Where communication is predominately between logical nearest neighbors, block allocation

can result in some message transfers via internal memory transfers, while others contend foraccess to the communication network. Each type of transfer needs its own parameters for la-tency and bandwidth. The dependence of the model on tacit assumptions about data assignmentto MPI tasks requires explicit expression. Cyclic task allocation combined with logical nearestneighbor communication causes most messages to contend for the network interface compo-nent of the node. Model parameters for communication latency and bandwidth should reflect thiscontention.

3.4.5. Effects of task allocation

The next set of equations extend the model to account for MPI task allocation effects. For simplicityin presentation, these equations are developed for an interior node in a PI × PJ processor grid,where PI and PJ are greater than two. A weak scaling study, in the sense of the hybrid parallelexperiments described in the previous section, is assumed.

TRELAX2 = Tsetup + Tsolve (12)

Tsetup = T calcsetup + T comm

setup (13)

T calcsetup = �C + 4�S + (12�NI )

N̂J

�(14)

T commsetup = 4(�I + �I N̂J ) + 4(�J + �J NI ) (15)

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 26: Methodology for modelling SPMD hybrid parallel computation

928 L. M. LIEBROCK AND S. P. GOUDY

Tsolve = T calcsolve + T comm

solve (16)

T calcsolve = 5�S + (21�NI )

N̂J

�+ (PI − 1)(8�N̂J ) (17)

T commsolve = 2

(PI2

(�M + �M · 3N̂J ) +(PI2

− 1

)(�N + �N · 3N̂J )

)(18)

The model of relaxation in Equation (4) is less complex than the model that is expressed inEquations (12)–(18). The directional components in Equation (15) allow the model to account fordifferences in degree of contention between the block and cyclic task allocations. The separation ofparameters in Equation (18) distinguishes internode and intranode communications for the blockallocation. These parameters are equal for cyclic allocation on Tflops; however, the degree ofnetwork interface contention is different for the setup and solution communication phases. Figures 6and 7 depict the general cases for nodes that are interior to the processor mesh as applied to thedecomposition of the simulation domain.Terms that adjust communication parameters based on data layout with respect to MPI task

allocation make the model harder to understand and to apply. For example, in the cyclic allocationon Tflops, processors P and P ′ communicate with each other. The relation between the two, inlogical task space, is P ′ = P + (Ptotal/2). It is possible that logical tasks P and P ′ communicateintranode, in some cases.

P P+1

(a)

+1

(b)

Figure 6. In the block allocation of MPI tasks on Vplant, roughly half the communication in one directionare memory to memory. In the other direction, there is more contention for the network interface component:

(a) traffic patterns for ‘h’ decomposition and (b) traffic patterns for ‘v’ decomposition.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 27: Methodology for modelling SPMD hybrid parallel computation

METHODOLOGY FOR MODELLING SPMD HYBRID PARALLEL COMPUTATION 929

P P’

(a)

P

P’

(b)

Figure 7. In general, for the cyclic allocation of MPI tasks on Tflops, commu-nications with logical nearest neighbors are internode: (a) traffic patterns for

‘h’; decomposition and (b) traffic patterns for ‘v’ decomposition.

The original idea for the model was to keep it as simple as possible. It is therefore reason-able to ask whether the additional complication adds sufficient value. Quantification of the ef-fects of model enhancements, without recourse to a trial and error approach, would be helpful.Examples of issues that one might be concerned about are: (1) impact on the model of usingaverages for parameter values; (2) effects of accuracy and variation of measured parameters;(3) effectiveness of the model when scaling the problem domain size or number of processorsbeyond the range in which the model was validated. In the next section these issues are ad-dressed by adaptation of techniques for performance measurement to evaluate the performance of amodel.

4. MODEL ANALYSIS

Fidelity, adaptability, and scalability are desirable features for a performance model. This sectionproposes a number of approaches for evaluation of the fitness of a such a model with respect to itsintended function. This research is at a preliminary stage.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 28: Methodology for modelling SPMD hybrid parallel computation

930 L. M. LIEBROCK AND S. P. GOUDY

Given this semi-empirical construction of the model, it seems natural to ask if the model hasenough fidelity to resolve the predictions for the cases of interest. Some model parameters aremeasured quantities, which have some degree of uncertainty.Consider which parameters have the potential to produce the largest variations in the model’s

output values when the parameter value is perturbed. Sensitivity analysis attempts to address this.The weak scaling study for RELAX2 provides the setting for the discussion of model sensitivity toparameter perturbations.An adaptable model separates application parameters from system parameters. Adapting such

a model to another system is accomplished by substitution of a new set of system parameters.Scalability means that the model retains its integrity as the application data size or the number ofprocessors increases. A model analysis should include not only a careful derivation of complexityexpressions but also a study of the precision and accuracy of the measured parameters in the model.Section 4.2 suggests approaches to verify these qualities in a model.

4.1. Sensitivity analysis

Mathematical sensitivity analysis treats the model equations in a rigorous mathematical way. Thistechnique gives general results and allows one to make claims of a theoretical nature. For a modelthat represents an algorithm running on a particular architecture, a more direct approach maygive some useful results. Empirical sensitivity analysis is straightforward: vary input and observevariations in output. The ratio of increments in output to increments in input is the ratio of interest.The model for a relaxation kernel of an iterative linear solver was presented in detail earlier.

Here some of the motivation is restated to derive a simplified model to use as a specific exampleof sensitivity analysis. Assume a weak nodal scaling study in which the total data size on a dual-processor compute node is kept constant as the number of nodes is increased. The local node datacan be further subdivided based on the way the processors within a node are applied to the problem.If the data size assigned to a node is nI × nJ , then the local MPI task and thread decompo-

sitions get data of size MI × MJ . The single character designators for the local data distributionschemes‖‖‖‖ and their associated data sizes are: s: MI = nI , MJ = nJ ; h: MI = nI /2, MJ = nJ ; v:MI = nI , MJ = nJ/2.In the interest of simplicity, the discussion of sensitivity omits the thread option, t. The global

data size is NI × NJ = (MI PI ) × (MJ PJ ) for processor grid PI × PJ .Models for execution time of RELAX2 setup and solve phases are

Tsetup = 12�MI (MJ/2) + 2(� + �MJ ) + 2(� + �MI ) (19)

Tsolve = ��MI (MJ/2) + (PI − 1)(8�MJ + 2(� + � · 4MJ )) (20)

where � = 13 if PI = 1, and �= 21 otherwise.The quantities that can vary in this representation of the model are of two types. The system

parameters �, �, and � are small measured quantities and have a random component. The application

‖‖‖‖A diagram of the possible data distributions appears in Figure 4 and a description of the designators is given inSection 3.4.4.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 29: Methodology for modelling SPMD hybrid parallel computation

METHODOLOGY FOR MODELLING SPMD HYBRID PARALLEL COMPUTATION 931

and mapping parameters are operation counts and data size. These parameters are controlled, thatis, they are specified rather than measured.Some quantities that can affect performance predictions can be measured or estimated. For ex-

ample, the number of network links per node can be used in approximation of contention forthe network interface. Alternatively, memory contention factors can be obtained from measure-ment. For the moment, differences between effective floating-point rates in the two computationalphases of relaxation are obscured. Likewise, the directional differences and contention effects ofthe communication parameters do not appear.

4.1.1. Measured parameters

This semi-empirical modelling methodology often measures parameters that describe platformcharacteristics. The quantities �, �, and � represent times of events that are of very short duration. Itis impossible to measure these events directly because they are typically smaller than the resolutionof the timers supplied by system libraries. An accepted engineering practice∗∗∗∗∗ is to measurea number of iterations of the event and report the average. The number of iterations should bechosen so that the overhead of calling the timing function is small compared with the duration ofthe iterated event. Because the reported time is an average, repetitions of the benchmark can beused to derive a mean and standard deviation for the event duration. According to [53], this practicecan be justified on theoretical grounds, provided that a series of independent measurements of theiterated event is made.Analysis of measurement precision has a well-known body of methods and techniques [17].

The relevant point is that the measured parameters in a semi-empirical model can be said to havetolerances, for example, � ± ��. The tolerances may come from mean and standard deviation of aparameter or from confidence intervals at a certain level of confidence. These ranges of parametervalues can be used in perturbation analysis of a model.

4.1.2. Controlled parameters

Application and mapping parameters do not vary in a statistical sense. They depend on systemcapacity, such as amount of available memory on a compute node or on the problem definition. Forthe weak scaling study of relaxation, the controlled variables are local subdomain size and aspectratio, together with processor grid size and aspect ratio. The paradigm for MPI task allocationis included here, because some systems permit user control. It is also possible to reorder MPIcommunicators to get a more effective allocation than the system assigns.Experiments and benchmarks showed that the measured parameters depend on values of the con-

trolled parameters. Effective floating-point performance depends on size and aspect ratio of the dataarrays. Communication latency and effective bandwidth depend on message size. Communicationlatency also depends on data layout in memory.Sensitivity analysis can indicate which parameters have the greatest potential to degrade the

quality of model predictions. For example, if the computation time is far greater than communication

∗∗∗∗∗Two uses of this technique are [17,31].

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 30: Methodology for modelling SPMD hybrid parallel computation

932 L. M. LIEBROCK AND S. P. GOUDY

time, then perturbations in communication parameters will have less effect on total time than thefloating-point parameter. In this case, accuracy and precision of the floating-point cost should receivemore careful attention. The tolerances on the parameter are an indicator of the precision with whichit was measured. Perturbation of the parameter value in the model may be a way to investigate theaccuracy of the measured parameter. More analysis and research needs to be done on this issue.

4.1.3. Case study: parameter sensitivity in RELAX2

Sensitivity analysis can be done on the complete RELAX2 model; however, a simple examplethat isolates a portion of it can serve to illustrate the method. The setup phase is matrix–vectormultiplication, which has been studied thoroughly [7,54]. Consider only the second computationalphase, solution of a block of tridiagonal equations. Recall that a motivation for the weak scalingstudy was effective use of resources in a compute node.Where Ap is the parallel overhead incurred by p processors and T1 is time taken by one processor,

define a metric for ineffectiveness

Rp = Ap/T1 (21)

Parallel efficiency, T1/Tp = T1/(T1 + Ap) = 1/(1 + Rp), increases as Rp decreases.

4.1.3.1. Mathematical sensitivity. First, we consider mathematical sensitivity for parameters in themodel for block tridiagonal solves. The following assumptions are made in order to simplify theexposition. The model is specific to the weak scaling study described in Section 3.4.4. Therefore,data size on each of the compute nodes is the same as that on the single processor. Further, assumethat there is one MPI process per node†††††, so that local data are square M × M . With p processorsallocated as a P × Q grid, the model in Equation (20) becomes

Tsolve = 13�M2/2 + �(8�M2/2) + (P − 1)(8�M + 2(� + � · 4M)) (22)

where � = 0 if P = 1 and �= 1 if P>1. For one processor, the modelled execution time is

T1×1 = 13�M2/2 (23)

Assuming P>1, parallel overhead for P × Q processors is

AP×Q = 8�M2/2 + (P − 1)(8�M + 2(� + � · 4M)) (24)

Notice the O(M2) computational component that appears in the expression for parallel overhead.This is one reason why efficiency drops sharply between P = 1 and P = 2.Taking partial derivatives with respect to the platform-specific parameters that represent floating-

point operation cost, �, and communication bandwidth, �, and latency, �, we obtain expressions for

†††††This is the s local data distribution scheme, as described in Section 4. Analysis for the other data mappings within anode would be similar.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 31: Methodology for modelling SPMD hybrid parallel computation

METHODOLOGY FOR MODELLING SPMD HYBRID PARALLEL COMPUTATION 933

model sensitivity to these parameters. As a basis for comparison, the modelled execution time forone processor depends only on �, with sensitivity

�T1×1

��= 13M2/2 (25)

Parallel overhead depends on �, �, and � with sensitivity to each parameter independently:

�AP×Q

��= 8M2/2 + (P − 1)(8M) (26)

�AP×Q

��= (P − 1)(2 · 4M) (27)

�AP×Q

��= (P − 1)(2) (28)

Based on the above equations, some general observations are possible. First, if P and M arecomparable, then the model is nearly as sensitive to changes in � as to changes in �. P and Mmight have the same order of magnitude on an extreme scale parallel computer. This situation couldalso arise if the available memory on a compute node does not allow the local data size to be largecompared with the number of available compute nodes. Next, the model is relatively insensitive tochanges in � unless M is quite small. Lastly, in the ranges of P and M that were used for the weakscaling studies on Tflops and Vplant, the model should be relatively insensitive to changes in �,except for the smallest local data size 256× 256.Consequently, it appears that the most important parameter to estimate well is floating-point cost.

The effects of MPI task allocation on performance are much less important to capture, at least inthe context of a weak scaling study. The incorporation of these effects into the model, as outlined inSection 3.4.5 may produce little change in the output from the model. Let us turn now to empiricalsensitivity as another way to gain insight into model fitness.

4.1.3.2. Empirical sensitivity. An empirical study of sensitivity for the entire parameter spacewould be computationally intensive. Examination of the model in Equation (20) indicates that thesensitivity of the floating-point cost is O(MI MJ ), while that of the communication parameters isO(PI MJ ). For convenience that model is restated here, in a slightly different form

Tsolve = 13�MI (MJ/2) + �(8�MI (MJ/2)) + (PI − 1)(8�MJ + 2(� + � · 4MJ )) (29)

where � = 0 if P = 1 and �= 1 if P>1. The influence of the two types of parameters on the outputvalue of the model should be most nearly comparable when MI and PI are as close as possible.For the weak scaling study, this situation occurs for the ‘h’ decomposition with local node datasize 256× 256 on 64 nodes. In this configuration‡‡‡‡‡ MI = 128 and PI = 16.

‡‡‡‡‡This ‘h’ configuration is one for which the Tflops model shows lack of fidelity.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 32: Methodology for modelling SPMD hybrid parallel computation

934 L. M. LIEBROCK AND S. P. GOUDY

Table VI. Average values for floating-point cost and communication parametersfrom the static model for the block tridiagonal solves in RELAX2. Units arenanoseconds per floating-point operation (�), nanoseconds per 4-byte word (�),

and microseconds for communication latency (�).

� � �

Tflops 50.4 22.2 23.0Vplant 11.5 37.9 6.9

Table VII. Values of the relative sensitivity metric for Tflops,in various ‘h’ configurations. For data 128× 256 per MPI

task, T1×1 = 10.73. Srel = T�,�/(T1×1 + T�).

PI PJ T� T�,� Srel

2 1 6.70 0.091 0.0054 2 6.91 0.274 0.0168 4 7.32 0.640 0.035

16 8 8.15 1.370 0.073

The mathematical sensitivity analysis in Section 4.1.3.1 does not account for the relative sizesof the model parameters. This potential weakness is illustrated by Table VI, which lists averageparameter values that were used in the first (static) version of the RELAX2 model. Because �and � are expressed in the same units, it is easy to see that the model in Equation (29) is moresensitive to � for the data size and processor grid specified above. For a message of 256 four-bytewords, the influence of � on the model output value is of the same order of magnitude as that of �.For small messages, latency and inverse bandwidth parameters can have comparable influence onmodel output.Let us define a metric for the relative sensitivity of the model for Tsolve to floating-point cost and

communication parameters.

Srel = T�,�

T1×1 + T�(30)

the ratio of model terms that depend on communication parameters to model terms that depend onthe floating-point cost parameter. The components of Srel are

T�,� = 2(� + � · 4MJ ) (31)

T1×1 = 13�MI MJ/2 (32)

and

T� = 8�MI MJ/2 + 8�MJ (PI − 1) (33)

Table VII lists values of the relative sensitivity metric for Tflops for various processor counts withdata size 128× 256 per process. For these configurations, the influence of floating-point cost onthe model is clearly dominant. The parameter � should be determined to a satisfactory level of

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 33: Methodology for modelling SPMD hybrid parallel computation

METHODOLOGY FOR MODELLING SPMD HYBRID PARALLEL COMPUTATION 935

Table VIII. Values of the relative sensitivity metric for Vplant, for processorconfiguration 16× 8 with various local data sizes. Srel = T�,�/ (T1×1 + T�).

MI MJ T1∗1 T� T�,� Srel

128 256 2.44 1.85 1.449 0.34256 512 9.76 6.71 2.614 0.16512 1024 39.0 25.4 4.942 0.077

1024 2048 156.0 98.9 9.599 0.038

precision before adjustments to the communication parameters are tried. For larger numbers ofprocessors, the communication parameters can be expected to become more important.Table VIII illustrates the effect that data size can have on the relative sensitivity metric. Relative

sensitivity is given for Vplant parameters in the model for Tsolve with a fixed 16× 8 processorgrid for data sizes starting in the range of the weak scaling study. The communication parameters,as should be expected, become less influential on model output as the data size increases. Inthe range of the weak scaling study, the influence of communication parameters should not beignored; however, floating-point terms still dominate. For Vplant, � should probably be adjustedfirst.Observe that the first line of Table VIII corresponds with the last line of Table VII. The relative

sensitivity metric for this data size and processor grid is much larger for Vplant than for Tflops. Oneway to account for the difference in performance between Tflops and Vplant for 16× 8 processors,with local data size 128× 256, is to compare the ratios of data transfer rate to floating-point speedfor the two platforms.Define balance factor as the ratio of network bandwidth to processor speed. In model parameter

terms, balance factor is �/�. A higher balance factor indicates that the network is ‘fast’ relativeto the rate at which instructions are processed. This condition should have a positive effect on thescalability of an application. That is, as more data can be transferred in the same time it takes todo a computation, the scalability of parallelism becomes greater.Using manufacturer specifications, one can calculate an approximate balance factor bf. For Intel

Teraflops,

bf= 1.2= 400 megabytes/s

333 megaflops(34)

while for Vplant,

bf= 0.25= 500 megabytes/s

2000 megaflops(35)

A parameterized analytic model should capture the effect of balance factor on scalability. Recallthe ‘ineffectiveness’ metric,

RP×Q = AP×Q

T1×1= T� + T�,�

T1×1(36)

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 34: Methodology for modelling SPMD hybrid parallel computation

936 L. M. LIEBROCK AND S. P. GOUDY

A smaller value of this metric indicates better parallel efficiency§§§§§ . For data size 128× 256 andprocessor grid 16× 8, the ineffectiveness ratio for Tflops is 0.89, while the ratio is 1.35 for Vplant.In practice, Tflops is the more scalable of the two systems for this specific problem.

4.1.3.3. Observations about sensitivity. Sensitivity analysis was carried out on the weakscaling model for the relaxation kernel from SMG. Both mathematical and empirical techniquesindicated that floating-point cost has greater influence on model output, as compared with the in-fluence of the communication parameters. Within the context of the weak scaling study, empiricalsensitivity analysis examined the data and processor configuration where the model sensitivity tocommunication parameters was expected to be greatest.On both testing platforms, the floating-point terms are dominant in the model. On Vplant the

influence of communication parameters is much stronger than on Tflops. The difference in balancefactor between the two machines is one reason for the observed difference in sensitivity.

4.2. Adaptability, scalability, and fidelity

The ultimate test of adaptability is fidelity on a different platform from the one for which the modelwas developed. Substitute system parameters for an untested platform into the model, make somepredictions and see how closely these predictions match reality. This program was followed in thescaling studies: Tflops was used to define a model for relaxation, system parameters from Vplantwere ‘plugged in’, and the scaling test results on Vplant were compared with predictions. Muchof the effort in the research for this work went into discovering reasons why Tflops is predictable,while Vplant is less so.A test for model scalability is to extrapolate beyond the ranges where timing studies were done.

The model should maintain fidelity as PI × PJ grows or as the memory footprint MI × MJ changes.Another test for the model is to interpolate processor counts between those in the scaling study. Thenodal scaling studies were run for a few data sizes only and benchmarks for the model parametersmatched those sizes. Testing the model predictions at unusual data sizes would, no doubt, providemore research opportunities.The scaling study timing data provides a basis for comparison of the model with reality. Therefore,

the quality of that data should be carefully evaluated. At a minimum, the evaluation should includeconfidence intervals for the measured run times. For example, comparison of the run times for thelocal data distribution schemes, described in Section 4, would be useful. Perhaps those alternativeschemes do not show statistically significant differences. In that case, the model should not beexpected to distinguish the alternatives. A further evaluation would address whether the differencesbetween model predictions and measured times are statistically significant.

4.3. Unanswered questions

In the context of multidimensional arrays, the most common data structures in the applicationsdiscussed in this work, measured parameters in the model depend on controlled parameters. Effective

§§§§§Parallel efficiency= T1×1/TP×Q = 1/(1 + RP×Q).

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 35: Methodology for modelling SPMD hybrid parallel computation

METHODOLOGY FOR MODELLING SPMD HYBRID PARALLEL COMPUTATION 937

floating-point performance depends on data size and aspect ratio of the arrays. Without explicitreference to memory hierarchy in the model, this semi-empirical approach relies on data fittingto obtain effective floating-point rates for the relaxation model. Further investigation is needed todiscover an alternative that does not make the model unwieldy.Communication latency depends on the size of the message. While asymptotic bandwidth is a

hardware parameter, the message size must be quite large, or small messages quite numerous, inorder to saturate the network capacity [34]. Smaller message sizes attain an effective bandwidththat can depend on a hardware protocol being different for short and long messages [55]. Empiricalsensitivity of the model to these factors could be determined by experiment.Contention for communication resources, whether on the network or in memory, seems to have a

stochastic character. To what extent is this true? Is it possible or desirable to replace the linear modelfor point-to-point communication? To what degree can the uncertainty of network contention bequantified? A stochastic analysis of the communication mechanism should probably be undertakento address this question.The relaxation kernel provided a non-trivial modelling exercise. One reason for the performance

modelling effort at Sandia National Laboratories is the prediction of the behavior of importantapplications on future architectures. Adaptation of this semi-empirical model methodology to asignificantly different platform will provide further edification.

5. CONCLUSIONS

This work describes a methodology for modelling hybrid parallel application performance on clus-ters of symmetric multiprocessors. To provide a specific setting for the modelling effort, MPI waschosen as the explicit communication between processes and OpenMP as the implicit communi-cation between threads. In addition to guidelines for model creation, the methodology includestechniques for estimating the variability of a particular system and application.In summary, the technique is composed of these steps.

1. Count computation and communication operations.2. Define application and mapping parameters.3. Define system parameters for the platform of interest.4. Write the analytic model in terms of the application and system parameters.5. Enhance the static model with dynamic information.6. Perform quantitative analysis of the parameterized semi-empirical model.

The methodology can be stated simply; however, depending on the platform, a good applicationmodel may be difficult to create. Where models appear to be inadequate, a list of situations weregiven that may be the cause.The modelling exercise for a sparse linear algebra kernel on Teraflops and Vplant illustrates

some interesting points. A simple model that uses a few constant parameters seems to be a poorpredictor of performance on these systems. Benchmarks that measure system parameters mustmimic fairly closely the setting of the application that is being modelled. Care must be taken inthe model, not only with communication patterns, but also with data layout and memory accesspatterns.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 36: Methodology for modelling SPMD hybrid parallel computation

938 L. M. LIEBROCK AND S. P. GOUDY

We have addressed the adequacy of a performance model for a parallel application in terms ofthree important characteristics: adaptability, scalability, and fidelity. Further, we note that sensitivityanalysis can direct attention to the parameter(s) of greatest influence.The subject of this research is endlessly fascinating to the authors. Opportunities for discovery,

challenges to ingenuity, and surprises abound. Despite the apparent reluctance of some platformsto behave in a reasonable and predictable manner, the work required to produce a predictive modelof an application is not wasted. One will always learn something noteworthy—about the system orthe application or the hapless modeller.

ACKNOWLEDGEMENTS

This work was partially supported by Sandia National Laboratories, Albuquerque, NM 87185 and Livermore,CA 94550. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration under ContractDE-AC04-94-AL85000.

REFERENCES

1. Gropp W, Lusk E, Skjellum A. Using MPI: Portable Parallel Programming with the Message Passing Interface. MITPress: Cambridge, MA, 1994.

2. OpenMP Forum. OpenMP Fortran application program interface. OpenMP Architecture Review Board. http://www.openmp.org [December 2004].

3. He Y, Ding HQ. MPI and Open MP paradigms on cluster of SMP architectures: the vacancy tracking algorithm formulti-dimensional array transposition. Proceedings of the IEEE/ACM SC2002 Conference, Baltimore, MD, November2002.

4. Tafti DK, Wang G. Application of embedded parallelism to large scale computations of complex industrial flows.Proceedings of the ASME Fluids Engineering Division, Anaheim, CA, ASME-IMECE, November 1998.

5. Bush IJ, Noble CJ, Allan RJ. Mixed OpenMP and MPI for parallel Fortran applications. European Workshop on OpenMP2000, Edinburgh, U.K., 2000.

6. Cappello F, Entiemble D. MPI versus MPI+OpenMP on the IBM SP for the NAS parallel benchmarks. Supercomputing2000, Dallas, TX, 2000.

7. Chow E, Hysom D. Assessing performance of hybrid MPI/OpenMP programs on SMP clusters. Technical ReportUCRL-JC-143957, Lawrence Livermore National Laboratory, 2001.

8. Etiemble D. Mixed-mode programming on clusters of multiprocessors. http://www.eecg.toronto.edu/∼de/Pa-06.pdf[January 2003].

9. Henty DS. Performance of hybrid message-passing and shared-memory parallelism for discrete element modelling.Supercomputing, Dallas, TX, 2000.

10. Krawezik G, Cappello F. Performance of MPI and three OpenMP programming styles on shared memory multiprocessors.SPAA’03 Fifteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, San Diego, CA, June 2003.

11. Mavriplis DJ. Parallel performance investigations of an unstructured mesh Navier–Stokes solver. Technical Report2000-13, ICASE, Hampton, VA, 2000.

12. Pannala S, D’Azvedo E, Syamlal M, O’Brien T. Hybrid (OpenMP and MPI) parallelization of MFIX: A multiphaseCFD code for modelling fluidized beds. SAC2003, Melbourne, FL. ACM: New York, March 2003.

13. Smith L, Bull M. Development of mixed-mode MPI/OpenMP applications. Scientific Programming 2001; 9:83–98.14. Viet TQ, Yoshinaga T, Abderazek BA, Sowa M. A hybrid MPI-OpenMP solution for a linear system on a cluster of

SMPs. SACSIS2003 Symposium on Advanced Computing Systems and Infrastructures, Tokyo, Japan, June 2003.15. Falgout RD, Jones JE. Multigrid on massively parallel architectures. Technical Report UCRL-JC-133948, Center for

Applied Scientific Computing, Lawrence Livermore National Laboratory, May 1999.16. Kerbyson DJ, Wasserman HJ, Hoisie A. Exploring advanced architectures using performance prediction. Proceedings of

the International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems,IWIA’02. IEEE Computer Society: Silver Spring, MD, 2002.

17. Lilja DJ. Measuring Computer Performance: A Practitioner’s Guide. Cambridge University Press: Cambridge, 2000.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 37: Methodology for modelling SPMD hybrid parallel computation

METHODOLOGY FOR MODELLING SPMD HYBRID PARALLEL COMPUTATION 939

18. Kerbyson DJ, Alme HJ, Hoisie A, Petrini F, Wasserman HJ, Gittings M. Predictive performance and scalability modelingof a large-scale application. SC2001 Proceedings, Denver, CO. ACM: New York, 2001.

19. Hoisie A, Lubeck O, Wasserman H. Performance and scalability analysis of teraflop-scale parallel architectures usingmultidimensional wavefront applications. International Journal of High Performance Computing Applications 2000;14(4):330–346.

20. Adhianto L, Chapman B. Performance modeling of communication and computation in hybrid MPI and OpenMPapplications. Proceedings of 12th International Conference on Parallel and Distributed Systems, ICPADS’06, Minneapolis,MN, Association for Computing Machinery, 2006.

21. Brehm J, Worley PH, Madhukar M. Performance modeling for SPMD message-passing programs. Technical ReportORNL/TM-13254, Oak Ridge National Laboratory, June 1996.

22. Boyd EL, Abandah G, Lee H-H, Davidson ES. Modeling computation and communication performance of parallelscientific applications: A case study of the IBM SP2. Technical Report CSE-TR-236-95, University of Michigan, MI,1995.

23. Sreekantaswamy HV, Chanson S, Wagner A. Performance prediction modeling of multicomputers. Proceedings of the12th International Conference on Distributed Computing Systems. IEEE Computer Society Press: Silver Spring, MD,1992; 278–285.

24. Worley PH. Impact of communication protocol on performance. Technical Report ORNL/TM-13682, Oak Ridge NationalLaboratory, February 1999.

25. Worley PH, Robinson AC, Mackay DR, Barragy EJ. A study of application sensitivity to variation in message passinglatency and bandwidth. Technical Report ORNL/TM-13250, Oak Ridge National Laboratory, June 1996.

26. Crovella ME, LeBlanc TJ. Parallel performance prediction using lost cycles analysis. Proceedings of Supercomputing’94. IEEE/ACM: New York, 600–610.

27. Grama AY, Gupta A, Kumar V. Isoefficiency: Measuring scalability of parallel algorithms and architectures. IEEE Paralleland Distributed Technology 1993; 1(3):12–21.

28. Marin G, Mellor-Crummey J. Cross-architecture performance predictions for scientific applications using parameterizedmodels. SIGMETRICS/Performance’04. ACM: New York, NY, June 2004.

29. Abandah GA, Davidson ES. Configuration independent analysis for characterizing shared-memory applications. TechnicalReport CSE-TR-357-98, Advanced Computer Architecture Laboratory, EECS Department, University of Michigan, MI,September 1997.

30. Dongarra J, Malony AD, Moore S, Mucci P, Shende S. Performance instrumentation and measurement for terascalesystems. http://citeseer.ist.psu.edu/663429.html [fall 2004].

31. Bull JM. Measuring synchronisation and scheduling overheads in OpenMP. First European Workshop on OpenMP, Lund,Sweden, 1999.

32. Mucci PJ, London KS. Low level architectural characterization benchmarks for parallel computers. Technical Report394, Computer Science Department, University of Tennessee, July 1998.

33. Prabhakar A, Getov V. Performance evaluation of hybrid parallel programming paradigms. Performance Analysis andGrid Computing, Getov V, Gerndt M, Hoisie A, Malony A, Miller B (eds.). Kluwer Academic Publishers: Norwell, MA,2004; 57–76.

34. Rabenseifner R. Hybrid parallel programming: performance problems and chances. Proceedings of 45th Cray UsersGroup Conference, Columbus, OH, May 2003.

35. Carrington L, Wolter N, Snavely A. A framework for application performance prediction to enable scalabilityunderstanding. Scaling to New Heights Workshop, Pittsburgh, PA, May 2002.

36. MacDonald S, Szafron D, Schaeffer J, Bromling S. Generating Parallel Program Frameworks from Parallel DesignPatterns (Lecture Notes in Computer Science, vol. 1900). Springer: Berlin, 2001; 95–104.

37. Grove DA, Coddington PD. A performance modeling system for message-passing parallel programs. Technical ReportDHPC-105, Adelaide University, Adelaide, Australia, 2001.

38. van Gemund AJC. Performance prediction of parallel processing systems: The PAMELA methodology. Proceedings ofACM International Conference on Supercomputing, Tokyo, Japan. ACM: New York, July 1993.

39. Adve VS. Analyzing the behavior and performance of parallel programs. PhD Dissertation, University of Wisconsin,WI, 1993.

40. Kapelnikov A, Muntz RR, Ercegovac MD. A methodology for performance analysis of parallel computations withlooping constructs. Journal of Parallel and Distributed Computing 1992; 14:105–120.

41. Mak VW, Lundstrom SF. Predicting performance of parallel computations. IEEE Transactions on Parallel and DistributedSystems 1990; 1(3):257–269.

42. Mierendorff H, Schwamborn H, Tazza M. Performance modelling of grid problems—A case study on the SUPRENUMsystem. Parallel Computing 1994; 20:1527–1546.

43. Simon J, Wierum J-M. Performance prediction of benchmark programs for massively parallel architectures. Proceedingsof the Tenth International Conference on High Performance Computer Systems, HPCS’96, Ottawa, Canada, June 1996.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe

Page 38: Methodology for modelling SPMD hybrid parallel computation

940 L. M. LIEBROCK AND S. P. GOUDY

44. van Gemund AJC. Symbolic performance modeling of parallel systems. IEEE Transactions on Parallel and DistributedSystems 2003; 14(2):154–165.

45. Adve VS, Vernon MK. Parallel program performance prediction using deterministic task graph analysis. ACM Transactionson Computer Systems 2004; 22(1):94–136.

46. Yan Y, Zhang X, Song Y. An effective and practical performance prediction model for parallel computing on nondedicatedheterogeneous NOW. Journal of Parallel and Distributed Computing 1997; 41(2):63–80.

47. Brown PN, Falgout RD, Jones JE. Semicoarsening multigrid on distributed memory machines. SIAM Journal on ScientificComputing 2000; 21(5):1823–1834.

48. Lawry W, Wilson C, Maccabe AB, Brightwell R. COMB: A portable benchmark suite for assessing MPI overlap. IEEEInternational Conference on Cluster Computing, Chicago, IL, September 2002.

49. Shende S, Malony A. The TAU parallel performance system. The International Journal of High Performance ComputingApplications 2006; 20(2):287–311.

50. Roth P, Miller B. On-line automated performance diagnosis on thousands of processes. Proceedings of 2006 ACMSIGPLAN Symposium on Principles of Parallel Programming, PPOPP’06, Association for Computing Machinery. NewYork, NY, 2006.

51. Wang HH. A parallel method for tridiagonal equations. ACM Transactions on Mathematical Software 1981; 7(2):170–183.52. Goudy S, Liebrock L, Schaffer S. Performance analysis of a hybrid parallel linear algebra kernel. The 5th International

Conference on Linux Clusters, Austin, TX. Linux Cluster Institute: Urbana, IL, May 2004.53. Purdom PW Jr, Brown CA. The Analysis of Algorithms. Holt, Rinehart & Winston: New York, 1985.54. Jin G, Mellor-Crummey J. Experiences tuning SMG98—A semicoarsening multigrid benchmark based on the hypre

library. ICS’02, New York City, NY, June 2002.55. Alexandrov A, Ionescu MF, Schauser KE, Scheiman C. LogGP: Incorporating long messages into the LogP model for

parallel computation. Journal of Parallel and Distributed Computing 1997; 44(1):71–79.

Copyright q 2007 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2008; 20:903–940DOI: 10.1002/cpe