FAST AND ACCURATE SIMULATION ENVIRONMENT (FASE) FOR …ufdcimages.uflib.ufl.edu/UF/E0/02/20/68/00001/grobelny_e.pdf · and guidance. I would also like to express my utmost gratitude

FAST AND ACCURATE SIMULATION ENVIRONMENT (FASE) FORHIGH-PERFORMANCE COMPUTING SYSTEMS AND APPLICATIONS

By

ERIC M. GROBELNY

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2008

1

c© 2008 Eric M. Grobelny

2

To my family and friends

3

ACKNOWLEDGMENTS

I would like to thank, first and foremost, my parents, brother, and sister for their love

and guidance. I would also like to express my utmost gratitude to my advisor, Dr. Alan

George, for supporting me through graduate school and teaching me the necessary skills

to become a researcher and innovator. Another crucial person who taught me much about

research in computer engineering is Dr. Jeff Vetter. Furthermore, I wish to express my

appreciation to my sponsors (the Department of Defense, Honeywell, and the University

of Florida) for their financial aid. Without it I would be in extreme debt. Finally, I would

like to thank Mr. Robert Henuber for planting the seed that inspired and motivated me to

become a doctor of philosophy in computer engineering.

4

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 BACKGROUND AND RELATED RESEARCH . . . . . . . . . . . . . . . . . . 17

3 FAST AND ACCURATE SIMULATION ENVIRONMENT (PHASE 1) . . . . . 24

3.1 Application Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.1.1 Application Characterization . . . . . . . . . . . . . . . . . . . . . . 263.1.2 Stimulus Development . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Simulation Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2.1 Component Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.2 System Development . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2.3 System Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.1 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.2 System Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.3.3 Case Study: Sweep3D . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.3.1 Experiment 1: Accuracy . . . . . . . . . . . . . . . . . . . 463.3.3.2 Experiment 2: Speed . . . . . . . . . . . . . . . . . . . . . 473.3.3.3 Experiment 3: Virtual system prototyping . . . . . . . . . 49

3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4 PERFORMANCE AND AVAILABILITY PREDICTIONS OF VIRTUALLYPROTOTYPED SYSTEMS FOR SPACE-BASED APPLICATIONS (PHASE 2) 57

4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.1.1 Project Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.1.2 DM System Architecture . . . . . . . . . . . . . . . . . . . . . . . . 594.1.3 DM Middleware Architecture . . . . . . . . . . . . . . . . . . . . . . 61

4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.2.1 Physical Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.2.2 Markov-Reward Modeling . . . . . . . . . . . . . . . . . . . . . . . 65

4.2.2.1 Data node model . . . . . . . . . . . . . . . . . . . . . . . 664.2.2.2 System model . . . . . . . . . . . . . . . . . . . . . . . . . 69

5

4.2.3 Discrete-Event Simulation Modeling . . . . . . . . . . . . . . . . . . 704.2.4 Fault Model Library . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.3.1 Model Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.3.1.1 Component model calibration and validation . . . . . . . . 764.3.1.2 System performability model . . . . . . . . . . . . . . . . 77

4.3.2 Case Study: Fast Fourier Transform . . . . . . . . . . . . . . . . . . 784.3.3 Case Study: Synthetic Aperture Radar . . . . . . . . . . . . . . . . 82

4.3.3.1 Amenability study . . . . . . . . . . . . . . . . . . . . . . 864.3.3.2 In-depth application analysis . . . . . . . . . . . . . . . . 874.3.3.3 Flight system . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5 HYBRID SIMULATIONS TO IMPROVE THE ANALYSIS TIME OFDATA-INTENSIVE APPLICATIONS (PHASE 3) . . . . . . . . . . . . . . . . . 99

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.2 Background and Related Research . . . . . . . . . . . . . . . . . . . . . . . 1025.3 Hybrid Simulation Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.3.1 Function-Level Training . . . . . . . . . . . . . . . . . . . . . . . . . 1085.3.2 Analytical Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.3.3 Micro-Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185.4.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185.4.2 Performance Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 1215.4.3 Contention Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 1255.4.4 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

APPENDIX

A EXPERIMENTAL AND SIMULATIVE SETUP . . . . . . . . . . . . . . . . . . 137

A.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137A.2 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6

LIST OF TABLES

Table page

3-1 The FASE component library . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3-2 Experimental versus simulation execution times for matrix multiply . . . . . . . 43

3-3 Ratio of simulation to experimental wall-clock execution time . . . . . . . . . . 45

3-4 Compute node specifications for each cluster in heterogeneous system . . . . . . 46

3-5 Experimental versus simulation errors for Sweep3D . . . . . . . . . . . . . . . . 48

4-1 The DM middleware components . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4-2 Data node model states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4-3 Failure and recovery rates of the node model . . . . . . . . . . . . . . . . . . . . 68

4-4 Summary of DM component models . . . . . . . . . . . . . . . . . . . . . . . . . 71

4-5 Summary of fault models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4-6 Baseline system parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4-7 The FFT Algorithmic variations and system enhancements . . . . . . . . . . . . 81

4-8 Checkpoint options explored using patch-based SAR application . . . . . . . . . 90

4-9 Architectural enhancements explored for flight system . . . . . . . . . . . . . . . 92

5-1 Summary of relevant simulation models . . . . . . . . . . . . . . . . . . . . . . . 119

5-2 Key system parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5-3 Hybrid source model parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 121

5-4 Dataset sizes for each HSI data transaction . . . . . . . . . . . . . . . . . . . . . 129

5-5 Simulation times for various HSI image sizes . . . . . . . . . . . . . . . . . . . . 130

A-1 Computation systems at the HCS Lab at UF . . . . . . . . . . . . . . . . . . . . 137

7

LIST OF FIGURES

Figure page

3-1 High-level data-flow diagram of FASE framework . . . . . . . . . . . . . . . . . 25

3-2 The FASE application characterization process . . . . . . . . . . . . . . . . . . . 27

3-3 InfiniBand model latency validation . . . . . . . . . . . . . . . . . . . . . . . . . 40

3-4 InfiniBand model throughput validation . . . . . . . . . . . . . . . . . . . . . . 41

3-5 The TCP/IP/Ethernet model latency validation . . . . . . . . . . . . . . . . . . 41

3-6 The TCP/IP/Ethernet model throughput validation . . . . . . . . . . . . . . . 41

3-7 The SCI model latency validation . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3-8 The SCI model throughput validation . . . . . . . . . . . . . . . . . . . . . . . . 42

3-9 Sweep3D algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3-10 Experimental versus simulative execution times for Sweep3D . . . . . . . . . . . 47

3-11 Ratios of simulation to experimental wall-clock completion time for varyingsystem and dataset sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3-12 Execution times for Sweep3D running on various system configurations . . . . . 50

3-13 Maximum speedups for Sweep3D running on various network configurations . . 51

3-14 Speedups for Sweep3D running on 8192-node InfiniBand system . . . . . . . . . 53

4-1 System hardware architecture of the dependable multiprocessor . . . . . . . . . 60

4-2 System software architecture of the dependable multiprocessor . . . . . . . . . . 61

4-3 Logical diagram and photograph of DM testbed . . . . . . . . . . . . . . . . . . 65

4-4 Markov-reward data node model . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4-5 Markov-reward system model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4-6 The DM node models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4-7 The DM flight system model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4-8 Example fault-enabled system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4-9 Throughput validations for network and MDS subsystem models . . . . . . . . . 76

4-10 Markov versus simulation DM system performability comparison . . . . . . . . . 78

4-11 Dataflow diagram of parallel 2D FFT . . . . . . . . . . . . . . . . . . . . . . . . 79

8

4-12 Execution time per image for baseline and enhanced systems . . . . . . . . . . . 80

4-13 Parallel 2D FFT execution times per image for various performance-enhancingtechniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4-14 Distributed 2D FFT execution times per image for various performance-enhancingtechniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4-15 SAR dataflow with optional checkpoint stages and patched data decomposition . 84

4-16 Amenability results via Markov model for patch-based SAR application . . . . . 86

4-17 System performability percentages and throughputs for patch-based SAR . . . . 88

4-18 System performability and throughput for 8192-element patch-based SARexecuting on various system sizes . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4-19 Speedups of architectural enhancements for patch-based SAR . . . . . . . . . . 93

4-20 System performability and throughput of 20-node DM flight system executingpatch-based SAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5-1 High-level example systems employing hybrid modeling . . . . . . . . . . . . . . 107

5-2 High-level diagram of hybrid simulation approach . . . . . . . . . . . . . . . . . 108

5-3 Function-level training procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5-4 Example three-flow micro-simulation . . . . . . . . . . . . . . . . . . . . . . . . 117

5-5 PingPong accuracy results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5-6 MDSTest accuracy results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5-7 PingPong speedup results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5-8 MDSTest speedup results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5-9 MDSTest accuracy and speedup results using hybrid modeling approach . . . . 127

5-10 The HSI data decomposition and dataflow diagram . . . . . . . . . . . . . . . . 129

5-11 The HSI accuracy and speedup results for two hybrid configurations . . . . . . . 131

A-1 The MLD development environment . . . . . . . . . . . . . . . . . . . . . . . . 138

9

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

FAST AND ACCURATE SIMULATION ENVIRONMENT (FASE) FORHIGH-PERFORMANCE COMPUTING SYSTEMS AND APPLICATIONS

By

Eric M. Grobelny

August 2008

Chair: Alan GeorgeMajor: Electrical and Computer Engineering

As systems of computers become more complex in terms of their architecture,

interconnect, and heterogeneity, the optimum configuration and use of these machines

becomes a major challenge. To reduce the penalties caused by poorly configured systems,

simulation is often used to predict the performance of key applications to be executed on

the new systems. Simulation provides the capability to observe component and system

characteristics (e.g., performance and power) in order to make vital design decisions.

However, simulating high-fidelity models can be very time consuming and even prohibitive

when evaluating large-scale systems.

The Fast and Accurate Simulation Environment (FASE) framework seeks to support

large-scale system simulation by using high-fidelity models to capture the behavior of

only the performance-critical components while employing abstraction techniques to

capture the effects of those components with little impact on the system. To achieve this

balance of accuracy and simulation speed, FASE provides a methodology and associated

toolset to evaluate numerous architectural options. This approach allows users to make

system design decisions based on quantifiable demands of their key applications rather

than using manual analysis which can be error prone and impractical for large systems.

The framework accomplishes this evaluation through a novel approach of combining

discrete-event simulation with an application characterization scheme in order to remove

unnecessary details while focusing on components critical to the performance of the

10

application. In addition, FASE is extended to support in-depth availability analyses

and quick evaluations of data-intensive applications. In this document, we present the

methodology and techniques behind FASE and include several case studies validating

systems constructed using various applications and interconnects. The studies show that

FASE produces results with acceptable accuracy (i.e., maximum error of 23.3% and under

6% in most cases) when predicting the performance of complex applications executing on

HPC systems. Furthermore, when using FASE to analyze data-intensive applications, the

framework achieves over 1500× speedup with less than 1% error when compared to the

traditional, function-level modeling approach.

11

CHAPTER 1INTRODUCTION

Substantial capital resources are invested annually to expand the computational

capacity and improve the quality of tools scientists have at their disposal to solve

grand-challenge problems in physics, life sciences and other disciplines. Typically each

new large-scale, high-performance computing (HPC) system deployed at a national lab,

industry research center, or other site exemplifies the latest in technology and frequently

outperforms its predecessors as measured by the execution of generic benchmark suites.

While a supercomputer’s raw computational potential can be readily predicted and

demonstrated for these generic benchmarks, an application scientist’s ability to harness the

new system’s potential for their specific application is not guaranteed. Few applications

stress systems in exactly the same manner, especially at large system sizes, and

therefore predicting how best to allocate limited funds to build an optimally configured

supercomputer is challenging. Lacking quantitative data and a clear methodology to

understand and explore the design space, developers typically rely on intuition, or

instead simply use manual analysis to identify the best available option for each system

component, which may often lead to inefficiencies in the system architecture. A more

structured methodology is required to provide developers the means to perform an

accurate cost-benefit analysis to ensure resources are efficiently allocated. The Fast and

Accurate Simulation Environment (FASE) has been developed to address this critical

need.

FASE is a comprehensive methodology and associated toolset for performance

prediction and system-design exploration that encompasses a means to characterize

applications, design and build virtual prototypes of candidate systems, and study the

performance of applications in a quick, accurate, and cost-effective manner. FASE

approaches the grand challenge that is performance prediction of HPC applications on

candidate systems by splitting the problem into two domains: the application and the

12

simulation domains. Though some interdependencies must exist between the two realms,

this split isolates the work conducted in either domain so that application analysis data

and system models can be reused with little effort when exploring other applications

or system designs. Unlike other performance prediction environments, FASE provides

the unique capability of virtually prototyping candidate systems via a graphical user

interface. This feature not only provides substantial time and cost savings as compared

to developing an experimental prototype, but also captures structural dependencies

(e.g., network contention) within the computational subsystems allowing users to explore

decomposition and load balancing options. Furthermore, virtual prototyping can help

forecast the technological advances in specific components that would be most critical

to improving the performance of select applications. More importantly, cross-over points

in key metrics, such as network latency can be identified by quantitatively assessing

where to apply Amdahl’s law for a particular application and system pair. In order to

ensure all options are examined, analysis can also include the reuse potential of currently

deployed systems in order to determine if upgrading or otherwise augmenting those

existing systems will provide a better return on investment compared to building an

entirely new system. Another unique feature of FASE is its iterative design and analysis

procedures in which results from one or more initial runs are used to refine the application

characterization process as well as dictate the fidelity of the component models employed

in candidate architectures. Iterations during these stages can result in highly targeted

performance data that drives simulated systems optimized for speed and accuracy. To

accommodate for optimized system models, the framework supports a combination of

analytical and simulation model types. This combination allows users to effectively adjust

the focal points of the simulations to the components with the greatest impact on system

performance through the use of the simulation models while still accounting for the lesser

influential components by using faster analytical models. As a result, the complexity

of the overall system design is reduced thus decreasing simulation time. In summary,

13

FASE allows designers to evaluate the performance of a specific set of applications while

exploring a rich set of system design options available for deployment via an upgradeable,

modular component library. The list below details the main contributions of the FASE

framework.

1. A systematic, iterative methodology that describes the various options availableat each step of the design and analysis process and illuminates the implications andissues with each option.

2. The FASE toolset that provides an application characterization tool (supportingMPI-based C and Fortran applications) to collect performance data and a graphical,object-oriented (using C++) simulation environment to virtually prototype andevaluate candidate systems.

3. A pre-built model library that contains a variety of HPC architectural componentsfacilitating rapid prototyping and evaluation of systems with varying degrees of detailat all the key subsystems.

This study consists of three phases of research. The first phase focuses on designing

and developing a robust and comprehensive methodology and toolkit that users can

employ to quickly and accurately predict the performance of interesting HPC systems and

applications. More specifically, the work conducted in the first phase provides a detailed

procedure to characterize an application, design and build component and system models,

and analyze the application’s performance on various system configurations via simulation.

A basic toolkit, consisting of an application analysis tool, a simulator, and pre-built

simulation models of key components and sample systems, facilitates the prediction and

analysis procedure. After creating the foundation of FASE, we perform case studies using

a matrix multiply benchmark and a real scientific application in order to validate the

speed and accuracy of FASE.

The second phase of the study explores the use of the FASE framework to perform

an in-depth performability analysis of space-based systems. This work consists of the

design and development of component and system models to represent the candidate

space system as well as a simulation-based fault injection framework to conduct scalability

14

and performability analyses of different system configurations and application variations.

The scalability analysis employs the 2D Fast Fourier Transform (2D FFT) as the key

benchmark kernel due to its computational relevance in many spaced-based applications.

After identifying the strengths and weaknesses of the base architecture, we conduct a

performability study of the system under different environmental conditions. The study

uses the Synthetic Aperture Radar (SAR) application which actually incorporates the

2D FFT kernel in order to perform image processing. The study reports insight on

key architectural and algorithmic options that provide performance and availability

enhancements for future space systems.

Phase three expands the foundation of FASE by incorporating hybrid simulation

techniques to address scalability issues when analyzing data-intensive applications. The

research conducted within this phase focuses on the design and development of techniques

that combine the strengths of the function-level and analytical modeling approaches

in order to reduce simulation times in applications that process very large datasets.

The proposed approach is validated using simple benchmark programs executing on an

emerging space-based platform while its capabilities are demonstrated by analyzing a

data-intensive, remote-sensing application called hyperspectral imaging (HSI).

The remainder of this document consists of the technical details of the study. Chapter

2 provides background information on the basic concepts involved in the performance

prediction process through simulation. The chapter also presents brief overviews on

previous research conducted that share similar methods and goals of FASE. Chapter

A describes the various facilities and tools used to conduct the research. Chapter 3

details the FASE framework while Chapter 4 applies and extends the framework to

perform scalability and performability analyses on a space-based system for a thorough

architectural and algorithmic evaluation. Chapter 5 extends FASE by enhancing pre-built

models with the necessary mechanisms needed to support hybrid simulations for the

15

analysis of data-intensive applications. Finally, Chapter 6 summarizes the work presented

in this document.

16

CHAPTER 2BACKGROUND AND RELATED RESEARCH

Building large-scale systems to determine how an application will perform is simply

not a viable option due to time and cost constraints. Therefore, other methods of

precursory investigation are needed, and several different types of modeling techniques

exist to aid in this process. Analytical [1], [2] and statistical [3], [4] modeling are two such

options and both methods involve a representation of the system using mathematical

equations or Markov models in order to gain insight on how a particular system will

perform based on certain parameters. These models can become very complex especially

when considering the large number of configuration parameters of large-scale HPC systems

and still remain inaccurate due to the overall complexity of the systems. In addition,

it is difficult for analytical modeling to address the higher-order transient features of

an application, such as network contention, and often over-simplifications are employed

to make the equations solvable [5]. Computer simulation is an alternative that brings

accuracy, and flexibility to this challenge. Real hardware can be modeled at any degree

of fidelity required by the user and dictated by the application, allowing the system to

be tailored to the application and vice versa. A simulation-based approach provides the

user with the flexibility to model important components very accurately while sacrificing

the accuracy of less vital components by modeling them at a lower fidelity. In addition to

these benefits, computer simulation supports the scaling of specific component parameters,

allowing for the modeling of next-generation technologies which may not be currently

available for experimental study. Analysis based on such models can also provide concrete

evidence that may influence the future road map of system component manufactures.

Classically, there are two types of computer simulation environments:

execution-driven and trace-driven. Execution-driven simulations often use a program’s

machine code as the input to the simulation models and also have near clock-cycle fidelity,

producing accurate results at the cost of slow simulation speeds [6], [7]. Though very

17

useful for detailed studies of small-scale systems, execution-driven simulations tend to

become impractical with regards to time when used to simulate large, complex systems.

Trace-driven simulations employ higher-level representations of an application’s execution

to drive the simulations [8], [9]. These representations are generally captured using

an instrumented version of the application under study running on a given hardware

configuration. In essence, the trace-driven simulations require extra time during the

“characterization” stage to thoroughly understand and capture relevant information about

the application. This additional time spent during characterization is typically amortized

during the simulation stage by having the input available for numerous simulated system

configurations. The accuracy of trace-driven simulations not only depends on the fidelity

of the models, but also on the detail of the information obtained corresponding to the

application. A non-traditional simulation type is model-driven. Model-driven simulations

use formal models developed within the simulation environment in order to emulate the

behavior of an application. Essentially, these models produce output data that stimulates

the components of the simulated system as if the real application were running. Though

the development of the models can be very time consuming, depending on the complexity

of the modeled application, once the model is developed it can be used within any system

without any extra work.

In order to perform a trace-driven simulation, a representation of the application

is necessary. Traces and profiles are two types of application representations that can

be used to portray the behavior of a program [10]. Traces gather a chronology of events

such as computation or communication occur in time, allowing a user to observe what

a program is doing during a specific time period. Because traces are dependent on the

execution time of a program, long-running programs can produce extremely large trace

logs. These large logs can be impractical or even impossible to store depending on how

much detail is recorded by the trace program. Profiles, by contrast, do not record events

in time, but rather tally the events in order to provide useful statistics to the end user.

18

The overhead incurred from the execution of extra code used to collect the profile or

trace data can be quite similar, but is ultimately dependent on the level of detail profiled

or traced as well as the application under study. While trace generation may impose

penalties associated with the creation of large trace files (depending on the frequency

of disk access), it has the advantage that very little additional processing is needed as

intermediate results are calculated. By contrast, profiling typically requires very little file

access, but may demand the frequent calculation of intermediate results for the application

profile. In essence, a trace provides raw data describing the execution of an application

and a profile outputs a processed form of this data. Both profiles and traces can be useful

tools in trace-driven simulation, depending on the type of simulation models used for the

study and the amount of information that is desired. Further discussion on how trace and

profile tools are leveraged within the FASE framework are described in Chapter 3.

Traces and profiles can be collected for both sequential and parallel programs though

this study focuses on the latter. Various parallel programming models exist in order to

facilitate computation across multi-process, multi-node systems. These models include

shared address, message passing, and data parallel [11]. Due to its proliferation within

the scientific community, this study focuses on the message passing paradigm. More

specifically, we consider the Message Passing Interface (MPI) as the de facto standard for

inter-process communication via message passing [12], [13]. The MPI standard defines a

number of functions that allow application developers to pass information from a source

process to one or more destination processes. The destination processes can be local to

the sending process or can be located on a remote node requiring transactions across some

interconnect. More details on the MPI standard and its implementation can be found in

[12].

Predicting the performance of an arbitrary application executing on a specific system

architecture has been a long-sought goal for application and system designers alike.

Throughout the years, researchers have designed a number of prediction frameworks that

19

employ varying techniques and methodologies in order to produce accurate and meaningful

results. The approach taken by each framework allows it to overcome certain obstacles or

focus on specific application or system types while inevitably sacrificing some key feature

such as accuracy, speed, or scalability. The following paragraphs briefly describe existing

prediction environments similar to FASE and the methodologies and general limitations of

each.

The Integrated Simulation Environment (ISE) [14], a precursor to FASE, employs

hardware-in-the-loop (HWIL) simulation in which actual MPI programs are executed

using real hardware coupled with high-fidelity simulation models. The computation time

of each simulated node is calculated via a process spawned on a “process machine,” while

network models calculate the delays for the communication events. Though the ISE shows

reasonable simulation time and accuracy, the scalability of the framework is limited due

to the large number of processes needed to represent large-scale systems. The POEMS

project [15] adheres to an approach similar to HWIL simulation in that systems can

be evaluated based on results from actual hardware and simulation models as well as

performance data collected from actual software and analytical models. The simulation

environment used by POEMS integrates SimpleScalar with the COMPASS simulator [16]

enabling direct execution-driven, parallel simulation of MPI programs. These high-fidelity

simulators can provide very accurate and insightful results though the simulations can

require large amounts of time especially when dealing with very large systems.

While ISE and POEMS use execution-driven simulations, other projects employ

a model-driven approach. CHAOS [17] creates application emulators that reproduce

the computation, communication, and I/O characteristics of the program under study.

The emulators are then used to pass their information to a suite of simulators that, in

turn, determine the performance of the modeled application on the target simulated

system. The PACE [18] framework defines a custom language, called CHIP3S, that is

used to generate a model describing a parallel program’s flow control and workload.

20

Similarly, Performance Prophet [19] uses the Unified Modeling Language (UML) to model

parallel and distributed applications. Both PACE and Performance Prophet use these

models to drive systems composed of both analytical and simulation models in order to

balance speed and accuracy. Due to the inherent difficulty of automatically creating an

accurate application model, all three projects require intimate knowledge of the program

it represents. As a result, the source code must be analyzed and profiled to collect the

necessary information to reconstruct the program’s execution.

Work conducted by the Performance Evaluation Research Center (PERC) focuses

on performance prediction of parallel systems under the assumption that an application’s

performance on a single processor is memory bound and the interconnect dictates the

scalability of the program under study [20], [21]. The framework uses a three-step process:

1. Collect memory performance parameters of the considered machine

2. Collect memory access patterns independent of the underlying architecture on which

the application is executed

3. Algebraically convolve the results from steps one and two, then feed the results

to the DIMEMAS [22] network simulator developed at the European Center for

Parallelism of Barcelona

The researchers of PERC have reported accurate performance predictions using a

wide range of applications. However, collecting memory access patterns can be a very

time consuming task that results in large amounts of data for each application considered.

Also, the DIMEMAS network simulator is relatively simplistic and has the potential for

large inaccuracies when analyzing communication-intensive applications with potential

contention issues.

Performance prediction is the main theme of FASE and the frameworks mentioned

above. However, performance analysis is an integral technique used to characterize

the applications under study prior to simulation. The accuracy and detail of this

characterization data can greatly influence the accuracy and speed of the simulation

21

frameworks that use it. While a great number of performance analysis tools exist

for various purposes [23], SvPablo [24], Paradyn [25], and TAU [26] are the tools

most applicable to the FASE simulation framework. SvPablo may be used to analyze

the performance of MPI and OpenMP parallel programs, and allows for interactive

instrumentation and correlates performance data with the original source code. Paradyn

also is compatible with MPI programs and offers the advantage that no modifications

to the source code are needed due to dynamic instrumentation at the binary level.

Paradyn’s focus is on the explicit identification of performance bottlenecks. TAU is an

MPI performance analysis tool that can provide performance data on a thread-level

basis, and provides the user with a choice of three different instrumentations. Many

performance analysis tools including SvPablo and TAU are based on or support the

popular Performance Application Programming Interface (PAPI), which is also supported

by the Sequoia tool used by FASE. Each of these analysis tools can be incorporated into

the pre-simulation stage of FASE in conjunction with or as a replacement for Sequoia to

provide additional information on an application’s behavior.

FASE culminates many of these general techniques and methods in order to provide

a robust, flexible performance prediction framework. In addition, FASE features an

environment that allows users to build virtual representations of candidate architectures.

These virtual prototypes capture structural dependencies such as network congestion

and workload distribution that can greatly impact an application’s performance. Many

of the frameworks described in this section use simplistic communication models that

have difficulties capturing such issues. Furthermore, FASE employs a systematic,

iterative methodology that produces highly modular application characterization

data and component models. The framework illustrates the relationships between the

application performance information and the architectural models such that features

and mechanisms within the models can be identified and altered to improve prediction

accuracy and simulation speed. The other frameworks propose using models of various

22

fidelities in order to speed up the simulations; however, they do not explicitly describe the

decision-making process of choosing or designing a model’s fidelity. Finally, FASE provides

a fully extensible pre-built model library with components ranging from the application

layer down to the hardware layer. Unlike many of the frameworks described above, FASE

includes a number of detailed middleware models that can have significant impacts on

the performance of the overall system. Many of the pre-built models are highly tunable,

thus allowing a single, generic model to represent many different implementations based

on the parameter values set by the user. The combination of these features makes FASE a

powerful, flexible environment for rapidly evaluating applications executing on a variety of

candidate systems.

23

CHAPTER 3FAST AND ACCURATE SIMULATION ENVIRONMENT (PHASE 1)

The Fast and Accurate Simulation Environment (FASE) is a framework designed

to facilitate system design according to the needs of one or more key applications.

FASE provides a methodology and corresponding toolkit to evaluate the performance

of virtually-prototyped systems to determine, in a timely and cost-effective manner, the

ideal system configuration for a specific set of applications. In order to promote quick

and modular predictions, the FASE framework is broken into two primary domains

- the application and simulation domains. In the application domain, FASE employs

various tools and techniques to characterize the behavior of an application in order

to create accurate representations of its execution. The information gathered is used

to not only identify and understand the characteristics of the various phases inherent

to the application, but also to generate the stimulus data to drive the simulation

models. The characterization data can be collected using one or more tools depending

on the application, the capabilities of the employed tool(s), and the simulation

models used during simulation. Once the data is collected, it can be used in numerous

simulations without any modifications thus facilitating the exploration of various system

configurations. More details on the application domain are provided in Section 3.1.

The simulation domain incorporates the design, development, and analysis of the

virtually-prototyped systems to be studied. In this domain, component models are

designed and validated in order to create systems that incorporate new or emerging

technologies. To ease system development, FASE provides a library of pre-constructed

models tailored to accommodate the design of HPC environments. Once a system has

been constructed, characterization data from any number of applications can be used

as stimulus to the simulation thus allowing rapid analyses of the system under varying

workloads. More information on the various aspects of simulation domain are detailed in

Section 3.2.

24

Figure 3-1 illustrates a high-level representation of the process associated with the

FASE framework. The dark gray blocks represent steps in the application domain while

the simulation steps are denoted by white blocks. Notice that a user can work in both

domains concurrently. Also note that the framework incorporates multiple feedback

paths that allow the user to follow an iterative process by which insight is gained through

application characterization and simulation, and used to refine the models and application

analysis data employed for future iterations. Section 3.2.3 contains further details on how

an iterative methodology may be employed in FASE.

Figure 3-1. High-level data-flow diagram of FASE framework

3.1 Application Domain

The application domain is a critical part of the FASE framework. In this domain,

important information is gathered that provides insight on the behaviors of an application

during its execution. The main goal within the application realm is to gather enough

information about an application so that systems in the simulation environment are

stimulated as if they were really running the code. As such, this domain is decomposed

into two main stages: 1) application characterization and 2) stimulus development. The

25

application characterization stage employs analysis tools to collect pertinent performance

data that illustrates how the application exercises the computational subsystems. The

data that can be collected includes communication information, computation information,

memory accesses, and disk I/O. This data can then be used directly or processed and

analyzed during the stimulus development stage. In the stimulus development stage, raw

data gathered during characterization is used to provide valid input to the simulation

models such that the simulated system’s components are exercised as if the real program

was executing on it. More details on both stages as well as the various options available in

each are provided in the proceeding sections.

3.1.1 Application Characterization

Application characterization is a vital step in FASE that enables accurate predictions

on the performance of an application executing on a target system. The goal of

characterization is to identify and track the performance-critical attributes that dictate an

application’s performance based on both application and target-system parameters. FASE

provides a framework in which users can analyze their applications using existing systems

(also known as instrumentation platforms) in order to prepare for simulation. The basic

methodology by which the user analyzes the application is shown in Figure 3-2. The tools

employed in each iteration initially depend on the user’s experience with the application

and can then change based on the results from the previous iteration of analysis. The

selected tools should be capable of capturing the inherent qualities of the application while

minimizing the collection of information resulting from dependencies on the underlying

architecture of the instrumentation platform. Perturbation (i.e., the additional overhead

imposed on the system due to instrumentation) should also be considered to ensure data

accuracy. Though characterization is part of the application domain, the simulation

models should also be considered during tool selection. For example, if the processor

model to be used in simulation supports only instruction-based information, then the

analysis tool(s) selected should provide at least that information for that particular model.

26

Multiple tools can be used based on the details they provide, but output data must be

converted to a common format in order to drive the simulation models.

Figure 3-2. The FASE application characterization process

FASE incorporates a feedback loop (Loop 1 in Figure 3-1) at the characterization

stage such that multiple characterizations may be performed to first understand the

main bottleneck of the program (e.g., processor, network, memory, disk I/O, etc.),

and then focus on the collection of information that characterizes the main bottleneck

while abstracting the components of lesser impact. This data can then be fed into the

simulation environment and analyzed until the system exposes a different component as

the bottleneck. If desired, the characterization data and models can then be switched

or adjusted to incorporate the appropriate information and components to capture

the nature of the new bottleneck. In fact, the performance data can provide all the

27

necessary information for applications with an arbitrary bottleneck, while the simulation

models incorporate abstraction by only using the data that corresponds to capabilities

supported in their designs. Further details on the simulation phase and how application

characterization influences design decisions are described in the next section.

The initial deployment of FASE employs a single analysis tool called the Sequoia

Toolkit developed at Oak Ridge National Laboratory. Sequoia is a trace-based tool

that supports the analysis of C and FORTRAN applications using the MPI parallel

programming model, the current de facto standard for large-scale message-passing

platforms [28]. Instrumentation is conducted during link-time by using the profiling

MPI (PMPI) wrapper functions. PMPI is defined in the MPI standard and provides an

easy interface for profiling tools to analyze MPI programs [12]. Therefore, a Sequoia user

must only rebuild his code by linking the Sequoia library, simplifying the data collection

process. Though not required, Sequoia also supports additional functions that can be

manually inserted into the application to start or stop data collection as well as denote

various phases within the code to facilitate analysis.

The Sequoia Toolkit explicitly supports the logging of communication and

computation events. A communication event in Sequoia is defined as any MPI function

encountered during the execution of the code. The tool collects relevant information such

as source, destination, and transfer size for all key MPI communication functions and also

logs important non-communication functions (e.g., MPI Topology, MPI Comm create).

Collecting communication events at the MPI level inherently isolates the characterization

data from the underlying network of the instrumentation platform thus allowing the

data to be used on a variety of simulated systems that employ different interconnect

technologies. Network topology dependencies are also removed during characterization

since network transfers between machines are captured as high-level semantics representing

process-to-process communications rather than architecturally-dependent characteristics

such as latency and bandwidth.

28

Computation events occur between communication events. Sequoia supports two

mechanisms that measure computation statistics during an application’s execution: timing

functions and the Performance Application Programming Interface (PAPI). PAPI is

an API from the University of Tennessee that provides access to hardware counters on

a variety of platforms that can be used to track a wide range of low-level performance

statistics such as number of instructions issued, L1 cache misses, and data TLB misses

[29]. Every logged event, both computation as well as communication, include both

wall-clock and CPU-time measurements. With PAPI enabled, computation events include

additional performance information on clock cycles, instruction count, and the number of

loads, stores, and floating point operations executed. Sequoia does not explicitly support

the collection of I/O data. However, rough estimates can be calculated by comparing

wall-clock and CPU times for each computation event.

The characterization stage does suffer from one inherent problem that must be

addressed in order to provide an environment to predict the performance of large-scale

systems. This issue arises due to the restrictions placed on the user by the physical

testbed used to collect characterization data. For example, to collect accurate information

about an application running on a 1024-node system, a 1024-node testbed must be

available to the user. Of course, not all users have access to systems with 1000s of nodes

for evaluation purposes, though they may be interested in observing the execution of

their applications on these larger systems. In order to overcome this limitation, FASE

incorporates two techniques – the process method and the extrapolation method. The

process method allows each physical node in the instrumentation platform to run multiple

processes and thus gather characterization information for multiple simulated nodes.

The downside to this approach is resource contention at shared components that can

lead to inaccurate representations of the application’s execution. Also, a single node can

only support a limited number of processes. This limitation is encountered when OS

or middleware restrictions are met or when the node becomes so bogged down that the

29

application cannot finish within a reasonable amount of time. Initial tests show that

the process method produces communication events identical to those found using the

traditional approach. However, computation events can suffer from large inaccuracies due

to memory and cache contention issues especially in tests using many processes per node

with larger datasets. The experiments conducted in Section 3.3 employ traces collected

using the traditional approach; however, research is currently underway to remedy the

inaccuracies of the process method in order to facilitate large-scale system evaluation.

The extrapolation method observes trends in the application’s behavior while

changing system size and dataset size, and then formulates a rough model of the

application based on the findings. The model describes the communication, computation,

and other behaviors of the application using a high-level language. The language can then

be read by an extrapolation program to produce traces for an arbitrary system size and

application dataset size. Details on extrapolating communication patterns of large-scale

scientific applications can be found in [30]. This approach supports accurate generation

of traces and does not suffer from the limitations of the process method, though it can

be quite difficult to determine the trends of an application, especially when dealing with

applications that behave dynamically based on measured values. Although many more

issues can arise when using the extrapolation-based approach, this topic is out of the scope

of this research.

3.1.2 Stimulus Development

After an application has been characterized, the information collected is used to

develop stimulus data used as input to the simulation stage. FASE supports three

methods of providing input to simulation models. These methods include a trace-based

approach, a model-driven approach, and a hybrid approach. The exact method employed

is left to the user though the selection should depend on the type of application under

study, the amount of effort that can be afforded, and the amount of knowledge gathered

on the internals of the application. Details on each method are provided below.

30

The trace-based approach is the quickest and most automated method available to

the FASE user. The method can use either raw or processed performance data collected

during the characterization stage according to the type of information required by the

simulation models. However, the trace-based approach does place some restrictions on

the user. First, the simulation environment must have a trace reader that is capable of

translating the performance information into data structures native to the simulation

environment. This restriction requires a common format to which all performance data

must conform. Therefore, if multiple tools are employed to gather characterization

data, their outputs must be merged and modified to some common format type that is

supported by the trace reader. The second issue is that trace data must be collected for

each system and dataset size under consideration. As system and dataset sizes increase,

the trace data from complex applications could potentially require extremely large

amounts of storage space and thus care must be taken to keep trace files manageable.

In the current version of FASE, this limitation can be alleviated by collecting data for

only certain regions of code or a limited number of iterations through the use of specific

instrumentation constructs supported by Sequoia.

The model-driven approach requires much more manual effort by the user than the

trace-based approach. This method uses a formal model of the application’s behavior

based on either a thorough analysis of characterization data collected while varying

system and dataset size or through source code analysis. The developed models have the

capability of reproducing the behaviors of complex, adaptive applications that cannot be

captured using the trace-based approach. In general, this approach begins by identifying

key application parameters that affect its performance. The next step is to ascertain

the parameters having the greatest impact on performance and then determining the

various component models the application will exercise during execution. Once these

steps are complete, the actual model is developed such that it executes the correct

computation, communication, and other events based on the behaviors discovered during

31

characterization. The actual type of model employed in this approach is not limited by the

FASE framework. Markov chains, stochastic and analytical models and explicit simulative

models are a few model types that can be used within FASE as long as they can interface

with the simulation environment.

The last approach supported by FASE is the hybrid approach. In this approach,

a mix of trace and model-driven stimulus is used in order to combine the accuracy

and ease-of-use of trace-based simulations with the flexibility and dynamism of the

model-driven approach. In this method, the application is characterized at a very high

level to identify structured and dynamic areas of code. The structured areas are used to

generate trace data while small-scale formal models are employed to represent the dynamic

areas. This mixture of techniques decreases the amount of trace data needed, reduces the

amount of effort required to formulate formal models and maintains relatively accurate

representations of the application’s behavior.

The initial deployment of FASE uses the trace-based approach as its primary

stimulus. The pre-built FASE model library consists of a Sequoia trace reader that

translates Sequoia data into the necessary data structure in the simulation environment.

Though both model-driven and hybrid approaches are defined within the FASE

framework, this phase of research focuses on simulations conducted using only the

trace-based approach.

3.2 Simulation Domain

The simulation domain consists of three stages: 1) component design, 2) system

development, and 3) system analysis. The first of the three stages involves the creation

of the necessary components used to build the systems under study. This stage can be

particularly time-consuming depending on the complexity of the component as well as

level of fidelity; however, it is a one-time penalty that must be paid to gain the benefits

of simulation. The initial release of FASE includes several pre-built models of common

components (detailed in Section 3.2.1) to aid users in this process and more will be added

32

in the future. The next step in the simulation domain is the development of the candidate

systems. The process of constructing virtual systems typically requires less time than

component design although construction time normally increases with system size and

complexity. Similar to stage one, the overhead of building a system must be paid once

though numerous applications and configurations can be analyzed using the system.

Finally, the third stage allows us to reap the benefits of the FASE framework and process.

System analysis uses the components and systems constructed in stages one and two,

and the application stimulus data from the application domain, in order to predict the

performance of an application on a configured system. Since many variations of systems

are likely to be analyzed, this stage is assumed to be the most time-sensitive. In the

following subsections, we discuss each of the three simulation stages in more detail.

3.2.1 Component Design

Each component that is of interest to the user must first be designed and developed

in the simulation environment of choice. The components must not only represent the

behavior of the component, but also correspond to the level of detail provided by the

characterization tools and the abstraction level to which the application lends itself. In

most cases, certain parts of a component will be abstracted, while other parts that are

known to affect performance will be modeled in more detail. The decision of where to add

fidelity to components and where to abstract to save development and simulation time

should be based on trends and application attributes discovered during characterization.

The components designed should incorporate a variety of key parameters that dictate

their behavior and performance. An important step in the component design is tweaking

these parameters to accurately portray the target hardware. The actual values supplied

to the models should be based on empirical data collected using similar hardware or on

the predicted performance if the components are future technologies. For example, the

network and middleware models shown in Table 3-1 were validated according to real

hardware in the High-performance Computing and Simulation (HCS) Research Lab at the

33

University of Florida. The experimental setup and results for these validation tests are

presented in Section 3.3.

The FASE development environment uses a graphical, discrete-event simulation

tool called Mission Level Designer (MLD) from MLDesign Technologies [27]. Its core

components, called primitives, perform basic functions such as arithmetic and data

flow control. The behaviors exemplified by each primitive are described using the

object-oriented C++ language to promote modularity. More complex functions are

commonly realized by connecting multiple primitives such that data is manipulated as it

flows from one primitive to another. Alternately, users may write customized primitives

to provide the equivalent functionality. MLD was selected as the simulation tool for

FASE for three main reasons. First, it is a full-featured tool that supports component and

system design as well as the capabilities to simulate the developed systems all through

a GUI. Second, MLD supports various design features that facilitate quick design times

even for very complex systems. Finally, the authors have much experience using the tool

and many models have been created outside of FASE that can be imported with little or

no modifications. Although FASE currently uses MLD as its simulation environment of

choice, it may be adapted to support additional simulation environments in the future.

A wide range of pre-constructed models populate the initial FASE library in order

to provide a starting point for users. Each model was designed and developed based on

the hardware or software they represent through the use of technical details provided by

corresponding standards and other literature. The fidelity of each model in the pre-built

library corresponds to the current HPC focus of the initial deployment of FASE as well

as the capabilities of Sequoia. As a result, it incorporates high-fidelity network and

communication middleware models to capture scalability characteristics while providing

lower fidelity models for components such as a CPU, memory, and disk. Table 3-1

highlights the more important component-level models currently populating the pre-built

FASE library. It is noteworthy to mention that a variety of components not listed in Table

34

3-1 can be developed using those that are listed. For example, an explicit multicore CPU

model does not currently exist in the FASE library. However, by combining two CPU

models, a shared memory model, and two trace files, one can analyze the performance of

an application running on a multicore machine with little effort.

The network models listed in Table 3-1 share similar characteristics. Each model

receives high-level data structures that define the various parameters required to create

and output one or more network transactions between multiple nodes. Each network

model also has numerous user-definable parameters such as link rate, maximum data size,

and buffer size that dictate the performance of communication events. Furthermore, the

models include a number of parameters that define the capabilities of the subsystems that

supply the network interfaces with the necessary data to be transferred. For example,

the InfiniBand model incorporates the parameters LocalInterconnectLatency and Local-

InterconnectBW to define the latency and bandwidth of the interconnect between host

memory and the InifiniBand host channel adapter (HCA). These parameters are used to

calculate the performance penalties incurred from transferring data from memory to the

HCA. These calculations effectively abstract away the complex behaviors of the underlying

transfer mechanisms while still accounting for their performance impacts. The middleware

models in Table 3-1 provide the performance-critical capabilities of the protocol each

represents. The TCP model is a single, generic model with a variety of parameters that

enable a user to configure it as a specific implementation. Similarly, the MPI model also

incorporates many parameters so that particular implementations can be represented

using a single model. The MPI layer is modeled using two layers such that the general,

high-level functionality of the MPI protocol forms the network-independent layer while the

second layer employs interface models that translate MPI data into network-specific data

structures. This layered approach allows a common interface to be used in all systems

featuring MPI while providing “plug-and-play” capabilities to support MPI transfers over

various interconnects.

35

Table 3-1. The FASE component library

Class Type Model name Fidelity Description

Networks

InfiniBand

Host ChannelAdapter (HCA)

High Conducts IB protocol processingon incoming and outgoing packetsfor IB compute nodes

Switch High Device supporting cut-through andstore-and-forward routing usingcrossbar backplane

Channel Interface Medium Dynamic buffering mechanism

EthernetNetwork Inter-face Card (NIC)

Medium Conducts Ethernet protocol pro-cessing on incoming and outgoingframes for Ethernet compute nodes

Switch High Device supporting cut-through andstore-and-forward routing usingcrossbar or bus backplane

SCI Link Controller High Conducts SCI protocol processingon incoming and outgoing packetsfor SCI compute nodes

Middleware

IP IP Interface Low Handles IP address resolution

TCPTCP Connection High Provides reliable sockets between

two devices

TCP Manager High Manages TCP connections toensure that the correct socketreceives its corresponding segments

MPI

MPICH2 High Provides MPI interface using TCPas the transport layer

MVAPICH High Provides MPI interface for Infini-Band

MP-MPICH High Provides MPI interface for SCI

Processors GenericProcessor

Generic Proces-sor

Low Supports timing information tomodel computation

OperatingSystems

Generic OS Generic OS Low Supports some memory manage-ment capabilities of the OS

Memories GenericMemory

Generic Memory Low Models read and write accessesbased on data size

Disks Generic Disk Generic Disk Low Models read and write accessesbased on data size

Exotics ReconfigurableDevice

ReconfigurableDevice

Medium Models a specialized coproces-sor (e.g., FPGA) that computesapplication kernels

36

3.2.2 System Development

After component development, systems must be created to analyze potential

configurations. The systems developed should correspond to the demands of the

applications as discovered via characterization. FASE provides the capability to

not only change the system size and the components in the system, but also tweak

component parameters such as the network’s latency and bandwidth, middleware delays,

and processing capabilities. This feature allows the user to scrutinize the effects of

configuration changes ranging from minor system upgrades to complete system redesign

using exotic hardware. Scalability issues in this stage are dependent on the simulation

environment rather than the application. Timely and efficient development of a massively

parallel system in a given simulation environment can quickly become an issue as system

sizes scale to very high levels, and the creation of systems with thousands of nodes can

become an almost unwieldy task. Since FASE is focused on rapid analysis of arbitrary

systems, it must address this issue. Among the ways FASE supports the creation of large

systems, the MLD simulation tool supports hierarchical, XML-based designs such that a

single module can encapsulate multiple compute nodes yet the simulator still maintains

the fidelity of the underlying components and includes their effect in the analysis. Systems

are created using a graphical interface which is automatically translated into XML code

describing system-level details such as the number and type of each component and how

they are interconnected. In addition, MLD supports dynamic instances where if a model

is created according to certain guidelines, a single block can represent a user-specified

number of identical components. Finally, a more advanced method of large-scale system

creation can use hierarchical simulations (rather than hierarchical models within a single

simulation), where small-scale systems are built and analyzed with the intent of producing

intermediate traces to stimulate a higher-level simulation in which multiple nodes from

each small-scale system then act as a single node in the large-scale system. This method

has the potential to not only speed up development times for these systems, but also to

37

reduce simulation runtimes. While this technique has not yet been employed, we plan to

examine its potential benefits in future research to improve the scalability of the FASE

simulation environment.

3.2.3 System Analysis

After the application has been thoroughly analyzed and the components and initial

systems have been developed, the user can begin analyzing the performance of various

application executing on the target systems. The stimulus data from the application

domain is used as input to the simulation models in order to induce network traffic and

processing. One powerful feature of FASE is its ability to carry out multiple simulation

runs using different system configurations, but all based on the same set of stimulus data.

Therefore, the additional time spent in application characterization allows the system

analysis to proceed much quicker.

During simulation, statistics can be gathered on numerous aspects of the system

such as application runtime, network bandwidth, average network latency, and average

processing time. In addition to the “profiling”-type data that is collected, it may also

be desired to collect “traces” from the simulation. These traces differ from the stimulus

traces in that they are more architecture-dependent and less generic, but they provide

at least one common function - giving the user further insight into the performance of

the application under study. The breadth and depth of performance statistics and results

collected during simulation will determine the level of insight available post-simulation,

and the results collected should be tailored towards the needs of the user. However, the

type and fidelity of results collected may also negatively impact simulation runtime, so the

user should be careful not to collect an excessive amount of unnecessary result data. The

simulation environment may also be tailored to output results in a common format, such

that they may be viewed in a performance visualization tool such as Jumpshot [31].

In some cases, results obtained during system analysis can lead to additional insight

in terms of the application bottleneck, requiring re-characterization of the application as

38

shown in feedback loop 4 in Figure 3-1. In this situation, the steps from the application

domain are repeated, followed by possible additional component and system designs, and

repeated system analysis. Traces collected during simulation may also potentially be used

to drive future simulations in this iterative process to solve an optimization problem and

determine the ideal system configuration.

3.3 Results and Analysis

This section presents results and analysis for experiments conducted using FASE.

In each of the following subsections, we introduce the experimental setup used to collect

stimulus data (i.e., traces) and experimental numbers against which the simulation results

are compared. The first subsection presents the calibration procedures followed to validate

the three main network models in the current library – InfiniBand, TCP over IP over

Ethernet, and the direct-connect network based on the Scalable Coherent Interface (SCI)

protocol [32]. In each case, the appropriate MPI middleware layer is also incorporated on

each interconnect in the modeling environment. Section 3.3.2 presents a simple scalability

study for a matrix multiply benchmark using the aforementioned interconnects. Finally,

Section 3.3.3 showcases the features and capabilities of FASE through a comprehensive

scalability analysis of the Sweep3D application from the ASCI Blue Benchmark suite.

3.3.1 Model Validation

In order to test and validate the FASE construction set, the MLD models were first

calibrated to accurately represent some prevalent systems available in our lab. Validation

of the network and middleware models was conducted using a testbed comprised of 16

dual-processor 1.4 GHz Opteron nodes each having 1 GB of main memory and running

the 64-bit CentOS Linux variant with kernel version 2.6.9-22. The nodes employed the

Voltaire HCA 400 attached to the Voltaire ISR 9024 switch for 10 Gbps InfiniBand

connectivity while Gigabit Ethernet was provided using integrated Broadcom BCM5404C

LAN controllers connected via a Force10 S50 switch. The direct-connect network model

was calibrated to represent SCI hardware supplied by Dolphin Inc. A simple PingPong

39

MPI program that measures low-level network performance was used to calibrate the

models to best represent the 16-node cluster’s performance over a specific interconnect.

Three MPI middleware layers were modeled including MVAPICH-0.9.5, MPICH2-1.0.5,

and MP-MPICH-1.3.0 for InfiniBand, TCP, and SCI, respectively.

Figures 3-3 through 3-8 show the experimentally gathered network performance values

of the InfiniBand, TCP/IP/Ethernet and SCI testbeds compared to those produced by

the simulation model. The performance of each configuration closely matches that of

the testbed and the average error between the experimental tests and simulative models

was 5% for InfiniBand, 3.6% for TCP/IP/Ethernet, and 2.7% for SCI. Throughput

calculated from the PingPong benchmark latencies for message sizes up to 32 MB show

the simulative bandwidths closely follow the measured bandwidths and have an average

error roughly equal to that found in the latency experiments. These results show the

component models are highly accurate when compared to the real systems but it is

noteworthy to mention that dips in the measured throughput are readily apparent at

4 MB and 256 KB for the InfiniBand and TCP/IP/Ethernet networks, respectively.

The decreases in bandwidth are due to overheads incurred in the software layers as the

software employs different mechanisms to accommodate the larger-sized transfers and the

corresponding models abstract these throttling points with the goal of best matching the

overall trend.

(a) Small message sizes (b) Large message sizes

Figure 3-3. InfiniBand model latency validation

40

Figure 3-4. InfiniBand model throughput validation


Figure 3-5. The TCP/IP/Ethernet model latency validation

Figure 3-6. The TCP/IP/Ethernet model throughput validation

3.3.2 System Validation

After validating the network and middleware models, we proceeded to examine the

accuracy and speed of FASE using a simple benchmark - matrix multiply. The selected

41


Figure 3-7. The SCI model latency validation

Figure 3-8. The SCI model throughput validation

implementation took a master/worker approach with the master node transmitting a

segment of matrix A and the entire matrix B to the corresponding workers in sequential

order and then receiving the results from each node in the same order. Measurements were

collected using the InfiniBand and Gigabit Ethernet models for system sizes of 2, 4, and 8

nodes and dataset sizes of 500×500, 1000×1000, 1500×1500, and 2000×2000. Each data

element is a 64-bit double-precision, floating-point value. The following paragraph outlines

the procedure taken to conduct the analysis of the matrix multiply using the FASE

methodology and corresponding toolkit. First, the matrix multiply code was instrumented

by linking the Sequoia library and the resulting binary was executed for each combination

of system and dataset size. Trace files for each combination were automatically generated

by the Sequoia instrumentation code during execution and served as the stimulus to the

42

simulation models. The component models from the FASE library were used to create

six systems for each system size (3) and network (2) analyzed. This step was conducted

while collecting characterization data from the matrix multiply. After the trace files were

collected and the systems built, each system was simulated using the corresponding trace

files for each system size. It should be noted that for this particular experiment, the

Sequoia traces were collected using the testbed’s InfiniBand network, though simulations

were run using both InifiniBand and Gigabit Ethernet networks in order to show the

portability of the traces and highlight the flexibility of FASE. Finally, experimental

runtimes were measured for each system and dataset size running on both networks in

order to determine the errors associated with the simulated systems. Table 3-2 presents

the experimental and simulative results for the various networks.

Table 3-2. Experimental versus simulation execution times for matrix multiply

InfiniBand Gigabit Ethernet

System Data Exp. Sim. Error Exp. Sim. Error

size size (sec) (sec) (sec) (sec)

2

500 3.32 3.33 0.21% 3.42 3.38 1.24%

1000 49.45 49.40 0.09% 48.81 49.58 1.58%

1500 187.40 187.03 0.20% 187.50 187.43 0.04%

2000 459.36 458.80 0.12% 460.48 459.50 0.21%

4

500 1.07 1.06 1.48% 1.16 1.12 3.12%

1000 16.69 16.54 0.93% 16.63 16.76 0.79%

1500 62.87 62.49 0.61% 63.29 63.06 0.37%

2000 153.82 153.19 0.41% 154.43 154.02 0.26%

8

500 0.52 0.46 12.70% 0.62 0.58 7.60%

1000 7.23 7.09 2.66% 7.71 7.50 2.69%

1500 27.51 26.93 2.12% 28.24 27.94 1.06%

2000 66.91 65.90 1.52% 68.18 67.60 0.86%

From Table 3-2, one can see that the simulations closely matched the experimental

execution times of the matrix multiply. The maximum error, 12.7%, occurred at smaller

dataset sizes with larger systems due to shorter runtimes that are more greatly affected

43

by any timing deviations from various anomalies such as OS task management, dynamic

middleware techniques, etc. The affects of such anomalies are normally amortized when

analyzing the typical, long-running HPC application.

The simulation times for each run of the matrix multiply were also collected in order

to quantify the slowdown of using simulation versus real hardware thus highlighting the

“fast” portion of FASE. Table 3-3 shows that the ratios of simulation to experimental

wall-clock times are very low and in some cases (e.g., small systems with large data sizes),

the simulation actually completes faster than the hardware (represented by a ratio less

than one). The ratios less than one are directly related to the amount of characterization

information collected as well as the high level of abstraction of computation events in

the simulation models. In the case of the matrix multiply, computation was abstracted

through the use of timing and this information was fed into a low-fidelity processor model,

thus accommodating short simulation times. As system size and problem size scales,

more time is spent triggering high-fidelity network models thus slowing the simulations.

However, the wall-clock simulation times observed within FASE are orders of magnitude

faster than cycle-accurate simulators where a 1000× or greater slow down in wall-clock

execution time is common.

3.3.3 Case Study: Sweep3D

Now that the system models have been validated using the matrix multiply

benchmark, they can be used to predict the performance of any application executing

on them given the proper characterization and stimulus development steps have been

conducted. In order to display the full capabilities and features of FASE, a more

complex application was selected. The Sweep3D algorithm forms the foundation of a

real Accelerated Strategic Computing Initiative (ASCI) application and solves a 1-group,

time-independent, discrete ordinate 3D Cartesian geometry neutron transport problem

[33], [34]. As shown in Figure 3-9, each iteration of the algorithm involves two main steps.

The first step solves the streaming operator by “sweeping” each angle of the Cartesian

44

Table 3-3. Ratio of simulation to experimental wall-clock execution time

InfiniBand Gigabit Ethernet

System Data Exp. Sim. Ratio Exp. Sim. Ratio

size size (sec) (sec) (sec) (sec)

2

500 3.32 5.58 1.68 3.42 6.21 1.82

1000 49.45 19.42 0.39 48.81 24.40 0.50

1500 187.40 42.71 0.23 187.50 54.60 0.29

2000 459.36 75.10 0.16 460.48 96.40 0.21

4

500 1.07 9.62 8.96 1.16 11.20 9.64

1000 16.69 33.44 2.00 16.63 44.10 2.65

1500 62.87 73.28 1.17 63.29 99.10 1.57

2000 153.82 131.05 0.85 154.43 174.00 1.13

8

500 0.52 17.92 34.22 0.62 22.60 36.17

1000 7.29 60.25 8.27 7.71 88.70 11.52

1500 27.51 129.06 4.69 28.24 195.00 6.92

2000 66.91 230.99 3.45 68.18 357.00 5.24

geometry using blocking, point-to-point communication functions while the second step

uses an iterative process to solve the scattering operator employing collective functions.

There are numerous input parameters that may be set including the number of processing

elements in the X and Y dimensions of the logical system as well as the number of grid

points (i.e., double-precision, floating-point values) assigned to the XYZ dimensions of

the data cube. The default dataset sizes supplied with Sweep3D are 50×50×50 and

150×150×150, though the experiments in this section will also explore an intermediate

dataset - 100×100×100. In this section, we present a set of experiments designed to

quantify the accuracy and speed of the FASE simulation environment with respect to the

Sweep3D application. The first experiment illustrates the accuracy of FASE by comparing

experimentally measured execution times of Sweep3D versus the times produced by

corresponding simulated systems. Experiment 2 analyzes the speed of simulations

conducted using FASE and demonstrates how the speed scales with system size. The

section concludes with a final experiment that showcases the full potential of the FASE

45

framework by providing a detailed simulative analysis of the Sweep3D application running

on systems with various sizes, interconnect technologies, topologies, and middleware

attributes.

Figure 3-9. Sweep3D algorithm

3.3.3.1 Experiment 1: Accuracy

The first set of experiments performed using Sweep3D was very similar to those from

the last section. The main difference, however, is that the system under study for these

experiments leverages the abilities of FASE to simulate heterogeneous components. This

system is composed of a heterogeneous, 64-node Linux cluster that features four types of

computational nodes as listed in Table 3-4.

Table 3-4. Compute node specifications for each cluster in heterogeneous system

Node count Processor Memory OS Kernel

Cluster 1 10 1.4 GHz Opteron 1 GB DDR333 CentOS 2.6.9-22

Cluster 2 14 3.2 GHz Xeon with EM64T 2 GB DDR333 CentOS 2.6.9-22

Cluster 3 30 2.0 GHz Opteron 1 GB DDR400 CentOS 2.6.9-22

Cluster 4 10 2.4 GHz Xeon 1 GB DDR266 Redhat 9 2.4.20-8

All measurements and Sequoia traces were gathered over the Gigabit Ethernet

interconnect. Figure 3-10 shows a comparison of the execution times from the physical

and simulated systems while increasing the system and dataset sizes. Table 3-5 displays

46

Figure 3-10. Experimental versus simulative execution times for Sweep3D

the errors between experimental and simulative execution times. In this experiment, we

observed slightly higher error rates than the matrix multiply benchmark and network

validation tests, but this trend is to be expected considering the increased complexity

of the Sweep3D application and heterogeneous system used for the study. In all but five

cases, error rates were below 10%, with many cases showing errors around 1%. In the

cases with 10% error or greater, we can largely attribute the higher values to extraneous

data traffic and spurious OS activity among other effects due to non-dedicated resources.

Once again, the increased error rates occurred in cases where either dataset sizes were

small or system sizes were relatively large (or both). The maximum error observed is

23.28%, which is under the acceptable threshold for predicting performance of simulated

systems since real-world implementations of the hardware and software devices will have a

great effect on the actual performance of the final system.

3.3.3.2 Experiment 2: Speed

For each experiment, the testbed execution time of the application was compared

to the simulation time of the virtually-prototyped system. As seen in Figure 3-11, the

simulation time increases with both dataset size and system size. Increases in either

characteristic raise the amount of network traffic, thus causing more computation time to

47

Table 3-5. Experimental versus simulation errors for Sweep3D

Dataset size

System size 50×50×50 100×100×100 150×150×150

2 0.69% 0.13% 0.21%

4 0.90% 5.38% 9.76%

8 0.60% 0.25% 2.84%

16 8.36% 10.11% 15.31%

32 5.95% 23.28% 1.03%

64 14.66% 17.36% 1.02%

be spent processing interactions between the higher-fidelity models. In fact, the longest

simulation time, one hour, occurred for the 64-node system using the 150×150×150

dataset.

Figure 3-11. Ratios of simulation to experimental wall-clock completion time for varyingsystem and dataset sizes

While a one-hour simulation time is within an acceptable tolerance, if we extrapolate

the timing results to even larger system and dataset sizes, we find that a system of

1024 nodes computing a 250×250×250 grid-point dataset will take approximately 70

hours. In order to cut the time when simulating the computation of very large datasets

on large-scale systems, the stimulus development and simulation techniques previously

described in this chapter can be employed. In this case, knowledge of the application’s

execution characteristics can help to speed up simulations. The Sweep3D application

48

performs twelve iterations of its core code, with each compute node having identical

communication blocks and very similar computation blocks per iteration. The affinity

across iterations allows us to use the performance data collected during a single iteration

to extrapolate the total time taken to execute all twelve iterations. This process can lead

to decreased simulation times with little effects on simulation accuracy. In fact, removing

all but a single iteration from the Sweep3D traces resulted in a simulation speedup of 9.6

(total of 7 minutes rather than 65 minutes) while impacting the accuracy of the model by

less than ±1%.

3.3.3.3 Experiment 3: Virtual system prototyping

Now that we have a fast and accurate baseline system for the Sweep3D application,

we can explore the effects of changing the system configuration. This experiment explores

the performance impact of increasing the number of nodes in the system by scaling the

processing power of each node. This scaling effectively extrapolates the performance

of the baseline 64-node system to represent the performance of Sweep3D executing on

systems up to 8192 nodes. The results provided in this experiment present the best-case

values of each system size since network issues such as switch congestion and multi-hop

transfers that arise from adding additional nodes to a system are not considered. Though

the actual network pressures of these systems are not fully represented, the results

do provide an upper bound performance of the Sweep3D application running on the

corresponding system. This upper bound can be used to quickly identify whether or not a

particular system is suitable to run a particular application thus facilitating the evaluation

of numerous configurations with the intent of pinpointing a small subset the original

candidate systems to simulate in more detail. Future work will incorporate the stimulus

development techniques discussed in Section 3.1.2 so that network contention is considered

to improve the accuracy of the performance predictions.

The various networks that we examine include standard Gigabit Ethernet (GigE), an

enhanced version of GigE, 10 GigE, InfiniBand, a 2D 8×8 direct-connect SCI network,

49

Figure 3-12. Execution times for Sweep3D running on various system configurations

and a 3D 4×4×4 SCI network. Each configuration included in this study attempts to shed

light on how changes in system size, network bandwidth, network latency, middleware

characteristics, and topology affect the overall performance of Sweep3D each of which

is easily configurable via the FASE framework. The results from this experiment are

displayed in Figure 3-12, with the 64-node Gigabit Ethernet system used as the baseline.

Figure 3-12 shows the Gigabit Ethernet system experienced nearly linear speedups

until the system size reached 1024 nodes. At this point, the communication of Sweep3D

begins to dominate execution time. In general, the trends displayed in Figure 3-12 were

surprising, given an initial timing analysis of the algorithm showed that half the execution

time was spent in communication blocks. Upon further analysis we determined that the

reason for these trends is due to the “Late Sender” problem where processes post their

MPI Recvs before the matching MPI Send is executed by the corresponding process,

causing the receiving process to become idle. In the case with 8192 nodes, the application

becomes network-bound, even with the Late Sender problem. Therefore, we must change

the focus to other network technologies to alleviate the communication bottleneck.

50

Figure 3-13. Maximum speedups for Sweep3D running on various network configurations

The first attempt to remedy the communication bottleneck for larger system sizes

employed optimized versions of TCP and MPI. Specifically, we increased key parameters

such as TCP window size and TCP maximum transfer unit (effectively enabling the

use of jumbo frames) as well as reduced MPI overhead. This case, labeled Enhanced

GigE, showed little improvements in total execution time leading us to conclude that the

bandwidth and latency of the network dictate the performance of communication events

rather than the middleware. The next tests conducted replaced the Gigabit Ethernet

interconnect with high-performance networks such as 10 GigE, InfiniBand and SCI. Figure

3-12 shows that in smaller systems, the interconnect has little effect on the performance

of Sweep3D. However, it becomes apparent that beyond 1024 nodes a faster interconnect

provides better speedups. For example, the 10 GigE and InfiniBand interconnects offer

speedups of 4.1× and 5.31×, respectively, over the baseline GigE system. Figure 3-13

illustrates the maximum speedups for the various high-performance networks that were

tested.

Not only do these tests analyze the affects of adding bandwidth and lowering latency,

but the SCI cases demonstrate the power of virtual system prototyping by exploring

the impact of mapping the Sweep3D algorithm to drastically different topologies,

51

specifically a 2D and 3D torus. At first glance, the Sweep3D algorithm seems to map

well to direct-connect network topologies based on its nearest neighbor communication

pattern; however, from Figure 3-13 it appears that this is not exactly the case. Further

analysis of the application and underlying architecture shows that the three fundamental

characteristics of the SCI network hinder further performance improvements. First, the

SCI protocol uses small payload sizes of 128-bytes which cannot effectively amortize

communication overhead. The second and third characteristics are the one-way packet

flow of a dimensional ring and the packet forwarding or dimension switching delays

that occur at each intermediate node while routing packets to the destination. Under

certain circumstances, as dictated by the Sweep3D algorithm, few packets experience

more than one forwarding delay due to the target node being the next neighboring node

in the dimensional ring’s packet flow. This case provides optimal mapping between the

algorithm and the network architecture resulting in excellent performance. However,

this scenario is only one of four communication flows used by Sweep3D. The remaining

communications result in numerous transactions that must travel almost the entire

length of a dimensional ring in order to reach its destination resulting in many multi-hop

transfers, higher latencies, and lower overall speedups. These negative effects could easily

be lost when using low-fidelity, analytical network models due to their inability to capture

structural characteristics that can greatly impact overall system performance.

Not only can FASE be used to analyze the effects of interchanging network

technologies, but it can also provide a highly detailed analysis of a specific technology. We

chose the InfiniBand network for the in-depth study since it showed the best performance

of the six configurations. The InfiniBand model, as well as the MPI middleware model,

has numerous user-definable parameters that can be changed to correspond to the current

and future versions of the technologies. These parameters provide a mechanism to

perform fine-grained analyses to squeeze as much performance as possible from a specific

technology while also providing valuable insight on any new bottlenecks that may arise

52

Figure 3-14. Speedups for Sweep3D running on 8192-node InfiniBand system

during future upgrades. From the results in Figure 3-14, one can see that even at 8192

nodes, network enhancements have little effect on the InfiniBand system’s performance.

The greatest speedup from changes in the communication layer comes from middleware

enhancements that achieve a 1.21× performance boost over the baseline. These results

indicate that further improvements of the InfiniBand system’s computation capabilities

are needed in order to show any significant speedups when tweaking the middleware and

network layers. The enhancement that increases network and processing performance

as well as decreases the overhead associated with the middleware (see Figure 3-14, Mid-

dleware+Network+Processor enhancement) reinforces this claim by providing a 2.44×

speedup. This study provides but a brief summary of the modifications that can be made

to test the effectiveness of key technologies when analyzing an application using FASE.

3.4 Conclusions

The task of designing powerful yet cost-efficient HPC systems is becoming extremely

daunting due to not only the increasing complexity of individual computation and

I/O components but also the effective mapping of grand-challenge applications to the

underlying architecture. In this first phase of research, we presented a framework called

the Fast and Accurate Simulation Environment (FASE) that aids system engineers

53

overcome the challenges of designing systems that target specific applications through

the use of analysis tools in conjunction with discrete-event simulation. Within the

application domain, FASE provides a methodology to analyze and extract an application’s

performance-critical data which is then used to discover trends and limitations as well

as provide stimuli for simulation models for virtual prototyping. We also provided

background on various options for performance prediction of HPC systems through

modeling and simulation, and outlined the need for a solution that can provide fast

simulation times with accurate results. The FASE framework was then outlined and its

different components and features were described.

To showcase the capabilities of FASE, we gathered a variety of results showing the

performance of various system configurations. We first provided validation results for our

InfiniBand, TCP/IP over Ethernet, and SCI models, showing that the network models

that serve as the backbone of the case studies in this paper have been carefully tuned

to accurately match real hardware. We then showed the results of a matrix multiply

case study where we compared experimental to simulated execution times for a parallel

MPI-based matrix multiply benchmark. In most cases, errors in the models were less than

1%, with the maximum error of 12.7% occurring in a case with a small dataset size and

a large system size. These conditions result in short experimental runtimes of less than

one second where transient effects such as OS task management and page faults can cause

unpredictable deviations in application execution time. In terms of simulation speed, the

slowdowns observed by simulating the parallel matrix multiply were very low, and in some

cases, the simulation actually completed before the actual system finished executing the

code.

The final case study presented a series of experiments using the Sweep3D benchmark,

which is the main kernel of a real Accelerated Strategic Computing Initiative application.

We performed simulation and hardware experiments over a range of dataset sizes using

a Gigabit Ethernet-based system, and again found errors to be very low in most cases.

54

A maximum error of 23.28% was observed, which is considered acceptable when dealing

with predicting the performance of a complex application running on an HPC system.

Again, the cases with high errors correspond to non-typical HPC scenarios with larger

systems working on small datasets resulting in conditions that amplify deviations in

experimentally measured runtimes. We also proposed and employed tactics to speed up

simulation by 10× while sacrificing less than 1% accuracy. With a fast and accurate

baseline system established for Sweep3D, we proceeded to use the FASE methodology

to predict the performance of the application for systems with various sizes and network

technologies. We found that the application was actually more processor-bound than

initially anticipated due to the MPI “Late Sender” problem where a process posts

an MPI Recv before the corresponding MPI Send is executed, causing the receiving

process to become idle. However, as the systems increased in size, Sweep3D did become

network bound, with 10 Gigabit Ethernet and InfiniBand both providing significant

performance improvements over the Gigabit Ethernet baseline mainly due to their

increased bandwidth. The analysis of Sweep3D concluded with an in-depth look at

its execution on an InfiniBand system while varying fine-grained parameters such as

network bandwidth, packet size, and middleware overhead. These modifications provided

minimal performance improvements because the algorithm’s bottleneck changed from

communication to computation when network backpressure was reduced by choosing an

improved network technology (i.e., InfiniBand).

The work conducted during this phase of research produced a flexible and

comprehensive framework for performance modeling and prediction. This framework

provides a generalized methodology for application characterization, design and

development of component and system models, and analysis of applications running on the

virtual systems under consideration. The work also produced a set of tools and a model

library to facilitate performance prediction. The case studies validated the usefulness

of FASE by displaying both fast and accurate results when comparing the observed

55

experimental and simulative values. The studies also illustrated the capabilities of FASE

for analyzing the effects of architectural variations in order to improve the scalability of

applications. The contributions and accomplishments of this work have been compiled into

a manuscript and published in Simulation: Transactions of The Society for Modeling and

Simulation International [35].

56

CHAPTER 4PERFORMANCE AND AVAILABILITY PREDICTIONS OF VIRTUALLY

PROTOTYPED SYSTEMS FOR SPACE-BASED APPLICATIONS (PHASE 2)

This chapter presents two detailed case studies of the FASE framework applied to

analyze the effects of various configuration and algorithmic changes to a space-based

system. The first case study looks at performance and scalability issues of the NASA

Dependable Multiprocessor (DM) executing a key application kernel, the Fast Fourier

Transform. The second evaluates the performance and availability of the Synthetic

Aperture Radar (SAR) application running on the DM system in a faulty environment

such as space. The proceeding sections provide the details and results of these case studies

as well as a novel analysis approach to accurately predict system performability. The first

section presents background information on the motivations and initial design of the space

system. The next section supplies details on the approach taken to design and develop

the necessary models to virtually explore the performance, scalability, and availability

trade-offs of the candidate system. It also describes the unique analysis approach used to

study SAR executing on the DM system. Finally, the experiments and results from both

case studies are presented followed by the conclusions drawn from the analyses.

4.1 Background

4.1.1 Project Overview

NASA and other space agencies have had a long and relatively productive history

of space exploration as exemplified by recent rover missions to Mars. Traditionally,

space exploration missions have essentially been remote-control platforms with all major

decisions made by operators located in control centers on Earth. The onboard computers

in these remote systems have contained minimal functionality, partially in order to satisfy

design size and power constraints, but also to reduce complexity and therefore minimize

the cost of developing components that can endure the harsh environment of space. Hence,

these traditional space computers have been capable of doing little more than executing

small sets of real-time spacecraft control procedures, with little or no processing features

57

remaining for instrument data processing. This approach has proven to be an effective

means of meeting tight budget constraints because most missions to date have generated

a manageable volume of data that can be compressed and post-processed by ground

stations.

However, as outlined in NASA’s latest strategic plan and other sources, the demand

for onboard processing is predicted to increase substantially due to several factors [36].

As the capabilities of instruments on exploration platforms increase in terms of the

number, type and quality of images produced in a given time period, additional processing

capability will be required to cope with limited downlink bandwidth and line-of-sight

challenges. Substantial bandwidth savings can be achieved by performing preprocessing

and, if possible, knowledge extraction on raw data in-situ. Beyond simple data collection,

the ability for space probes to autonomously self-manage will be a critical feature to

successfully execute planned space-exploration missions. Autonomous spacecraft have

the potential to substantially increase their return on investment through opportunistic

explorations conducted outside the Earth-bound operator control loop. To achieve this

goal, the required processing capability becomes even more demanding when decisions

must be made quickly for applications with real-time deadlines. However, providing

the required level of onboard processing capability for such advanced features and

simultaneously meeting tight budget requirements is a challenging problem that must

be addressed.

In response, NASA has initiated several projects to develop technologies that address

the onboard processing gap. One such program, NASA’s New Millennium Program

(NMP), provides a venue to test emergent technology for space. The Dependable

Multiprocessor (DM) is one of the four experiments on the upcoming NMP Space

Technology 8 (ST8) mission, to be launched in 2009, and the experiment seeks to

deploy Commercial-Off-The-Shelf (COTS) technology to boost onboard processing

performance per watt [37]. The DM system combines COTS processors and networking

58

components (e.g., Ethernet) with a novel and robust middleware system that provides a

means to customize application deployment and recovery features, and thereby maximize

system efficiency while maintaining the required level of reliability by adapting to the

harsh environment of space. In addition, the DM system middleware provides a parallel

processing environment comparable to that found in high-performance COTS clusters of

which application scientists are familiar. By adopting a standard development strategy

and runtime environment, the additional expense and time loss associated with porting

applications from the laboratory to the spacecraft payload can be significantly reduced.

4.1.2 DM System Architecture

Building upon the strengths of past research efforts [38], [39], [40], the DM

system provides a cost-effective, standard processing platform with a seamless

transition from ground-based computational clusters to space systems. By providing

development and runtime environments familiar to earth and space science application

developers, project development time, risk and cost can be substantially reduced.

The DM hardware architecture (see Figure 4-1) follows an integrated-payload concept

whereby components can be incrementally added to a standard system infrastructure

inexpensively [41]. The DM platform is composed of a collection of COTS data

processors (augmented with runtime-reconfigurable COTS FPGAs) interconnected

by redundant COTS packet-switched networks such as Ethernet or RapidIO [42]. To

guard against unrecoverable component failures, COTS components can be deployed

with redundancy, and the choice of whether redundant components are used as cold

or hot spares is mission-specific. The scalable nature of non-blocking switches provides

distinct performance advantages over traditional bus-based architectures and also allows

network-level redundancy to be added on a per-component basis. Additional peripherals or

custom modules may be added to the network to extend the system’s capability; however,

these peripherals are outside of the scope of the base architecture.

59

Figure 4-1. System hardware architecture of the dependable multiprocessor

Future versions of the DM system may be deployed with a full complement of COTS

components but, in order to reduce project risk for the DM experiment, components

that provide critical control functionality are radiation-hardened in the baseline

system configuration. The DM is controlled by one or more System Controllers, each

a radiation-hardened single-board computer, which monitor and maintain the health of the

system. Also, the system controller is responsible for interacting with the main controller

for the entire spacecraft. Although system controllers are highly reliable components, they

can be deployed in a redundant fashion for highly critical or long-term missions with cold

or hot sparing. A radiation-hardened Mass Data Store (MDS) with onboard data handling

and processing capabilities provides a common interface for sensors, downlink systems

and other peripherals to attach to the DM system. Furthermore, the MDS provides a

globally accessible and secure location for storing checkpoints, I/O and other system data.

The primary dataflow in the system is from instrument to Mass Data Store, through

the cluster, back to the Mass Data Store, and finally to the ground via the spacecraft’s

Communication Subsystem. Because the MDS is a highly reliable component, it will likely

have an adequate level of reliability for most missions and therefore need not be replicated.

However, redundant spares or a fully distributed memory approach may be required for

some missions. In fact, results from an investigation of the system performance suggest

that a monolithic and centralized MDS may limit the scalability of certain applications

and these results are presented in Section 4.3.

60

4.1.3 DM Middleware Architecture

The DM middleware has been designed with the resource-limited environment typical

of embedded space systems in mind and yet is meant to scale up to hundreds of data

processors per the goals for future generations of the technology. A top-level overview of

the DM software architecture is illustrated in Figure 4-2. A key feature of this architecture

is the integration of generic job management and software fault-tolerant techniques

implemented in the middleware framework. The DM middleware is independent of

and transparent to both the specific mission application and the underlying platform.

This transparency is achieved for mission applications through well-defined, high-level,

Application Programming Interfaces (APIs) and policy definitions, and at the platform

layer through abstract interfaces and library calls that isolate the middleware from the

underlying platform. This method of isolation and encapsulation makes the middleware

services portable to new platforms.

Figure 4-2. System software architecture of the dependable multiprocessor

To achieve a standard runtime environment with which science application designers

are accustomed, a commodity operating system such as a Linux variant forms the basis

for the software platform on each system node including the control processor and mass

data store (i.e., the Hardened Processor seen in Figure 4-2). Providing a COTS runtime

system allows space scientists to develop their applications on inexpensive ground-based

clusters and transfer their applications to the flight system with minimal effort. Such an

61

easy path to flight deployment will reduce project costs and development time, ultimately

leading to more science missions deployed over a given period of time. Table 4-1 provides

descriptions of the other DM middleware components.

Table 4-1. The DM middleware components

Component Description

High-Availability Middle-ware (HAM)

Provides standard communication interface between all soft-ware components including user applications. Guarantees inorder delivery of all messages and supports seamless switchingbetween redundant networks.

Fault-Tolerance Manager(FTM)

Central fault recovery agent for DM system. Monitors statusof software agents and reliable messaging middleware. Up-dates JM tables upon resource changes affecting applicationscheduling.

Job Manager (JM) Centralized component that schedules jobs, allocates resources,dispatches processes, and directs application recovery.

Job Manager Agents(JMA)

Distributed software agents that fork the execution of jobs andmanage required runtime job information on the local host.

Fault-tolerant EmbeddedMessage Passing Interface(FEMPI)

Application-independent, fault-tolerant, message passing inter-face adhering to the MPI standards. Provides a subset of theMPI API and supports various fault recovery modes.

MDS Server Services all data operations between applications and massmemory.

4.2 Approach

The FASE framework presented in Chapter 3 provides an ideal environment for

exploring the design options involved with system configuration of the DM system. The

models in the pre-built library were designed so that components could be configured for

embedded or traditional HPC systems through simple parameter tweaks representing the

capacity or capability of a specific resource. Therefore, various design trade-offs can be

explored using a variety of hardware and software models in order to analyze their effects

on system performance and scalability.

In order to apply the FASE framework to study COTS components in space, the

original design was extended to support fault injection capabilities. These additional

features allow users to explore not only performance-oriented issues, but also those dealing

62

with fault tolerance and availability. In order to facilitate the use of these new capabilities,

the researcher introduces a novel approach to predict the performability, a metric that

combines performance and availability to describe degradable systems [43], of COTS-based

payload processing systems. This approach analyzes systems in three complementary

domains: 1) physical prototype, 2) Markov-reward model, and 3) discrete-event simulation

model. Techniques from each domain represent cornerstones in the analysis process

though each has its strengths and weaknesses. Physical prototypes offer validity in

measured values but provide limited scalability and adaptability. Markov-reward models

allow for quick performability measurements for specific failure and recovery rates, but

are not suitable when modeling complex systems due to high dimensionality which is

required for high-fidelity models. Finally, simulation provides a free-form environment

to evaluate systems with arbitrary configurations and workloads, but often suffers

from increased development time and lengthy analyses. By intelligently leveraging the

strengths in each domain, a quick and precise analysis of various system configurations

and applications can be achieved that include a variety of arbitrary workloads and

fault-injection campaigns. The process begins with the evaluation of the prototype system,

where real-world performance values such as network latency and component recovery

times are measured and used to calibrate the Markov-reward and discrete-event simulation

models that would otherwise lack validity. Next, quick performability evaluations of

the system’s fault-tolerant software architecture are conducted using Markov modeling

techniques to identify efficient designs and workloads. Thus, the Markov models trim an

otherwise large design space to eliminate the time spent analyzing poor designs. The final

step uses pre-built or customized simulation models to analyze architectural enhancements

and dependencies within the selected systems and applications at a level of detail that

cannot be achieved in the previous domains. The resulting methodology allows candidate

systems to be thoroughly and accurately analyzed for both performance and availability

63

thus allowing designers to compare alternate fault-tolerant architectures for aerospace

applications.

We apply the three-stage methodology to analyze and quantify the performance and

fault-tolerant characteristics of the DM management software and proposed flight system.

The following subsections provide more details on the modeling efforts involved with this

work.

4.2.1 Physical Prototype

The first stage of the analysis approach involves the development and testing of a

prototype system that represents a scaled-down version of the proposed system. The

prototype DM system was designed and developed to mirror when possible and emulate

when necessary the features of a typical satellite system. As shown in Figure 4-3, the

prototype hardware consists of a collection of COTS Single-Board Computers (SBCs)

running a Linux-based operating system interconnected with redundant Gigabit Ethernet

networks. One SBC is augmented with an FPGA coprocessor and a reset controller and

power supply is incorporated for power-off resets. Six SBCs are used to mirror four data

processor boards and emulate the functionality of the two radiation-hardened control

and MDS nodes. Each SBC is comprised of a 1 GHz PowerPC processor, 1 GB of main

memory and dual Gigabit Ethernet NICs. A Linux workstation emulates the role of the

Spacecraft Command and Control Processor, which is responsible for communication

with and external control of the DM system but is outside the scope of this paper.

The MPI middleware layer used on the testbed is FEMPI 1.0, a custom, fault-tolerant

implementation of a selected subset of the MPI standard [44]. GoAhead’s SelfReliant 4.1 is

used as the high-availability middleware which provides network communication, liveliness

information, and network failover. Finally, the MDS storage device was emulated via a

5400 RPM hard drive.

The prototype is used to measure the achievable performance of the system executing

microbenchmarks that exercise its network and MDS subsystems. It is also employed

64

(a) Logical diagram (b) Photograph

Figure 4-3. Logical diagram and photograph of DM testbed

to gather the response times necessary to detect failed components within the system.

Further details on how the prototype is used to validate the DM models are presented in

Section 4.3.

4.2.2 Markov-Reward Modeling

After the prototype system has been developed and performance and other key

metrics have been measured, the analysis transitions to the Markov-reward modeling

domain. Within this domain, quick evaluations are conducted to explore various fault and

recovery rates in order to identify the workloads and system configurations that are most

interesting for further study. In these studies, steady-state performability (SSP) is the

common fitness metric used to describe the performance of degradable multiprocessor

computer systems [45]. The SSP allows users to predict a mean computational

performance of the system which takes into account both short- and long-term effects

which could otherwise cause skew in experimental measurements of system performance.

A typical method used to estimate the SSP involves using Markov-reward models

(MRMs), constructs based on continuous-time Markov chains (CTMC). MRMs combine

state probabilities, obtained from steady-state analyses of CTMCs, and reward rates based

on computational performance to calculate the SSP of a system. Formally, the SSP is

defined as the expected asymptotic reward or

65

SSP =∑iεS

πiri (4–1)

where S represents a set of all possible states that the given system can occupy, πi denotes

a steady-state probability of the system occupying state i, and ri stands for a reward rate

for the ith state [46].

4.2.2.1 Data node model

The data node model focuses on calculating the performability of the node under

faulty conditions. To simplify the node model we assumed that jobs are continuously

scheduled on the node by the JM. Also, the delay between the completion of one job and

start of another is considered to be negligible since the run time of each job is significantly

larger than the scheduling time. These assumptions allow the model to be realized as a

six-state CTMC as shown in Figure 4-4. Each state corresponds to a particular condition

of the data processing node where the three primary components, the application, JMA,

and system (i.e., HAM and operating system), are either operational or not.

Figure 4-4. Markov-reward data node model

The SAPP state occurs when the application is executing on the node and all other

services are running correctly. This state is the only node configuration that has an

66

associated non-zero reward rate value equal to 1, which makes the SSP equivalent to the

availability for this model. In order for the model to transition out of the SAPP state,

an SEE-related error must occur causing a hang or crash of the application, JMA, or

system governed by the fault rates λAF , λJF , and λSF , respectively. Since the recovery

policy (node reset) for node-wide errors and HAM errors is identical, the failure rates

are combined to simplify the model. Each failure rate is proportionally related to the

independent variable, MTBFNODE (mean time between faults for the node), which is

equivalent to the SEE rate experienced by the node. The majority of SEEs are expected

to impact the CPU therefore each of the aforementioned fault rates is obtained by scaling

the MTBFNODE by the CPU utilization of the given software component (CPU%APP ,

CPU%JMA, CPU%SY S). The CPU utilization is defined by following equation.

CPU%APP + CPU%JMA + CPU%SY S = 100% (4–2)

The SDET state denotes a detection delay when the error has occurred in the

application causing it to abort or crash. This delay is associated with the heartbeat

interval of the running application to the JMA. The SJMA state denotes a configuration in

which there is no application running but the rest of the system is functioning properly.

To transition to the SREC state the JMA must start an application with rate µFR which

is inversely proportional to the time required by the system to start the process. The

SREC state symbolizes the application recovering from a crash. The rate µRC , at which

the system can transition back to SAPP , is dependent on the checkpointing interval of

the application as well as the size of the checkpoint and the transfer time from the MDS.

When the JMA fails all running applications are terminated and the model enters the

SSY S state. Upon entering the SSY S state, the HAM will immediately attempt to start

a JMA with a start rate of µFR. If the operating system or HAM fails before the JMA

starts up, the node model will switch to the SDOWN state and the node will be rebooted.

The reboot rate µRB dictates the time required for the system to cycle power to the node,

67

start the operating system and HAM, and reconnect to the system. Tables 4-2 and 4-3

summarize the states and parameters incorporated into the node model, respectively.

Table 4-2. Data node model states

Symbol Running Components Description

SAPP SYS, JMA, Application System is functioning correctly.

SDET SYS, JMA Application has crashed or hanged.

SJMA SYS, JMA The JMA is ready to start or restart the applica-tion.

SREC SYS, JMA, Application The application is recovering from the crash.

SSY S SYS The JMA has crashed or hanged (the application isautomatically killed).

SDOWN None The system has crashed and requires reboot.

Table 4-3. Failure and recovery rates of the node model

Symbol Rate [1/s] or Value Description Type

MTBFNODE variable Mean time between faults fora node.

Input

λAF CPU%APP ×MTBFNODE Application fault rate. Derived

λJF CPU%JMA ×MTBFNODE JMA fault rate. Derived

λSF CPU%SY S ×MTBFNODE System fault rate. Derived

µRB 0.0333 System reboot rate (HAMand OS).

Measured

µRC 0.069061 Application recovery rate. Measured

µFR 14.27 System fork rate. Measured

µDT 0.8333 Failed application detectionrate.

Measured

CPU%APP 70% Portion of CPU used byApplication.

Estimated

CPU%JMA 5% Portion of CPU used byJMA.

Estimated

CPU%SY S 25% Portion of CPU used by OSand HAM.

Estimated

The rates specified in Table 4-3 are divided into three categories – derived, measured,

and estimated. The derived rates are calculated based on equations using the SEE rate

while the measured rates were obtained by experimental measurements on the DM

68

prototype system. The CPU utilization values chosen reflect the estimated workload of

typical applications running on the DM system.

4.2.2.2 System model

The goal of the Markov model representing the DM system is to approximate the

SSP of the system with an arbitrary number of nodes. For this model, we assume each

node executes completely independent workloads and, as a result, the model represents a

best-case approximation and sets an upper bound on the SSP of the system. The fact that

the radiation-hardened control and MDS nodes in the DM system are not susceptible to

SEEs further simplifies the model. To predict the SSP of such a system, we developed a

Markov-reward model with N+1 states, where N is the number of the compute nodes in

the cluster (see Figure 4-5).

Figure 4-5. Markov-reward system model

Each state in the system model denotes the number of nodes that are currently in the

SAPP state at a given time. Most commonly, the reward rate associated with each state is

simply set to the number of nodes in the SAPP state. However, in systems such as those

with hot spares or those that incur overhead penalties, each node’s reward rate can be

modified accordingly. The node failure rate λND, defined in Equation 4–3, is equivalent to

the aggregate rate of all transitions from the SAPP state, while the recovery rate of a node,

λND, is the rate at which the node model can transition back to the SAPP state. Equation

5–1 provides a formal definition of this recovery rate.

λND = λAF + λJF + λSF =1

MTBFNODE

(4–3)

69

µND =P (SAPP )λND

1− P (SAPP )(4–4)

To simulate the data node and system models we use the SHARPE tool which

is commonly used to simulate Markov chains, Petri nets, and hierarchical models for

availability, reliability and dependability calculations. The tool is actively developed at

Duke University [47].

In order to coarsely evaluate the performance of the DM system the researcher

developed a hierarchical Markov-reward model that allows for rapid evaluation of potential

computational rates achievable for a range of applications under varying fault conditions.

Unfortunately, such a basic model lacks the fidelity and precision to explore the effects

of network, CPU, MDS, or scheduling performance, which in conjunction with fault

conditions can significantly affect the SSP. The quality of the SSP obtained from the

Markov reward model is further evaluated and compared to the simulative model in

Section 4.3.

4.2.3 Discrete-Event Simulation Modeling

The final step in the three-stage analysis involves in-depth evaluations of virtually

prototyped system using discrete-event simulation models. In the simulation domain,

analyses of application and architectural configurations are conducted with the intent to

identify the settings that produce the highest performance and availability.

Based on the node and system architectures from Sections 4.1.2 and 4.1.3,

discrete-event simulation models of key components were designed and developed. Each

model adheres to the FASE methodology of balancing speed and accuracy, and some

models actually extend or enhance the pre-existing models in the FASE library. The

components listed in Table 4-4 were formally modeled to not only capture the correct

functionality of the corresponding technology but also to incorporate their impacts on

system performance and fault tolerance. From these core component models, node and

system models were developed. Figure 4-6 illustrates the middleware models that compose

70

a data processing node, system control node, and MDS node. Finally, the virtual flight

system was developed using the final design architecture (see Figure 4-7).

Table 4-4. Summary of DM component models

Component Library Description

Fault ToleranceManager

DM Detects component failures, notifies Job Manager,and takes necessary recovery procedures.

Job Manager DM Schedules and manages jobs. Handles task restartsbased on available resources.

Job Manager Agent DM Starts and monitors applications on data node.Notifies Job Manager when application failure de-tected.

High-AvailabilityMiddleware

DM Provides reliable communication between nodes insystem. Monitors JM, JMA, and other nodes forfailures. Notifies FTM of failures.

MDS Server DM Handles data access requests to the mass data store.

TCP Layer FASE Provides TCP protocol for reliable communicationbetween nodes.

IP Layer FASE Provides IP protocol for all network transfers.

Ethernet NIC FASE Provides Ethernet protocol for all network transfers.Supports multiple ports.

Ethernet Switch FASE Provides Ethernet connectivity between nodes.Supports variety of backplane and routing options.

4.2.4 Fault Model Library

The models described in the previous section capture the functionality and

performance-based characteristics as seen in their real-world counterparts. However,

they do not include fault detection and recovery mechanisms needed to function properly

when exposed to a fault. As a result, a fault model library was designed to integrate key

features that enhance the models to react appropriately under various fault campaigns.

In addition, the fault model library provides the necessary components to generate and

inject faults into an arbitrary system. The models in the library were specifically designed

so that new and pre-existing models could be “fault-enabled” with few additions or

modifications. The fault models were also designed to create a fault hierarchy such that

a single, high-level component could be affected by a fault and the mechanisms would

71

(a) Compute node (b) Control node

(c) MDS node

Figure 4-6. The DM node models

Figure 4-7. The DM flight system model

72

automatically propagate the fault to all lower-level entities. This hierarchical design

not only captures the area of influence of a particular fault type, but it also provides an

infrastructure to define interdependencies between various components. Table 4-5 lists the

fault model components accompanied with brief descriptions of their functionality.

Table 4-5. Summary of fault models

Component Description

Fault Generator Controls when faults are generated. Generation times based onrandom distributions or user defined.

Fault Controller Injects faults into system. Selects target component randomly orbased on user-defined susceptibility matrix. Monitors the status ofall fault managers and modules in the system.

Fault Manager Propagates injected faults from the fault controller to all lower-level faulty devices. Aids in the recovery process of managed com-ponents when necessary.

Fault Module Provides high-level fault mechanisms such as detection and recov-ery to integrate into new and pre-existing models. Inherits FaultBase mechanisms and data structures.

Fault Base Provides low-level fault data structures and mechanisms to inte-grate into new and pre-existing models. These data structures andmechanisms deal with scheduling events, managing faulty events,and managing memory (e.g., preventing memory leaks).

Figure 4-8 illustrates an example of a fault-enabled system based on the DM

architecture. The figure shows the various hardware and software components that

can be affected by faults as well as some of the fault models that manage the injection,

detection, and reaction to faults in the system. Fault Modules have been integrated into

each of the hardware and software components in the system and Fault Managers are used

to manage groups of modules. The actual groupings of faulty components are based on

either the physical proximity of the components in a device or the management system

that controls the liveliness of the component. Another factor to consider when creating

these groups is how each component reacts to specific faults. For example, the Data Node

Fault Manager in Figure 4-8 manages how faults are injected at the node level such that if

the corresponding data node becomes faulty, it will pass the fault to the lower-level fault

73

managers within the NIC, Middleware, and Applications blocks so that they can disable

their corresponding components (e.g., Port, JMA, HAM, or SAR).

Figure 4-8. Example fault-enabled system

Faults are injected into the system by the Fault Control Block which is composed

of the Fault Generator and Fault Controller. The Fault Generator creates time-based

fault events as dictated by either a random distribution or the user. The Fault Controller

receives events from the Fault Generator and provides the injection capabilities necessary

to stimulate the virtual system with failures. The Fault Controller targets specific

components in the system according to a susceptibility matrix that defines the probability

each listed component will experience a fault. The actual percentage values supplied

within the susceptibility matrix are user-defined and both the utilization and physical

size of each component should be considered when setting the values. Once the Fault

Controller determines its target component, it injects the fault into the system via

Fault Managers. The Fault Managers relay the fault to the target component so the

74

target can react according to a particular policy defined by the user. The Fault Module

is incorporated into each fault-enabled component and provides virtual detection and

recovery functions that can be redefined to allow the user to configure the device to take

the necessary actions as dictated by it fault-tolerant policies.

The models within the fault library were programmed primarily using the

object-oriented C++ language so that other systems and models can be easily retrofitted

with fault injection capabilities. Also, the models were designed with extensibility in mind

to support a wide range of detection and recovery methods for many types of faults. All

the components described in the previous section have been retrofitted with the necessary

fault models and each has been configured to react to and recover from faults as dictated

by the policies setup for the DM system. Details on these policies can be found in [48].


This section describes the methods used to analyze and identify performance and

availability issues in the DM system architecture. The section presents experiments

used to validate the Markov and simulation models through the use of experimentally

gathered measurements from the prototype system. These measurements are used to

calibrate the models and validation results are presented where applicable. After the

models are calibrated, a scalability study of an important application kernel, the 2D

FFT, is presented. The goal of this investigation is to find bottlenecks that exist in the

proposed design and explore the effects of changing key system features (regarding both

architectural and algorithmic variations) to better map the application to the underlying

architecture. The scalability study is followed by an in-depth performability analysis of

the DM system using the SAR application in order to evaluate the trade-offs between

performance, scalability, and availability. The subsection concludes with an evaluation of

the proposed 20-node flight system incorporating the optimal configurations to maximize

performability.

75

4.3.1 Model Calibration

4.3.1.1 Component model calibration and validation

Model validation is a critical step in any modeling effort that attempts to provide

accurate results comparable to those produced by real systems. Validation of complex

systems such as the DM system is difficult to accomplish, therefore we attempt to

overcome this challenge by decomposing and validating its subsystems: the network

subsystem and the MDS subsystem. The network subsystem encompasses all software and

hardware layers employed during a data transfer which correspond to the HAM, TCP,

IP, and Ethernet models. In order to validate this subsystem, a simple PingPong MPI

program that measures low-level network performance between two nodes was executed

on the prototype system described in Section 4.2.1 and the results were used to calibrate

the models to best represent the testbed’s network and middleware performance. Figure

4-9a illustrates the experimentally gathered throughput values as compared to those

produced by the simulated system. The figure shows the simulation model closely matches

the performance measured on the testbed with a mean relative error of 1.27%. A similar

mean relative error was observed when comparing the experimental and simulative latency

measurements across the studied message sizes.

(a) Network subsystem (b) MDS subsystem

Figure 4-9. Throughput validations for network and MDS subsystem models

Once the network subsystem was validated, the performance of another key

subsystem, the MDS, was calibrated according to experimental measurements. Again,

76

a simple benchmark was developed that transfers data of varying sizes to and from the

MDS node. The validation results are shown in Figure 4-9b, with the MDS subsystem

model producing mean relative errors of 1.58% for writes and 2.03% for reads.

From the validation process and documentation, the DM system’s main component

parameters were calibrated in order to most accurately represent the testbed system. The

values for key parameters are listed in Table 4-6 and this configuration corresponds to the

baseline system used in the proceeding experiments.

Table 4-6. Baseline system parameters

Parameter Name Value

Processor power 1200 MIPS, 600 MFLOPS

MPI maximum throughput 57 MB/s

MPI message latency 13.6 ms

HAM buffer size 2000000 bytes

Network bandwidth Non-blocking 1000 Mb/s

Network switch latency 5 µs

MDS bandwidth (write/read) 60/40 MB/s

MDS latency (write/read) 300/500 µs

MDS open file overhead 8 ms

4.3.1.2 System performability model

In addition to component calibration, the simulation system models were validated

with regard to their near-idealistic performability as compared to those produced by the

Markov-reward models presented in Section 4.2.2. For this experiment, a serial version of

an LU decomposition kernel was scheduled on each node in the tested systems. Each LU

job processed matrices of 1000×1000 elements with each element being an 8-byte double.

The experiment varied the system size from four to thirty-two nodes by powers of two

while exploring ten MTBFNODE values. The minimum fault rate expected for each DM

data node was estimated to be three faults per day, which corresponds to the maximum

MTBFNODE value analyzed (28800 seconds = 8 hours) and a relatively hospitable

77

environment. The remaining rates were selected to analyze the system’s performability in

harsher conditions and the results from this study are shown in Figure 4-10.

(a) Performability (b) Error

Figure 4-10. Markov versus simulation DM system performability comparison

Figure 4-10a shows a comparison between the performability numbers collected for

the Markov and simulation models while Figure 4-10b illustrates the relative errors of

the simulation results as compared to those produced by the Markov model. One can

see that for larger MTBFNODE values, the two analysis techniques yield near identical

results. However, deviations between the approaches become apparent when analyzing

systems exposed to more faults (i.e., small MTBFNODE values). These deviations result

from the varying levels of detail captured by each modeling approach. In this instance, the

simulation model captures extra performance penalties such as network and scheduling

delays that affect the performance of the system. In high fault conditions, these penalties

begin to accumulate due to numerous job restarts thus negatively impacting the system’s

overall performability. In addition, the deviations at the smaller MTBFNODE values

increase with the system size due to scheduling overhead as well as resource contention in

the network and MDS node not modeled in the Markov-reward models.

4.3.2 Case Study: Fast Fourier Transform

The experiments conducted for this portion of the study explore the performance and

scalability of the 2D FFT kernel executing on the DM system. A fault-tolerant, parallel

78

2D FFT serves as the baseline algorithm, which distributes an image evenly over N

processing nodes and performs a logical transpose of the data via a corner turn. A single

iteration of the FFT, illustrated in Figure 4-11, includes several stages of computation,

inter-processor communication (i.e., corner turn), and several MDS accesses (i.e., image

read and write and checkpoint operations).

Figure 4-11. Dataflow diagram of parallel 2D FFT

The results of the baseline simulation (see Figure 4-12) show that the performance

of the FFT slightly worsens as the number of data nodes increases. In order to pinpoint

the cause of the performance decrease of the FFT, the processor, network, and MDS

characteristics were greatly enhanced (i.e., up to 1000-fold). The results in Figure 4-12

show that enhancing the processor and network has little effect on the performance of the

FFT, while MDS improvements greatly decrease execution time and enhance scalability.

The reason the FFT application performance is so directly tied to MDS performance is

due to the high number of accesses to the MDS, the large MDS access latencies, and the

serialization of accesses to the MDS.

After the MDS was verified as the bottleneck for the 2D FFT, several options were

explored in order to mitigate the negative effects of the central memory. The options

included algorithmic variations, enhancing the performance of the MDS, and combinations

of these techniques. Table 4-7 lists the different variations.

Each technique offers performance enhancements over the baseline algorithm (i.e.,

P-FFT). Figure 4-13 shows that the parallel FFT with distributed checkpointing and

79

Figure 4-12. Execution time per image for baseline and enhanced systems

distributed data provides the best speedup (up to 740×) over the baseline because it

eliminates all MDS accesses. Individually, the distributed checkpointing and distributed

data techniques result in only a minimal performance increase since the time taken to

access the MDS still dominates the total execution time. MDS performance enhancements

reduce the execution of the parallel FFT by a factor of 5. Switching the FFT algorithm

(see Figure 4-14) to the distributed version achieves a 2.5× speedup over the baseline

which can then be further increased to 14× and 100× by employing MDS improvements

and distributed data, respectively. It is noteworthy to mention that the distributed FFT is

well suited for larger systems sizes since the number of MDS accesses remains constant as

system size increases.

Results for the parallel 2D FFT (Figure 4-13) magnify the effects of the MDS on the

system’s performance. Though the parallel FFT’s general trend shows worse performance

as system size scales, the top four lines show numerous anomalies where the performance

of the FFT actually improves as the number of nodes in the system increases. These

anomalies arise from the total number of MDS accesses needed to compute a single image

for the entire system. For example, a dip in execution time occurs in the baseline parallel

FFT algorithm when moving from 18 to 19 nodes. The total number of MDS accesses of

the parallel FFT using 18 nodes is 90 while the number of accesses decreases to 76 for

80

Table 4-7. The FFT Algorithmic variations and system enhancements

Algorithm /Technique

Description Label

Parallel FFT Baseline parallel 2D FFT. P-FFT (Baseline)

Parallel FFTwith distributedcheckpointing

Parallel 2D FFT with “nearest neighbor”checkpointing - data node i saves checkpointdata to data node (i + 1) mod N, where i is aunique integer (0 ≤ i ≤ N − 1) and N is thenumber of tasks in a specific job.

P-FFT-DCP

Parallel FFTwith distributeddata

Parallel 2D FFT with each node collecting aportion of an image for processing thus elimi-nating the data retrieval and data save stages.

P-FFT-DD

Parallel FFTwith distributedcheckpointing anddistributed data

Combination of both distribution techniquesdescribed above.

P-FFT-DCP-DD

Parallel FFTwith MDS en-hancements

Parallel 2D FFT using a performance-enhancedMDS. The MDS bandwidth is improved 100-fold and the access latency is reduced by afactor of 50.

P-FFT-MDSe

Distributed FFT A variation of the 2D FFT that has each nodeprocess an entire image rather than a part ofthe image.

D-FFT

Distributed FFTwith distributeddata

Distributed 2D FFT algorithm with each nodecollecting an entire image to process.

D-FFT-DD

Distributed FFTwith MDS en-hancements

Distributed 2D FFT algorithm using aperformance-enhanced MDS.

D-FFT-MDSe

the 19-node case. Since the MDS is the system’s bottleneck, the execution time of the

algorithm benefits from the reduction of MDS accesses. Only in the parallel FFT with

distributed data and distributed checkpointing option do we see the “zig-zags” disappear

due to no data transfers occurring between the nodes and the MDS. The distributed FFT

(see Figure 4-14) also does not show any performance anomalies due to the nature of the

algorithm. That is, the number of MDS accesses remains constant per image since only

one node is responsible for computing that image.

81

The results in Figures 4-13 and 4-14 corresponded to 1 MB images, thus we

conducted simulations to analyze the affects of larger image sizes. Our results showed

that the algorithms and enhancements reversed the trend for the parallel FFT. That is,

the execution times improved as the system size grew, though the improvements were very

minimal. Also, the sporadic performance jumps were amortized due to the large number

of MDS accesses as compared to the variance in the number of accesses. The distributed

FFT with distributed data was the only option that showed a large improvement because

more processing could occur when data was more readily available for the processors. The

results demonstrate that a realistic application can be effectively executed by the DM

system if the mass memory subsystem is improved to allow for parallel memory accesses

and distributed checkpoints.

Figure 4-13. Parallel 2D FFT execution times per image for variousperformance-enhancing techniques

4.3.3 Case Study: Synthetic Aperture Radar

For the next case study, we evaluate the performance and availability of a more

complex application, SAR. Synthetic Aperture Radar (SAR) is a high-resolution,

82

Figure 4-14. Distributed 2D FFT execution times per image for variousperformance-enhancing techniques

broad-area imaging process used for reconnaissance, surveillance, targeting, navigation,

and other operations requiring highly detailed, terrain-structural information [49]. It uses

a two-dimensional, space-variant convolution that can be decomposed into two domains

of processing – range and azimuth. In order to correctly transition between the range and

azimuth domains, the data must be reordered via a transpose operation [50]. This case

study analyzes a fault-tolerant version of SAR that incorporates an optional checkpointing

stage (striped block in Figure 4-15a) to save and recover rollback points in the event of a

failed job. Figure 4-15a illustrates the data flow of the fault-tolerant SAR application.

There are various implementations of the SAR application each differing based on

the data decomposition across participating processing nodes and, thus, the amount of

communication and computation conducted by each node [51]. For this study, we consider

the patch-based approach which splits each SAR image into “patches” along the azimuth

dimension and distributes each patch to an available compute node. This patched version

does not require communication between participating nodes although each node must

83

(a) Dataflow diagram (b) Data decomposition

Figure 4-15. SAR dataflow with optional checkpoint stages and patched datadecomposition

fetch and process additional data to ensure correct results. Figure 4-15b illustrates the

data decomposition of the patch-based implementation.

The baseline SAR application used throughout this study processes

28000×5616-element images which is the approximate size of the images collected by

the European Remote-Sensing (ERS) satellites [52]. Each element is stored within the

mass data store as a complex pair of 8-bit integers (2-bytes total), the typical format used

for raw SAR data. When the data is imported by the SAR application, each element is

expanded to a complex pair of 32-bit floating-point numbers (8-bytes per pair) in order to

improve precision and reduce the potential for round-off errors, and the range dimension

is padded to 8192 elements to increase the efficiency of the FFT calculations. When SAR

is complete, the padded elements in the range dimension are removed from the processed

image and the remaining elements are converted to complex pairs of 16-bit short integers

(4-bytes per pair) thus reducing the amount of storage needed to store the data back to

the MDS. The patch data size is P×5616 elements, where P is the patch size, and the

overhead data size is 1296×5616 elements per patch. For each simulation run, the DM

84

system is observed over ten, 100-minute orbits and the radiation-hardened control and

MDS nodes are assumed to experience no failures.

For the fault-injection experiments, the fault control block (described in Section 4.2.4)

creates and inserts faults into the system using an exponential distribution with a mean,

MTBFSY STEM , defined as:

MTBFSY STEM =MTBFNODE

N(4–5)

MTBFNODE is the mean time between faults per node and N is the number of

nodes in the system. The MTBFNODE rates considered in the fault experiments are

identical to those investigated in Section 4.3.1.2 and represent radiation conditions ranging

from minimal to extreme. Faults are injected into a particular node based on a uniform

distribution and the specific component to target on the selected node is dictated by the

following percentages: SAR Application = 70%, HAM/System = 25%, and JMA = 5%.

These percentage values were estimated based on the anticipated behavior of the SAR

application executing on a DM data node. Also, faults can be injected into recovering

components thus restarting the recovery process.

The following sections describe the techniques and capabilities within the two

modeling domains of the three-stage analysis as they are applied to evaluate the

performance and availability of SAR executing on the DM system. Section 4.3.3.1

presents the amenability study of SAR using the Markov-reward model to determine

if the application maps well to the DM architecture. After the study, we enter the

discrete-event simulation domain in order to explore the various application and

architectural options available to improve performance and availability. Section 4.3.3.2

reports the performability and system throughput of the patch-based SAR application

while considering various patch sizes. It then evaluates the different checkpoint storage

options with the intent of improving the fault tolerance of the application. Finally, we

85

investigate numerous architectural enhancements to support efficient computing on a

twenty-node system and conclude with a final analysis of the proposed flight system.

4.3.3.1 Amenability study

This section begins with a preliminary analysis of the SAR application using

the Markov-reward model to determine whether the workload characteristics of this

particular application are appropriate for the DM system. The study presents best-case

performability numbers of the patch-based SAR application employing 2048-, 4096-, and

8192-element patches executing on systems using 4, 8, 16, and 32 data nodes. The fault

injection rates explored are identical to those used in Section 4.3.1.2.

Figure 4-16. Amenability results via Markov model for patch-based SAR application

From Figure 4-16, we can see that the patched SAR shows promising performability

numbers for each system size and patch size while the system experiences relatively benign

conditions. However, as the fault rate increases, we observe decreasing performability in

all cases. Furthermore, larger patch sizes have negative effects on the performability of

the system due to the increased amount of time required to complete the processing of

each SAR job. In fact, the Markov model reported a difference of 32.9% in performability

86

between the 2048-element patch and the 8192-element patch at the highest fault rate.

From these results, we observe that the patch-based SAR application produces good

performability numbers for the fault rates targeted for the DM system thus making it a

good candidate for further investigation using the discrete-event simulation approach.

4.3.3.2 In-depth application analysis

For the next step in the case study, we transition into the final stage of our proposed

methodology – the discrete-event simulation domain. Within this stage, we have the

capability to analyze many interesting application and system options while exposing

each configuration to various fault conditions. This section focuses on the various options

available for the patch-based SAR application. Similar to the amenability study, we

explore the impact of patch size on the system’s performability while considering patches

with 2048, 4096, and 8192 elements in the azimuth dimension. The objective of this study

is to determine which patch size achieves the best performance.

Figure 4-17 illustrates the results collected from the simulations. The left column

of charts shows the performability percentages of each patch size and the right column

displays the corresponding throughputs while varying system size and MTBFNODE.

From the figures in the left column, we see that the 2048-element patch size has the

highest performability in all system sizes and fault rates due to the application’s short

execution time. As the patch size grows, the execution time for each SAR job lengthens

thus increasing the probability of a fault interrupting the completion of a job. These

results concur with those reported by the Markov-reward model in the previous section.

However, the simulations produced lower performability percentages than those observed

using the Markov model. In fact, a maximum reduction of 43.0% in the performability of

larger systems experiencing high fault rates was reported by the discrete-event simulation

models. Another key observation is that the performability of each system decreases as

the system size enlarges thus indicating a potential bottleneck within the system. Our

87

(a) 2048-element performability (b) 2048-element throughput

(c) 4096-element performability (d) 4096-element throughput

(e) 8192-element performability (f) 8192-element throughput

Figure 4-17. System performability percentages and throughputs for patch-based SAR

88

Markov-reward model fails to show this trend due to its inability to capture architectural

dependencies.

Despite observing decreased performability when using larger patches, the system

throughputs for these jobs can be much higher than the 2048-element case. In fact, the

figures in the right column show improved throughputs up to 1.86× and 2.29× for the

4096- and 8192-element cases, respectively, in low-fault environments. However, the

2048-element case does outperform the 8192-element patch size when the MTBFNODE

rate is less than 200 seconds. In these configurations, the fault rates are lower than the

average execution times of the applications using larger patch sizes thus increasing the

probability of a fault causing a SAR job to fail. Finally, for each patch size and fault rate,

the throughput reaches its peak value in systems with eight data nodes. Although the

exact reason for this peak in performance is not fully clear from this particular study, we

hypothesize that the centralized MDS is the main cause. The study conducted in the next

section verifies this hypothesis by observing the effects of scaling the performance of key

system components.

For this study, we are most concerned with light to moderate fault rates (i.e.,

MTBFNODE values between 28000 and 3600 seconds) where the performability

percentages for all patch and system sizes are very similar. Therefore, we focus on the

8192-element patch configuration for further analysis of SAR and the DM system in order

to maximize the system’s throughput.

Now that SAR has been configured properly with regards to performance and

availability, we can now evaluate various checkpointing options with the goal of improving

application fault tolerance. Table 4-8 provides a list of the checkpoint options along

with brief descriptions of each. For this evaluation, we observe the performability and

throughput of four system sizes – 4, 8, 16, and 32 data nodes – while incorporating the

optional checkpoint stage for the SAR application (see Figure 4-15a). The objective of this

study is to observe the impact on performance and availability of checkpointing within the

89

SAR application. Furthermore, we wish to measure and compare the overheads associated

with storing the checkpoint data to a reliable, centralized location (MDS node) versus

checkpointing to unreliable, distributed data nodes.

Table 4-8. Checkpoint options explored using patch-based SAR application

Checkpoint option Description Label

No checkpointing No checkpointing conducted. NoCP

MDS checkpointing Checkpoint data is stored on the MDS node. MDSCP

Data node checkpointing Checkpoint data is stored on the nearest neighbordata node.

DataCP

Figure 4-18 shows the performability and throughput observed for each checkpoint

option while varying system size and MTBFNODE. The performability is reported

by the solid lines in the charts while the throughput is represented by dotted lines.

When comparing the results from all four figures, we see that as system size increases,

performability decreases for all MTBFNODE rates while throughput increases. The

MDSCP checkpoint option reports both the lowest performability and throughput for

all cases. Again, the likely cause of this poor performance is the increased pressure

placed on the centralized MDS node. In the smaller system sizes, the DataCP option

reports an approximate 11% drop in performability. Also, using the DataCP option

lowers throughputs by 33.5% and 27.5% for systems consisting of two and four data

nodes, respectively. However, this option does show slight benefits in the 16- and 32-node

systems. For both system sizes, nearly equivalent performability percentages were reported

and a maximum speedup of 1.08× in throughput was measured when compared to

the NoCP case. Another important observation from this study is that performability

does not always translate into raw performance (i.e., throughput). Figure 4-18b shows

nearly equivalent performability percentages between the three checkpoint options at

low fault rates. Conversely, the throughput of the NoCP option is 2.5× and 1.3× greater

than the MDSCP and DataCP options, respectively. This observation suggests that

while performability is a useful metric to evaluate the overall utilization of a degradable

90

system in a faulty environment, it does not represent the true performance of a specific

application since it does not differentiate between meaningful computation and the

processing conducted due to extra mechanisms such as checkpointing.

(a) 4 data nodes (b) 8 data nodes

(c) 16 data nodes (d) 32 data nodes

Figure 4-18. System performability and throughput for 8192-element patch-based SARexecuting on various system sizes

The results attained in this study suggest that the patched-based SAR application

is not well suited for checkpointing. Although the DataCP option achieved improved

throughput over the NoCP option in large system sizes, the improvement was minimal

and the additional network transactions and demands on the data nodes could easily

negate any gains if multiple jobs were executed on each node. In the final study of the

flight system, we focus on the 8192-element patched SAR application with checkpointing

disabled.

91

4.3.3.3 Flight system

In this section, we explore the performability of the patch-based SAR application with

8192-element patches executing on the DM flight system composed of twenty data nodes

[53]. From the previous section, we have shown that the performance of the DM system

beyond eight nodes suffers due to an unidentified bottleneck within the system. In this

study, we enhance and modify the system architecture in order to identify and remedy this

bottleneck in order to efficiently support twenty data nodes processing SAR. Table 4-9

lists the various architectural enhancements investigated. The objective of this study is to

modify the current DM system design to support twenty data nodes and to improve the

performability and throughput of the system with realistic upgrades and augmentations.

Table 4-9. Architectural enhancements explored for flight system

Enhancement Description Label

Processor Increases processing power of floating-point unit by 2× andincreases throughput and decreases latency of middlewareby 2×.

Proc

MDS storage device Increases bandwidth and decreases latency of MDS storagedevice by 2×.

MDSe

MDS nodes Incorporates N MDS nodes within system. N×MDS

Before the enhancements were studied, we measured the performability and

throughput of a 20-node system using default settings that served as the baseline

configuration. To identify the system bottleneck, we target two main components of

the system – the data node processor and the MDS storage device. By accelerating

the data processor by 2×, we assume that the floating-point unit and middleware layer

attain equivalent boosts in performance. Therefore, this enhancement improves the

performance of both floating-point computations and network transfers. Figure 4-19a

shows that upgrading the processor provides no performance gains for large MTBFNODE

values; however, a 1.58× speedup is attained for high fault rates due to reduced execution

times as compared to the baseline. When we improve the performance of the MDS

storage device, we observe speedups ranging from 1.7× to 2.1× over the baseline. These

92

speedups suggest that the MDS is the main bottleneck of the system for the SAR

application. To further substantiate this claim, we augment the current system design

with additional MDS nodes in order to reduce contention. Figure 4-19a illustrates that the

DM system employing one extra MDS node doubles the performance observed from the

baseline system for low fault rates and more than triples it in extreme faulty conditions.

Employing three MDS nodes in the 20-node DM system further improves the performance

of the system by 2.5× in light and moderate faulty conditions and 4.4× in high fault

rates. An interesting observation from Figure 4-19 is that the speedup reported for each

enhancement increases with the fault rate. This increase is caused by the reduction in

execution time of the SAR application thus allowing the job to complete more frequently

without experiencing a failure.

(a) Component enhancements (b) Combination enhancements

Figure 4-19. Speedups of architectural enhancements for patch-based SAR

Next, we investigate the impact of combining the individual component enhancements

in order to maximize system performance. From Figure 4-19b one can see that significant

speedups are achieved from combining an upgraded processor and MDS storage device

with a system using multiple MDS nodes. The system using two MDS nodes showed

speedups ranging from 12.8× to 4× while the system incorporating three MDS nodes

was found to have a maximum speedup of 17.5× in high fault conditions and 5.1× in the

environments in which data nodes experience three faults per day.

93

From the results in the architectural study, the MDS was deemed the main bottleneck

of the system. In order to remedy this contention point for the 20-node flight system,

we propose using three MDS nodes with enhanced storage devices. We also include

upgraded data-node processors in order to accelerate data computations and middleware

processing. Using this upgraded design, we conduct the final performability study of the

SAR application. Figure 4-20 provides the results.

Figure 4-20. System performability and throughput of 20-node DM flight system executingpatch-based SAR

The results in Figure 4-20 show that the proposed flight system performs well in

most radiation conditions. In fact, the performability of the system is predicted to exceed

99.5% in light to moderate fault conditions (i.e., MTBFNODE > 7200 seconds). The

minimum performability of the system executing the patch-based SAR application is

54.0%. The throughputs achieved by the flight system were greatly improved even at the

highest fault rates. The minimum throughput was measured to be 316 images per orbit

and the maximum is 587 images per orbit. Assuming 100% of the system’s data nodes

are dedicated to processing SAR images, the throughput of the proposed flight system

dictates that it should be able to support a sustainable input rate of 29.35 MB/sec of raw

SAR data from the sensors. However, the expected input rate currently used by the ERS

satellites was calculated to be approximately 11.90 MB/sec. This difference translates

94

into a situation where only 9 data nodes are required to compute SAR jobs while the

remaining nodes are free to perform other compute jobs, conduct test diagnostics, or

simply remain idle to conserve power. This final observation tells us that the proposed

DM flight system architecture executing the patch-based SAR application is more than

suitable to handle the large demands for computation seen in the ERS satellites.

4.4 Conclusions

This phase of research presented an approach that combines analysis techniques

using small-scale prototype systems, Markov-reward models, and discrete-event simulation

models in order to quickly and accurately evaluate the performance and availability

(i.e., performability) of aerospace systems and applications. The combination of these

techniques allowed us to calibrate component models using experimental measurements,

quickly pinpoint workloads and fault rates supported by the management software via

Markov-reward models, and thoroughly investigate specific applications executing on

virtually prototyped systems through discrete-event simulation. Details of each analysis

technique and the extensions and enhancements to the FASE framework were outlined.

Next, models were calibrated to reflect the performance of the physical testbed system

in order to ensure accurate results. Finally, two case studies reported performance,

scalability, and availability predictions of the 2D FFT kernel and SAR application

executing on the NASA Dependable Multiprocessor to showcase the capabilities of the

presented approach.

For model calibration, we used a small-scale prototype system consisting of four

data nodes, one MDS node, and one control node to collect performance measurements.

These measured values were then used to calibrate the Markov and simulation models

and the MDS and network subsystem models were validated using simple benchmarks to

confirm the accuracy of the simulation models. A performability validation experiment

was then conducted using an LU decomposition kernel to compare results between the

Markov-reward models and simulation models while each was subjected to various fault

95

rates. The results showed differences between the modeling approaches of less than

1% for low fault rates though larger errors were observed for higher fault rates due to

shortcomings (e.g., modeling architectural dependencies) of the Markov model.

After validation, the DM system models were used to evaluate the performance and

scalability of the system executing the 2D FFT kernel. This study exposed the centralized

MDS as a potential performance bottleneck in jobs that frequently access the MDS.

Various techniques were explored to mitigate the MDS bottleneck including distributed

checkpointing, distributing interconnections between sensors and data processors (i.e.,

distributed data), algorithm variations, and improving the performance of the MDS.

The study showed that eliminating extraneous MDS accesses was the best option though

enhancing the MDS memory was also a good option for increasing performance. Regarding

scalability, changing the algorithm from a parallel to a distributed approach and including

distributed checkpointing provides the best performance improvement of all the options

analyzed. For large image sizes (i.e., 64MB), the distributed FFT with distributed data

was the only option that showed a large improvement because more processing could occur

when data was more readily available for the processors.

The second case study used the patch-based SAR application to study its

performance and availability while running on the DM system. The Markov model

was initially employed to quickly determine that the performability of the application

using various patch sizes was suitable for further evaluation on the DM system. Once this

preliminary analysis was successful, we continued our three-stage analysis of SAR using

the discrete-event simulation models. Simulation results were similar to those produced by

the Markov-reward model but with one key difference. The performability values predicted

by the simulation model were much lower than those produced by the Markov model

when considering high fault rates and large systems. For example, the simulation-based

performability of a 32-node system experiencing an MTBFNODE rate of 60 seconds was

found to be 6.1% compared to the 49.1% reported by the Markov model. Again, the

96

Markov model did not capture the architectural dependencies of the SAR application

on the MDS node and thus the results overlooked its impact on the performability of

the larger systems. The simulative model was also used to gauge the overall system

throughput while executing SAR. The application produced its highest throughput (116

images per orbit) when executing on an 8-node system; however, in some cases, systems

beyond this size reported reductions in throughput due to contention at the MDS.

Finally, checkpointing was employed in an attempt to improve the fault tolerance of

the SAR application. Two storage options were explored such that checkpoint data was

either stored on the MDS node or neighboring data node. The results showed that for all

system sizes and fault rates, checkpointing to the MDS node reduced performability and

system throughput. Checkpointing to a neighboring data node was found to have slight

benefits in larger system sizes; however, the gains were much too small and checkpointing

was deemed unnecessary for the patch-based SAR application executing on the DM

system. During the study, certain system configurations resulted in each checkpoint

option producing similar performability values yet drastically different throughputs. This

observation suggested that while performability is a useful metric to gauge the robustness

of a degradable system under varying fault environments, it does not accurately represent

the efficiency for which an application executes on a system. Thus, the results from this

study reinforce the need to couple Markov and discrete-event simulation modeling for

comprehensive analyses of aerospace systems and applications.

After the SAR application was configured for optimal performability and throughput,

we explored architectural enhancements in order to identify and alleviate the bottlenecks

within the system. The results found that the MDS was the main throttling point of

the system due to the cumulative effects of each data node accessing and transmitting

data during various phases of SAR. To alleviate this contention point, we enhanced the

capabilities of the MDS storage device which allowed the system to nearly double its

throughput. We also observed similar speedups in system throughput by incorporating

97

additional MDS nodes into the system. The DM system using two MDS nodes received

a 2.0× boost in throughput while a design using three MDS nodes achieved a 2.5×

improvement under moderate fault conditions. As the fault rate increased, greater

speedups were observed for each enhancement over the baseline case. Shortened execution

times reduced the effects of the higher fault rates thus increasing the overall efficiency and

throughput of the system. The case study concluded with a performability evaluation of

the final flight system that incorporated twenty enhanced data nodes and three enhanced

MDS nodes. The proposed flight system was exposed to various fault rates and its

maximum throughput was observed to be approximately 587 images per orbit in relatively

light fault conditions and 316 images per orbit in the worst conditions studied. The

DM’s performability was observed to be over 99.5% when considering light to moderate

radiation conditions (i.e., less than one fault every two hours per data node).

The work conducted during this phase of research produced a novel, 3-stage analysis

process for predicting the performance and availability of high-performance, embedded

systems and applications. The FASE framework was extended by incorporating a fault

model library. This additional library allows FASE users to inject faults into arbitrary

systems in order to conduct in-depth availability and performability analyses. The case

studies demonstrated the capabilities of the enhanced framework as applied to the DM

system. The contributions and accomplishments of this work have been compiled into

two manuscripts. The first manuscript presents the simulation work involved with the

2D FFT study and was published in [48]. The second paper introduces the three-stage

analysis approach for predicting the performance and availability of radiation-suspectible

systems and applications. This paper was submitted to ACM Transactions on Embedded

Computing Systems [54].

98

CHAPTER 5HYBRID SIMULATIONS TO IMPROVE THE ANALYSIS TIME OF

DATA-INTENSIVE APPLICATIONS (PHASE 3)

The FASE framework presented in Chapter 3 laid the foundation for fast and

accurate performance predictions of arbitrary applications executing on a wide range

of systems. The information presented encompasses strategies to address the general

issues of characterizing applications, designing and developing components and systems,

and analyzing the performance of the virtual systems for the applications under

study. However, in practice, the techniques and tools developed succumb to lengthy

simulation times when studying data-intensive applications. These long analysis times are

exacerbated when considering large-scale, parallel systems to the point in which simulation

becomes prohibitive.

This section presents the work conducted for the third and final phase of the

dissertation. This research focuses on reducing the time needed to simulate data-intensive

systems and applications via a novel, hybrid simulation approach. The section starts by

discussing the motivations followed by the presentation of background information and

related research. A detailed description of the hybrid modeling approach illustrates the

techniques employed to enhance and extend the FASE framework to speedup simulations

evaluating data-intensive applications. Experiments and results follow validating the

successes of the proposed technique. This section also presents a case study that

illustrates the accuracy and speed of the hybrid approach as it is applied to the DM

system described in Chapter 4. Finally, conclusions are drawn and key insight is offered.

5.1 Introduction

As scientists discover new methods for exploration and discovery of both earth- and

space-bound phenomena, the amount of data collected by the accompanying equipment

can become staggering. With this increase in data, new applications are needed to

analyze and interpret the information in order to identify the areas of interest. As a

result, key areas of science such as the geosciences, remote sensing, and systems biology

99

are employing applications that process typical datasets in the multi-gigabyte range

and beyond. For example, NASA satellite systems performing remote sensing tasks

generate 50 gigabytes of images per hour while multiple terabytes of data on the human

genetic code were collected for the Human Genome project [55]. In order to efficiently

process these large quantities of data, new systems must be designed that push the

limits of processing and I/O technologies. However, designing the most effective system

for a specific application set is a nontrivial task due to vast number of technologies

and techniques available to designers. Simulation is often used to facilitate the design

process with discrete-event simulation being a particularly effective method to verify and

analyze specific characteristics of an architecture or protocol before deployment. This

capability allows developers to save vast amounts of time and money by circumventing the

development and evaluation of premature prototypes. It also expands the design space to

include architectures that use both new and emerging technologies thus allowing system

designers to select the best configuration for a particular set of applications.

Although discrete-event simulation provides many benefits, one major shortcoming

is the time required to analyze complex systems executing data-intensive applications.

Generally speaking, the time required to analyze a system using discrete-event simulation

is highly dependent on the number of discrete events generated by the model. In

high-fidelity simulation models, these large datasets are typically split into a considerable

number of fragments with each fragment processed individually in order to mimic the

behavior of the data transaction. This fragmentation generates a proportional number of

discrete events that must be scheduled and processed throughout the simulated system,

thus dramatically lengthening simulation time. Simulation times are exacerbated by the

fact that the majority of the candidate systems employ numerous distributed components

working in parallel to complete a given task. The combination of large, complex systems

processing sizeable datasets can cause simulations to run for days, if not longer. Such

lengthy analyses are often prohibitive due to unacceptable increases in design and

100

development times. Thus, methods for improving the efficiency and speed of discrete-event

simulations, while sacrificing as little accuracy as possible, are essential to remain an

effective tool for prototyping and evaluating architectures executing data-intensive

applications.

In this chapter, we present a novel approach to hybrid simulation modeling that aims

to reduce simulation time for data-intensive applications while retaining a high degree of

accuracy. Our approach combines the accuracy of function-level models with the speed

of analytical models and “micro-simulations” to achieve fast and accurate results. The

modeling procedure uses a technique called function-level training to collect performance

measurements from the simulated system. These measurements essentially take a snapshot

of the current state of the system and are used to calibrate the analytical models. The

calibrated analytical model is then employed to calculate the time required to complete

the current transaction assuming that the state of the model remains relatively unchanged

throughout its execution. Micro-Simulations also use measurements gathered during the

training period, though they are employed to account for device and contention delays at

components actively participating in the data transaction. The method operates with each

hybrid transaction beginning in the function-level training procedure in which functional

models are employed to collect statistics that characterize the current status of the system.

When the training period is complete, the analytical model calibrates itself using the

collected statistics, calculates the time required to complete the current transaction, and

schedules the transmission of a final data structure using the function-level model. Finally,

this last data structure traverses the system invoking micro-simulations at each component

it encounters until it reaches the destination device model. With this approach, the

redundant processing that typically occurs during large data transfers is replaced by a

single calculation (i.e., analytical model) and a fixed number of micro-simulations that

collectively approximate the ultimate outcome of these repetitious computations. The

101

resulting hybrid methodology supports the timely evaluation and analysis of a class of

applications that has traditionally stressed discrete-event simulators.

5.2 Background and Related Research

Within the past decade, technology has advanced in leaps and bounds to support

both the acquisition and storage of large quantities of data relating to a wide range

of fields including finance, commerce, scientific computing, and national security. Due

to the overwhelming quantities of data collected, the importance of processing and

understanding the information has lead to a new field of study within computer science

called knowledge discovery in databases (KDD). KDD, as defined in [56], is “. . . the

non-trivial process of identifying valid, novel, and potentially useful, and ultimately

understandable patterns in data.” With the emergence of KDD, many new research

initiatives have commenced to develop and optimize algorithms and applications that

analyze data residing in large data repositories while other projects aim to design powerful

computational systems to effectively sift through the large datasets to discover useful

information. The Data-Intensive Computing Initiative (DICI) at Pacific Northwest

National Laboratory is one such research project that is investigating scalable solutions

in both software and hardware to support timely, effective analyses of large quantities of

data [57]. However, in order to analyze the capabilities of new hardware designs, the cost

of building prototype solutions can be prohibitive. As a result, tools such as discrete-event

simulators are imperative to perform extensive, yet cost-effective evaluations of a number

of design options.

Traditional, high-fidelity modeling approaches dictate that the behavior of a modeled

entity should be mimicked identically to ensure correct and accurate results. As a result,

much simulation time is used to fragment each transaction into smaller chunks that

are transferred and processed by the corresponding functional models. This modeling

approach is very accurate, but its scalability suffers due to large simulation times as the

dataset grows in size. Numerous solutions have been investigated in order to remedy this

102

scalability problem through both general and specific methods that attempt to speed

up simulations. One methodology, called staged simulation, targets the reduction of

discrete events in wireless network simulations. Staged simulation uses various techniques,

such as function caching and incremental computations, to decrease the amount of

redundant processing conducted thus supporting faster simulations of larger systems [58].

According to case studies, the method achieves a 30× improvement in simulation speed for

a 1,500-node system; however, many of the proposed techniques are uniquely designed for

wireless network simulations.

A general approach that has been investigated to improve simulation time is parallel

discrete-event simulation (PDES). PDES frameworks attempt to reduce simulation

time by distributing the workload among multiple processors [59]. PDES frameworks

typically fall into one of two categories: conservative or optimistic. In conservative PDES

environments, each participating node processes an event only when all pending events

anywhere in the simulation that may affect the considered event have been completed

[60]. By using a conservative approach, processors may remain idle for significant periods

of time waiting on other processors to signal that they can proceed with their next

scheduled event. CPSim [61] and Parsec [62] are two tools that support conservative

PDES. Conversely, optimistic PDES avoids idle cycles by allowing events to be processed

regardless of their affinities with concurrent or previous events residing in other nodes.

To support this capability and maintain correctness, detection and rollback mechanisms

are incorporated into the design [63]. The detection mechanism introduces overhead

due to the additional processing required to identify dependencies between events while

rollbacks delay simulations by repeating computations optimistically completed. Examples

of tools supporting optimistic PDES include SPaDES [64], WARPED [65], and ModSim

[66]. While parallel simulators have been proven effective for certain applications, they

often provide very poor parallel efficiency due to the high levels of interdependencies and

103

synchronizations inherent in discrete-event simulations. As a result, we look to alternative

methods for improving the efficiency of simulating data-intensive applications.

Fluid-based modeling provides a particularly successful approach to greatly reduce

the time needed to analyze large-scale systems processing sizeable datasets. Fluid-based

models attempt to abstract an entire data transaction as a single fluid representation,

called a flow. Typically, analytical models are used to define a flow’s behavior as well

as the interactions between multiple flows contending for a shared resource or channel.

Conversely, the behavior of a function-level transaction is captured by the collective

behavior of each individual fragment that composes the transaction as dictated by the

state of the system. By using a fluid-based model, an extremely large data transaction can

be modeled as a continuous event, without generating a large number of discrete-events

that would otherwise slow the simulation. In cases where the simulation time is dominated

by the modeling of these large data transactions, such a technique has the potential to

significantly reduce simulation time. For these reasons, the use of fluid-based models

has become popular for applications such as large-scale network simulations [67].

Unfortunately, coarser-grained models are typically less accurate than function-level

models [68], and at times less efficient due to ripple effects caused by the interaction

of competing flows [69]. The ripple effect, discussed in [70], refers to the phenomenon

where the rate change of a single flow causes rate changes to other flows that propagate

throughout the system. In simulations handling a large number of flows, these ripples can

create enough extraneous events to significantly decrease the gains in simulation efficiency

obtained from using fluid-based models.

Several projects have proposed designs of hybrid simulation environments that

combine functional-level and fluid-based, analytical models for accurate and efficient

network simulations [71],[72],[73]. For each of these simulators, users typically decide

whether a network transaction is modeled as multiple, function-level fragments or as

a single, fluid-based flow. While the frameworks provide the flexibility of supporting

104

both modeling approaches, they suffer from a few shortcomings. First, fluid streams are

modeled solely using analytical models that consider traffic rates and buffer capacities.

Thus, the accuracy of the fluid transaction is based entirely on the analytical model’s

ability to approximate contention and fluctuating network behavior without the use of

detailed, functional-level models. Furthermore, each simulation environment is designed to

analyze only the network-specific characteristics of a system. Therefore, the frameworks

are unable to correctly model the complete behavior of a data-intensive application

and system. Finally, the capabilities of each simulator are demonstrated by focusing

on a single data stream in the presence of randomly generated background traffic. The

demonstrations suggest that difficulties could arise when attempting to model real traffic

generated by the entire system in order to mimic the behavior of an application.

Other research projects have looked beyond fluid-based modeling to reduce the

complexity of simulating data-intensive applications. A framework that combines

application emulators with a set of simulation models for dealing with large-scale, parallel

applications is described in [74]. The application emulators dynamically feed stimuli to

the simulator in the form of epochs, which represent groups of events and their high-level

data dependencies processed by a processor model. While the coarse-grained processor

models and the application emulators abstract away much of the work performed for a

given application, the framework still relies on high-fidelity bus and disk models, which

can hinder the simulation during very large data transactions.

Since the FASE framework places an emphasis on accurate and detailed modeling of

data transactions, it stands to benefit greatly from the hybrid modeling methodology. For

this work, the discrete-event models developed in Chapters 3 and 4 have been extended to

support the hybrid modeling methodologies proposed in this chapter. The remainder of

this chapter presents the hybrid-modeling extensions incorporated into FASE and results

from case studies that showcase the new capabilities of the extended framework.

105

5.3 Hybrid Simulation Approach

In this section, we introduce a hybrid simulation approach that can be applied to

a number of component types to produce fast and accurate simulation results. Before

providing the specific details on the methodology, we first define some basic terminology

used throughout the chapter as well as a few basic modeling concepts. In this chapter, we

refer to a generic data operation as a transaction, while a fragment and a flow correspond

to function-level and fluid-based modeling, respectively. Figure 5-1a illustrates a generic

hybrid system model consisting of three model types: 1) source, 2) path component, and 3)

sink. Each model type consists of both functional and fluid-based models and participates

in the modeling of the transactions within the system. The components labeled as a source

begin new data transactions based on operations created by the corresponding component

or received from an upper-layer component. The path components are intermediate

models that receive both fragments and flows and propagate each data type to the next

component in the path. The sink models signify the destinations for the transactions

and ultimately receive the actual data sent by the corresponding sources for further

processing. Also, logical channels connect the origin source, path components, and the

sink model participating in each data transaction. In order to relate these generic terms to

a real-world system, we illustrate a system (see Figure 5-1b) employing three transactions.

Transaction T1 corresponds to a remote disk write in which server A (source) transmits

data to a shared, remote disk (sink) using two shared switches (path components).

Transaction T2 is a data transfer in which the workstation receives a message from Server

B, and Transaction T3 represents a write access to the remote disk by server B. These

transactions are used throughout this section to exemplify the techniques employed in the

proposed approach.

Our hybrid modeling approach incorporates three main steps in order to support

quick yet accurate results: 1) function-level training, 2) analytical modeling, and 3)

micro-simulation. As illustrated in Figure 5-2, the first step of a hybrid simulation

106

(a) Generic system (b) Real-world system

Figure 5-1. High-level example systems employing hybrid modeling

(i.e., function-level training) employs the functional models at all hybrid model types

participating in the transaction to collect performance measurements that characterize

the current state of the system. These measurements are then used in the analytical

modeling stage to configure the corresponding model and calculate the length of time

the source is busy processing the current transaction. The third step, micro-simulation,

uses the metrics collected by path components and sinks during function-level training

to compute delays experienced by each flow due to internal mechanisms and contention.

Data flows through our hybrid systems as follows. First, the source model receives the

data and uses its function-level model to transmit F fragments using its functional models.

As these fragments traverse the system, each model collects statistics on the behavior of

the transaction. After F measurements have been made by the source, one more fragment,

called the head fragment, is transmitted and the statistics gathered by the source are fed

to the analytical model. The analytical model calculates a time based on the statistics and

effectively delays the component for the computed time. While delayed, the device can

still respond to remote requests from other components though no further discrete events

are created by the current data transaction. After the calculated time has elapsed, the

107

source sends one final fragment, deemed the tail fragment, using its function-level models.

This fragment can be delayed at each path-component and sink model according to the

corresponding micro-simulations that occur at each model type. The data transaction

is complete when the tail fragment reaches the sink model. The following subsections

present in-depth details on the three steps employed in our hybrid simulation approach

and discuss the interdependencies between each stage.

Figure 5-2. High-level diagram of hybrid simulation approach

5.3.1 Function-Level Training

Data-intensive applications typically perform numerous operations that manipulate

large quantities of data. In order to effectively model these operations, we consider them

as a two-stage process. The first stage, named transitional period, represents the period

in which the system begins a new operation. During the transitional period, the system

experiences an increase in data traversing between components while these devices take

the necessary actions to adjust to the new influx of data. The most common impacts

observed in each affected component are buffer growth and increased contention. Once

the system entities have adjusted to the changes, the operation reaches its steady-state

period. In this stage, the system is effectively in equilibrium and the output rate of

each component is highly predictable when considering identical inputs. Our hybrid

approach uses these stages by applying the best suited models for each in order to ensure

108

accuracy while providing opportunities to reduce simulation time. Specifically, we employ

function-level models during the transitional period and then switch to the fluid-based

models when a steady state has been reached.

The use of high-fidelity models during the transitional period allows us to accurately

capture fine-grained events that have the potential to greatly impact the system’s

overall performance. However, once the operation has reached its steady-state period,

an analytical, fluid-based model can be employed. By using the analytical model,

numerous redundant computations can be replaced by a single calculation thus saving

a significant amount of time proportional to the size of the data corresponding to the

current transaction. One difficulty with this approach is configuring the analytical model

to correctly represent the system’s current state. To overcome this challenge, our hybrid

simulation approach collects statistics on specific attributes of the system as experienced

by the corresponding component while using the function-level models. By collecting these

measurements, we can effectively “train” our analytical model to capture the behavior of

the component based on the state of the system. This process of collecting performance

metrics and other statistics during the transitional period is called function-level training.

Function-level training is conducted at the beginning of each data transaction to

ensure the accuracy of each modeled transaction. As shown in Figure 5-3, the process

begins with the source model transmitting the first F fragments of the data transaction

using its functional model. While these fragments traverse the system, the source keeps

track of key statistics such as the departure rate of fragments as dictated by the system.

Meanwhile, the path-component and sink models monitor the stream of fragments received

from each source in order to calculate internal statistics used within the micro-simulation

stage. Once F measurements have been collected by the source, it transmits a final

fragment, called the head fragment, that contains information, such as the transaction’s

remaining data size, that is used during micro-simulations within the path-component

and sink models. This event also marks the time when the source switches from its

109

function-level implementation to its fluid-based design. Sections 5.3.2 and 5.3.3 provide

in-depth details on the fluid-based, analytical model and micro-simulations, respectively.

Figure 5-3. Function-level training procedure

The source model can be configured to switch between its function-level and

fluid-based, analytical models according to two user-definable parameters. The first

parameter allows users to specify the number of measurements, F, that should be taken

using the function-level model before switching to the fluid model. The second defines a

specific data size under which the source is forced to use only its functional model. The

first parameter is ideally set to a value that supports the use of the function-level model

during the entire transitional period of a transaction while also capturing the steady-state

statistics for the various component models. If the training period is too short, the

component models could potentially capture metrics that misrepresent the steady-state

conditions for the given transaction. However, if the training period is too long, simulation

time is wasted due to the scheduling and processing of superfluous discrete events. The

second parameter allows the user to control the minimum data size that can use the

hybrid approach. This parameter is important for two reasons. First, each transaction

110

incurs overhead when employing the hybrid approach due to the extra computation

required at each component to calculate and log statistics. As a result, this overhead has

a greater impact on smaller transactions and can potentially increase simulation time as

compared to functional modeling. More importantly, the hybrid approach was designed

for large transactions, and therefore, can suffer from inaccuracies when considering smaller

transactions. The proceeding section provides details on the issues regarding the accuracy

of the analytical model.

In order to illustrate function-level training, we examine the three transactions

introduced in the previous section and displayed in Figure 5-1b. Consider data sizes

of 10 KB, 8 KB, and 10 KB for Transactions T1, T2, and T3, respectively. Also, the

maximum fragment size is 1 KB and all sources’ training periods are configured to

take four measurements (i.e., F = 4). In this example, each source transmits 4 KB of

data during its training period. The source for T1 (i.e., Server A) determines it can

send 0.5 maximum-sized fragment every second while Server B can transmit 0.25 and

1.0 maximum-sized fragments every second for transactions T2 and T3, respectively.

Furthermore, the path-component and sink models calculate similar arrival rates for the

transactions during the function-level training periods. After the four measurements are

collected, each source transmits a head fragment containing the necessary information to

be used by the path-component and sink models during their micro-simulations. Also,

the sources use the calculated fragment output rates to calibrate their analytical models

to determine the times the sources are busy processing their current transactions. The

following sections continue this example by outlining how the analytical models and

micro-simulations are used to model the behavior of the three transactions traversing the

system.

5.3.2 Analytical Modeling

After the training period completes, the resulting performance metrics collected at

the source model are used by a fluid-based, analytical model (see Figure 5-2) to calculate

111

the time in which the current transaction is expected to complete. A timer is set within

the source model based on the calculated time. When the timer expires, the source model

outputs one last fragment, the tail fragment, which signifies the end of the transaction.

The tail fragment is used by the path-component and sink models to calculate final delays

incurred by each flow. These delays are computed using micro-simulations, which are

described in more detail in the following section.

In order to ensure accuracy within the analytical model, three requirements must

be satisfied. First, the model must be capable of capturing the steady-state behavior

of the component that it represents. This requirement can be fulfilled by employing

validated derivations that are either custom-built or defined in literature. Second, the

model should use parameters that correspond to the current state of the system. We

address this requirement through function-level training as described in the previous

section. Finally, since the analytical model is invoked only one time to calculate the time

required to complete each transaction, it inherently assumes the state of the system does

not change within the calculated time period. However, complex systems typically have

numerous transactions interacting and causing changes within the system. As a result, the

last constraint requires a means to recalibrate the analytical model when a system update

potentially affects the behavior of the modeled component. Instead of fully satisfying this

requirement, we allow the analytical model to suffer from any inaccuracies that may occur

due to system updates. However, we introduce a mechanism described in the subsequent

section to compensate for these inaccuracies.

The speedup attained using the analytical model depends on the complexity of the

mechanisms replaced and the number of discrete events created using the equivalent

function-level model. The greater the model’s complexity and the larger the number

of discrete events created by the functional model, the more speedup is achieved using

the hybrid approach. Furthermore, the complexity of the path-component and sink

models also come into play since discrete events created at the source normally lead to

112

nontrivial computations at lower-layer components as well as the potential to create more

events that further increase simulation times. Finally, the reduction of discrete events

enabled by analytical models has the potential to greatly trim the memory requirements

of the simulation. As a result, the employment of the hybrid approach allows users to

more efficiently analyze larger systems or attain even greater speedups especially when

replacing function-level systems that require the use of disk swap space. The combination

of all these factors imbue the analytical model with the potential to greatly reduce the

simulation times observed in pure function-level simulations.

During each function-level training period for the three transactions considered in our

example system, 4 KB of data was sent. Also, Server A and Server B calculated fragment

output rates of 0.5, 0.25, and 1.0 fragments per second for Transactions T1, T2, and T3,

respectively. For this example, we assume both sources use the simple analytical model

expressed as

SourceDelay =d T

Fragmaxe − 1

O(5–1)

where O is the fragment output rate, T is the remaining size of the transaction, and

Fragmax is the maximum fragment size. Using this analytical model, the sources delay

for 10, 12, and 5 seconds before sending the tail fragments for Transactions T1, T2, and

T3, respectively. After delaying for the specified amount of time, the corresponding source

transmits the current transaction’s tail fragment, signifying the end of the transfer. The

next section details the role of the micro-simulation technique within the example system.

5.3.3 Micro-Simulation

Thus far, our hybrid simulation approach has addressed source-based delays as

configured according to a steady-state view of the system. However, the state of a system

frequently changes resulting in potential impacts that propagate back to the source

models due to feedback mechanisms. There are two options available to handle this

situation. First, the feedback mechanisms can be incorporated into the fluid models such

113

that changes in the system cause source models to recalibrate their analytical equations

according to the new state of the system. However, this approach can greatly suffer from

an effect similar to the “ripple effect” reported in [70]. That is, changes can continuously

cause source models to recalibrate, thus reverting the hybrid simulation approach into

nothing more than a function-level approach with additional overhead.

The second alternative attempts to take advantage of a simple observation – the

most likely system changes to have substantial effects on the performance of a transaction

are the additions or removals of other transactions competing for the same resource.

By using this observation, the second option allows the source to complete as originally

scheduled while the path-component and sink models account for delays caused by

the additional transactions creating contention within the system. This approach,

deemed micro-simulation, uses FIFO queues and the performance statistics collected

during function-level training to calculate device and contention delays incurred by

each transaction. Micro-Simulation effectively reduces the complexity of the modeled

component into a simple queuing system that approximates the delays experienced by

each flow. In order to simplify the system, the device’s behavior is characterized by a

single parameter, service rate, while each flow is represented with three parameters:

1) start time, 2) arrival rate, and 3) number of fragments. The service rate parameter

specifies the rate at which the device can complete a virtual fragment. This parameter

is typically calculated based on the component’s performance attributes such as latency

and throughput as well as the fragment size. The start time specifies the time at which

a flow first arrives at the component, the arrival rate denotes the rate at which a virtual

fragment arrives at the device, and the number of fragments parameter indicates the

number of virtual fragments represented in a flow. The arrival rate for each flow is

calculated at the path-component and sink models during the flow’s function-level training

while the start time and number of fragments are defined in the head fragment. Due to

queuing complexities, computer simulation is used to calculate the delays for each flow in

114

a path-component or sink model. Micro-Simulations are conducted only when a new flow

begins (as signified by a head fragment) or an existing flow completes (as indicated by

a tail fragment). For both events, the micro-simulation’s state is updated to the current

simulation time; however, predictions of a flow’s completion time are only made when

its tail fragment is received before the micro-simulation progressed to a state in which

the flow has completed. Since the addition of new transactions cannot be forecasted, yet

will likely affect the delays experienced by existing flows, speculative calculations of the

completion times can lead to wasted computation. As a result, micro-simulations perform

the minimum amount of computations necessary to calculate the delays experienced by a

flow, thus minimizing the simulation time required by this technique.

Micro-Simulation improves simulation time for two reasons. First, by reducing the

system into a queuing problem, micro-simulation abstracts away the complexities involved

with modeling large transactions employing potentially complicated components. The

queuing system can be quickly processed using computer simulation, thus decreasing the

computation time required to explicitly simulate the system. Second, micro-simulations

do not create discrete events and, thus, do not suffer from scheduling or other processing

delays external to the model that have the potential to drastically increase the overall

simulation time.

Figure 5-4 provides the micro-simulation that occurs at Switch B illustrated in Figure

5-1b. For this example, we assume that Switch B uses a bus-based backplane to route

fragments, and therefore, its micro-simulations model any contention between transactions

sharing this resource. Also, Flow #1, Flow #2, and Flow #3 in Figure 5-4 correspond

to Transactions T1, T2, and T3 in Figure 5-1b, respectively. The characterization

parameters of the switch component and three flows are provided at the top, the state

of the micro-simulation’s queue with respect to time is displayed in the middle, and the

micro-simulation updates and key events that trigger them are shown at the bottom. The

115

micro-simulation’s queue state illustrates the order and number of virtual fragments from

each flow residing in the queue at a given point in time.

In order to illustrate the three flow types that can occur in our hybrid simulations, we

assume that the transactions’ start times are interleaved according to the values displayed

in Figure 5-4 and each tail fragment sent by the corresponding source experiences a two

second routing delay before it is received by Switch B. Flow #1 illustrates the case in

which the component receives the flow’s tail fragment and then calculates the flow’s

completion time to be the current simulation time. In this case, the tail fragment is

immediately processed by the component since no contention delays occurred. Flow

#2 represents a flow that encounters source-based delays. That is, the component

receives the flow’s tail fragments after the flow’s completion time as calculated by the

micro-simulation. Similar to Flow #1, the tail fragment is processed immediately in

this case. Finally, Flow #3 demonstrates the process followed when the tail fragment

is received by the component but the micro-simulation determines that the flow is not

complete. In this case, the device schedules the processing of the tail fragment to the

corresponding flow’s predicted completion time. Take notice, the micro-simulation’s state

never progresses further than the current simulation time. However, predictions are made

to calculate the completion time of Flow #3 when the tail fragment arrives before the

micro-simulation’s state has progressed to the point in which the flow is determined to be

complete.

While the micro-simulation technique eliminates potential slowdowns due to the

ripple effect phenomenon, it does not compensate for all source-based inaccuracies that

may occur when modeling a transaction. Consider the following scenario. Transaction

A begins its training period and causes contention at a shared resource already in use

by Transaction B. Assuming the shared resource can handle only one transaction at a

time without causing delays, the sources of both transactions observe slower rates at

which their data traverses the system. Transaction A finishes its training period just as

116

Figure 5-4. Example three-flow micro-simulation

Transaction B completes, thus configuring its analytical source model to represent the

component’s behavior in the presence of contention. However, the contention no longer

exists since Transaction B completed. The resulting scenario cannot be resolved through

micro-simulations and requires a feedback mechanism that can notify the source of the

change in the system so it can retrain. However, the feedback loop leads back to the ripple

effect for which micro-simulations are used to avoid. Section 5.5 discusses proposed future

work that investigates techniques that incorporate a feedback mechanism to remedy these

inaccuracies while mitigating the ripple effect.

Throughout this study, we consider only FIFO queues, although our design can be

easily extended to support priority-based queues. Furthermore, we assume infinite queue

capacities in order to simplify the modeling effort. However, the necessary mechanisms are

in place to support finite queue sizes. Finally, when using this approach, the source cannot

proceed with the next transaction until the tail fragment has been fully completed. As

a result, mechanisms that may allow transactions to complete early (e.g., non-blocking)

should be disabled or carefully controlled using this hybrid approach.

117


In this section we discuss the results of the analyses conducted to show the

capabilities of our hybrid simulation approach as it is applied to the NASA Dependable

Multiprocessor (see Chapter 4 for details). The first section presents the simulation setup

followed by validation results using low-level benchmarks to verify the accuracy and

showcase the speed of the hybrid models. Next, we evaluate the speed and accuracy of the

hybrid models when considering contention at shared resources. Contention stresses the

proposed hybrid methodology due to the lack of feedback loops that adjust the analytical

models within the source components. After the speed and accuracy of our approach have

been showcased, we apply the technique to analyze the performance of the DM system

executing a data-intensive, hyperspectral imaging (HSI) application.

5.4.1 Simulation Setup

The key simulation models employed for the following experiments are listed in

Table 5-1. Similar to the models developed in the previous chapters, these models

were also created using MLD. Each model captures the functionality and behavior of

the corresponding technology while adhering to the FASE methodology of balancing

speed and accuracy. From these core component models, node and system models were

developed. Finally, Table 5-2 lists the key system parameters configured to best match the

performance of the prototype DM system.

The DM system incorporates two main subsystems that can benefit from hybrid

simulation models - the network and mass data store. Both subsystems use the HAM for

inter-node communication, which in turn employs TCP/IP as its primary communication

protocol over a Gigabit Ethernet link. Therefore, we retrofitted the HAM model in the

DM model library and the TCP, IP, and Ethernet models found in the pre-built FASE

library with the appropriate hybrid simulation mechanisms. The MDS Server model was

also modified to incorporate hybrid mechanisms to model large data accesses to the MDS

device.

118

Table 5-1. Summary of relevant simulation models

Library Component Description

DMHigh-AvailabilityMiddleware

Provides reliable communication between nodes in system.

MDS Server Handles data access requests to the mass data store.

FASE

TCP Layer Provides TCP protocol for reliable communication betweennodes.

IP Layer Provides IP protocol for all network transfers.

Ethernet NIC Provides Ethernet protocol for all network transfers. Sup-ports multiple ports.

Ethernet Switch Provides Ethernet connectivity between nodes. Supportsvariety of backplane and routing options.

Table 5-2. Key system parameters

Parameter name Value

Processor power 1200 MIPS, 600 MFLOPS

MPI maximum throughput 57 MB/s

MPI message latency 13.6 ms

HAM buffer size 2000000 bytes

Network bandwidth Non-blocking 1000 Mb/s

Network switch latency 5 µs

MDS bandwidth (write/read) 60/40 MB/s

MDS latency (write/read) 300/500 µs

MDS open file overhead 8 ms

The HAM model acts as a source model for most data transfers between nodes.

The model receives data transactions from the application layer and fragments

each transaction into messages with a maximum size of 14000 bytes when using its

function-level implementation. During the training period, the round-trip time of each

message is calculated in order to calibrate the analytical model used in the fluid-based

implementation. The HAM model supports non-blocking communications via buffering

techniques.

The TCP, IP, Ethernet NIC, and Ethernet switch models are retrofitted with hybrid

mechanisms. Each of these components acts as a path component and therefore collects

119

statistics on the flows traversing through them in order to calculate delays incurred at the

device via micro-simulations. The TCP model also acts as a source model since it has the

capability to fragment the messages received from the HAM into TCP segments with a

maximum size of 1460 bytes. The TCP model is outfitted with an analytical model that

uses the command window, maximum window size, and acknowledgement rate to calculate

the time needed to transmit N bytes of data. Currently, our fluid model does not account

for TCP segments dropped in transit; however, the model can be extended to collect the

necessary metrics during training and the analytical model modified to account for the

effects of dropped segments.

Finally, the MDS model is also retrofitted with hybrid simulation mechanisms to

provide sequential accesses for submitted disk I/O jobs. The MDS model represents

a sink model and therefore uses micro-simulations to calculate delays incurred while

accessing the mass storage device. More specifically, the queue service rate of the MDS’s

micro-simulations is configured according to the bandwidth and latency of the storage

device while the arrival rates of each flow are calculated based on the performance metrics

gathered during each transaction’s training period. Due to the features of the MDS

server, read accesses retain exclusive ownership of the MDS storage device throughout the

duration of the access while write accesses can be interleaved in 14000-byte messages, the

maximum message size of the HAM.

For this study, the HAM and TCP source models are configured as shown in Table

5-3. The training values were chosen to best balance the accuracy and speed of the

hybrid simulations used to analyze the DM system. The following sections show that

these values provide sufficient training periods to produce fast yet accurate results from

each transaction. The minimum hybrid message size for each source model is initially

set to zero in order to evaluate the hybrid methodology for all message sizes. It should

be noted that each simulation presents a unique case for which these parameters should

120

be calibrated to best accommodate the characteristics of the systems and applications of

interest.

Table 5-3. Hybrid source model parameters

Model Parameter name Initial value

HAMFunction-level training measurements 10

Minimum hybrid message size 0 MB

TCPFunction-level training measurements 100

Minimum hybrid message size 0 MB

All simulations are conducted on dedicated compute nodes with a nominal number

of processes executing in the background to minimize the noise experienced during the

experiments. Each node is configured with a quad-core, 2.4 GHz Xeon processor with 2

GB of main memory running a 64-bit Linux variant using kernel 2.6.9-55.

5.4.2 Performance Modeling

The first study conducted uses two simple benchmarks, PingPong and MDSTest, to

show the accuracy and speed of the hybrid simulation approach under ideal conditions

(i.e., no resource contention and minimal system state changes). The PingPong benchmark

transfers data between two data nodes while the MDSTest benchmark writes and reads

data to and from the MDS node. For both programs, the data transfers range from one

byte to four gigabytes. The small-scale DM testbed system is used to collect experimental

measurements to which results from both the function-level and hybrid model results are

compared. Consequentially, the results not only showcase the capabilities of our modeling

approach, but also validate the DM model’s accuracy.

From Figure 5-5 and Figure 5-6, one can observe that both modeling approaches

closely reproduce the throughputs observed in the testbed system. For the PingPong

benchmark, we calculate nearly identical mean relative errors of 1.24% for the functional

and hybrid models when compared to the experimental measurements. For this particular

benchmark, the error found between the two modeling approaches was negligible. The

MDS benchmark tests produced similar results with mean relative errors of 1.50% for

121

writes and 3.01% for reads when using the hybrid approach as compared to experimentally

collected measurements. Again, the MDSTest benchmark showed negligible errors between

the two modeling approaches.

Figure 5-5. PingPong accuracy results

Speedup results for the PingPong benchmark are shown in Figure 5-7. From the

figure, we see that employing the hybrid simulation approach in systems executing

applications that transfer large data sizes can greatly improve simulation times. In fact,

we observe an order of magnitude speedup for datasets as small as 4 MB, while 4 GB

data transfers observe nearly an 1850× speedup. The large speedups observed at the

larger data sizes are directly proportional to the number of discrete events eliminated

through the use of the hybrid simulation approach. For example, a 4 GB function-level

transfer using TCP requires the creation, scheduling, and processing of approximately 2.9

million discrete events while the currently configured hybrid approach uses no more than

1000 discrete events to simulate the same transaction. By simply dividing the number of

122

Figure 5-6. MDSTest accuracy results

discrete events generated using both approaches, we calculate an ideal speedup of around

2940× thus verifying such large gains using our method.

The MDSTest speedup results (see Figure 5-8) show similar trends to those observed

in the PingPong benchmark. Over 10× speedups are achieved at 4 MB datasets and

maximum speedups of over 1700× and 1800× are observed at 4GB datasets for writes and

reads, respectively. The MDSTest showed larger speedups for read operations due to a

slightly smaller amount of computation required to conduct micro-simulations at the MDS

since reads are sequentially executed with exclusive access to the device.

Both benchmarks show speedups that begin to level off past the 1 GB message size.

This behavior is due to a 1 GB message size limitation placed on data transactions that

use the TCP model in order to avoid problems associated with its 32-bit variables that

maintain its current state for mechanisms such as windowing and acknowledgements. It

should be noted that these restrictions are only placed on component models that have

inherent limitations that can potentially cause problems when considering very large

data sizes. Also, both benchmarks show increasing slowdowns between 1 KB and 256 KB

123

Figure 5-7. PingPong speedup results

Figure 5-8. MDSTest speedup results

124

messages, which signify additional overhead when using the proposed hybrid simulation

approach. The cause of this added overhead is due to the gathering of statistics on the

increasing number of fragments used in each transaction. However, once a message size of

512 KB is reached, our hybrid approach overcomes this logging penalty and shows positive

speedup.

The benchmarks used in this study represent best-case scenarios for using the hybrid

simulation technique. This quick study shows that our hybrid approach can provide

substantial reductions in simulation time while having little impact on the accuracy of the

results. However, neither benchmark experienced conditions that could potentially cause

inaccuracies in the hybrid models and thus represent the best-case numbers achievable

by the technique as it is currently configured. More specifically, neither benchmark

causes system state changes while previously configured transactions are executing. The

proceeding section investigates the accuracy and speedup attainable by the hybrid method

when the system is exposed to contention and thus numerous state updates.

5.4.3 Contention Modeling

Now that we have shown the accuracy and speedups achieved using our hybrid

models under relatively ideal conditions (i.e., little to no external effects influencing

a transaction), we investigate the impacts of using the technique under more extreme

conditions. In this study, we introduce contention into the system to observe how the

hybrid models react with regards to simulation speed and accuracy. For this test, we

use the MDSTest benchmark while increasing the number of nodes that simultaneously

access the MDS from two to thirty-two. This scenario creates contention at two shared

resources within the system – the output port in the Ethernet switch attaching the MDS

node to the network and the MDS storage device. For this benchmark, each node involved

first writes N bytes of data to the MDS, where N ranges from 1 B to 128 MB, and then

reads the data back with a synchronization point between each operation to maximize the

amount of contention and thus the stress on the hybrid simulation approach.

125

Figure 5-9 illustrates the relative errors and speedups observed when comparing the

hybrid write and read access times to the results obtained via the function-level models.

Data sizes less than 64 KB are not displayed since the hybrid models use only their

function-level implementations consequentially producing identical results between the two

simulation approaches. From Figure 5-9a, one can see that the hybrid write accesses show

relatively large deviations in accuracy at smaller data sizes. In fact, a maximum error

of just over 46% was calculated for a 256 KB dataset regardless of the number of flows

tested. Furthermore, as the data size increases, the general trend observed is a decrease in

error. These observations suggest that although the fluid-based models do not adequately

represent the behavior of the source models at smaller data sizes, they are much more

accurate at larger datasets. Meanwhile, for larger flow counts, we observe a minor increase

in error when transitioning from 1 MB to 2 MB data transactions. This increase is likely

due to abstractions made within the hybrid HAM model with respect to its buffer size

(recall from Table 5-2 that the HAM’s buffer size was set to 2 MB) and the non-blocking

functionality provided by this buffer. Figure 5-9b shows that the hybrid read accesses

perform nearly identically to the function-level operations with a maximum observed error

of 0.26% occurring at 64 KB. From the figure, we also observe that the number of nodes

participating in the read portion of the MDSTest benchmark has a minimal effect on the

accuracy of the hybrid simulation approach due to the serialization of accesses conducted

by the MDS Server.

Figure 5-9c and Figure 5-9d show the speedups achieved for the hybrid write and

read accesses over the various message sizes and flow counts. From the figures, one can

see that both operations show significant gains in speedup as the data size increases. The

maximum speedups, 595× and 760×, for the write and read operations, respectively, are

observed at 128 MB. A minimum speedup of 0.75× (i.e., slowdown of 25%) is observed

for both writes and reads, which represents the overhead associated with the extra

computation required to calculate and log statistics when using the hybrid approach.

126

(a) Write errors (b) Read errors

(c) Write speedups (d) Read speedups

Figure 5-9. MDSTest accuracy and speedup results using hybrid modeling approach

While large errors are observed in the hybrid models when processing smaller write

operations, we must remember that the proposed approach is designed to model large

data transactions. As a result, we identify a cross-over point for which data transactions

within the DM system use only the function-level models versus the hybrid models in

order to remedy the large errors at small data sizes while still achieving speedups for large

data sizes. To pinpoint this cross-over point, we calculate the speedup-to-error ratios

for the write accesses (see Figure 5-9a and Figure 5-9c) at each data size and select the

value that sustains a ratio greater than one for all flow counts. When this ratio is greater

than one, the speedup is larger than the error thus suggesting that the benefit of the

technique outweighs its inaccuracies. It should be noted that while this ratio is useful to

quickly identify a cross-over point, it can provide values that result in potentially large

127

inaccuracies in the cases where both the speedup and error values are large. For this study

we find that the 4 MB data size produced ratios of one or greater for all flow counts.

However, we select 8 MB as the minimum message size to use the hybrid models since

we desire single-digit errors for all flow counts as well. In the next section, we explore the

impact of configuring the HAM and TCP models to use this value as its minimum hybrid

message size on a real data-intensive application.

5.4.4 Case Study

Hyperspectral imaging is a technique that combines conventional imaging and

spectroscopy to identify and classify various objects within a 3D image. HSI is used in

applications that include mapping, reconnaissance and surveillance, and environmental

monitoring. Similar to other remote-sensing techniques, HSI typically deals with large

amounts of data that in some applications must be processed in real-time to provide

immediate assessment of potentially threatening scenarios. In this study, we apply our

hybrid simulation approach to an HSI application based on the algorithm presented in [75]

in order to showcase its capabilities when analyzing a real scientific application executing

on the DM system.

Figure 5-10 illustrates the dataflow diagram of the HSI application. Each

participating node acquires a slab of the image, calculates the autocorrelation sample

matrix (ACSM), and transmits the results to a single root node. The root node processes

the data collected from each node in the weight calculation stage and broadcasts C

classification constraints to each node. The nodes then classify the original image data

based on the constraints and save the resulting data to construct an output image. Table

5-4 displays the number of 4-byte elements transferred in each stage in terms of pixels

per row/column (N ), spectral bands (L), number of processors (P), and number of

classification constraints (C ).

For this case study, we explore the accuracy and speed of the hybrid version of

the DM flight system composed of twenty data nodes [53] as compared to its standard,

128

Figure 5-10. The HSI data decomposition and dataflow diagram

Table 5-4. Dataset sizes for each HSI data transaction

Transaction Dataset size (elements)

Get DataN2×L

P

Reduce L2

Broadcast C × L

Save DataN2×C

P

function-level counterpart. The study analyzes a total of ten image sizes (listed in Table

5-5). The first five datasets represent the images processed using current and emerging

implementations of the HSI application. The last five image sizes represent datasets that

may be analyzed in future versions of HSI and showcase the capabilities of the hybrid

technique when dealing with very large datasets. The triple line denotes the boundary in

which extrapolation is used to approximate the simulation times required by the standard,

function-level approach to complete a single HSI iteration. Extrapolation is employed as a

means to quickly estimate the full functional simulation time rather than occupy resources

for such large periods of time. The study also considers two configurations of the hybrid

HAM and TCP models. The configuration labeled Hybrid-0MB uses the default values

listed in Table 5-3 while the Hybrid-8MB configuration sets the minimum hybrid message

size for both the HAM and TCP models to 8 MB. Recall from Section 5.4.3 that this

129

data size was determined to balance the trade-off of using the hybrid method’s speed and

accuracy for smaller data transactions.

Table 5-5. Simulation times for various HSI image sizes

Image dimensions Raw data size Simulation times

(elements) Functional Hybrid-0MB Hybrid-8MB

1024×1024×64 256 MB 2.18 min 7.56 s 39.40 s

1024×1024×128 512 MB 3.92 min 8.08 s 40.25 s

1024×1024×256 1 GB 7.37 min 9.55 s 42.48 s

1024×1024×512 2 GB 14.31 min 15.34 s 50.88 s

1024×1024×1024 4 GB 28.33 min 20.91 s 80.33 s

2048×2048×1024 16 GB 1.88 hours 23.44 s 50.17 s

4096×4096×1024 64 GB 7.51 hours 34.07 s 1.03 min

8192×8192×1024 256 GB 1.25 days 1.25 min 1.74 min

16384×16384×1024 1 TB 5.01 days 4.08 min 4.66 min

32768×32768×1024 4 TB 20.03 days 15.53 min 16.29 min

Table 5-5 shows simulation times required to complete a single iteration of HSI

processing the corresponding image while Figure 5-11 displays the error and speedup of

the two hybrid configurations versus the standard, function-level approach. The results

show that both hybrid configurations provide very accurate results as well as improved

simulation times for all image sizes. In fact, the maximum errors observed in the Hybrid-

0MB and Hybrid-8MB setups are 0.77% and 0.0032%, respectively. Similar to the previous

studies, we find that speedup increases with dataset size though it begins to level as data

sizes become very large. Maximum speedups of 1858× and 1771× are observed in the

Hybrid-0MB and Hybrid-8MB configurations, respectively. Note that Hybrid-8MB reports

less speedup for all data sizes due to the increased amount of fragmentation occurring

for small to medium data transactions. However, the accuracy of this configuration for

smaller image sizes is significantly improved, though the accuracy of the Hybrid-0MB

configuration is very acceptable.

130

(a) Error (b) Speedup

Figure 5-11. The HSI accuracy and speedup results for two hybrid configurations

5.5 Conclusions

As data-intensive applications become increasingly prevalent, more efficient systems

must be designed to accommodate their special demands. In order to facilitate the design

of these systems, discrete-event simulation is often used to virtually prototype candidate

systems. However, lengthy analysis times of complex systems via simulation are further

hindered when evaluating data-intensive applications due to the sheer volume of data

created, processed, and scheduled by the simulation environment. In this chapter, we

presented a novel approach for hybrid simulation to speed the analysis of applications

processing large datasets while retaining a high degree of accuracy. Our approach

featured two techniques, function-level training and micro-simulations, to calibrate

analytical models that depict the long-term, steady-state behaviors of the corresponding

components and account for changes in the system’s performance without the use of

feedback mechanisms. Details on our hybrid simulation approach were outlined and the

various implications of each technique used were discussed.

To showcase the capabilities of the proposed approach, we applied the techniques

to the NASA Dependable Multiprocessor. First, we observed the accuracy and speedup

achieved by the DM system models using the proposed techniques as compared to a pure

functional model while employing two low-level benchmarks. The PingPong benchmark

reported a mean relative error of 1.24% when using the hybrid simulation approach

131

while the MDSTest benchmark showed 1.64% and 3.01% errors for writes and reads,

respectively. Furthermore, our approach showed speedups up to 1850× in the PingPong

benchmark and over 1700× in the MDSTest. These large speedups were a result of the

drastic reduction of discrete events processed by the hybrid approach. However, the

outcomes observed from the initial tests represented best-case numbers. As a result, we

analyzed the hybrid DM system model while executing the MDSTest benchmark on two

to thirty-two nodes. This scenario caused contention at the MDS node and therefore

investigated the capabilities of our proposed methodology in more stressing conditions.

The results from this study showed errors up to 46%, though these larger errors occurred

for smaller data sizes. As the transaction size increased, the errors decreased to more

reasonable percentages and eventually to values less than 1%. Maximum speedups of the

hybrid approach were observed to be 595× and 790× for writes and reads, respectively, at

the maximum message size observed (i.e., 128 MB). While the proposed hybrid simulation

approach reported large errors in this study, they were observed at smaller transaction

sizes that displayed smaller speedups when employing the hybrid models. As a result,

we identified a cross-over point at 8MB that supported the use of the function-level

models to ensure accurate results for small data sizes while transitioning to the hybrid

models for data sizes larger than 8MB in order to support the speedy simulations of large

transactions.

Once our approach was validated and its potential demonstrated using low-level

benchmarks, we evaluated its accuracy and speed using a hyperspectral imaging (HSI)

application executing on the DM flight system. By analyzing various image sizes using

both the standard function-level and proposed hybrid simulations, we found that our

approach produced a maximum error of 0.7% while displaying a maximum speedup of

290×. The trends observed from the study showed larger errors for smaller datasets due

to inaccuracies in the analytical model while the observed speedup increased with larger

datasets. The analysis concluded by analyzing much larger datasets using the hybrid

132

simulation approach with extrapolations used to estimate the amount of time required

by the function-level models. The results showed a projected, maximum speedup of

1858×. Speedups of this magnitude dictate that our hybrid approach has the potential to

complete month-long simulations in mere minutes.

The work conducted during this final phase of research produced a hybrid

simulation approach that employed two novel techniques, function-level training and

micro-simulations, to reduce the analysis times of simulations considering data-intensive

applications. These hybrid simulation mechanisms were incorporated into the FASE

framework and numerous pre-existing models within the FASE and DM model libraries

were retrofitted to accommodate these speedy features. Case studies demonstrated

the capabilities of the proposed approach as applied to the DM system executing a

data-intensive, hyperspectral imaging application. The contributions and accomplishments

of this work have been compiled into a manuscript that was submitted to ACM Transac-

tions on Modeling and Computer Simulation [76].

133

CHAPTER 6CONCLUSIONS

This document presented three phases of research to show the wide range of research

topics addressed by the FASE framework. The first phase analyzed the various aspects

involved with the design and development of a performance prediction infrastructure

using application characterization and discrete-event simulation in order to balance speed

and accuracy. The work laid out a generalize methodology to predict the performance

of applications executing on virtually prototyped systems. The methodology was then

realized through the use of an application characterization tool called Sequoia, which

traced MPI function calls and measured computation time between communication

events, and a pre-built model library created in MLDesigner, a hierarchical, discrete-event

simulation tool. Case studies were then conducted to observe simulation accuracy and

speed when compared to experimental measurements. The results showed accuracy

errors within an acceptable threshold (within 25%) and simulation speeds no greater

than three orders of magnitude slower than experimental processing times. After

validation, the potential of FASE was showcased with an in-depth study of the Sweep3D

algorithm executing on virtual systems composed of various network types, middleware

implementations, processing capabilities, and other degrees of freedom in the systems’

hardware and software configurations.

The framework developed in the first phase of this research created an ideal

environment to evaluate high-performance, embedded space systems. Consequently,

a NASA-sponsored project was used to explore the flexibility of FASE and extend

the framework to support not only scalability studies of the proposed space system,

but also availability studies. The work conducted in phase two expanded the FASE

pre-built models to include reliable middleware technologies that monitor system health,

schedule and deploy jobs, and react and recover from faults. A fault model library

was also developed to inject faults into the system in order to perform availability

134

studies. After the necessary toolkit was in place, a three-stage analysis procedure was

formulated for performance and availability evaluations. This approach allows users to

calibrate component models using experimental measurements, quickly identify workloads

and fault rates supported by a management software via Markov-reward models, and

thoroughly investigate specific applications executing on virtually prototyped system

through discrete-event simulation. The novel analysis methodology and simulation

models were then applied to explore the scalability of the 2D FFT kernel executing on

the DM system. The scalability study revealed the prime bottleneck of the system was

the centralized memory and algorithmic and architectural variations were analyzed to

alleviate the problem. After the scalability analysis, the SAR application was used to

study the performability of the virtual flight system consisting of twenty data nodes.

The results showed good system throughput (i.e., between 300 and 600 images per orbit)

and performability (i.e., over 99.5% in low radiation environments and 54.0% in extreme

conditions) when the system was enhanced and augmented with improved data processors

and MDS storage devices as well as extra MDS nodes to mitigate the contention point

discovered in the 2D FFT case study and further substantiated in the SAR study.

The final phase of this research considered extensions to the existing FASE framework

in order to overcome scalability issues with simulation time when analyzing data-intensive

applications. To overcome these issues, we proposed a novel hybrid simulation approach

that employ two unique techniques, function-level training and micro-simulations, to

reduce the amount of time required to simulate system executing applications with

large datasets. The proposed approach combines the accuracy of function-level models,

via function-level training, with the speed of analytical models and micro-simulations

in order to quickly and accurately approximate the time needed to complete a data

transaction. This combination drastically reduces the number of discrete-events processed

and scheduled by the simulator, thus resulting in simulation speedups over the pure

function-level approach. The approach was then applied to the DM system executing

135

low-level benchmarks and a hyperspectral imaging application. The low-level benchmarks

showed relatively accurate results (less than 7%) and order-of-magnitude speedups when

considering data transactions as small as 8 MB. Furthermore, the accuracy and speedup

of the hybrid simulation approach improved as the transaction size increased. In fact, the

simulations reported minimum errors of less than 1% and speedups over 1700× for the

low-level benchmarks and 1500× for the HSI application.

136

APPENDIX AEXPERIMENTAL AND SIMULATIVE SETUP

This research project incorporates two realms to explore various aspects and proposed

techniques for fast and accurate performance prediction. These realms are experimental

and simulative. The experimental realm deals with physical hardware and software used

to construct a compute system to collect “real-world” values, which are compared to

the results gathered in the second realm, simulation. The simulation realm provides an

environment to explore computational systems unavailable to the researcher due to limited

funds, non-existent components, or future generations of components. More details on the

interactions between these realms as applied to performance modeling and prediction will

be discussed in proceeding sections.

A.1 Experimental Setup

The work conducted during the course of this study employs equipment from the

High-performance Computing and Simulation (HCS) Lab at the University of Florida.

The HCS lab consists of 9 compute clusters each with various resources regarding the

processor, interconnect, main memory, hard disk, and software modules. Table A-1

lists the subset of clusters used for this study and their resource types, capabilities, and

capacities.

Table A-1. Computation systems at the HCS Lab at UF

Cluster CPU CPU CPU Node Memory Specialname type speed count count features

Alpha Xeon 3.2 GHz 128 32 2 GB DDR667 EMT64, Quad-coreDelta Xeon 3.2 GHz 16 16 2 GB DDR333 EMT64, PCI-ExpressMu Opteron 2.0 GHz 32 32 1 GB DDR400 PCI-Express, QsNetII

Lambda Opteron 1.4 GHz 32 16 1 GB DDR333 10 Gb/s InfiniBandKappa Xeon 2.4 GHz 70 35 1 GB DDR266

A.2 Simulation Setup

The modeling tool employed for this project was Mission-Level Designer (MLD)

developed by MLDesign Technologies Inc [27]. MLD is a block-oriented, discrete-event

simulation environment that supports modular and hierarchical designs. At its core,

137

MLD uses primitives, C++ code that provides some specific function such as arithmetic,

dataflow switching, or data queuing. Larger modules and systems are constructed by

connecting two or more primitives and/or other modules via a graphical interface. In order

to further facilitate user design, MLD supplies numerous libraries with pre-built primitives

and modules. Figure A-1 shows the development environment of MLD.

Figure A-1. The MLD development environment

138

REFERENCES

[1] O. Lubeck, Y. Luo, H. Wasserman, and F. Bassetti, “An Empirical HierarchicalMemory Model Based on Hardware Performance Counters,” Proc. Int’l Conf.Parallel and Distributed Processing Techniques and Applications, Las Vegas, NV,July 13-16, 1998.

[2] D. Kerbyson, H. Wasserman, and A. Hoisie, “Exploring Advanced ArchitecturesUsing Performance Prediction,” Proc. Int’l Workshop on Innovative Architecture,Kohala Coast, Big Island, HI, Jan. 10-11, 2002.

[3] M. Salsburg, “A Statistical Approach to Computer Performance Modeling,” ACMSIGMETRICS Performance Evaluation Review, vol. 15, no. 1, pp. 155-162, May1987.

[4] E. Strohmaier, “Statistical Performance Modeling: Case Study of the NPB 2.1Results,” Proc. Third Int’l Euro-Par Conf. Parallel Processing, Passau, Germany,Aug. 26-29, 1997.

[5] R. Jain, The Art of Computer Systems Performance Analysis. John Wiley and Sons,1991.

[6] A. Sampogna, D. Kaeli, D. Green, M. Silva, and C. Sniezek, “Performance ModelingUsing Object-Oriented Execution-Driven Simulation,” Proc. 29th Simulation Symp.,New Orleans, LA, Apr. 8-11, 1996.

[7] S. Dwarkadas, J. Jump, and J. Sinclair, “Execution-Driven Simulation ofMultiprocessors: Address and Timing Analysis,” ACM Trans. Modeling andComputer Simulation, vol. 4, no. 4, pp. 314-338, Oct. 1994.

[8] R. Uhlig and T. Mudge, “Trace-driven Memory Simulation: A Survey,” ACMComputing Surveys, vol. 29, no. 2, pp. 128-170, June 1997.

[9] J. Flanagan, B. Nelson, J. Archibald, and G. Thompson, “The Inaccuracy ofTrace-Driven Simulation Using Incomplete Multiprogramming Trace Data,” Proc.Fourth Int’l Workshop Modeling, Analysis, and Simulation of Computer andTelecommunication Systems, San Jose, CA, Feb. 1-3, 1996.

[10] S. Moore, F. Wolf, J. Dongarra, S. Shende, P. Teller, and B. Mohr, “A ScalableApproach to MPI Application Analysis,” Proc. 12th European PVM/MPI Users’Group Meeting, Sorrento, Italy, Sept. 18-21, 2005.

[11] D. Culler, J. Singh, and A. Gupta, Parallel Computer Architecture: A Hard-ware/Software Approach. Morgan Kaufmann Publishers, 1998.

[12] MPI Forum, “MPI: A Message-Passing Interface Standard,” University of Tennessee,Version 1.1, June 1995.

139

[13] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A High-Performance, PortableImplementation of the MPI Message Passing Interface Standard,” Parallel Comput-ing, vol. 22, no. 6, pp. 789-828, Sept. 1996.

[14] A. George, R. Fogarty, J. Markwell, and M. Miars, “An Integrated SimulationEnvironment for Parallel and Distributed System Prototyping,” Simulation, vol. 72,no. 5, pp. 283-294, May 1999.

[15] E. Deelman, A. Dube, A. Hoisie, Y. Luo, R. Oliver, D. Sundaram-Stukel,H. Wasserman, V. Adve, R. Bagrodia, J. Browne, E. Houstis, O. Lubeck, J. Rice,P. Teller, and M. Vernon, “POEMS: End-to-End Performance Design of LargeParallel Adaptive Computational Systems,” IEEE Trans. Software Engineering, vol.26, no. 11, pp. 1027-1048, Nov. 2000.

[16] R. Bagrodia, E. Deelman, S. Docy, and T. Phan, “Performance Prediction of LargeParallel Applications Using Parallel Simulations,” Proc. Seventh ACM SIGPLANSymp. Principles and Practice of Parallel Programming, pp. 151-161, Atlanta, GA,May 1999.

[17] M. Uysal, T. Kurc, A. Sussman, and J. Saltz, “A Performance Prediction Frameworkfor Data Intensive Applications on Large Scale Parallel Machines,” Technical ReportCS-TR-3918 and UMIACS-TR-98-39, University of Maryland, Department ofComputer Science and UMIACS, July 1998.

[18] J. Cao, D. Kerbyson, E. Papaefstathiou, and G. Nudd, “Performance Modelingof Parallel and Distributed Computing Using PACE,” Proc. 19th IEEE Int’lPerformance, Computing, and Communications Conf., pp. 485-492, Phoenix, AZ,Feb. 20-22, 2000.

[19] S. Pllana and T. Fahringer, “Performance Prophet: A Performance Modeling andPrediction Tool for Parallel and Distributed Programs,” Proc. Int’l Conf. ParallelProcessing, Oslo, Norway, June 14-17, 2005.

[20] A. Snavely, L. Carrington, and N. Wolter, “A Framework for Performance Modelingand Prediction,” Proc. 15th Supercomputing Conf., Baltimore, MD, Nov. 16-22,2002.

[21] D. Bailey and A. Snavely, “Performance Modeling: Understanding the Present andPredicting the Future,” Proc. Euro-Par Conf., Lisbon, Portugal, Aug. 30-Sept. 2,2005.

[22] R. Badia, J. Labarta, J. Gimenez, and F. Escale, “DIMEMAS: Predicting MPIApplications Behavior in Grid Environments,” Proc. Workshop on Grid Applicationsand Programming Tools, Seattle, WA, June 25, 2003.

[23] S. Moore, D. Cronk, K. London, and J. Dongarra, “Review of Performance AnalysisTools for MPI Parallel Programs,” Technical Report, University of Tennessee,Computer Science Department, 1998.

140

[24] L. DeRose and D. Reed, “SvPablo: A Multi-Language Architecture-IndependentPerformance Analysis System,” Proc. Int’l Conf. Parallel Processing, Fukushima,Japan, Sept. 1999.

[25] B. Miller, M. Callaghan, J. Cargille, J. Hollingsworth, R. Irvin, K. Karavanic,K. Kunchithapadam, and T. Newhall, “The Paradyn Parallel PerformanceMeasurement Tool,” IEEE Computer, vol. 28, no. 11, pp. 37-46, Nov. 1995.

[26] S. Shende and A. Malony, “The TAU Parallel Performance System,” Int’l Journalof High Performance Computing Applications, vol. 20, no. 2, pp. 287-331, Summer2006.

[27] G. Schorcht, I. Troxel, K. Farhangian, P. Unger, D. Zinn, C. Mick, A. George, andH. Salzwedel, “System-Level Simulation Modeling with MLDesigner,” Proc. 11thInt’l Symp. Modeling, Analysis, and Simulation of Computer and TelecommunicationSystems, pp. 207-212, Orlando, FL, Oct. 12-15, 2003.

[28] J. Vetter, N. Bhatia, E. Grobelny, and P. Roth, “Capturing Petascale ApplicationCharacteristics with the Sequoia Toolkit,” Proc. Parallel Computing, Malaga, Spain,Sept. 13-16, 2005.

[29] S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci, “A ScalableCross-Platform Infrastructure for Application Performance Tuning Using HardwareCounters,” Proc. 13th Supercomputing Conf., Dallas, TX, Nov. 4-10, 2000.

[30] E. Grobelny and J. Vetter, “Extrapolating Communication Patterns of Large-scaleScientific Applications,” Technical Report, University of Florida and Oak RidgeNational Laboratory, 2006.

[31] O. Zaki, E. Lusk, W. Gropp, and D. Swider, “Toward Scalable PerformanceVisualization with Jumpshot,” The Int’l Journal of High Performance ComputingApplications, vol. 13, no. 2, pp. 277-288, Fall 1999.

[32] D. Gustavson and Q. Li, “The Scalable Coherent Interface (SCI),” IEEE Communi-cations Magazine, vol. 34, no. 8, pp. 52-63, Aug. 1996.

[33] K. Koch, R. Baker, and R. Alcouffe, “Solution of the First-Order Form of the 3-DDiscrete Ordinates Equation on a Massively Parallel Processor,” Trans. of theAmerican Nuclear Society, vol. 65, no. 198, 1992.

[34] J. Vetter and A. Yoo, “An Empirical Performance Evaluation of Scalable ScientificApplications,” Proc. 15th Supercomputing Conf., Baltimore, MD, Nov. 16-22, 2002.

[35] E. Grobelny, D. Bueno, I. Troxel, A. George, and J. Vetter, “FASE: A Frameworkfor Scalable Performance Prediction of HPC Systems and Applications,” Simulation:Transactions of The Society for Modeling and Simulation International, vol. 83, no.10, pp. 721-745, Oct. 2007.

141

[36] M. Griffin, “NASA 2006 Strategic Plan,” National Aeronautics and SpaceAdministration, NP-2006-02-423-HQ, Washington DC, Feb. 2006.

[37] J. Ramos, J. Samson, D. Lupia, I. Troxel, R. Subramaniyan, A. Jacobs, J. Greco,G. Cieslewski, J. Curreri, M. Fischer, E. Grobelny, A. George, V. Aggarwal, M. Pateland R. Some, “High-Performance, Dependable Multiprocessor,” Proc. IEEE/AIAAAerospace Conf., Big Sky, MT, Mar. 4-11, 2006.

[38] D. Dechant, “The Advanced Onboard Signal Processor (AOSP),” Advances in VLSIand Computer Systems, vol. 2, no. 2, pp. 69-78, Oct. 1990.

[39] M. Iacoponi and D. Vail, “The Fault Tolerance Approach of the AdvancedArchitecture On-Board Processor,” Proc. Symp. Fault-Tolerant Computing, Chicago,IL, June 21-23, 1989.

[40] F. Chen, L. Craymer, J. Deifik, A. Fogel, D. Katz, A. Silliman Jr., R. Some,S. Upchurch and K. Whisnant, “Demonstration of the Remote Exploration andExperimentation (REE) Fault-Tolerant Parallel-Processing Supercomputer forSpacecraft Onboard Scientific Data Processing,” Proc. Int’l Conf. DependableSystems and Networks, New York, NY, June 25-28, 2000.

[41] E. Prado, P. Prewitt and E. Ille, “A Standard Approach to Spaceborne PayloadData Processing,” Proc. IEEE Aerospace Conf., Big Sky, MT, March 10-17, 2001.

[42] S. Fuller, RapidIO - The Embedded System Interconnect. John Wiley & Sons, 2005.

[43] J. Meyer, “On Evaluating the Performability of Degradable Computing Systems,”IEEE Trans. Computers, vol. C-29, no. 8, pp.720-731, Aug. 1980.

[44] R. Subramaniyan, V. Aggarwal, A. Jacobs, and A. George, “FEMPI: A LightweightFault-tolerant MPI for Embedded Cluster Systems,” Proc. Int’l Conf. EmbeddedSystems and Applications, Las Vegas, NV, June 26-29, 2006.

[45] R. Smith, K. Trivedi and A. Ramesh, “Performability Analysis: Measures, anAlgorithm and a Case Study,” IEEE Trans. Computers, vol. 37, no. 4, pp.406-417,Apr. 1988.

[46] B. Haverkort, R. Marie, G. Rubino, and K. Trivedi (editors), PerformabilityModeling: Techniques and Tools. Wiley, 2001.

[47] C. Hirel, R. Sahner, X. Zang, and K. Trivedi, “Reliability and PerformabilityModeling Using SHARPE 2000,” Proc. Int’l Conf. Computer Performance Evalua-tion: Modeling Techniques and Tools, Schaumburg, IL, Mar. 27-31, 2000.

[48] I. Troxel, E. Grobelny and A. George, “System Management Services forHigh-Performance In-situ Aerospace Computing,” AIAA Journal of AerospaceComputing, Information, and Communication, vol. 4, no. 2, pp. 636-656, Feb. 2007.

142

[49] D. Bueno, C. Conger, A. Leko, I. Troxel and A. George, “RapidIO-based SpaceSystem Architectures for Synthetic Aperture Radar and Ground Moving TargetIndicator,” High-Performance Embedded Computing Workshop, MIT Lincoln Lab,Lexington, MA, Sept. 20-22, 2005.

[50] P. Meisl, M. Ito, and I. Cumming, “Parallel Synthetic Aperture Radar Processingon Workstation Networks,” Proc. 10th Int’l Parallel Processing Symp., pp. 716-723,Honolulu, HI, Apr. 15-19, 1996.

[51] C. Miller, D. Payne, T. Phung, H. Siegel, and R. Williams, “Parallel Processing ofSpaceborne Imaging Radar Data,” Proc. Eighth Supercomputing Conf., San Diego,CA, Dec. 4-8, 1995.

[52] D. Sandwell, “SAR Image Formation: ERS SAR Processor Coded in MATLAB,”http://www.geo.uzh.ch/rsl/research/SARLab/GMTILiterature/PDF/San02d.

pdf, 2002.

[53] J. Samson, G. Gardner, D. Lupia, M. Patel, P. Davis, V. Aggarwal, A. George,Z. Kalbarcyzk, and R. Some, “Technology Validation: NMP ST8 DependableMultiprocessor Project II,” Proc. IEEE Aerospace Conf., Big Sky, MT, Mar. 3-10,2007.

[54] E. Grobelny, G. Cieslewski, I. Troxel, and A. George, “Predicting the Performance ofRadiation-Susceptible Aerospace Computing Systems and Applications,” submittedto ACM Trans. Embedded Computing Systems.

[55] M. Cannataro, D. Talia, and P. Srimani, “Parallel Data Intensive Computing inScientific and Commercial Applications,” Parallel Computing, vol. 28, no. 5, pp.673-704, May 2002.

[56] U. Fayyad, “Data Mining and Knowledge Discovery: Making Sense Out Of Data,”IEEE Expert: Intelligent Systems and Their Applications, vol. 11, no. 5, pp. 20-25,Oct. 1996.

[57] Data-Intensive Computing Initiative (DICI), Pacific Northwest National Laboratory,http://dicomputing.pnl.gov/.

[58] K. Walsh and E. Sirer, “Staged Simulation: A General Technique For ImprovingSimulation Scale and Performance,” ACM Trans. Modeling and Computer Simula-tions, vol. 14, no. 2, pp. 170-195, Apr. 2004.

[59] R. Fujimoto, “Parallel Simulation: Parallel and Distributed Simulation Systems,”Proc. Winter Simulation Conf., pp. 147-157, Arlington, VA, Dec. 9-12, 2001.

[60] D. Nicol, “Principles of Conservative Parallel Simulation,” Proc. Winter SimulationConf., pp. 128-135, Coronado, CA, Dec. 8-11, 1996.

[61] B. Groselj, “CPSim: A Tool for Creating Scalable Discrete-Event Simulations,”Proc. Winter Simulation Conference, pp 579-583, Arlington, VA, Dec. 3-6, 1995.

143

http://www.geo.uzh.ch/rsl/research/SARLab/GMTILiterature/PDF/San02d.pdf

http://www.geo.uzh.ch/rsl/research/SARLab/GMTILiterature/PDF/San02d.pdf

http://dicomputing.pnl.gov/

[62] R. Bagrodia, R. Meyer, M. Takai, Y. Chen, X. Zeng, J. Martin, B. Park andH. Song “Parsec: A Parallel Simulation Environment for Complex Systems,” IEEEComputer, vol. 31, no. 10, pp. 77-85, Oct. 1998.

[63] R. Fujimoto, “Optimistic Approaches to Parallel Discrete-event Simulation,” Trans.of the Society for Computer Simulation International, vol. 7, no. 2, pp. 153-191, June1990.

[64] Y. Teo, S. Tay and S. Kong, “SPaDES: An Environment for Structured ParallelSimulation,” Technical Report, TR20/96, Department of Information Systems andComputer Science, National University of Singapore, Singapore, Oct. 1996.

[65] D. Martin, T. McBrayer, and P. Wilsey, “WARPED: A Time Warp SimulationKernel for Analysis and Application Development,” Proc. 29th Hawaii Int’l Conf.System Sciences Volume 1: Software Technology and Architecture, Jan. 3-6, 1996, pp.383-386.

[66] J. West and A. Mullarney, “ModSim: A Language for Distributed Simulation.”Proc. SCS Multiconf. Distributed Simulation, pp. 155-159, San Diego, CA, Feb. 3-5,1988.

[67] Y. Liu, F. Presti, V. Misra, D. Towsley and Y. Gu, “Fluid Models and Solutions forLarge-scale IP Networks,” Proc. ACM SIGMETRICS Int’l Conf. Measurement andModeling Computer Systems, pp. 91-101, San Diego, CA, June 10-14, 2003.

[68] A. Yan and W. Gong, “Time-driven Fluid Simulation for High Speed Networks,”IEEE Trans. Information Theory, vol. 45, no. 5, pp. 1588-1599, June 1999.

[69] B. Liu, Y. Gao, J. Kurose, D. Towsley and W. Gong, “Fluid Simulation ofLarge-scale Networks: Issues and Tradeoffs,” Proc. Int’l Conf. Parallel and Dis-tributed Processing Techniques and Applications, pp. 2136-2142, Las Vegas, NV, June28-July 1, 1999.

[70] G. Kesidis, A. Singh, D. Cheung and W. Kwok, “Feasibility of Fluid-DrivenSimulation for ATM Network,” Proc. IEEE Global Telcommunications Conf., pp.2013-2017, London, England, Nov. 18-22, 1996.

[71] B. Melamed and S. Pan, “HNS: A Streamlined Hybrid Network Simulator,” ACMTrans. Modeling and Computer Simulation, vol. 14, no. 3, pp. 251-277, July 2004.

[72] G. Riley, T. Jafaar and R. Fujimoto, “Integrated Fluid and Packet NetworkSimulations,” Proc. IEEE Int’l Symp. Modeling, Analysis and Simulation ofComputer and Telecommunications Systems, pp. 511-518, Fort Worth, TX, Oct.11-16, 2002.

[73] C. Kiddle, R. Simmonds, C. Williamson and B. Unger, “Hybrid Packet/Fluid FlowNetwork Simulation,” Proc. 17th Workshop Parallel and Distributed Simulation, pp.143-152, San Diego, CA, June 10-13, 2003.

144

[74] M. Uysal, T. Kurc, A. Sussman and J. Saltz, “A Performance Prediction Frameworkfor Data Intensive Applications on Large-Scale Parallel Machines,” Proc. Fourth Int’lWorkshop on Languages, Compilers, and Run-time Systems for Scalable Computers,pp. 243-258, Pittsburgh, PA, May 28-30, 1998.

[75] C. Chang, H. Ren, and S. Chiang, “Real-time Processing Algorithms for TargetDetection and Classification in Hyperspectral Imagery,” IEEE Trans. Geoscienceand Remote Sensing, vol. 39, no. 4, pp. 760-768, Apr. 2001.

[76] E. Grobelny, C. Reardon, and A. George, “A Hybrid Simulation Approach toReduce Analysis Time of Data-Intensive Applications,” submitted to ACM Trans.Modeling and Computer Simulation.

145

BIOGRAPHICAL SKETCH

Eric Grobelny began attending the University of Florida in Fall 1998 and received

his B.S. in 2002 and M.E. in 2004. He conducted research at the High-performance

Computing and Simulation (HCS) Laboratory under the supervision of Dr. Alan George

for six years focusing on performance analysis and prediction for high-performance

computing systems and applications. His other interests include high-performance

embedded computing for aerospace systems and applications, simulation-based fault

injection, and high-performance interconnect technologies. As a member of the HCS lab,

he also worked on numerous side projects including developing an MPI communication

layer for satellite systems, investigating performance enhancements for low-locality

applications, and exploring techniques and best practices for disaster recovery and

mission assurance in dynamic, high-performance environments. Eric has accepted a job at

Honeywell Space Systems in Clearwater, Florida.

146

Documents

FAST AND ACCURATE SIMULATION ENVIRONMENT (FASE) FOR …ufdcimages.uflib.ufl.edu/UF/E0/02/20/68/00001/grobelny_e.pdf · and guidance. I would also like to express my utmost gratitude