Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
FAST AND ACCURATE SIMULATION ENVIRONMENT (FASE) FORHIGH-PERFORMANCE COMPUTING SYSTEMS AND APPLICATIONS
By
ERIC M. GROBELNY
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2008
1
c© 2008 Eric M. Grobelny
2
To my family and friends
3
ACKNOWLEDGMENTS
I would like to thank, first and foremost, my parents, brother, and sister for their love
and guidance. I would also like to express my utmost gratitude to my advisor, Dr. Alan
George, for supporting me through graduate school and teaching me the necessary skills
to become a researcher and innovator. Another crucial person who taught me much about
research in computer engineering is Dr. Jeff Vetter. Furthermore, I wish to express my
appreciation to my sponsors (the Department of Defense, Honeywell, and the University
of Florida) for their financial aid. Without it I would be in extreme debt. Finally, I would
like to thank Mr. Robert Henuber for planting the seed that inspired and motivated me to
become a doctor of philosophy in computer engineering.
4
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 BACKGROUND AND RELATED RESEARCH . . . . . . . . . . . . . . . . . . 17
3 FAST AND ACCURATE SIMULATION ENVIRONMENT (PHASE 1) . . . . . 24
3.1 Application Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.1.1 Application Characterization . . . . . . . . . . . . . . . . . . . . . . 263.1.2 Stimulus Development . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Simulation Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2.1 Component Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2.2 System Development . . . . . . . . . . . . . . . . . . . . . . . . . . 373.2.3 System Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.1 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.2 System Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.3.3 Case Study: Sweep3D . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.3.1 Experiment 1: Accuracy . . . . . . . . . . . . . . . . . . . 463.3.3.2 Experiment 2: Speed . . . . . . . . . . . . . . . . . . . . . 473.3.3.3 Experiment 3: Virtual system prototyping . . . . . . . . . 49
3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4 PERFORMANCE AND AVAILABILITY PREDICTIONS OF VIRTUALLYPROTOTYPED SYSTEMS FOR SPACE-BASED APPLICATIONS (PHASE 2) 57
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.1.1 Project Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.1.2 DM System Architecture . . . . . . . . . . . . . . . . . . . . . . . . 594.1.3 DM Middleware Architecture . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.2.1 Physical Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.2.2 Markov-Reward Modeling . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.2.1 Data node model . . . . . . . . . . . . . . . . . . . . . . . 664.2.2.2 System model . . . . . . . . . . . . . . . . . . . . . . . . . 69
5
4.2.3 Discrete-Event Simulation Modeling . . . . . . . . . . . . . . . . . . 704.2.4 Fault Model Library . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.3.1 Model Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.1.1 Component model calibration and validation . . . . . . . . 764.3.1.2 System performability model . . . . . . . . . . . . . . . . 77
4.3.2 Case Study: Fast Fourier Transform . . . . . . . . . . . . . . . . . . 784.3.3 Case Study: Synthetic Aperture Radar . . . . . . . . . . . . . . . . 82
4.3.3.1 Amenability study . . . . . . . . . . . . . . . . . . . . . . 864.3.3.2 In-depth application analysis . . . . . . . . . . . . . . . . 874.3.3.3 Flight system . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5 HYBRID SIMULATIONS TO IMPROVE THE ANALYSIS TIME OFDATA-INTENSIVE APPLICATIONS (PHASE 3) . . . . . . . . . . . . . . . . . 99
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 995.2 Background and Related Research . . . . . . . . . . . . . . . . . . . . . . . 1025.3 Hybrid Simulation Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.3.1 Function-Level Training . . . . . . . . . . . . . . . . . . . . . . . . . 1085.3.2 Analytical Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115.3.3 Micro-Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185.4.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185.4.2 Performance Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 1215.4.3 Contention Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 1255.4.4 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
APPENDIX
A EXPERIMENTAL AND SIMULATIVE SETUP . . . . . . . . . . . . . . . . . . 137
A.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137A.2 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6
LIST OF TABLES
Table page
3-1 The FASE component library . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3-2 Experimental versus simulation execution times for matrix multiply . . . . . . . 43
3-3 Ratio of simulation to experimental wall-clock execution time . . . . . . . . . . 45
3-4 Compute node specifications for each cluster in heterogeneous system . . . . . . 46
3-5 Experimental versus simulation errors for Sweep3D . . . . . . . . . . . . . . . . 48
4-1 The DM middleware components . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4-2 Data node model states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4-3 Failure and recovery rates of the node model . . . . . . . . . . . . . . . . . . . . 68
4-4 Summary of DM component models . . . . . . . . . . . . . . . . . . . . . . . . . 71
4-5 Summary of fault models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4-6 Baseline system parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4-7 The FFT Algorithmic variations and system enhancements . . . . . . . . . . . . 81
4-8 Checkpoint options explored using patch-based SAR application . . . . . . . . . 90
4-9 Architectural enhancements explored for flight system . . . . . . . . . . . . . . . 92
5-1 Summary of relevant simulation models . . . . . . . . . . . . . . . . . . . . . . . 119
5-2 Key system parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5-3 Hybrid source model parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5-4 Dataset sizes for each HSI data transaction . . . . . . . . . . . . . . . . . . . . . 129
5-5 Simulation times for various HSI image sizes . . . . . . . . . . . . . . . . . . . . 130
A-1 Computation systems at the HCS Lab at UF . . . . . . . . . . . . . . . . . . . . 137
7
LIST OF FIGURES
Figure page
3-1 High-level data-flow diagram of FASE framework . . . . . . . . . . . . . . . . . 25
3-2 The FASE application characterization process . . . . . . . . . . . . . . . . . . . 27
3-3 InfiniBand model latency validation . . . . . . . . . . . . . . . . . . . . . . . . . 40
3-4 InfiniBand model throughput validation . . . . . . . . . . . . . . . . . . . . . . 41
3-5 The TCP/IP/Ethernet model latency validation . . . . . . . . . . . . . . . . . . 41
3-6 The TCP/IP/Ethernet model throughput validation . . . . . . . . . . . . . . . 41
3-7 The SCI model latency validation . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3-8 The SCI model throughput validation . . . . . . . . . . . . . . . . . . . . . . . . 42
3-9 Sweep3D algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3-10 Experimental versus simulative execution times for Sweep3D . . . . . . . . . . . 47
3-11 Ratios of simulation to experimental wall-clock completion time for varyingsystem and dataset sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3-12 Execution times for Sweep3D running on various system configurations . . . . . 50
3-13 Maximum speedups for Sweep3D running on various network configurations . . 51
3-14 Speedups for Sweep3D running on 8192-node InfiniBand system . . . . . . . . . 53
4-1 System hardware architecture of the dependable multiprocessor . . . . . . . . . 60
4-2 System software architecture of the dependable multiprocessor . . . . . . . . . . 61
4-3 Logical diagram and photograph of DM testbed . . . . . . . . . . . . . . . . . . 65
4-4 Markov-reward data node model . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4-5 Markov-reward system model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4-6 The DM node models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4-7 The DM flight system model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4-8 Example fault-enabled system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4-9 Throughput validations for network and MDS subsystem models . . . . . . . . . 76
4-10 Markov versus simulation DM system performability comparison . . . . . . . . . 78
4-11 Dataflow diagram of parallel 2D FFT . . . . . . . . . . . . . . . . . . . . . . . . 79
8
4-12 Execution time per image for baseline and enhanced systems . . . . . . . . . . . 80
4-13 Parallel 2D FFT execution times per image for various performance-enhancingtechniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4-14 Distributed 2D FFT execution times per image for various performance-enhancingtechniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4-15 SAR dataflow with optional checkpoint stages and patched data decomposition . 84
4-16 Amenability results via Markov model for patch-based SAR application . . . . . 86
4-17 System performability percentages and throughputs for patch-based SAR . . . . 88
4-18 System performability and throughput for 8192-element patch-based SARexecuting on various system sizes . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4-19 Speedups of architectural enhancements for patch-based SAR . . . . . . . . . . 93
4-20 System performability and throughput of 20-node DM flight system executingpatch-based SAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5-1 High-level example systems employing hybrid modeling . . . . . . . . . . . . . . 107
5-2 High-level diagram of hybrid simulation approach . . . . . . . . . . . . . . . . . 108
5-3 Function-level training procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5-4 Example three-flow micro-simulation . . . . . . . . . . . . . . . . . . . . . . . . 117
5-5 PingPong accuracy results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5-6 MDSTest accuracy results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5-7 PingPong speedup results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5-8 MDSTest speedup results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5-9 MDSTest accuracy and speedup results using hybrid modeling approach . . . . 127
5-10 The HSI data decomposition and dataflow diagram . . . . . . . . . . . . . . . . 129
5-11 The HSI accuracy and speedup results for two hybrid configurations . . . . . . . 131
A-1 The MLD development environment . . . . . . . . . . . . . . . . . . . . . . . . 138
9
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
FAST AND ACCURATE SIMULATION ENVIRONMENT (FASE) FORHIGH-PERFORMANCE COMPUTING SYSTEMS AND APPLICATIONS
By
Eric M. Grobelny
August 2008
Chair: Alan GeorgeMajor: Electrical and Computer Engineering
As systems of computers become more complex in terms of their architecture,
interconnect, and heterogeneity, the optimum configuration and use of these machines
becomes a major challenge. To reduce the penalties caused by poorly configured systems,
simulation is often used to predict the performance of key applications to be executed on
the new systems. Simulation provides the capability to observe component and system
characteristics (e.g., performance and power) in order to make vital design decisions.
However, simulating high-fidelity models can be very time consuming and even prohibitive
when evaluating large-scale systems.
The Fast and Accurate Simulation Environment (FASE) framework seeks to support
large-scale system simulation by using high-fidelity models to capture the behavior of
only the performance-critical components while employing abstraction techniques to
capture the effects of those components with little impact on the system. To achieve this
balance of accuracy and simulation speed, FASE provides a methodology and associated
toolset to evaluate numerous architectural options. This approach allows users to make
system design decisions based on quantifiable demands of their key applications rather
than using manual analysis which can be error prone and impractical for large systems.
The framework accomplishes this evaluation through a novel approach of combining
discrete-event simulation with an application characterization scheme in order to remove
unnecessary details while focusing on components critical to the performance of the
10
application. In addition, FASE is extended to support in-depth availability analyses
and quick evaluations of data-intensive applications. In this document, we present the
methodology and techniques behind FASE and include several case studies validating
systems constructed using various applications and interconnects. The studies show that
FASE produces results with acceptable accuracy (i.e., maximum error of 23.3% and under
6% in most cases) when predicting the performance of complex applications executing on
HPC systems. Furthermore, when using FASE to analyze data-intensive applications, the
framework achieves over 1500× speedup with less than 1% error when compared to the
traditional, function-level modeling approach.
11
CHAPTER 1INTRODUCTION
Substantial capital resources are invested annually to expand the computational
capacity and improve the quality of tools scientists have at their disposal to solve
grand-challenge problems in physics, life sciences and other disciplines. Typically each
new large-scale, high-performance computing (HPC) system deployed at a national lab,
industry research center, or other site exemplifies the latest in technology and frequently
outperforms its predecessors as measured by the execution of generic benchmark suites.
While a supercomputer’s raw computational potential can be readily predicted and
demonstrated for these generic benchmarks, an application scientist’s ability to harness the
new system’s potential for their specific application is not guaranteed. Few applications
stress systems in exactly the same manner, especially at large system sizes, and
therefore predicting how best to allocate limited funds to build an optimally configured
supercomputer is challenging. Lacking quantitative data and a clear methodology to
understand and explore the design space, developers typically rely on intuition, or
instead simply use manual analysis to identify the best available option for each system
component, which may often lead to inefficiencies in the system architecture. A more
structured methodology is required to provide developers the means to perform an
accurate cost-benefit analysis to ensure resources are efficiently allocated. The Fast and
Accurate Simulation Environment (FASE) has been developed to address this critical
need.
FASE is a comprehensive methodology and associated toolset for performance
prediction and system-design exploration that encompasses a means to characterize
applications, design and build virtual prototypes of candidate systems, and study the
performance of applications in a quick, accurate, and cost-effective manner. FASE
approaches the grand challenge that is performance prediction of HPC applications on
candidate systems by splitting the problem into two domains: the application and the
12
simulation domains. Though some interdependencies must exist between the two realms,
this split isolates the work conducted in either domain so that application analysis data
and system models can be reused with little effort when exploring other applications
or system designs. Unlike other performance prediction environments, FASE provides
the unique capability of virtually prototyping candidate systems via a graphical user
interface. This feature not only provides substantial time and cost savings as compared
to developing an experimental prototype, but also captures structural dependencies
(e.g., network contention) within the computational subsystems allowing users to explore
decomposition and load balancing options. Furthermore, virtual prototyping can help
forecast the technological advances in specific components that would be most critical
to improving the performance of select applications. More importantly, cross-over points
in key metrics, such as network latency can be identified by quantitatively assessing
where to apply Amdahl’s law for a particular application and system pair. In order to
ensure all options are examined, analysis can also include the reuse potential of currently
deployed systems in order to determine if upgrading or otherwise augmenting those
existing systems will provide a better return on investment compared to building an
entirely new system. Another unique feature of FASE is its iterative design and analysis
procedures in which results from one or more initial runs are used to refine the application
characterization process as well as dictate the fidelity of the component models employed
in candidate architectures. Iterations during these stages can result in highly targeted
performance data that drives simulated systems optimized for speed and accuracy. To
accommodate for optimized system models, the framework supports a combination of
analytical and simulation model types. This combination allows users to effectively adjust
the focal points of the simulations to the components with the greatest impact on system
performance through the use of the simulation models while still accounting for the lesser
influential components by using faster analytical models. As a result, the complexity
of the overall system design is reduced thus decreasing simulation time. In summary,
13
FASE allows designers to evaluate the performance of a specific set of applications while
exploring a rich set of system design options available for deployment via an upgradeable,
modular component library. The list below details the main contributions of the FASE
framework.
1. A systematic, iterative methodology that describes the various options availableat each step of the design and analysis process and illuminates the implications andissues with each option.
2. The FASE toolset that provides an application characterization tool (supportingMPI-based C and Fortran applications) to collect performance data and a graphical,object-oriented (using C++) simulation environment to virtually prototype andevaluate candidate systems.
3. A pre-built model library that contains a variety of HPC architectural componentsfacilitating rapid prototyping and evaluation of systems with varying degrees of detailat all the key subsystems.
This study consists of three phases of research. The first phase focuses on designing
and developing a robust and comprehensive methodology and toolkit that users can
employ to quickly and accurately predict the performance of interesting HPC systems and
applications. More specifically, the work conducted in the first phase provides a detailed
procedure to characterize an application, design and build component and system models,
and analyze the application’s performance on various system configurations via simulation.
A basic toolkit, consisting of an application analysis tool, a simulator, and pre-built
simulation models of key components and sample systems, facilitates the prediction and
analysis procedure. After creating the foundation of FASE, we perform case studies using
a matrix multiply benchmark and a real scientific application in order to validate the
speed and accuracy of FASE.
The second phase of the study explores the use of the FASE framework to perform
an in-depth performability analysis of space-based systems. This work consists of the
design and development of component and system models to represent the candidate
space system as well as a simulation-based fault injection framework to conduct scalability
14
and performability analyses of different system configurations and application variations.
The scalability analysis employs the 2D Fast Fourier Transform (2D FFT) as the key
benchmark kernel due to its computational relevance in many spaced-based applications.
After identifying the strengths and weaknesses of the base architecture, we conduct a
performability study of the system under different environmental conditions. The study
uses the Synthetic Aperture Radar (SAR) application which actually incorporates the
2D FFT kernel in order to perform image processing. The study reports insight on
key architectural and algorithmic options that provide performance and availability
enhancements for future space systems.
Phase three expands the foundation of FASE by incorporating hybrid simulation
techniques to address scalability issues when analyzing data-intensive applications. The
research conducted within this phase focuses on the design and development of techniques
that combine the strengths of the function-level and analytical modeling approaches
in order to reduce simulation times in applications that process very large datasets.
The proposed approach is validated using simple benchmark programs executing on an
emerging space-based platform while its capabilities are demonstrated by analyzing a
data-intensive, remote-sensing application called hyperspectral imaging (HSI).
The remainder of this document consists of the technical details of the study. Chapter
2 provides background information on the basic concepts involved in the performance
prediction process through simulation. The chapter also presents brief overviews on
previous research conducted that share similar methods and goals of FASE. Chapter
A describes the various facilities and tools used to conduct the research. Chapter 3
details the FASE framework while Chapter 4 applies and extends the framework to
perform scalability and performability analyses on a space-based system for a thorough
architectural and algorithmic evaluation. Chapter 5 extends FASE by enhancing pre-built
models with the necessary mechanisms needed to support hybrid simulations for the
15
analysis of data-intensive applications. Finally, Chapter 6 summarizes the work presented
in this document.
16
CHAPTER 2BACKGROUND AND RELATED RESEARCH
Building large-scale systems to determine how an application will perform is simply
not a viable option due to time and cost constraints. Therefore, other methods of
precursory investigation are needed, and several different types of modeling techniques
exist to aid in this process. Analytical [1], [2] and statistical [3], [4] modeling are two such
options and both methods involve a representation of the system using mathematical
equations or Markov models in order to gain insight on how a particular system will
perform based on certain parameters. These models can become very complex especially
when considering the large number of configuration parameters of large-scale HPC systems
and still remain inaccurate due to the overall complexity of the systems. In addition,
it is difficult for analytical modeling to address the higher-order transient features of
an application, such as network contention, and often over-simplifications are employed
to make the equations solvable [5]. Computer simulation is an alternative that brings
accuracy, and flexibility to this challenge. Real hardware can be modeled at any degree
of fidelity required by the user and dictated by the application, allowing the system to
be tailored to the application and vice versa. A simulation-based approach provides the
user with the flexibility to model important components very accurately while sacrificing
the accuracy of less vital components by modeling them at a lower fidelity. In addition to
these benefits, computer simulation supports the scaling of specific component parameters,
allowing for the modeling of next-generation technologies which may not be currently
available for experimental study. Analysis based on such models can also provide concrete
evidence that may influence the future road map of system component manufactures.
Classically, there are two types of computer simulation environments:
execution-driven and trace-driven. Execution-driven simulations often use a program’s
machine code as the input to the simulation models and also have near clock-cycle fidelity,
producing accurate results at the cost of slow simulation speeds [6], [7]. Though very
17
useful for detailed studies of small-scale systems, execution-driven simulations tend to
become impractical with regards to time when used to simulate large, complex systems.
Trace-driven simulations employ higher-level representations of an application’s execution
to drive the simulations [8], [9]. These representations are generally captured using
an instrumented version of the application under study running on a given hardware
configuration. In essence, the trace-driven simulations require extra time during the
“characterization” stage to thoroughly understand and capture relevant information about
the application. This additional time spent during characterization is typically amortized
during the simulation stage by having the input available for numerous simulated system
configurations. The accuracy of trace-driven simulations not only depends on the fidelity
of the models, but also on the detail of the information obtained corresponding to the
application. A non-traditional simulation type is model-driven. Model-driven simulations
use formal models developed within the simulation environment in order to emulate the
behavior of an application. Essentially, these models produce output data that stimulates
the components of the simulated system as if the real application were running. Though
the development of the models can be very time consuming, depending on the complexity
of the modeled application, once the model is developed it can be used within any system
without any extra work.
In order to perform a trace-driven simulation, a representation of the application
is necessary. Traces and profiles are two types of application representations that can
be used to portray the behavior of a program [10]. Traces gather a chronology of events
such as computation or communication occur in time, allowing a user to observe what
a program is doing during a specific time period. Because traces are dependent on the
execution time of a program, long-running programs can produce extremely large trace
logs. These large logs can be impractical or even impossible to store depending on how
much detail is recorded by the trace program. Profiles, by contrast, do not record events
in time, but rather tally the events in order to provide useful statistics to the end user.
18
The overhead incurred from the execution of extra code used to collect the profile or
trace data can be quite similar, but is ultimately dependent on the level of detail profiled
or traced as well as the application under study. While trace generation may impose
penalties associated with the creation of large trace files (depending on the frequency
of disk access), it has the advantage that very little additional processing is needed as
intermediate results are calculated. By contrast, profiling typically requires very little file
access, but may demand the frequent calculation of intermediate results for the application
profile. In essence, a trace provides raw data describing the execution of an application
and a profile outputs a processed form of this data. Both profiles and traces can be useful
tools in trace-driven simulation, depending on the type of simulation models used for the
study and the amount of information that is desired. Further discussion on how trace and
profile tools are leveraged within the FASE framework are described in Chapter 3.
Traces and profiles can be collected for both sequential and parallel programs though
this study focuses on the latter. Various parallel programming models exist in order to
facilitate computation across multi-process, multi-node systems. These models include
shared address, message passing, and data parallel [11]. Due to its proliferation within
the scientific community, this study focuses on the message passing paradigm. More
specifically, we consider the Message Passing Interface (MPI) as the de facto standard for
inter-process communication via message passing [12], [13]. The MPI standard defines a
number of functions that allow application developers to pass information from a source
process to one or more destination processes. The destination processes can be local to
the sending process or can be located on a remote node requiring transactions across some
interconnect. More details on the MPI standard and its implementation can be found in
[12].
Predicting the performance of an arbitrary application executing on a specific system
architecture has been a long-sought goal for application and system designers alike.
Throughout the years, researchers have designed a number of prediction frameworks that
19
employ varying techniques and methodologies in order to produce accurate and meaningful
results. The approach taken by each framework allows it to overcome certain obstacles or
focus on specific application or system types while inevitably sacrificing some key feature
such as accuracy, speed, or scalability. The following paragraphs briefly describe existing
prediction environments similar to FASE and the methodologies and general limitations of
each.
The Integrated Simulation Environment (ISE) [14], a precursor to FASE, employs
hardware-in-the-loop (HWIL) simulation in which actual MPI programs are executed
using real hardware coupled with high-fidelity simulation models. The computation time
of each simulated node is calculated via a process spawned on a “process machine,” while
network models calculate the delays for the communication events. Though the ISE shows
reasonable simulation time and accuracy, the scalability of the framework is limited due
to the large number of processes needed to represent large-scale systems. The POEMS
project [15] adheres to an approach similar to HWIL simulation in that systems can
be evaluated based on results from actual hardware and simulation models as well as
performance data collected from actual software and analytical models. The simulation
environment used by POEMS integrates SimpleScalar with the COMPASS simulator [16]
enabling direct execution-driven, parallel simulation of MPI programs. These high-fidelity
simulators can provide very accurate and insightful results though the simulations can
require large amounts of time especially when dealing with very large systems.
While ISE and POEMS use execution-driven simulations, other projects employ
a model-driven approach. CHAOS [17] creates application emulators that reproduce
the computation, communication, and I/O characteristics of the program under study.
The emulators are then used to pass their information to a suite of simulators that, in
turn, determine the performance of the modeled application on the target simulated
system. The PACE [18] framework defines a custom language, called CHIP3S, that is
used to generate a model describing a parallel program’s flow control and workload.
20
Similarly, Performance Prophet [19] uses the Unified Modeling Language (UML) to model
parallel and distributed applications. Both PACE and Performance Prophet use these
models to drive systems composed of both analytical and simulation models in order to
balance speed and accuracy. Due to the inherent difficulty of automatically creating an
accurate application model, all three projects require intimate knowledge of the program
it represents. As a result, the source code must be analyzed and profiled to collect the
necessary information to reconstruct the program’s execution.
Work conducted by the Performance Evaluation Research Center (PERC) focuses
on performance prediction of parallel systems under the assumption that an application’s
performance on a single processor is memory bound and the interconnect dictates the
scalability of the program under study [20], [21]. The framework uses a three-step process:
1. Collect memory performance parameters of the considered machine
2. Collect memory access patterns independent of the underlying architecture on which
the application is executed
3. Algebraically convolve the results from steps one and two, then feed the results
to the DIMEMAS [22] network simulator developed at the European Center for
Parallelism of Barcelona
The researchers of PERC have reported accurate performance predictions using a
wide range of applications. However, collecting memory access patterns can be a very
time consuming task that results in large amounts of data for each application considered.
Also, the DIMEMAS network simulator is relatively simplistic and has the potential for
large inaccuracies when analyzing communication-intensive applications with potential
contention issues.
Performance prediction is the main theme of FASE and the frameworks mentioned
above. However, performance analysis is an integral technique used to characterize
the applications under study prior to simulation. The accuracy and detail of this
characterization data can greatly influence the accuracy and speed of the simulation
21
frameworks that use it. While a great number of performance analysis tools exist
for various purposes [23], SvPablo [24], Paradyn [25], and TAU [26] are the tools
most applicable to the FASE simulation framework. SvPablo may be used to analyze
the performance of MPI and OpenMP parallel programs, and allows for interactive
instrumentation and correlates performance data with the original source code. Paradyn
also is compatible with MPI programs and offers the advantage that no modifications
to the source code are needed due to dynamic instrumentation at the binary level.
Paradyn’s focus is on the explicit identification of performance bottlenecks. TAU is an
MPI performance analysis tool that can provide performance data on a thread-level
basis, and provides the user with a choice of three different instrumentations. Many
performance analysis tools including SvPablo and TAU are based on or support the
popular Performance Application Programming Interface (PAPI), which is also supported
by the Sequoia tool used by FASE. Each of these analysis tools can be incorporated into
the pre-simulation stage of FASE in conjunction with or as a replacement for Sequoia to
provide additional information on an application’s behavior.
FASE culminates many of these general techniques and methods in order to provide
a robust, flexible performance prediction framework. In addition, FASE features an
environment that allows users to build virtual representations of candidate architectures.
These virtual prototypes capture structural dependencies such as network congestion
and workload distribution that can greatly impact an application’s performance. Many
of the frameworks described in this section use simplistic communication models that
have difficulties capturing such issues. Furthermore, FASE employs a systematic,
iterative methodology that produces highly modular application characterization
data and component models. The framework illustrates the relationships between the
application performance information and the architectural models such that features
and mechanisms within the models can be identified and altered to improve prediction
accuracy and simulation speed. The other frameworks propose using models of various
22
fidelities in order to speed up the simulations; however, they do not explicitly describe the
decision-making process of choosing or designing a model’s fidelity. Finally, FASE provides
a fully extensible pre-built model library with components ranging from the application
layer down to the hardware layer. Unlike many of the frameworks described above, FASE
includes a number of detailed middleware models that can have significant impacts on
the performance of the overall system. Many of the pre-built models are highly tunable,
thus allowing a single, generic model to represent many different implementations based
on the parameter values set by the user. The combination of these features makes FASE a
powerful, flexible environment for rapidly evaluating applications executing on a variety of
candidate systems.
23
CHAPTER 3FAST AND ACCURATE SIMULATION ENVIRONMENT (PHASE 1)
The Fast and Accurate Simulation Environment (FASE) is a framework designed
to facilitate system design according to the needs of one or more key applications.
FASE provides a methodology and corresponding toolkit to evaluate the performance
of virtually-prototyped systems to determine, in a timely and cost-effective manner, the
ideal system configuration for a specific set of applications. In order to promote quick
and modular predictions, the FASE framework is broken into two primary domains
- the application and simulation domains. In the application domain, FASE employs
various tools and techniques to characterize the behavior of an application in order
to create accurate representations of its execution. The information gathered is used
to not only identify and understand the characteristics of the various phases inherent
to the application, but also to generate the stimulus data to drive the simulation
models. The characterization data can be collected using one or more tools depending
on the application, the capabilities of the employed tool(s), and the simulation
models used during simulation. Once the data is collected, it can be used in numerous
simulations without any modifications thus facilitating the exploration of various system
configurations. More details on the application domain are provided in Section 3.1.
The simulation domain incorporates the design, development, and analysis of the
virtually-prototyped systems to be studied. In this domain, component models are
designed and validated in order to create systems that incorporate new or emerging
technologies. To ease system development, FASE provides a library of pre-constructed
models tailored to accommodate the design of HPC environments. Once a system has
been constructed, characterization data from any number of applications can be used
as stimulus to the simulation thus allowing rapid analyses of the system under varying
workloads. More information on the various aspects of simulation domain are detailed in
Section 3.2.
24
Figure 3-1 illustrates a high-level representation of the process associated with the
FASE framework. The dark gray blocks represent steps in the application domain while
the simulation steps are denoted by white blocks. Notice that a user can work in both
domains concurrently. Also note that the framework incorporates multiple feedback
paths that allow the user to follow an iterative process by which insight is gained through
application characterization and simulation, and used to refine the models and application
analysis data employed for future iterations. Section 3.2.3 contains further details on how
an iterative methodology may be employed in FASE.
Figure 3-1. High-level data-flow diagram of FASE framework
3.1 Application Domain
The application domain is a critical part of the FASE framework. In this domain,
important information is gathered that provides insight on the behaviors of an application
during its execution. The main goal within the application realm is to gather enough
information about an application so that systems in the simulation environment are
stimulated as if they were really running the code. As such, this domain is decomposed
into two main stages: 1) application characterization and 2) stimulus development. The
25
application characterization stage employs analysis tools to collect pertinent performance
data that illustrates how the application exercises the computational subsystems. The
data that can be collected includes communication information, computation information,
memory accesses, and disk I/O. This data can then be used directly or processed and
analyzed during the stimulus development stage. In the stimulus development stage, raw
data gathered during characterization is used to provide valid input to the simulation
models such that the simulated system’s components are exercised as if the real program
was executing on it. More details on both stages as well as the various options available in
each are provided in the proceeding sections.
3.1.1 Application Characterization
Application characterization is a vital step in FASE that enables accurate predictions
on the performance of an application executing on a target system. The goal of
characterization is to identify and track the performance-critical attributes that dictate an
application’s performance based on both application and target-system parameters. FASE
provides a framework in which users can analyze their applications using existing systems
(also known as instrumentation platforms) in order to prepare for simulation. The basic
methodology by which the user analyzes the application is shown in Figure 3-2. The tools
employed in each iteration initially depend on the user’s experience with the application
and can then change based on the results from the previous iteration of analysis. The
selected tools should be capable of capturing the inherent qualities of the application while
minimizing the collection of information resulting from dependencies on the underlying
architecture of the instrumentation platform. Perturbation (i.e., the additional overhead
imposed on the system due to instrumentation) should also be considered to ensure data
accuracy. Though characterization is part of the application domain, the simulation
models should also be considered during tool selection. For example, if the processor
model to be used in simulation supports only instruction-based information, then the
analysis tool(s) selected should provide at least that information for that particular model.
26
Multiple tools can be used based on the details they provide, but output data must be
converted to a common format in order to drive the simulation models.
Figure 3-2. The FASE application characterization process
FASE incorporates a feedback loop (Loop 1 in Figure 3-1) at the characterization
stage such that multiple characterizations may be performed to first understand the
main bottleneck of the program (e.g., processor, network, memory, disk I/O, etc.),
and then focus on the collection of information that characterizes the main bottleneck
while abstracting the components of lesser impact. This data can then be fed into the
simulation environment and analyzed until the system exposes a different component as
the bottleneck. If desired, the characterization data and models can then be switched
or adjusted to incorporate the appropriate information and components to capture
the nature of the new bottleneck. In fact, the performance data can provide all the
27
necessary information for applications with an arbitrary bottleneck, while the simulation
models incorporate abstraction by only using the data that corresponds to capabilities
supported in their designs. Further details on the simulation phase and how application
characterization influences design decisions are described in the next section.
The initial deployment of FASE employs a single analysis tool called the Sequoia
Toolkit developed at Oak Ridge National Laboratory. Sequoia is a trace-based tool
that supports the analysis of C and FORTRAN applications using the MPI parallel
programming model, the current de facto standard for large-scale message-passing
platforms [28]. Instrumentation is conducted during link-time by using the profiling
MPI (PMPI) wrapper functions. PMPI is defined in the MPI standard and provides an
easy interface for profiling tools to analyze MPI programs [12]. Therefore, a Sequoia user
must only rebuild his code by linking the Sequoia library, simplifying the data collection
process. Though not required, Sequoia also supports additional functions that can be
manually inserted into the application to start or stop data collection as well as denote
various phases within the code to facilitate analysis.
The Sequoia Toolkit explicitly supports the logging of communication and
computation events. A communication event in Sequoia is defined as any MPI function
encountered during the execution of the code. The tool collects relevant information such
as source, destination, and transfer size for all key MPI communication functions and also
logs important non-communication functions (e.g., MPI Topology, MPI Comm create).
Collecting communication events at the MPI level inherently isolates the characterization
data from the underlying network of the instrumentation platform thus allowing the
data to be used on a variety of simulated systems that employ different interconnect
technologies. Network topology dependencies are also removed during characterization
since network transfers between machines are captured as high-level semantics representing
process-to-process communications rather than architecturally-dependent characteristics
such as latency and bandwidth.
28
Computation events occur between communication events. Sequoia supports two
mechanisms that measure computation statistics during an application’s execution: timing
functions and the Performance Application Programming Interface (PAPI). PAPI is
an API from the University of Tennessee that provides access to hardware counters on
a variety of platforms that can be used to track a wide range of low-level performance
statistics such as number of instructions issued, L1 cache misses, and data TLB misses
[29]. Every logged event, both computation as well as communication, include both
wall-clock and CPU-time measurements. With PAPI enabled, computation events include
additional performance information on clock cycles, instruction count, and the number of
loads, stores, and floating point operations executed. Sequoia does not explicitly support
the collection of I/O data. However, rough estimates can be calculated by comparing
wall-clock and CPU times for each computation event.
The characterization stage does suffer from one inherent problem that must be
addressed in order to provide an environment to predict the performance of large-scale
systems. This issue arises due to the restrictions placed on the user by the physical
testbed used to collect characterization data. For example, to collect accurate information
about an application running on a 1024-node system, a 1024-node testbed must be
available to the user. Of course, not all users have access to systems with 1000s of nodes
for evaluation purposes, though they may be interested in observing the execution of
their applications on these larger systems. In order to overcome this limitation, FASE
incorporates two techniques – the process method and the extrapolation method. The
process method allows each physical node in the instrumentation platform to run multiple
processes and thus gather characterization information for multiple simulated nodes.
The downside to this approach is resource contention at shared components that can
lead to inaccurate representations of the application’s execution. Also, a single node can
only support a limited number of processes. This limitation is encountered when OS
or middleware restrictions are met or when the node becomes so bogged down that the
29
application cannot finish within a reasonable amount of time. Initial tests show that
the process method produces communication events identical to those found using the
traditional approach. However, computation events can suffer from large inaccuracies due
to memory and cache contention issues especially in tests using many processes per node
with larger datasets. The experiments conducted in Section 3.3 employ traces collected
using the traditional approach; however, research is currently underway to remedy the
inaccuracies of the process method in order to facilitate large-scale system evaluation.
The extrapolation method observes trends in the application’s behavior while
changing system size and dataset size, and then formulates a rough model of the
application based on the findings. The model describes the communication, computation,
and other behaviors of the application using a high-level language. The language can then
be read by an extrapolation program to produce traces for an arbitrary system size and
application dataset size. Details on extrapolating communication patterns of large-scale
scientific applications can be found in [30]. This approach supports accurate generation
of traces and does not suffer from the limitations of the process method, though it can
be quite difficult to determine the trends of an application, especially when dealing with
applications that behave dynamically based on measured values. Although many more
issues can arise when using the extrapolation-based approach, this topic is out of the scope
of this research.
3.1.2 Stimulus Development
After an application has been characterized, the information collected is used to
develop stimulus data used as input to the simulation stage. FASE supports three
methods of providing input to simulation models. These methods include a trace-based
approach, a model-driven approach, and a hybrid approach. The exact method employed
is left to the user though the selection should depend on the type of application under
study, the amount of effort that can be afforded, and the amount of knowledge gathered
on the internals of the application. Details on each method are provided below.
30
The trace-based approach is the quickest and most automated method available to
the FASE user. The method can use either raw or processed performance data collected
during the characterization stage according to the type of information required by the
simulation models. However, the trace-based approach does place some restrictions on
the user. First, the simulation environment must have a trace reader that is capable of
translating the performance information into data structures native to the simulation
environment. This restriction requires a common format to which all performance data
must conform. Therefore, if multiple tools are employed to gather characterization
data, their outputs must be merged and modified to some common format type that is
supported by the trace reader. The second issue is that trace data must be collected for
each system and dataset size under consideration. As system and dataset sizes increase,
the trace data from complex applications could potentially require extremely large
amounts of storage space and thus care must be taken to keep trace files manageable.
In the current version of FASE, this limitation can be alleviated by collecting data for
only certain regions of code or a limited number of iterations through the use of specific
instrumentation constructs supported by Sequoia.
The model-driven approach requires much more manual effort by the user than the
trace-based approach. This method uses a formal model of the application’s behavior
based on either a thorough analysis of characterization data collected while varying
system and dataset size or through source code analysis. The developed models have the
capability of reproducing the behaviors of complex, adaptive applications that cannot be
captured using the trace-based approach. In general, this approach begins by identifying
key application parameters that affect its performance. The next step is to ascertain
the parameters having the greatest impact on performance and then determining the
various component models the application will exercise during execution. Once these
steps are complete, the actual model is developed such that it executes the correct
computation, communication, and other events based on the behaviors discovered during
31
characterization. The actual type of model employed in this approach is not limited by the
FASE framework. Markov chains, stochastic and analytical models and explicit simulative
models are a few model types that can be used within FASE as long as they can interface
with the simulation environment.
The last approach supported by FASE is the hybrid approach. In this approach,
a mix of trace and model-driven stimulus is used in order to combine the accuracy
and ease-of-use of trace-based simulations with the flexibility and dynamism of the
model-driven approach. In this method, the application is characterized at a very high
level to identify structured and dynamic areas of code. The structured areas are used to
generate trace data while small-scale formal models are employed to represent the dynamic
areas. This mixture of techniques decreases the amount of trace data needed, reduces the
amount of effort required to formulate formal models and maintains relatively accurate
representations of the application’s behavior.
The initial deployment of FASE uses the trace-based approach as its primary
stimulus. The pre-built FASE model library consists of a Sequoia trace reader that
translates Sequoia data into the necessary data structure in the simulation environment.
Though both model-driven and hybrid approaches are defined within the FASE
framework, this phase of research focuses on simulations conducted using only the
trace-based approach.
3.2 Simulation Domain
The simulation domain consists of three stages: 1) component design, 2) system
development, and 3) system analysis. The first of the three stages involves the creation
of the necessary components used to build the systems under study. This stage can be
particularly time-consuming depending on the complexity of the component as well as
level of fidelity; however, it is a one-time penalty that must be paid to gain the benefits
of simulation. The initial release of FASE includes several pre-built models of common
components (detailed in Section 3.2.1) to aid users in this process and more will be added
32
in the future. The next step in the simulation domain is the development of the candidate
systems. The process of constructing virtual systems typically requires less time than
component design although construction time normally increases with system size and
complexity. Similar to stage one, the overhead of building a system must be paid once
though numerous applications and configurations can be analyzed using the system.
Finally, the third stage allows us to reap the benefits of the FASE framework and process.
System analysis uses the components and systems constructed in stages one and two,
and the application stimulus data from the application domain, in order to predict the
performance of an application on a configured system. Since many variations of systems
are likely to be analyzed, this stage is assumed to be the most time-sensitive. In the
following subsections, we discuss each of the three simulation stages in more detail.
3.2.1 Component Design
Each component that is of interest to the user must first be designed and developed
in the simulation environment of choice. The components must not only represent the
behavior of the component, but also correspond to the level of detail provided by the
characterization tools and the abstraction level to which the application lends itself. In
most cases, certain parts of a component will be abstracted, while other parts that are
known to affect performance will be modeled in more detail. The decision of where to add
fidelity to components and where to abstract to save development and simulation time
should be based on trends and application attributes discovered during characterization.
The components designed should incorporate a variety of key parameters that dictate
their behavior and performance. An important step in the component design is tweaking
these parameters to accurately portray the target hardware. The actual values supplied
to the models should be based on empirical data collected using similar hardware or on
the predicted performance if the components are future technologies. For example, the
network and middleware models shown in Table 3-1 were validated according to real
hardware in the High-performance Computing and Simulation (HCS) Research Lab at the
33
University of Florida. The experimental setup and results for these validation tests are
presented in Section 3.3.
The FASE development environment uses a graphical, discrete-event simulation
tool called Mission Level Designer (MLD) from MLDesign Technologies [27]. Its core
components, called primitives, perform basic functions such as arithmetic and data
flow control. The behaviors exemplified by each primitive are described using the
object-oriented C++ language to promote modularity. More complex functions are
commonly realized by connecting multiple primitives such that data is manipulated as it
flows from one primitive to another. Alternately, users may write customized primitives
to provide the equivalent functionality. MLD was selected as the simulation tool for
FASE for three main reasons. First, it is a full-featured tool that supports component and
system design as well as the capabilities to simulate the developed systems all through
a GUI. Second, MLD supports various design features that facilitate quick design times
even for very complex systems. Finally, the authors have much experience using the tool
and many models have been created outside of FASE that can be imported with little or
no modifications. Although FASE currently uses MLD as its simulation environment of
choice, it may be adapted to support additional simulation environments in the future.
A wide range of pre-constructed models populate the initial FASE library in order
to provide a starting point for users. Each model was designed and developed based on
the hardware or software they represent through the use of technical details provided by
corresponding standards and other literature. The fidelity of each model in the pre-built
library corresponds to the current HPC focus of the initial deployment of FASE as well
as the capabilities of Sequoia. As a result, it incorporates high-fidelity network and
communication middleware models to capture scalability characteristics while providing
lower fidelity models for components such as a CPU, memory, and disk. Table 3-1
highlights the more important component-level models currently populating the pre-built
FASE library. It is noteworthy to mention that a variety of components not listed in Table
34
3-1 can be developed using those that are listed. For example, an explicit multicore CPU
model does not currently exist in the FASE library. However, by combining two CPU
models, a shared memory model, and two trace files, one can analyze the performance of
an application running on a multicore machine with little effort.
The network models listed in Table 3-1 share similar characteristics. Each model
receives high-level data structures that define the various parameters required to create
and output one or more network transactions between multiple nodes. Each network
model also has numerous user-definable parameters such as link rate, maximum data size,
and buffer size that dictate the performance of communication events. Furthermore, the
models include a number of parameters that define the capabilities of the subsystems that
supply the network interfaces with the necessary data to be transferred. For example,
the InfiniBand model incorporates the parameters LocalInterconnectLatency and Local-
InterconnectBW to define the latency and bandwidth of the interconnect between host
memory and the InifiniBand host channel adapter (HCA). These parameters are used to
calculate the performance penalties incurred from transferring data from memory to the
HCA. These calculations effectively abstract away the complex behaviors of the underlying
transfer mechanisms while still accounting for their performance impacts. The middleware
models in Table 3-1 provide the performance-critical capabilities of the protocol each
represents. The TCP model is a single, generic model with a variety of parameters that
enable a user to configure it as a specific implementation. Similarly, the MPI model also
incorporates many parameters so that particular implementations can be represented
using a single model. The MPI layer is modeled using two layers such that the general,
high-level functionality of the MPI protocol forms the network-independent layer while the
second layer employs interface models that translate MPI data into network-specific data
structures. This layered approach allows a common interface to be used in all systems
featuring MPI while providing “plug-and-play” capabilities to support MPI transfers over
various interconnects.
35
Table 3-1. The FASE component library
Class Type Model name Fidelity Description
Networks
InfiniBand
Host ChannelAdapter (HCA)
High Conducts IB protocol processingon incoming and outgoing packetsfor IB compute nodes
Switch High Device supporting cut-through andstore-and-forward routing usingcrossbar backplane
Channel Interface Medium Dynamic buffering mechanism
EthernetNetwork Inter-face Card (NIC)
Medium Conducts Ethernet protocol pro-cessing on incoming and outgoingframes for Ethernet compute nodes
Switch High Device supporting cut-through andstore-and-forward routing usingcrossbar or bus backplane
SCI Link Controller High Conducts SCI protocol processingon incoming and outgoing packetsfor SCI compute nodes
Middleware
IP IP Interface Low Handles IP address resolution
TCPTCP Connection High Provides reliable sockets between
two devices
TCP Manager High Manages TCP connections toensure that the correct socketreceives its corresponding segments
MPI
MPICH2 High Provides MPI interface using TCPas the transport layer
MVAPICH High Provides MPI interface for Infini-Band
MP-MPICH High Provides MPI interface for SCI
Processors GenericProcessor
Generic Proces-sor
Low Supports timing information tomodel computation
OperatingSystems
Generic OS Generic OS Low Supports some memory manage-ment capabilities of the OS
Memories GenericMemory
Generic Memory Low Models read and write accessesbased on data size
Disks Generic Disk Generic Disk Low Models read and write accessesbased on data size
Exotics ReconfigurableDevice
ReconfigurableDevice
Medium Models a specialized coproces-sor (e.g., FPGA) that computesapplication kernels
36
3.2.2 System Development
After component development, systems must be created to analyze potential
configurations. The systems developed should correspond to the demands of the
applications as discovered via characterization. FASE provides the capability to
not only change the system size and the components in the system, but also tweak
component parameters such as the network’s latency and bandwidth, middleware delays,
and processing capabilities. This feature allows the user to scrutinize the effects of
configuration changes ranging from minor system upgrades to complete system redesign
using exotic hardware. Scalability issues in this stage are dependent on the simulation
environment rather than the application. Timely and efficient development of a massively
parallel system in a given simulation environment can quickly become an issue as system
sizes scale to very high levels, and the creation of systems with thousands of nodes can
become an almost unwieldy task. Since FASE is focused on rapid analysis of arbitrary
systems, it must address this issue. Among the ways FASE supports the creation of large
systems, the MLD simulation tool supports hierarchical, XML-based designs such that a
single module can encapsulate multiple compute nodes yet the simulator still maintains
the fidelity of the underlying components and includes their effect in the analysis. Systems
are created using a graphical interface which is automatically translated into XML code
describing system-level details such as the number and type of each component and how
they are interconnected. In addition, MLD supports dynamic instances where if a model
is created according to certain guidelines, a single block can represent a user-specified
number of identical components. Finally, a more advanced method of large-scale system
creation can use hierarchical simulations (rather than hierarchical models within a single
simulation), where small-scale systems are built and analyzed with the intent of producing
intermediate traces to stimulate a higher-level simulation in which multiple nodes from
each small-scale system then act as a single node in the large-scale system. This method
has the potential to not only speed up development times for these systems, but also to
37
reduce simulation runtimes. While this technique has not yet been employed, we plan to
examine its potential benefits in future research to improve the scalability of the FASE
simulation environment.
3.2.3 System Analysis
After the application has been thoroughly analyzed and the components and initial
systems have been developed, the user can begin analyzing the performance of various
application executing on the target systems. The stimulus data from the application
domain is used as input to the simulation models in order to induce network traffic and
processing. One powerful feature of FASE is its ability to carry out multiple simulation
runs using different system configurations, but all based on the same set of stimulus data.
Therefore, the additional time spent in application characterization allows the system
analysis to proceed much quicker.
During simulation, statistics can be gathered on numerous aspects of the system
such as application runtime, network bandwidth, average network latency, and average
processing time. In addition to the “profiling”-type data that is collected, it may also
be desired to collect “traces” from the simulation. These traces differ from the stimulus
traces in that they are more architecture-dependent and less generic, but they provide
at least one common function - giving the user further insight into the performance of
the application under study. The breadth and depth of performance statistics and results
collected during simulation will determine the level of insight available post-simulation,
and the results collected should be tailored towards the needs of the user. However, the
type and fidelity of results collected may also negatively impact simulation runtime, so the
user should be careful not to collect an excessive amount of unnecessary result data. The
simulation environment may also be tailored to output results in a common format, such
that they may be viewed in a performance visualization tool such as Jumpshot [31].
In some cases, results obtained during system analysis can lead to additional insight
in terms of the application bottleneck, requiring re-characterization of the application as
38
shown in feedback loop 4 in Figure 3-1. In this situation, the steps from the application
domain are repeated, followed by possible additional component and system designs, and
repeated system analysis. Traces collected during simulation may also potentially be used
to drive future simulations in this iterative process to solve an optimization problem and
determine the ideal system configuration.
3.3 Results and Analysis
This section presents results and analysis for experiments conducted using FASE.
In each of the following subsections, we introduce the experimental setup used to collect
stimulus data (i.e., traces) and experimental numbers against which the simulation results
are compared. The first subsection presents the calibration procedures followed to validate
the three main network models in the current library – InfiniBand, TCP over IP over
Ethernet, and the direct-connect network based on the Scalable Coherent Interface (SCI)
protocol [32]. In each case, the appropriate MPI middleware layer is also incorporated on
each interconnect in the modeling environment. Section 3.3.2 presents a simple scalability
study for a matrix multiply benchmark using the aforementioned interconnects. Finally,
Section 3.3.3 showcases the features and capabilities of FASE through a comprehensive
scalability analysis of the Sweep3D application from the ASCI Blue Benchmark suite.
3.3.1 Model Validation
In order to test and validate the FASE construction set, the MLD models were first
calibrated to accurately represent some prevalent systems available in our lab. Validation
of the network and middleware models was conducted using a testbed comprised of 16
dual-processor 1.4 GHz Opteron nodes each having 1 GB of main memory and running
the 64-bit CentOS Linux variant with kernel version 2.6.9-22. The nodes employed the
Voltaire HCA 400 attached to the Voltaire ISR 9024 switch for 10 Gbps InfiniBand
connectivity while Gigabit Ethernet was provided using integrated Broadcom BCM5404C
LAN controllers connected via a Force10 S50 switch. The direct-connect network model
was calibrated to represent SCI hardware supplied by Dolphin Inc. A simple PingPong
39
MPI program that measures low-level network performance was used to calibrate the
models to best represent the 16-node cluster’s performance over a specific interconnect.
Three MPI middleware layers were modeled including MVAPICH-0.9.5, MPICH2-1.0.5,
and MP-MPICH-1.3.0 for InfiniBand, TCP, and SCI, respectively.
Figures 3-3 through 3-8 show the experimentally gathered network performance values
of the InfiniBand, TCP/IP/Ethernet and SCI testbeds compared to those produced by
the simulation model. The performance of each configuration closely matches that of
the testbed and the average error between the experimental tests and simulative models
was 5% for InfiniBand, 3.6% for TCP/IP/Ethernet, and 2.7% for SCI. Throughput
calculated from the PingPong benchmark latencies for message sizes up to 32 MB show
the simulative bandwidths closely follow the measured bandwidths and have an average
error roughly equal to that found in the latency experiments. These results show the
component models are highly accurate when compared to the real systems but it is
noteworthy to mention that dips in the measured throughput are readily apparent at
4 MB and 256 KB for the InfiniBand and TCP/IP/Ethernet networks, respectively.
The decreases in bandwidth are due to overheads incurred in the software layers as the
software employs different mechanisms to accommodate the larger-sized transfers and the
corresponding models abstract these throttling points with the goal of best matching the
overall trend.
(a) Small message sizes (b) Large message sizes
Figure 3-3. InfiniBand model latency validation
40
Figure 3-4. InfiniBand model throughput validation
(a) Small message sizes (b) Large message sizes
Figure 3-5. The TCP/IP/Ethernet model latency validation
Figure 3-6. The TCP/IP/Ethernet model throughput validation
3.3.2 System Validation
After validating the network and middleware models, we proceeded to examine the
accuracy and speed of FASE using a simple benchmark - matrix multiply. The selected
41
(a) Small message sizes (b) Large message sizes
Figure 3-7. The SCI model latency validation
Figure 3-8. The SCI model throughput validation
implementation took a master/worker approach with the master node transmitting a
segment of matrix A and the entire matrix B to the corresponding workers in sequential
order and then receiving the results from each node in the same order. Measurements were
collected using the InfiniBand and Gigabit Ethernet models for system sizes of 2, 4, and 8
nodes and dataset sizes of 500×500, 1000×1000, 1500×1500, and 2000×2000. Each data
element is a 64-bit double-precision, floating-point value. The following paragraph outlines
the procedure taken to conduct the analysis of the matrix multiply using the FASE
methodology and corresponding toolkit. First, the matrix multiply code was instrumented
by linking the Sequoia library and the resulting binary was executed for each combination
of system and dataset size. Trace files for each combination were automatically generated
by the Sequoia instrumentation code during execution and served as the stimulus to the
42
simulation models. The component models from the FASE library were used to create
six systems for each system size (3) and network (2) analyzed. This step was conducted
while collecting characterization data from the matrix multiply. After the trace files were
collected and the systems built, each system was simulated using the corresponding trace
files for each system size. It should be noted that for this particular experiment, the
Sequoia traces were collected using the testbed’s InfiniBand network, though simulations
were run using both InifiniBand and Gigabit Ethernet networks in order to show the
portability of the traces and highlight the flexibility of FASE. Finally, experimental
runtimes were measured for each system and dataset size running on both networks in
order to determine the errors associated with the simulated systems. Table 3-2 presents
the experimental and simulative results for the various networks.
Table 3-2. Experimental versus simulation execution times for matrix multiply
InfiniBand Gigabit Ethernet
System Data Exp. Sim. Error Exp. Sim. Error
size size (sec) (sec) (sec) (sec)
2
500 3.32 3.33 0.21% 3.42 3.38 1.24%
1000 49.45 49.40 0.09% 48.81 49.58 1.58%
1500 187.40 187.03 0.20% 187.50 187.43 0.04%
2000 459.36 458.80 0.12% 460.48 459.50 0.21%
4
500 1.07 1.06 1.48% 1.16 1.12 3.12%
1000 16.69 16.54 0.93% 16.63 16.76 0.79%
1500 62.87 62.49 0.61% 63.29 63.06 0.37%
2000 153.82 153.19 0.41% 154.43 154.02 0.26%
8
500 0.52 0.46 12.70% 0.62 0.58 7.60%
1000 7.23 7.09 2.66% 7.71 7.50 2.69%
1500 27.51 26.93 2.12% 28.24 27.94 1.06%
2000 66.91 65.90 1.52% 68.18 67.60 0.86%
From Table 3-2, one can see that the simulations closely matched the experimental
execution times of the matrix multiply. The maximum error, 12.7%, occurred at smaller
dataset sizes with larger systems due to shorter runtimes that are more greatly affected
43
by any timing deviations from various anomalies such as OS task management, dynamic
middleware techniques, etc. The affects of such anomalies are normally amortized when
analyzing the typical, long-running HPC application.
The simulation times for each run of the matrix multiply were also collected in order
to quantify the slowdown of using simulation versus real hardware thus highlighting the
“fast” portion of FASE. Table 3-3 shows that the ratios of simulation to experimental
wall-clock times are very low and in some cases (e.g., small systems with large data sizes),
the simulation actually completes faster than the hardware (represented by a ratio less
than one). The ratios less than one are directly related to the amount of characterization
information collected as well as the high level of abstraction of computation events in
the simulation models. In the case of the matrix multiply, computation was abstracted
through the use of timing and this information was fed into a low-fidelity processor model,
thus accommodating short simulation times. As system size and problem size scales,
more time is spent triggering high-fidelity network models thus slowing the simulations.
However, the wall-clock simulation times observed within FASE are orders of magnitude
faster than cycle-accurate simulators where a 1000× or greater slow down in wall-clock
execution time is common.
3.3.3 Case Study: Sweep3D
Now that the system models have been validated using the matrix multiply
benchmark, they can be used to predict the performance of any application executing
on them given the proper characterization and stimulus development steps have been
conducted. In order to display the full capabilities and features of FASE, a more
complex application was selected. The Sweep3D algorithm forms the foundation of a
real Accelerated Strategic Computing Initiative (ASCI) application and solves a 1-group,
time-independent, discrete ordinate 3D Cartesian geometry neutron transport problem
[33], [34]. As shown in Figure 3-9, each iteration of the algorithm involves two main steps.
The first step solves the streaming operator by “sweeping” each angle of the Cartesian
44
Table 3-3. Ratio of simulation to experimental wall-clock execution time
InfiniBand Gigabit Ethernet
System Data Exp. Sim. Ratio Exp. Sim. Ratio
size size (sec) (sec) (sec) (sec)
2
500 3.32 5.58 1.68 3.42 6.21 1.82
1000 49.45 19.42 0.39 48.81 24.40 0.50
1500 187.40 42.71 0.23 187.50 54.60 0.29
2000 459.36 75.10 0.16 460.48 96.40 0.21
4
500 1.07 9.62 8.96 1.16 11.20 9.64
1000 16.69 33.44 2.00 16.63 44.10 2.65
1500 62.87 73.28 1.17 63.29 99.10 1.57
2000 153.82 131.05 0.85 154.43 174.00 1.13
8
500 0.52 17.92 34.22 0.62 22.60 36.17
1000 7.29 60.25 8.27 7.71 88.70 11.52
1500 27.51 129.06 4.69 28.24 195.00 6.92
2000 66.91 230.99 3.45 68.18 357.00 5.24
geometry using blocking, point-to-point communication functions while the second step
uses an iterative process to solve the scattering operator employing collective functions.
There are numerous input parameters that may be set including the number of processing
elements in the X and Y dimensions of the logical system as well as the number of grid
points (i.e., double-precision, floating-point values) assigned to the XYZ dimensions of
the data cube. The default dataset sizes supplied with Sweep3D are 50×50×50 and
150×150×150, though the experiments in this section will also explore an intermediate
dataset - 100×100×100. In this section, we present a set of experiments designed to
quantify the accuracy and speed of the FASE simulation environment with respect to the
Sweep3D application. The first experiment illustrates the accuracy of FASE by comparing
experimentally measured execution times of Sweep3D versus the times produced by
corresponding simulated systems. Experiment 2 analyzes the speed of simulations
conducted using FASE and demonstrates how the speed scales with system size. The
section concludes with a final experiment that showcases the full potential of the FASE
45
framework by providing a detailed simulative analysis of the Sweep3D application running
on systems with various sizes, interconnect technologies, topologies, and middleware
attributes.
Figure 3-9. Sweep3D algorithm
3.3.3.1 Experiment 1: Accuracy
The first set of experiments performed using Sweep3D was very similar to those from
the last section. The main difference, however, is that the system under study for these
experiments leverages the abilities of FASE to simulate heterogeneous components. This
system is composed of a heterogeneous, 64-node Linux cluster that features four types of
computational nodes as listed in Table 3-4.
Table 3-4. Compute node specifications for each cluster in heterogeneous system
Node count Processor Memory OS Kernel
Cluster 1 10 1.4 GHz Opteron 1 GB DDR333 CentOS 2.6.9-22
Cluster 2 14 3.2 GHz Xeon with EM64T 2 GB DDR333 CentOS 2.6.9-22
Cluster 3 30 2.0 GHz Opteron 1 GB DDR400 CentOS 2.6.9-22
Cluster 4 10 2.4 GHz Xeon 1 GB DDR266 Redhat 9 2.4.20-8
All measurements and Sequoia traces were gathered over the Gigabit Ethernet
interconnect. Figure 3-10 shows a comparison of the execution times from the physical
and simulated systems while increasing the system and dataset sizes. Table 3-5 displays
46
Figure 3-10. Experimental versus simulative execution times for Sweep3D
the errors between experimental and simulative execution times. In this experiment, we
observed slightly higher error rates than the matrix multiply benchmark and network
validation tests, but this trend is to be expected considering the increased complexity
of the Sweep3D application and heterogeneous system used for the study. In all but five
cases, error rates were below 10%, with many cases showing errors around 1%. In the
cases with 10% error or greater, we can largely attribute the higher values to extraneous
data traffic and spurious OS activity among other effects due to non-dedicated resources.
Once again, the increased error rates occurred in cases where either dataset sizes were
small or system sizes were relatively large (or both). The maximum error observed is
23.28%, which is under the acceptable threshold for predicting performance of simulated
systems since real-world implementations of the hardware and software devices will have a
great effect on the actual performance of the final system.
3.3.3.2 Experiment 2: Speed
For each experiment, the testbed execution time of the application was compared
to the simulation time of the virtually-prototyped system. As seen in Figure 3-11, the
simulation time increases with both dataset size and system size. Increases in either
characteristic raise the amount of network traffic, thus causing more computation time to
47
Table 3-5. Experimental versus simulation errors for Sweep3D
Dataset size
System size 50×50×50 100×100×100 150×150×150
2 0.69% 0.13% 0.21%
4 0.90% 5.38% 9.76%
8 0.60% 0.25% 2.84%
16 8.36% 10.11% 15.31%
32 5.95% 23.28% 1.03%
64 14.66% 17.36% 1.02%
be spent processing interactions between the higher-fidelity models. In fact, the longest
simulation time, one hour, occurred for the 64-node system using the 150×150×150
dataset.
Figure 3-11. Ratios of simulation to experimental wall-clock completion time for varyingsystem and dataset sizes
While a one-hour simulation time is within an acceptable tolerance, if we extrapolate
the timing results to even larger system and dataset sizes, we find that a system of
1024 nodes computing a 250×250×250 grid-point dataset will take approximately 70
hours. In order to cut the time when simulating the computation of very large datasets
on large-scale systems, the stimulus development and simulation techniques previously
described in this chapter can be employed. In this case, knowledge of the application’s
execution characteristics can help to speed up simulations. The Sweep3D application
48
performs twelve iterations of its core code, with each compute node having identical
communication blocks and very similar computation blocks per iteration. The affinity
across iterations allows us to use the performance data collected during a single iteration
to extrapolate the total time taken to execute all twelve iterations. This process can lead
to decreased simulation times with little effects on simulation accuracy. In fact, removing
all but a single iteration from the Sweep3D traces resulted in a simulation speedup of 9.6
(total of 7 minutes rather than 65 minutes) while impacting the accuracy of the model by
less than ±1%.
3.3.3.3 Experiment 3: Virtual system prototyping
Now that we have a fast and accurate baseline system for the Sweep3D application,
we can explore the effects of changing the system configuration. This experiment explores
the performance impact of increasing the number of nodes in the system by scaling the
processing power of each node. This scaling effectively extrapolates the performance
of the baseline 64-node system to represent the performance of Sweep3D executing on
systems up to 8192 nodes. The results provided in this experiment present the best-case
values of each system size since network issues such as switch congestion and multi-hop
transfers that arise from adding additional nodes to a system are not considered. Though
the actual network pressures of these systems are not fully represented, the results
do provide an upper bound performance of the Sweep3D application running on the
corresponding system. This upper bound can be used to quickly identify whether or not a
particular system is suitable to run a particular application thus facilitating the evaluation
of numerous configurations with the intent of pinpointing a small subset the original
candidate systems to simulate in more detail. Future work will incorporate the stimulus
development techniques discussed in Section 3.1.2 so that network contention is considered
to improve the accuracy of the performance predictions.
The various networks that we examine include standard Gigabit Ethernet (GigE), an
enhanced version of GigE, 10 GigE, InfiniBand, a 2D 8×8 direct-connect SCI network,
49
Figure 3-12. Execution times for Sweep3D running on various system configurations
and a 3D 4×4×4 SCI network. Each configuration included in this study attempts to shed
light on how changes in system size, network bandwidth, network latency, middleware
characteristics, and topology affect the overall performance of Sweep3D each of which
is easily configurable via the FASE framework. The results from this experiment are
displayed in Figure 3-12, with the 64-node Gigabit Ethernet system used as the baseline.
Figure 3-12 shows the Gigabit Ethernet system experienced nearly linear speedups
until the system size reached 1024 nodes. At this point, the communication of Sweep3D
begins to dominate execution time. In general, the trends displayed in Figure 3-12 were
surprising, given an initial timing analysis of the algorithm showed that half the execution
time was spent in communication blocks. Upon further analysis we determined that the
reason for these trends is due to the “Late Sender” problem where processes post their
MPI Recvs before the matching MPI Send is executed by the corresponding process,
causing the receiving process to become idle. In the case with 8192 nodes, the application
becomes network-bound, even with the Late Sender problem. Therefore, we must change
the focus to other network technologies to alleviate the communication bottleneck.
50
Figure 3-13. Maximum speedups for Sweep3D running on various network configurations
The first attempt to remedy the communication bottleneck for larger system sizes
employed optimized versions of TCP and MPI. Specifically, we increased key parameters
such as TCP window size and TCP maximum transfer unit (effectively enabling the
use of jumbo frames) as well as reduced MPI overhead. This case, labeled Enhanced
GigE, showed little improvements in total execution time leading us to conclude that the
bandwidth and latency of the network dictate the performance of communication events
rather than the middleware. The next tests conducted replaced the Gigabit Ethernet
interconnect with high-performance networks such as 10 GigE, InfiniBand and SCI. Figure
3-12 shows that in smaller systems, the interconnect has little effect on the performance
of Sweep3D. However, it becomes apparent that beyond 1024 nodes a faster interconnect
provides better speedups. For example, the 10 GigE and InfiniBand interconnects offer
speedups of 4.1× and 5.31×, respectively, over the baseline GigE system. Figure 3-13
illustrates the maximum speedups for the various high-performance networks that were
tested.
Not only do these tests analyze the affects of adding bandwidth and lowering latency,
but the SCI cases demonstrate the power of virtual system prototyping by exploring
the impact of mapping the Sweep3D algorithm to drastically different topologies,
51
specifically a 2D and 3D torus. At first glance, the Sweep3D algorithm seems to map
well to direct-connect network topologies based on its nearest neighbor communication
pattern; however, from Figure 3-13 it appears that this is not exactly the case. Further
analysis of the application and underlying architecture shows that the three fundamental
characteristics of the SCI network hinder further performance improvements. First, the
SCI protocol uses small payload sizes of 128-bytes which cannot effectively amortize
communication overhead. The second and third characteristics are the one-way packet
flow of a dimensional ring and the packet forwarding or dimension switching delays
that occur at each intermediate node while routing packets to the destination. Under
certain circumstances, as dictated by the Sweep3D algorithm, few packets experience
more than one forwarding delay due to the target node being the next neighboring node
in the dimensional ring’s packet flow. This case provides optimal mapping between the
algorithm and the network architecture resulting in excellent performance. However,
this scenario is only one of four communication flows used by Sweep3D. The remaining
communications result in numerous transactions that must travel almost the entire
length of a dimensional ring in order to reach its destination resulting in many multi-hop
transfers, higher latencies, and lower overall speedups. These negative effects could easily
be lost when using low-fidelity, analytical network models due to their inability to capture
structural characteristics that can greatly impact overall system performance.
Not only can FASE be used to analyze the effects of interchanging network
technologies, but it can also provide a highly detailed analysis of a specific technology. We
chose the InfiniBand network for the in-depth study since it showed the best performance
of the six configurations. The InfiniBand model, as well as the MPI middleware model,
has numerous user-definable parameters that can be changed to correspond to the current
and future versions of the technologies. These parameters provide a mechanism to
perform fine-grained analyses to squeeze as much performance as possible from a specific
technology while also providing valuable insight on any new bottlenecks that may arise
52
Figure 3-14. Speedups for Sweep3D running on 8192-node InfiniBand system
during future upgrades. From the results in Figure 3-14, one can see that even at 8192
nodes, network enhancements have little effect on the InfiniBand system’s performance.
The greatest speedup from changes in the communication layer comes from middleware
enhancements that achieve a 1.21× performance boost over the baseline. These results
indicate that further improvements of the InfiniBand system’s computation capabilities
are needed in order to show any significant speedups when tweaking the middleware and
network layers. The enhancement that increases network and processing performance
as well as decreases the overhead associated with the middleware (see Figure 3-14, Mid-
dleware+Network+Processor enhancement) reinforces this claim by providing a 2.44×
speedup. This study provides but a brief summary of the modifications that can be made
to test the effectiveness of key technologies when analyzing an application using FASE.
3.4 Conclusions
The task of designing powerful yet cost-efficient HPC systems is becoming extremely
daunting due to not only the increasing complexity of individual computation and
I/O components but also the effective mapping of grand-challenge applications to the
underlying architecture. In this first phase of research, we presented a framework called
the Fast and Accurate Simulation Environment (FASE) that aids system engineers
53
overcome the challenges of designing systems that target specific applications through
the use of analysis tools in conjunction with discrete-event simulation. Within the
application domain, FASE provides a methodology to analyze and extract an application’s
performance-critical data which is then used to discover trends and limitations as well
as provide stimuli for simulation models for virtual prototyping. We also provided
background on various options for performance prediction of HPC systems through
modeling and simulation, and outlined the need for a solution that can provide fast
simulation times with accurate results. The FASE framework was then outlined and its
different components and features were described.
To showcase the capabilities of FASE, we gathered a variety of results showing the
performance of various system configurations. We first provided validation results for our
InfiniBand, TCP/IP over Ethernet, and SCI models, showing that the network models
that serve as the backbone of the case studies in this paper have been carefully tuned
to accurately match real hardware. We then showed the results of a matrix multiply
case study where we compared experimental to simulated execution times for a parallel
MPI-based matrix multiply benchmark. In most cases, errors in the models were less than
1%, with the maximum error of 12.7% occurring in a case with a small dataset size and
a large system size. These conditions result in short experimental runtimes of less than
one second where transient effects such as OS task management and page faults can cause
unpredictable deviations in application execution time. In terms of simulation speed, the
slowdowns observed by simulating the parallel matrix multiply were very low, and in some
cases, the simulation actually completed before the actual system finished executing the
code.
The final case study presented a series of experiments using the Sweep3D benchmark,
which is the main kernel of a real Accelerated Strategic Computing Initiative application.
We performed simulation and hardware experiments over a range of dataset sizes using
a Gigabit Ethernet-based system, and again found errors to be very low in most cases.
54
A maximum error of 23.28% was observed, which is considered acceptable when dealing
with predicting the performance of a complex application running on an HPC system.
Again, the cases with high errors correspond to non-typical HPC scenarios with larger
systems working on small datasets resulting in conditions that amplify deviations in
experimentally measured runtimes. We also proposed and employed tactics to speed up
simulation by 10× while sacrificing less than 1% accuracy. With a fast and accurate
baseline system established for Sweep3D, we proceeded to use the FASE methodology
to predict the performance of the application for systems with various sizes and network
technologies. We found that the application was actually more processor-bound than
initially anticipated due to the MPI “Late Sender” problem where a process posts
an MPI Recv before the corresponding MPI Send is executed, causing the receiving
process to become idle. However, as the systems increased in size, Sweep3D did become
network bound, with 10 Gigabit Ethernet and InfiniBand both providing significant
performance improvements over the Gigabit Ethernet baseline mainly due to their
increased bandwidth. The analysis of Sweep3D concluded with an in-depth look at
its execution on an InfiniBand system while varying fine-grained parameters such as
network bandwidth, packet size, and middleware overhead. These modifications provided
minimal performance improvements because the algorithm’s bottleneck changed from
communication to computation when network backpressure was reduced by choosing an
improved network technology (i.e., InfiniBand).
The work conducted during this phase of research produced a flexible and
comprehensive framework for performance modeling and prediction. This framework
provides a generalized methodology for application characterization, design and
development of component and system models, and analysis of applications running on the
virtual systems under consideration. The work also produced a set of tools and a model
library to facilitate performance prediction. The case studies validated the usefulness
of FASE by displaying both fast and accurate results when comparing the observed
55
experimental and simulative values. The studies also illustrated the capabilities of FASE
for analyzing the effects of architectural variations in order to improve the scalability of
applications. The contributions and accomplishments of this work have been compiled into
a manuscript and published in Simulation: Transactions of The Society for Modeling and
Simulation International [35].
56
CHAPTER 4PERFORMANCE AND AVAILABILITY PREDICTIONS OF VIRTUALLY
PROTOTYPED SYSTEMS FOR SPACE-BASED APPLICATIONS (PHASE 2)
This chapter presents two detailed case studies of the FASE framework applied to
analyze the effects of various configuration and algorithmic changes to a space-based
system. The first case study looks at performance and scalability issues of the NASA
Dependable Multiprocessor (DM) executing a key application kernel, the Fast Fourier
Transform. The second evaluates the performance and availability of the Synthetic
Aperture Radar (SAR) application running on the DM system in a faulty environment
such as space. The proceeding sections provide the details and results of these case studies
as well as a novel analysis approach to accurately predict system performability. The first
section presents background information on the motivations and initial design of the space
system. The next section supplies details on the approach taken to design and develop
the necessary models to virtually explore the performance, scalability, and availability
trade-offs of the candidate system. It also describes the unique analysis approach used to
study SAR executing on the DM system. Finally, the experiments and results from both
case studies are presented followed by the conclusions drawn from the analyses.
4.1 Background
4.1.1 Project Overview
NASA and other space agencies have had a long and relatively productive history
of space exploration as exemplified by recent rover missions to Mars. Traditionally,
space exploration missions have essentially been remote-control platforms with all major
decisions made by operators located in control centers on Earth. The onboard computers
in these remote systems have contained minimal functionality, partially in order to satisfy
design size and power constraints, but also to reduce complexity and therefore minimize
the cost of developing components that can endure the harsh environment of space. Hence,
these traditional space computers have been capable of doing little more than executing
small sets of real-time spacecraft control procedures, with little or no processing features
57
remaining for instrument data processing. This approach has proven to be an effective
means of meeting tight budget constraints because most missions to date have generated
a manageable volume of data that can be compressed and post-processed by ground
stations.
However, as outlined in NASA’s latest strategic plan and other sources, the demand
for onboard processing is predicted to increase substantially due to several factors [36].
As the capabilities of instruments on exploration platforms increase in terms of the
number, type and quality of images produced in a given time period, additional processing
capability will be required to cope with limited downlink bandwidth and line-of-sight
challenges. Substantial bandwidth savings can be achieved by performing preprocessing
and, if possible, knowledge extraction on raw data in-situ. Beyond simple data collection,
the ability for space probes to autonomously self-manage will be a critical feature to
successfully execute planned space-exploration missions. Autonomous spacecraft have
the potential to substantially increase their return on investment through opportunistic
explorations conducted outside the Earth-bound operator control loop. To achieve this
goal, the required processing capability becomes even more demanding when decisions
must be made quickly for applications with real-time deadlines. However, providing
the required level of onboard processing capability for such advanced features and
simultaneously meeting tight budget requirements is a challenging problem that must
be addressed.
In response, NASA has initiated several projects to develop technologies that address
the onboard processing gap. One such program, NASA’s New Millennium Program
(NMP), provides a venue to test emergent technology for space. The Dependable
Multiprocessor (DM) is one of the four experiments on the upcoming NMP Space
Technology 8 (ST8) mission, to be launched in 2009, and the experiment seeks to
deploy Commercial-Off-The-Shelf (COTS) technology to boost onboard processing
performance per watt [37]. The DM system combines COTS processors and networking
58
components (e.g., Ethernet) with a novel and robust middleware system that provides a
means to customize application deployment and recovery features, and thereby maximize
system efficiency while maintaining the required level of reliability by adapting to the
harsh environment of space. In addition, the DM system middleware provides a parallel
processing environment comparable to that found in high-performance COTS clusters of
which application scientists are familiar. By adopting a standard development strategy
and runtime environment, the additional expense and time loss associated with porting
applications from the laboratory to the spacecraft payload can be significantly reduced.
4.1.2 DM System Architecture
Building upon the strengths of past research efforts [38], [39], [40], the DM
system provides a cost-effective, standard processing platform with a seamless
transition from ground-based computational clusters to space systems. By providing
development and runtime environments familiar to earth and space science application
developers, project development time, risk and cost can be substantially reduced.
The DM hardware architecture (see Figure 4-1) follows an integrated-payload concept
whereby components can be incrementally added to a standard system infrastructure
inexpensively [41]. The DM platform is composed of a collection of COTS data
processors (augmented with runtime-reconfigurable COTS FPGAs) interconnected
by redundant COTS packet-switched networks such as Ethernet or RapidIO [42]. To
guard against unrecoverable component failures, COTS components can be deployed
with redundancy, and the choice of whether redundant components are used as cold
or hot spares is mission-specific. The scalable nature of non-blocking switches provides
distinct performance advantages over traditional bus-based architectures and also allows
network-level redundancy to be added on a per-component basis. Additional peripherals or
custom modules may be added to the network to extend the system’s capability; however,
these peripherals are outside of the scope of the base architecture.
59
Figure 4-1. System hardware architecture of the dependable multiprocessor
Future versions of the DM system may be deployed with a full complement of COTS
components but, in order to reduce project risk for the DM experiment, components
that provide critical control functionality are radiation-hardened in the baseline
system configuration. The DM is controlled by one or more System Controllers, each
a radiation-hardened single-board computer, which monitor and maintain the health of the
system. Also, the system controller is responsible for interacting with the main controller
for the entire spacecraft. Although system controllers are highly reliable components, they
can be deployed in a redundant fashion for highly critical or long-term missions with cold
or hot sparing. A radiation-hardened Mass Data Store (MDS) with onboard data handling
and processing capabilities provides a common interface for sensors, downlink systems
and other peripherals to attach to the DM system. Furthermore, the MDS provides a
globally accessible and secure location for storing checkpoints, I/O and other system data.
The primary dataflow in the system is from instrument to Mass Data Store, through
the cluster, back to the Mass Data Store, and finally to the ground via the spacecraft’s
Communication Subsystem. Because the MDS is a highly reliable component, it will likely
have an adequate level of reliability for most missions and therefore need not be replicated.
However, redundant spares or a fully distributed memory approach may be required for
some missions. In fact, results from an investigation of the system performance suggest
that a monolithic and centralized MDS may limit the scalability of certain applications
and these results are presented in Section 4.3.
60
4.1.3 DM Middleware Architecture
The DM middleware has been designed with the resource-limited environment typical
of embedded space systems in mind and yet is meant to scale up to hundreds of data
processors per the goals for future generations of the technology. A top-level overview of
the DM software architecture is illustrated in Figure 4-2. A key feature of this architecture
is the integration of generic job management and software fault-tolerant techniques
implemented in the middleware framework. The DM middleware is independent of
and transparent to both the specific mission application and the underlying platform.
This transparency is achieved for mission applications through well-defined, high-level,
Application Programming Interfaces (APIs) and policy definitions, and at the platform
layer through abstract interfaces and library calls that isolate the middleware from the
underlying platform. This method of isolation and encapsulation makes the middleware
services portable to new platforms.
Figure 4-2. System software architecture of the dependable multiprocessor
To achieve a standard runtime environment with which science application designers
are accustomed, a commodity operating system such as a Linux variant forms the basis
for the software platform on each system node including the control processor and mass
data store (i.e., the Hardened Processor seen in Figure 4-2). Providing a COTS runtime
system allows space scientists to develop their applications on inexpensive ground-based
clusters and transfer their applications to the flight system with minimal effort. Such an
61
easy path to flight deployment will reduce project costs and development time, ultimately
leading to more science missions deployed over a given period of time. Table 4-1 provides
descriptions of the other DM middleware components.
Table 4-1. The DM middleware components
Component Description
High-Availability Middle-ware (HAM)
Provides standard communication interface between all soft-ware components including user applications. Guarantees inorder delivery of all messages and supports seamless switchingbetween redundant networks.
Fault-Tolerance Manager(FTM)
Central fault recovery agent for DM system. Monitors statusof software agents and reliable messaging middleware. Up-dates JM tables upon resource changes affecting applicationscheduling.
Job Manager (JM) Centralized component that schedules jobs, allocates resources,dispatches processes, and directs application recovery.
Job Manager Agents(JMA)
Distributed software agents that fork the execution of jobs andmanage required runtime job information on the local host.
Fault-tolerant EmbeddedMessage Passing Interface(FEMPI)
Application-independent, fault-tolerant, message passing inter-face adhering to the MPI standards. Provides a subset of theMPI API and supports various fault recovery modes.
MDS Server Services all data operations between applications and massmemory.
4.2 Approach
The FASE framework presented in Chapter 3 provides an ideal environment for
exploring the design options involved with system configuration of the DM system. The
models in the pre-built library were designed so that components could be configured for
embedded or traditional HPC systems through simple parameter tweaks representing the
capacity or capability of a specific resource. Therefore, various design trade-offs can be
explored using a variety of hardware and software models in order to analyze their effects
on system performance and scalability.
In order to apply the FASE framework to study COTS components in space, the
original design was extended to support fault injection capabilities. These additional
features allow users to explore not only performance-oriented issues, but also those dealing
62
with fault tolerance and availability. In order to facilitate the use of these new capabilities,
the researcher introduces a novel approach to predict the performability, a metric that
combines performance and availability to describe degradable systems [43], of COTS-based
payload processing systems. This approach analyzes systems in three complementary
domains: 1) physical prototype, 2) Markov-reward model, and 3) discrete-event simulation
model. Techniques from each domain represent cornerstones in the analysis process
though each has its strengths and weaknesses. Physical prototypes offer validity in
measured values but provide limited scalability and adaptability. Markov-reward models
allow for quick performability measurements for specific failure and recovery rates, but
are not suitable when modeling complex systems due to high dimensionality which is
required for high-fidelity models. Finally, simulation provides a free-form environment
to evaluate systems with arbitrary configurations and workloads, but often suffers
from increased development time and lengthy analyses. By intelligently leveraging the
strengths in each domain, a quick and precise analysis of various system configurations
and applications can be achieved that include a variety of arbitrary workloads and
fault-injection campaigns. The process begins with the evaluation of the prototype system,
where real-world performance values such as network latency and component recovery
times are measured and used to calibrate the Markov-reward and discrete-event simulation
models that would otherwise lack validity. Next, quick performability evaluations of
the system’s fault-tolerant software architecture are conducted using Markov modeling
techniques to identify efficient designs and workloads. Thus, the Markov models trim an
otherwise large design space to eliminate the time spent analyzing poor designs. The final
step uses pre-built or customized simulation models to analyze architectural enhancements
and dependencies within the selected systems and applications at a level of detail that
cannot be achieved in the previous domains. The resulting methodology allows candidate
systems to be thoroughly and accurately analyzed for both performance and availability
63
thus allowing designers to compare alternate fault-tolerant architectures for aerospace
applications.
We apply the three-stage methodology to analyze and quantify the performance and
fault-tolerant characteristics of the DM management software and proposed flight system.
The following subsections provide more details on the modeling efforts involved with this
work.
4.2.1 Physical Prototype
The first stage of the analysis approach involves the development and testing of a
prototype system that represents a scaled-down version of the proposed system. The
prototype DM system was designed and developed to mirror when possible and emulate
when necessary the features of a typical satellite system. As shown in Figure 4-3, the
prototype hardware consists of a collection of COTS Single-Board Computers (SBCs)
running a Linux-based operating system interconnected with redundant Gigabit Ethernet
networks. One SBC is augmented with an FPGA coprocessor and a reset controller and
power supply is incorporated for power-off resets. Six SBCs are used to mirror four data
processor boards and emulate the functionality of the two radiation-hardened control
and MDS nodes. Each SBC is comprised of a 1 GHz PowerPC processor, 1 GB of main
memory and dual Gigabit Ethernet NICs. A Linux workstation emulates the role of the
Spacecraft Command and Control Processor, which is responsible for communication
with and external control of the DM system but is outside the scope of this paper.
The MPI middleware layer used on the testbed is FEMPI 1.0, a custom, fault-tolerant
implementation of a selected subset of the MPI standard [44]. GoAhead’s SelfReliant 4.1 is
used as the high-availability middleware which provides network communication, liveliness
information, and network failover. Finally, the MDS storage device was emulated via a
5400 RPM hard drive.
The prototype is used to measure the achievable performance of the system executing
microbenchmarks that exercise its network and MDS subsystems. It is also employed
64
(a) Logical diagram (b) Photograph
Figure 4-3. Logical diagram and photograph of DM testbed
to gather the response times necessary to detect failed components within the system.
Further details on how the prototype is used to validate the DM models are presented in
Section 4.3.
4.2.2 Markov-Reward Modeling
After the prototype system has been developed and performance and other key
metrics have been measured, the analysis transitions to the Markov-reward modeling
domain. Within this domain, quick evaluations are conducted to explore various fault and
recovery rates in order to identify the workloads and system configurations that are most
interesting for further study. In these studies, steady-state performability (SSP) is the
common fitness metric used to describe the performance of degradable multiprocessor
computer systems [45]. The SSP allows users to predict a mean computational
performance of the system which takes into account both short- and long-term effects
which could otherwise cause skew in experimental measurements of system performance.
A typical method used to estimate the SSP involves using Markov-reward models
(MRMs), constructs based on continuous-time Markov chains (CTMC). MRMs combine
state probabilities, obtained from steady-state analyses of CTMCs, and reward rates based
on computational performance to calculate the SSP of a system. Formally, the SSP is
defined as the expected asymptotic reward or
65
SSP =∑iεS
πiri (4–1)
where S represents a set of all possible states that the given system can occupy, πi denotes
a steady-state probability of the system occupying state i, and ri stands for a reward rate
for the ith state [46].
4.2.2.1 Data node model
The data node model focuses on calculating the performability of the node under
faulty conditions. To simplify the node model we assumed that jobs are continuously
scheduled on the node by the JM. Also, the delay between the completion of one job and
start of another is considered to be negligible since the run time of each job is significantly
larger than the scheduling time. These assumptions allow the model to be realized as a
six-state CTMC as shown in Figure 4-4. Each state corresponds to a particular condition
of the data processing node where the three primary components, the application, JMA,
and system (i.e., HAM and operating system), are either operational or not.
Figure 4-4. Markov-reward data node model
The SAPP state occurs when the application is executing on the node and all other
services are running correctly. This state is the only node configuration that has an
66
associated non-zero reward rate value equal to 1, which makes the SSP equivalent to the
availability for this model. In order for the model to transition out of the SAPP state,
an SEE-related error must occur causing a hang or crash of the application, JMA, or
system governed by the fault rates λAF , λJF , and λSF , respectively. Since the recovery
policy (node reset) for node-wide errors and HAM errors is identical, the failure rates
are combined to simplify the model. Each failure rate is proportionally related to the
independent variable, MTBFNODE (mean time between faults for the node), which is
equivalent to the SEE rate experienced by the node. The majority of SEEs are expected
to impact the CPU therefore each of the aforementioned fault rates is obtained by scaling
the MTBFNODE by the CPU utilization of the given software component (CPU%APP ,
CPU%JMA, CPU%SY S). The CPU utilization is defined by following equation.
CPU%APP + CPU%JMA + CPU%SY S = 100% (4–2)
The SDET state denotes a detection delay when the error has occurred in the
application causing it to abort or crash. This delay is associated with the heartbeat
interval of the running application to the JMA. The SJMA state denotes a configuration in
which there is no application running but the rest of the system is functioning properly.
To transition to the SREC state the JMA must start an application with rate µFR which
is inversely proportional to the time required by the system to start the process. The
SREC state symbolizes the application recovering from a crash. The rate µRC , at which
the system can transition back to SAPP , is dependent on the checkpointing interval of
the application as well as the size of the checkpoint and the transfer time from the MDS.
When the JMA fails all running applications are terminated and the model enters the
SSY S state. Upon entering the SSY S state, the HAM will immediately attempt to start
a JMA with a start rate of µFR. If the operating system or HAM fails before the JMA
starts up, the node model will switch to the SDOWN state and the node will be rebooted.
The reboot rate µRB dictates the time required for the system to cycle power to the node,
67
start the operating system and HAM, and reconnect to the system. Tables 4-2 and 4-3
summarize the states and parameters incorporated into the node model, respectively.
Table 4-2. Data node model states
Symbol Running Components Description
SAPP SYS, JMA, Application System is functioning correctly.
SDET SYS, JMA Application has crashed or hanged.
SJMA SYS, JMA The JMA is ready to start or restart the applica-tion.
SREC SYS, JMA, Application The application is recovering from the crash.
SSY S SYS The JMA has crashed or hanged (the application isautomatically killed).
SDOWN None The system has crashed and requires reboot.
Table 4-3. Failure and recovery rates of the node model
Symbol Rate [1/s] or Value Description Type
MTBFNODE variable Mean time between faults fora node.
Input
λAF CPU%APP ×MTBFNODE Application fault rate. Derived
λJF CPU%JMA ×MTBFNODE JMA fault rate. Derived
λSF CPU%SY S ×MTBFNODE System fault rate. Derived
µRB 0.0333 System reboot rate (HAMand OS).
Measured
µRC 0.069061 Application recovery rate. Measured
µFR 14.27 System fork rate. Measured
µDT 0.8333 Failed application detectionrate.
Measured
CPU%APP 70% Portion of CPU used byApplication.
Estimated
CPU%JMA 5% Portion of CPU used byJMA.
Estimated
CPU%SY S 25% Portion of CPU used by OSand HAM.
Estimated
The rates specified in Table 4-3 are divided into three categories – derived, measured,
and estimated. The derived rates are calculated based on equations using the SEE rate
while the measured rates were obtained by experimental measurements on the DM
68
prototype system. The CPU utilization values chosen reflect the estimated workload of
typical applications running on the DM system.
4.2.2.2 System model
The goal of the Markov model representing the DM system is to approximate the
SSP of the system with an arbitrary number of nodes. For this model, we assume each
node executes completely independent workloads and, as a result, the model represents a
best-case approximation and sets an upper bound on the SSP of the system. The fact that
the radiation-hardened control and MDS nodes in the DM system are not susceptible to
SEEs further simplifies the model. To predict the SSP of such a system, we developed a
Markov-reward model with N+1 states, where N is the number of the compute nodes in
the cluster (see Figure 4-5).
Figure 4-5. Markov-reward system model
Each state in the system model denotes the number of nodes that are currently in the
SAPP state at a given time. Most commonly, the reward rate associated with each state is
simply set to the number of nodes in the SAPP state. However, in systems such as those
with hot spares or those that incur overhead penalties, each node’s reward rate can be
modified accordingly. The node failure rate λND, defined in Equation 4–3, is equivalent to
the aggregate rate of all transitions from the SAPP state, while the recovery rate of a node,
λND, is the rate at which the node model can transition back to the SAPP state. Equation
5–1 provides a formal definition of this recovery rate.
λND = λAF + λJF + λSF =1
MTBFNODE
(4–3)
69
µND =P (SAPP )λND
1− P (SAPP )(4–4)
To simulate the data node and system models we use the SHARPE tool which
is commonly used to simulate Markov chains, Petri nets, and hierarchical models for
availability, reliability and dependability calculations. The tool is actively developed at
Duke University [47].
In order to coarsely evaluate the performance of the DM system the researcher
developed a hierarchical Markov-reward model that allows for rapid evaluation of potential
computational rates achievable for a range of applications under varying fault conditions.
Unfortunately, such a basic model lacks the fidelity and precision to explore the effects
of network, CPU, MDS, or scheduling performance, which in conjunction with fault
conditions can significantly affect the SSP. The quality of the SSP obtained from the
Markov reward model is further evaluated and compared to the simulative model in
Section 4.3.
4.2.3 Discrete-Event Simulation Modeling
The final step in the three-stage analysis involves in-depth evaluations of virtually
prototyped system using discrete-event simulation models. In the simulation domain,
analyses of application and architectural configurations are conducted with the intent to
identify the settings that produce the highest performance and availability.
Based on the node and system architectures from Sections 4.1.2 and 4.1.3,
discrete-event simulation models of key components were designed and developed. Each
model adheres to the FASE methodology of balancing speed and accuracy, and some
models actually extend or enhance the pre-existing models in the FASE library. The
components listed in Table 4-4 were formally modeled to not only capture the correct
functionality of the corresponding technology but also to incorporate their impacts on
system performance and fault tolerance. From these core component models, node and
system models were developed. Figure 4-6 illustrates the middleware models that compose
70
a data processing node, system control node, and MDS node. Finally, the virtual flight
system was developed using the final design architecture (see Figure 4-7).
Table 4-4. Summary of DM component models
Component Library Description
Fault ToleranceManager
DM Detects component failures, notifies Job Manager,and takes necessary recovery procedures.
Job Manager DM Schedules and manages jobs. Handles task restartsbased on available resources.
Job Manager Agent DM Starts and monitors applications on data node.Notifies Job Manager when application failure de-tected.
High-AvailabilityMiddleware
DM Provides reliable communication between nodes insystem. Monitors JM, JMA, and other nodes forfailures. Notifies FTM of failures.
MDS Server DM Handles data access requests to the mass data store.
TCP Layer FASE Provides TCP protocol for reliable communicationbetween nodes.
IP Layer FASE Provides IP protocol for all network transfers.
Ethernet NIC FASE Provides Ethernet protocol for all network transfers.Supports multiple ports.
Ethernet Switch FASE Provides Ethernet connectivity between nodes.Supports variety of backplane and routing options.
4.2.4 Fault Model Library
The models described in the previous section capture the functionality and
performance-based characteristics as seen in their real-world counterparts. However,
they do not include fault detection and recovery mechanisms needed to function properly
when exposed to a fault. As a result, a fault model library was designed to integrate key
features that enhance the models to react appropriately under various fault campaigns.
In addition, the fault model library provides the necessary components to generate and
inject faults into an arbitrary system. The models in the library were specifically designed
so that new and pre-existing models could be “fault-enabled” with few additions or
modifications. The fault models were also designed to create a fault hierarchy such that
a single, high-level component could be affected by a fault and the mechanisms would
71
(a) Compute node (b) Control node
(c) MDS node
Figure 4-6. The DM node models
Figure 4-7. The DM flight system model
72
automatically propagate the fault to all lower-level entities. This hierarchical design
not only captures the area of influence of a particular fault type, but it also provides an
infrastructure to define interdependencies between various components. Table 4-5 lists the
fault model components accompanied with brief descriptions of their functionality.
Table 4-5. Summary of fault models
Component Description
Fault Generator Controls when faults are generated. Generation times based onrandom distributions or user defined.
Fault Controller Injects faults into system. Selects target component randomly orbased on user-defined susceptibility matrix. Monitors the status ofall fault managers and modules in the system.
Fault Manager Propagates injected faults from the fault controller to all lower-level faulty devices. Aids in the recovery process of managed com-ponents when necessary.
Fault Module Provides high-level fault mechanisms such as detection and recov-ery to integrate into new and pre-existing models. Inherits FaultBase mechanisms and data structures.
Fault Base Provides low-level fault data structures and mechanisms to inte-grate into new and pre-existing models. These data structures andmechanisms deal with scheduling events, managing faulty events,and managing memory (e.g., preventing memory leaks).
Figure 4-8 illustrates an example of a fault-enabled system based on the DM
architecture. The figure shows the various hardware and software components that
can be affected by faults as well as some of the fault models that manage the injection,
detection, and reaction to faults in the system. Fault Modules have been integrated into
each of the hardware and software components in the system and Fault Managers are used
to manage groups of modules. The actual groupings of faulty components are based on
either the physical proximity of the components in a device or the management system
that controls the liveliness of the component. Another factor to consider when creating
these groups is how each component reacts to specific faults. For example, the Data Node
Fault Manager in Figure 4-8 manages how faults are injected at the node level such that if
the corresponding data node becomes faulty, it will pass the fault to the lower-level fault
73
managers within the NIC, Middleware, and Applications blocks so that they can disable
their corresponding components (e.g., Port, JMA, HAM, or SAR).
Figure 4-8. Example fault-enabled system
Faults are injected into the system by the Fault Control Block which is composed
of the Fault Generator and Fault Controller. The Fault Generator creates time-based
fault events as dictated by either a random distribution or the user. The Fault Controller
receives events from the Fault Generator and provides the injection capabilities necessary
to stimulate the virtual system with failures. The Fault Controller targets specific
components in the system according to a susceptibility matrix that defines the probability
each listed component will experience a fault. The actual percentage values supplied
within the susceptibility matrix are user-defined and both the utilization and physical
size of each component should be considered when setting the values. Once the Fault
Controller determines its target component, it injects the fault into the system via
Fault Managers. The Fault Managers relay the fault to the target component so the
74
target can react according to a particular policy defined by the user. The Fault Module
is incorporated into each fault-enabled component and provides virtual detection and
recovery functions that can be redefined to allow the user to configure the device to take
the necessary actions as dictated by it fault-tolerant policies.
The models within the fault library were programmed primarily using the
object-oriented C++ language so that other systems and models can be easily retrofitted
with fault injection capabilities. Also, the models were designed with extensibility in mind
to support a wide range of detection and recovery methods for many types of faults. All
the components described in the previous section have been retrofitted with the necessary
fault models and each has been configured to react to and recover from faults as dictated
by the policies setup for the DM system. Details on these policies can be found in [48].
4.3 Results and Analysis
This section describes the methods used to analyze and identify performance and
availability issues in the DM system architecture. The section presents experiments
used to validate the Markov and simulation models through the use of experimentally
gathered measurements from the prototype system. These measurements are used to
calibrate the models and validation results are presented where applicable. After the
models are calibrated, a scalability study of an important application kernel, the 2D
FFT, is presented. The goal of this investigation is to find bottlenecks that exist in the
proposed design and explore the effects of changing key system features (regarding both
architectural and algorithmic variations) to better map the application to the underlying
architecture. The scalability study is followed by an in-depth performability analysis of
the DM system using the SAR application in order to evaluate the trade-offs between
performance, scalability, and availability. The subsection concludes with an evaluation of
the proposed 20-node flight system incorporating the optimal configurations to maximize
performability.
75
4.3.1 Model Calibration
4.3.1.1 Component model calibration and validation
Model validation is a critical step in any modeling effort that attempts to provide
accurate results comparable to those produced by real systems. Validation of complex
systems such as the DM system is difficult to accomplish, therefore we attempt to
overcome this challenge by decomposing and validating its subsystems: the network
subsystem and the MDS subsystem. The network subsystem encompasses all software and
hardware layers employed during a data transfer which correspond to the HAM, TCP,
IP, and Ethernet models. In order to validate this subsystem, a simple PingPong MPI
program that measures low-level network performance between two nodes was executed
on the prototype system described in Section 4.2.1 and the results were used to calibrate
the models to best represent the testbed’s network and middleware performance. Figure
4-9a illustrates the experimentally gathered throughput values as compared to those
produced by the simulated system. The figure shows the simulation model closely matches
the performance measured on the testbed with a mean relative error of 1.27%. A similar
mean relative error was observed when comparing the experimental and simulative latency
measurements across the studied message sizes.
(a) Network subsystem (b) MDS subsystem
Figure 4-9. Throughput validations for network and MDS subsystem models
Once the network subsystem was validated, the performance of another key
subsystem, the MDS, was calibrated according to experimental measurements. Again,
76
a simple benchmark was developed that transfers data of varying sizes to and from the
MDS node. The validation results are shown in Figure 4-9b, with the MDS subsystem
model producing mean relative errors of 1.58% for writes and 2.03% for reads.
From the validation process and documentation, the DM system’s main component
parameters were calibrated in order to most accurately represent the testbed system. The
values for key parameters are listed in Table 4-6 and this configuration corresponds to the
baseline system used in the proceeding experiments.
Table 4-6. Baseline system parameters
Parameter Name Value
Processor power 1200 MIPS, 600 MFLOPS
MPI maximum throughput 57 MB/s
MPI message latency 13.6 ms
HAM buffer size 2000000 bytes
Network bandwidth Non-blocking 1000 Mb/s
Network switch latency 5 µs
MDS bandwidth (write/read) 60/40 MB/s
MDS latency (write/read) 300/500 µs
MDS open file overhead 8 ms
4.3.1.2 System performability model
In addition to component calibration, the simulation system models were validated
with regard to their near-idealistic performability as compared to those produced by the
Markov-reward models presented in Section 4.2.2. For this experiment, a serial version of
an LU decomposition kernel was scheduled on each node in the tested systems. Each LU
job processed matrices of 1000×1000 elements with each element being an 8-byte double.
The experiment varied the system size from four to thirty-two nodes by powers of two
while exploring ten MTBFNODE values. The minimum fault rate expected for each DM
data node was estimated to be three faults per day, which corresponds to the maximum
MTBFNODE value analyzed (28800 seconds = 8 hours) and a relatively hospitable
77
environment. The remaining rates were selected to analyze the system’s performability in
harsher conditions and the results from this study are shown in Figure 4-10.
(a) Performability (b) Error
Figure 4-10. Markov versus simulation DM system performability comparison
Figure 4-10a shows a comparison between the performability numbers collected for
the Markov and simulation models while Figure 4-10b illustrates the relative errors of
the simulation results as compared to those produced by the Markov model. One can
see that for larger MTBFNODE values, the two analysis techniques yield near identical
results. However, deviations between the approaches become apparent when analyzing
systems exposed to more faults (i.e., small MTBFNODE values). These deviations result
from the varying levels of detail captured by each modeling approach. In this instance, the
simulation model captures extra performance penalties such as network and scheduling
delays that affect the performance of the system. In high fault conditions, these penalties
begin to accumulate due to numerous job restarts thus negatively impacting the system’s
overall performability. In addition, the deviations at the smaller MTBFNODE values
increase with the system size due to scheduling overhead as well as resource contention in
the network and MDS node not modeled in the Markov-reward models.
4.3.2 Case Study: Fast Fourier Transform
The experiments conducted for this portion of the study explore the performance and
scalability of the 2D FFT kernel executing on the DM system. A fault-tolerant, parallel
78
2D FFT serves as the baseline algorithm, which distributes an image evenly over N
processing nodes and performs a logical transpose of the data via a corner turn. A single
iteration of the FFT, illustrated in Figure 4-11, includes several stages of computation,
inter-processor communication (i.e., corner turn), and several MDS accesses (i.e., image
read and write and checkpoint operations).
Figure 4-11. Dataflow diagram of parallel 2D FFT
The results of the baseline simulation (see Figure 4-12) show that the performance
of the FFT slightly worsens as the number of data nodes increases. In order to pinpoint
the cause of the performance decrease of the FFT, the processor, network, and MDS
characteristics were greatly enhanced (i.e., up to 1000-fold). The results in Figure 4-12
show that enhancing the processor and network has little effect on the performance of the
FFT, while MDS improvements greatly decrease execution time and enhance scalability.
The reason the FFT application performance is so directly tied to MDS performance is
due to the high number of accesses to the MDS, the large MDS access latencies, and the
serialization of accesses to the MDS.
After the MDS was verified as the bottleneck for the 2D FFT, several options were
explored in order to mitigate the negative effects of the central memory. The options
included algorithmic variations, enhancing the performance of the MDS, and combinations
of these techniques. Table 4-7 lists the different variations.
Each technique offers performance enhancements over the baseline algorithm (i.e.,
P-FFT). Figure 4-13 shows that the parallel FFT with distributed checkpointing and
79
Figure 4-12. Execution time per image for baseline and enhanced systems
distributed data provides the best speedup (up to 740×) over the baseline because it
eliminates all MDS accesses. Individually, the distributed checkpointing and distributed
data techniques result in only a minimal performance increase since the time taken to
access the MDS still dominates the total execution time. MDS performance enhancements
reduce the execution of the parallel FFT by a factor of 5. Switching the FFT algorithm
(see Figure 4-14) to the distributed version achieves a 2.5× speedup over the baseline
which can then be further increased to 14× and 100× by employing MDS improvements
and distributed data, respectively. It is noteworthy to mention that the distributed FFT is
well suited for larger systems sizes since the number of MDS accesses remains constant as
system size increases.
Results for the parallel 2D FFT (Figure 4-13) magnify the effects of the MDS on the
system’s performance. Though the parallel FFT’s general trend shows worse performance
as system size scales, the top four lines show numerous anomalies where the performance
of the FFT actually improves as the number of nodes in the system increases. These
anomalies arise from the total number of MDS accesses needed to compute a single image
for the entire system. For example, a dip in execution time occurs in the baseline parallel
FFT algorithm when moving from 18 to 19 nodes. The total number of MDS accesses of
the parallel FFT using 18 nodes is 90 while the number of accesses decreases to 76 for
80
Table 4-7. The FFT Algorithmic variations and system enhancements
Algorithm /Technique
Description Label
Parallel FFT Baseline parallel 2D FFT. P-FFT (Baseline)
Parallel FFTwith distributedcheckpointing
Parallel 2D FFT with “nearest neighbor”checkpointing - data node i saves checkpointdata to data node (i + 1) mod N, where i is aunique integer (0 ≤ i ≤ N − 1) and N is thenumber of tasks in a specific job.
P-FFT-DCP
Parallel FFTwith distributeddata
Parallel 2D FFT with each node collecting aportion of an image for processing thus elimi-nating the data retrieval and data save stages.
P-FFT-DD
Parallel FFTwith distributedcheckpointing anddistributed data
Combination of both distribution techniquesdescribed above.
P-FFT-DCP-DD
Parallel FFTwith MDS en-hancements
Parallel 2D FFT using a performance-enhancedMDS. The MDS bandwidth is improved 100-fold and the access latency is reduced by afactor of 50.
P-FFT-MDSe
Distributed FFT A variation of the 2D FFT that has each nodeprocess an entire image rather than a part ofthe image.
D-FFT
Distributed FFTwith distributeddata
Distributed 2D FFT algorithm with each nodecollecting an entire image to process.
D-FFT-DD
Distributed FFTwith MDS en-hancements
Distributed 2D FFT algorithm using aperformance-enhanced MDS.
D-FFT-MDSe
the 19-node case. Since the MDS is the system’s bottleneck, the execution time of the
algorithm benefits from the reduction of MDS accesses. Only in the parallel FFT with
distributed data and distributed checkpointing option do we see the “zig-zags” disappear
due to no data transfers occurring between the nodes and the MDS. The distributed FFT
(see Figure 4-14) also does not show any performance anomalies due to the nature of the
algorithm. That is, the number of MDS accesses remains constant per image since only
one node is responsible for computing that image.
81
The results in Figures 4-13 and 4-14 corresponded to 1 MB images, thus we
conducted simulations to analyze the affects of larger image sizes. Our results showed
that the algorithms and enhancements reversed the trend for the parallel FFT. That is,
the execution times improved as the system size grew, though the improvements were very
minimal. Also, the sporadic performance jumps were amortized due to the large number
of MDS accesses as compared to the variance in the number of accesses. The distributed
FFT with distributed data was the only option that showed a large improvement because
more processing could occur when data was more readily available for the processors. The
results demonstrate that a realistic application can be effectively executed by the DM
system if the mass memory subsystem is improved to allow for parallel memory accesses
and distributed checkpoints.
Figure 4-13. Parallel 2D FFT execution times per image for variousperformance-enhancing techniques
4.3.3 Case Study: Synthetic Aperture Radar
For the next case study, we evaluate the performance and availability of a more
complex application, SAR. Synthetic Aperture Radar (SAR) is a high-resolution,
82
Figure 4-14. Distributed 2D FFT execution times per image for variousperformance-enhancing techniques
broad-area imaging process used for reconnaissance, surveillance, targeting, navigation,
and other operations requiring highly detailed, terrain-structural information [49]. It uses
a two-dimensional, space-variant convolution that can be decomposed into two domains
of processing – range and azimuth. In order to correctly transition between the range and
azimuth domains, the data must be reordered via a transpose operation [50]. This case
study analyzes a fault-tolerant version of SAR that incorporates an optional checkpointing
stage (striped block in Figure 4-15a) to save and recover rollback points in the event of a
failed job. Figure 4-15a illustrates the data flow of the fault-tolerant SAR application.
There are various implementations of the SAR application each differing based on
the data decomposition across participating processing nodes and, thus, the amount of
communication and computation conducted by each node [51]. For this study, we consider
the patch-based approach which splits each SAR image into “patches” along the azimuth
dimension and distributes each patch to an available compute node. This patched version
does not require communication between participating nodes although each node must
83
(a) Dataflow diagram (b) Data decomposition
Figure 4-15. SAR dataflow with optional checkpoint stages and patched datadecomposition
fetch and process additional data to ensure correct results. Figure 4-15b illustrates the
data decomposition of the patch-based implementation.
The baseline SAR application used throughout this study processes
28000×5616-element images which is the approximate size of the images collected by
the European Remote-Sensing (ERS) satellites [52]. Each element is stored within the
mass data store as a complex pair of 8-bit integers (2-bytes total), the typical format used
for raw SAR data. When the data is imported by the SAR application, each element is
expanded to a complex pair of 32-bit floating-point numbers (8-bytes per pair) in order to
improve precision and reduce the potential for round-off errors, and the range dimension
is padded to 8192 elements to increase the efficiency of the FFT calculations. When SAR
is complete, the padded elements in the range dimension are removed from the processed
image and the remaining elements are converted to complex pairs of 16-bit short integers
(4-bytes per pair) thus reducing the amount of storage needed to store the data back to
the MDS. The patch data size is P×5616 elements, where P is the patch size, and the
overhead data size is 1296×5616 elements per patch. For each simulation run, the DM
84
system is observed over ten, 100-minute orbits and the radiation-hardened control and
MDS nodes are assumed to experience no failures.
For the fault-injection experiments, the fault control block (described in Section 4.2.4)
creates and inserts faults into the system using an exponential distribution with a mean,
MTBFSY STEM , defined as:
MTBFSY STEM =MTBFNODE
N(4–5)
MTBFNODE is the mean time between faults per node and N is the number of
nodes in the system. The MTBFNODE rates considered in the fault experiments are
identical to those investigated in Section 4.3.1.2 and represent radiation conditions ranging
from minimal to extreme. Faults are injected into a particular node based on a uniform
distribution and the specific component to target on the selected node is dictated by the
following percentages: SAR Application = 70%, HAM/System = 25%, and JMA = 5%.
These percentage values were estimated based on the anticipated behavior of the SAR
application executing on a DM data node. Also, faults can be injected into recovering
components thus restarting the recovery process.
The following sections describe the techniques and capabilities within the two
modeling domains of the three-stage analysis as they are applied to evaluate the
performance and availability of SAR executing on the DM system. Section 4.3.3.1
presents the amenability study of SAR using the Markov-reward model to determine
if the application maps well to the DM architecture. After the study, we enter the
discrete-event simulation domain in order to explore the various application and
architectural options available to improve performance and availability. Section 4.3.3.2
reports the performability and system throughput of the patch-based SAR application
while considering various patch sizes. It then evaluates the different checkpoint storage
options with the intent of improving the fault tolerance of the application. Finally, we
85
investigate numerous architectural enhancements to support efficient computing on a
twenty-node system and conclude with a final analysis of the proposed flight system.
4.3.3.1 Amenability study
This section begins with a preliminary analysis of the SAR application using
the Markov-reward model to determine whether the workload characteristics of this
particular application are appropriate for the DM system. The study presents best-case
performability numbers of the patch-based SAR application employing 2048-, 4096-, and
8192-element patches executing on systems using 4, 8, 16, and 32 data nodes. The fault
injection rates explored are identical to those used in Section 4.3.1.2.
Figure 4-16. Amenability results via Markov model for patch-based SAR application
From Figure 4-16, we can see that the patched SAR shows promising performability
numbers for each system size and patch size while the system experiences relatively benign
conditions. However, as the fault rate increases, we observe decreasing performability in
all cases. Furthermore, larger patch sizes have negative effects on the performability of
the system due to the increased amount of time required to complete the processing of
each SAR job. In fact, the Markov model reported a difference of 32.9% in performability
86
between the 2048-element patch and the 8192-element patch at the highest fault rate.
From these results, we observe that the patch-based SAR application produces good
performability numbers for the fault rates targeted for the DM system thus making it a
good candidate for further investigation using the discrete-event simulation approach.
4.3.3.2 In-depth application analysis
For the next step in the case study, we transition into the final stage of our proposed
methodology – the discrete-event simulation domain. Within this stage, we have the
capability to analyze many interesting application and system options while exposing
each configuration to various fault conditions. This section focuses on the various options
available for the patch-based SAR application. Similar to the amenability study, we
explore the impact of patch size on the system’s performability while considering patches
with 2048, 4096, and 8192 elements in the azimuth dimension. The objective of this study
is to determine which patch size achieves the best performance.
Figure 4-17 illustrates the results collected from the simulations. The left column
of charts shows the performability percentages of each patch size and the right column
displays the corresponding throughputs while varying system size and MTBFNODE.
From the figures in the left column, we see that the 2048-element patch size has the
highest performability in all system sizes and fault rates due to the application’s short
execution time. As the patch size grows, the execution time for each SAR job lengthens
thus increasing the probability of a fault interrupting the completion of a job. These
results concur with those reported by the Markov-reward model in the previous section.
However, the simulations produced lower performability percentages than those observed
using the Markov model. In fact, a maximum reduction of 43.0% in the performability of
larger systems experiencing high fault rates was reported by the discrete-event simulation
models. Another key observation is that the performability of each system decreases as
the system size enlarges thus indicating a potential bottleneck within the system. Our
87
(a) 2048-element performability (b) 2048-element throughput
(c) 4096-element performability (d) 4096-element throughput
(e) 8192-element performability (f) 8192-element throughput
Figure 4-17. System performability percentages and throughputs for patch-based SAR
88
Markov-reward model fails to show this trend due to its inability to capture architectural
dependencies.
Despite observing decreased performability when using larger patches, the system
throughputs for these jobs can be much higher than the 2048-element case. In fact, the
figures in the right column show improved throughputs up to 1.86× and 2.29× for the
4096- and 8192-element cases, respectively, in low-fault environments. However, the
2048-element case does outperform the 8192-element patch size when the MTBFNODE
rate is less than 200 seconds. In these configurations, the fault rates are lower than the
average execution times of the applications using larger patch sizes thus increasing the
probability of a fault causing a SAR job to fail. Finally, for each patch size and fault rate,
the throughput reaches its peak value in systems with eight data nodes. Although the
exact reason for this peak in performance is not fully clear from this particular study, we
hypothesize that the centralized MDS is the main cause. The study conducted in the next
section verifies this hypothesis by observing the effects of scaling the performance of key
system components.
For this study, we are most concerned with light to moderate fault rates (i.e.,
MTBFNODE values between 28000 and 3600 seconds) where the performability
percentages for all patch and system sizes are very similar. Therefore, we focus on the
8192-element patch configuration for further analysis of SAR and the DM system in order
to maximize the system’s throughput.
Now that SAR has been configured properly with regards to performance and
availability, we can now evaluate various checkpointing options with the goal of improving
application fault tolerance. Table 4-8 provides a list of the checkpoint options along
with brief descriptions of each. For this evaluation, we observe the performability and
throughput of four system sizes – 4, 8, 16, and 32 data nodes – while incorporating the
optional checkpoint stage for the SAR application (see Figure 4-15a). The objective of this
study is to observe the impact on performance and availability of checkpointing within the
89
SAR application. Furthermore, we wish to measure and compare the overheads associated
with storing the checkpoint data to a reliable, centralized location (MDS node) versus
checkpointing to unreliable, distributed data nodes.
Table 4-8. Checkpoint options explored using patch-based SAR application
Checkpoint option Description Label
No checkpointing No checkpointing conducted. NoCP
MDS checkpointing Checkpoint data is stored on the MDS node. MDSCP
Data node checkpointing Checkpoint data is stored on the nearest neighbordata node.
DataCP
Figure 4-18 shows the performability and throughput observed for each checkpoint
option while varying system size and MTBFNODE. The performability is reported
by the solid lines in the charts while the throughput is represented by dotted lines.
When comparing the results from all four figures, we see that as system size increases,
performability decreases for all MTBFNODE rates while throughput increases. The
MDSCP checkpoint option reports both the lowest performability and throughput for
all cases. Again, the likely cause of this poor performance is the increased pressure
placed on the centralized MDS node. In the smaller system sizes, the DataCP option
reports an approximate 11% drop in performability. Also, using the DataCP option
lowers throughputs by 33.5% and 27.5% for systems consisting of two and four data
nodes, respectively. However, this option does show slight benefits in the 16- and 32-node
systems. For both system sizes, nearly equivalent performability percentages were reported
and a maximum speedup of 1.08× in throughput was measured when compared to
the NoCP case. Another important observation from this study is that performability
does not always translate into raw performance (i.e., throughput). Figure 4-18b shows
nearly equivalent performability percentages between the three checkpoint options at
low fault rates. Conversely, the throughput of the NoCP option is 2.5× and 1.3× greater
than the MDSCP and DataCP options, respectively. This observation suggests that
while performability is a useful metric to evaluate the overall utilization of a degradable
90
system in a faulty environment, it does not represent the true performance of a specific
application since it does not differentiate between meaningful computation and the
processing conducted due to extra mechanisms such as checkpointing.
(a) 4 data nodes (b) 8 data nodes
(c) 16 data nodes (d) 32 data nodes
Figure 4-18. System performability and throughput for 8192-element patch-based SARexecuting on various system sizes
The results attained in this study suggest that the patched-based SAR application
is not well suited for checkpointing. Although the DataCP option achieved improved
throughput over the NoCP option in large system sizes, the improvement was minimal
and the additional network transactions and demands on the data nodes could easily
negate any gains if multiple jobs were executed on each node. In the final study of the
flight system, we focus on the 8192-element patched SAR application with checkpointing
disabled.
91
4.3.3.3 Flight system
In this section, we explore the performability of the patch-based SAR application with
8192-element patches executing on the DM flight system composed of twenty data nodes
[53]. From the previous section, we have shown that the performance of the DM system
beyond eight nodes suffers due to an unidentified bottleneck within the system. In this
study, we enhance and modify the system architecture in order to identify and remedy this
bottleneck in order to efficiently support twenty data nodes processing SAR. Table 4-9
lists the various architectural enhancements investigated. The objective of this study is to
modify the current DM system design to support twenty data nodes and to improve the
performability and throughput of the system with realistic upgrades and augmentations.
Table 4-9. Architectural enhancements explored for flight system
Enhancement Description Label
Processor Increases processing power of floating-point unit by 2× andincreases throughput and decreases latency of middlewareby 2×.
Proc
MDS storage device Increases bandwidth and decreases latency of MDS storagedevice by 2×.
MDSe
MDS nodes Incorporates N MDS nodes within system. N×MDS
Before the enhancements were studied, we measured the performability and
throughput of a 20-node system using default settings that served as the baseline
configuration. To identify the system bottleneck, we target two main components of
the system – the data node processor and the MDS storage device. By accelerating
the data processor by 2×, we assume that the floating-point unit and middleware layer
attain equivalent boosts in performance. Therefore, this enhancement improves the
performance of both floating-point computations and network transfers. Figure 4-19a
shows that upgrading the processor provides no performance gains for large MTBFNODE
values; however, a 1.58× speedup is attained for high fault rates due to reduced execution
times as compared to the baseline. When we improve the performance of the MDS
storage device, we observe speedups ranging from 1.7× to 2.1× over the baseline. These
92
speedups suggest that the MDS is the main bottleneck of the system for the SAR
application. To further substantiate this claim, we augment the current system design
with additional MDS nodes in order to reduce contention. Figure 4-19a illustrates that the
DM system employing one extra MDS node doubles the performance observed from the
baseline system for low fault rates and more than triples it in extreme faulty conditions.
Employing three MDS nodes in the 20-node DM system further improves the performance
of the system by 2.5× in light and moderate faulty conditions and 4.4× in high fault
rates. An interesting observation from Figure 4-19 is that the speedup reported for each
enhancement increases with the fault rate. This increase is caused by the reduction in
execution time of the SAR application thus allowing the job to complete more frequently
without experiencing a failure.
(a) Component enhancements (b) Combination enhancements
Figure 4-19. Speedups of architectural enhancements for patch-based SAR
Next, we investigate the impact of combining the individual component enhancements
in order to maximize system performance. From Figure 4-19b one can see that significant
speedups are achieved from combining an upgraded processor and MDS storage device
with a system using multiple MDS nodes. The system using two MDS nodes showed
speedups ranging from 12.8× to 4× while the system incorporating three MDS nodes
was found to have a maximum speedup of 17.5× in high fault conditions and 5.1× in the
environments in which data nodes experience three faults per day.
93
From the results in the architectural study, the MDS was deemed the main bottleneck
of the system. In order to remedy this contention point for the 20-node flight system,
we propose using three MDS nodes with enhanced storage devices. We also include
upgraded data-node processors in order to accelerate data computations and middleware
processing. Using this upgraded design, we conduct the final performability study of the
SAR application. Figure 4-20 provides the results.
Figure 4-20. System performability and throughput of 20-node DM flight system executingpatch-based SAR
The results in Figure 4-20 show that the proposed flight system performs well in
most radiation conditions. In fact, the performability of the system is predicted to exceed
99.5% in light to moderate fault conditions (i.e., MTBFNODE > 7200 seconds). The
minimum performability of the system executing the patch-based SAR application is
54.0%. The throughputs achieved by the flight system were greatly improved even at the
highest fault rates. The minimum throughput was measured to be 316 images per orbit
and the maximum is 587 images per orbit. Assuming 100% of the system’s data nodes
are dedicated to processing SAR images, the throughput of the proposed flight system
dictates that it should be able to support a sustainable input rate of 29.35 MB/sec of raw
SAR data from the sensors. However, the expected input rate currently used by the ERS
satellites was calculated to be approximately 11.90 MB/sec. This difference translates
94
into a situation where only 9 data nodes are required to compute SAR jobs while the
remaining nodes are free to perform other compute jobs, conduct test diagnostics, or
simply remain idle to conserve power. This final observation tells us that the proposed
DM flight system architecture executing the patch-based SAR application is more than
suitable to handle the large demands for computation seen in the ERS satellites.
4.4 Conclusions
This phase of research presented an approach that combines analysis techniques
using small-scale prototype systems, Markov-reward models, and discrete-event simulation
models in order to quickly and accurately evaluate the performance and availability
(i.e., performability) of aerospace systems and applications. The combination of these
techniques allowed us to calibrate component models using experimental measurements,
quickly pinpoint workloads and fault rates supported by the management software via
Markov-reward models, and thoroughly investigate specific applications executing on
virtually prototyped systems through discrete-event simulation. Details of each analysis
technique and the extensions and enhancements to the FASE framework were outlined.
Next, models were calibrated to reflect the performance of the physical testbed system
in order to ensure accurate results. Finally, two case studies reported performance,
scalability, and availability predictions of the 2D FFT kernel and SAR application
executing on the NASA Dependable Multiprocessor to showcase the capabilities of the
presented approach.
For model calibration, we used a small-scale prototype system consisting of four
data nodes, one MDS node, and one control node to collect performance measurements.
These measured values were then used to calibrate the Markov and simulation models
and the MDS and network subsystem models were validated using simple benchmarks to
confirm the accuracy of the simulation models. A performability validation experiment
was then conducted using an LU decomposition kernel to compare results between the
Markov-reward models and simulation models while each was subjected to various fault
95
rates. The results showed differences between the modeling approaches of less than
1% for low fault rates though larger errors were observed for higher fault rates due to
shortcomings (e.g., modeling architectural dependencies) of the Markov model.
After validation, the DM system models were used to evaluate the performance and
scalability of the system executing the 2D FFT kernel. This study exposed the centralized
MDS as a potential performance bottleneck in jobs that frequently access the MDS.
Various techniques were explored to mitigate the MDS bottleneck including distributed
checkpointing, distributing interconnections between sensors and data processors (i.e.,
distributed data), algorithm variations, and improving the performance of the MDS.
The study showed that eliminating extraneous MDS accesses was the best option though
enhancing the MDS memory was also a good option for increasing performance. Regarding
scalability, changing the algorithm from a parallel to a distributed approach and including
distributed checkpointing provides the best performance improvement of all the options
analyzed. For large image sizes (i.e., 64MB), the distributed FFT with distributed data
was the only option that showed a large improvement because more processing could occur
when data was more readily available for the processors.
The second case study used the patch-based SAR application to study its
performance and availability while running on the DM system. The Markov model
was initially employed to quickly determine that the performability of the application
using various patch sizes was suitable for further evaluation on the DM system. Once this
preliminary analysis was successful, we continued our three-stage analysis of SAR using
the discrete-event simulation models. Simulation results were similar to those produced by
the Markov-reward model but with one key difference. The performability values predicted
by the simulation model were much lower than those produced by the Markov model
when considering high fault rates and large systems. For example, the simulation-based
performability of a 32-node system experiencing an MTBFNODE rate of 60 seconds was
found to be 6.1% compared to the 49.1% reported by the Markov model. Again, the
96
Markov model did not capture the architectural dependencies of the SAR application
on the MDS node and thus the results overlooked its impact on the performability of
the larger systems. The simulative model was also used to gauge the overall system
throughput while executing SAR. The application produced its highest throughput (116
images per orbit) when executing on an 8-node system; however, in some cases, systems
beyond this size reported reductions in throughput due to contention at the MDS.
Finally, checkpointing was employed in an attempt to improve the fault tolerance of
the SAR application. Two storage options were explored such that checkpoint data was
either stored on the MDS node or neighboring data node. The results showed that for all
system sizes and fault rates, checkpointing to the MDS node reduced performability and
system throughput. Checkpointing to a neighboring data node was found to have slight
benefits in larger system sizes; however, the gains were much too small and checkpointing
was deemed unnecessary for the patch-based SAR application executing on the DM
system. During the study, certain system configurations resulted in each checkpoint
option producing similar performability values yet drastically different throughputs. This
observation suggested that while performability is a useful metric to gauge the robustness
of a degradable system under varying fault environments, it does not accurately represent
the efficiency for which an application executes on a system. Thus, the results from this
study reinforce the need to couple Markov and discrete-event simulation modeling for
comprehensive analyses of aerospace systems and applications.
After the SAR application was configured for optimal performability and throughput,
we explored architectural enhancements in order to identify and alleviate the bottlenecks
within the system. The results found that the MDS was the main throttling point of
the system due to the cumulative effects of each data node accessing and transmitting
data during various phases of SAR. To alleviate this contention point, we enhanced the
capabilities of the MDS storage device which allowed the system to nearly double its
throughput. We also observed similar speedups in system throughput by incorporating
97
additional MDS nodes into the system. The DM system using two MDS nodes received
a 2.0× boost in throughput while a design using three MDS nodes achieved a 2.5×
improvement under moderate fault conditions. As the fault rate increased, greater
speedups were observed for each enhancement over the baseline case. Shortened execution
times reduced the effects of the higher fault rates thus increasing the overall efficiency and
throughput of the system. The case study concluded with a performability evaluation of
the final flight system that incorporated twenty enhanced data nodes and three enhanced
MDS nodes. The proposed flight system was exposed to various fault rates and its
maximum throughput was observed to be approximately 587 images per orbit in relatively
light fault conditions and 316 images per orbit in the worst conditions studied. The
DM’s performability was observed to be over 99.5% when considering light to moderate
radiation conditions (i.e., less than one fault every two hours per data node).
The work conducted during this phase of research produced a novel, 3-stage analysis
process for predicting the performance and availability of high-performance, embedded
systems and applications. The FASE framework was extended by incorporating a fault
model library. This additional library allows FASE users to inject faults into arbitrary
systems in order to conduct in-depth availability and performability analyses. The case
studies demonstrated the capabilities of the enhanced framework as applied to the DM
system. The contributions and accomplishments of this work have been compiled into
two manuscripts. The first manuscript presents the simulation work involved with the
2D FFT study and was published in [48]. The second paper introduces the three-stage
analysis approach for predicting the performance and availability of radiation-suspectible
systems and applications. This paper was submitted to ACM Transactions on Embedded
Computing Systems [54].
98
CHAPTER 5HYBRID SIMULATIONS TO IMPROVE THE ANALYSIS TIME OF
DATA-INTENSIVE APPLICATIONS (PHASE 3)
The FASE framework presented in Chapter 3 laid the foundation for fast and
accurate performance predictions of arbitrary applications executing on a wide range
of systems. The information presented encompasses strategies to address the general
issues of characterizing applications, designing and developing components and systems,
and analyzing the performance of the virtual systems for the applications under
study. However, in practice, the techniques and tools developed succumb to lengthy
simulation times when studying data-intensive applications. These long analysis times are
exacerbated when considering large-scale, parallel systems to the point in which simulation
becomes prohibitive.
This section presents the work conducted for the third and final phase of the
dissertation. This research focuses on reducing the time needed to simulate data-intensive
systems and applications via a novel, hybrid simulation approach. The section starts by
discussing the motivations followed by the presentation of background information and
related research. A detailed description of the hybrid modeling approach illustrates the
techniques employed to enhance and extend the FASE framework to speedup simulations
evaluating data-intensive applications. Experiments and results follow validating the
successes of the proposed technique. This section also presents a case study that
illustrates the accuracy and speed of the hybrid approach as it is applied to the DM
system described in Chapter 4. Finally, conclusions are drawn and key insight is offered.
5.1 Introduction
As scientists discover new methods for exploration and discovery of both earth- and
space-bound phenomena, the amount of data collected by the accompanying equipment
can become staggering. With this increase in data, new applications are needed to
analyze and interpret the information in order to identify the areas of interest. As a
result, key areas of science such as the geosciences, remote sensing, and systems biology
99
are employing applications that process typical datasets in the multi-gigabyte range
and beyond. For example, NASA satellite systems performing remote sensing tasks
generate 50 gigabytes of images per hour while multiple terabytes of data on the human
genetic code were collected for the Human Genome project [55]. In order to efficiently
process these large quantities of data, new systems must be designed that push the
limits of processing and I/O technologies. However, designing the most effective system
for a specific application set is a nontrivial task due to vast number of technologies
and techniques available to designers. Simulation is often used to facilitate the design
process with discrete-event simulation being a particularly effective method to verify and
analyze specific characteristics of an architecture or protocol before deployment. This
capability allows developers to save vast amounts of time and money by circumventing the
development and evaluation of premature prototypes. It also expands the design space to
include architectures that use both new and emerging technologies thus allowing system
designers to select the best configuration for a particular set of applications.
Although discrete-event simulation provides many benefits, one major shortcoming
is the time required to analyze complex systems executing data-intensive applications.
Generally speaking, the time required to analyze a system using discrete-event simulation
is highly dependent on the number of discrete events generated by the model. In
high-fidelity simulation models, these large datasets are typically split into a considerable
number of fragments with each fragment processed individually in order to mimic the
behavior of the data transaction. This fragmentation generates a proportional number of
discrete events that must be scheduled and processed throughout the simulated system,
thus dramatically lengthening simulation time. Simulation times are exacerbated by the
fact that the majority of the candidate systems employ numerous distributed components
working in parallel to complete a given task. The combination of large, complex systems
processing sizeable datasets can cause simulations to run for days, if not longer. Such
lengthy analyses are often prohibitive due to unacceptable increases in design and
100
development times. Thus, methods for improving the efficiency and speed of discrete-event
simulations, while sacrificing as little accuracy as possible, are essential to remain an
effective tool for prototyping and evaluating architectures executing data-intensive
applications.
In this chapter, we present a novel approach to hybrid simulation modeling that aims
to reduce simulation time for data-intensive applications while retaining a high degree of
accuracy. Our approach combines the accuracy of function-level models with the speed
of analytical models and “micro-simulations” to achieve fast and accurate results. The
modeling procedure uses a technique called function-level training to collect performance
measurements from the simulated system. These measurements essentially take a snapshot
of the current state of the system and are used to calibrate the analytical models. The
calibrated analytical model is then employed to calculate the time required to complete
the current transaction assuming that the state of the model remains relatively unchanged
throughout its execution. Micro-Simulations also use measurements gathered during the
training period, though they are employed to account for device and contention delays at
components actively participating in the data transaction. The method operates with each
hybrid transaction beginning in the function-level training procedure in which functional
models are employed to collect statistics that characterize the current status of the system.
When the training period is complete, the analytical model calibrates itself using the
collected statistics, calculates the time required to complete the current transaction, and
schedules the transmission of a final data structure using the function-level model. Finally,
this last data structure traverses the system invoking micro-simulations at each component
it encounters until it reaches the destination device model. With this approach, the
redundant processing that typically occurs during large data transfers is replaced by a
single calculation (i.e., analytical model) and a fixed number of micro-simulations that
collectively approximate the ultimate outcome of these repetitious computations. The
101
resulting hybrid methodology supports the timely evaluation and analysis of a class of
applications that has traditionally stressed discrete-event simulators.
5.2 Background and Related Research
Within the past decade, technology has advanced in leaps and bounds to support
both the acquisition and storage of large quantities of data relating to a wide range
of fields including finance, commerce, scientific computing, and national security. Due
to the overwhelming quantities of data collected, the importance of processing and
understanding the information has lead to a new field of study within computer science
called knowledge discovery in databases (KDD). KDD, as defined in [56], is “. . . the
non-trivial process of identifying valid, novel, and potentially useful, and ultimately
understandable patterns in data.” With the emergence of KDD, many new research
initiatives have commenced to develop and optimize algorithms and applications that
analyze data residing in large data repositories while other projects aim to design powerful
computational systems to effectively sift through the large datasets to discover useful
information. The Data-Intensive Computing Initiative (DICI) at Pacific Northwest
National Laboratory is one such research project that is investigating scalable solutions
in both software and hardware to support timely, effective analyses of large quantities of
data [57]. However, in order to analyze the capabilities of new hardware designs, the cost
of building prototype solutions can be prohibitive. As a result, tools such as discrete-event
simulators are imperative to perform extensive, yet cost-effective evaluations of a number
of design options.
Traditional, high-fidelity modeling approaches dictate that the behavior of a modeled
entity should be mimicked identically to ensure correct and accurate results. As a result,
much simulation time is used to fragment each transaction into smaller chunks that
are transferred and processed by the corresponding functional models. This modeling
approach is very accurate, but its scalability suffers due to large simulation times as the
dataset grows in size. Numerous solutions have been investigated in order to remedy this
102
scalability problem through both general and specific methods that attempt to speed
up simulations. One methodology, called staged simulation, targets the reduction of
discrete events in wireless network simulations. Staged simulation uses various techniques,
such as function caching and incremental computations, to decrease the amount of
redundant processing conducted thus supporting faster simulations of larger systems [58].
According to case studies, the method achieves a 30× improvement in simulation speed for
a 1,500-node system; however, many of the proposed techniques are uniquely designed for
wireless network simulations.
A general approach that has been investigated to improve simulation time is parallel
discrete-event simulation (PDES). PDES frameworks attempt to reduce simulation
time by distributing the workload among multiple processors [59]. PDES frameworks
typically fall into one of two categories: conservative or optimistic. In conservative PDES
environments, each participating node processes an event only when all pending events
anywhere in the simulation that may affect the considered event have been completed
[60]. By using a conservative approach, processors may remain idle for significant periods
of time waiting on other processors to signal that they can proceed with their next
scheduled event. CPSim [61] and Parsec [62] are two tools that support conservative
PDES. Conversely, optimistic PDES avoids idle cycles by allowing events to be processed
regardless of their affinities with concurrent or previous events residing in other nodes.
To support this capability and maintain correctness, detection and rollback mechanisms
are incorporated into the design [63]. The detection mechanism introduces overhead
due to the additional processing required to identify dependencies between events while
rollbacks delay simulations by repeating computations optimistically completed. Examples
of tools supporting optimistic PDES include SPaDES [64], WARPED [65], and ModSim
[66]. While parallel simulators have been proven effective for certain applications, they
often provide very poor parallel efficiency due to the high levels of interdependencies and
103
synchronizations inherent in discrete-event simulations. As a result, we look to alternative
methods for improving the efficiency of simulating data-intensive applications.
Fluid-based modeling provides a particularly successful approach to greatly reduce
the time needed to analyze large-scale systems processing sizeable datasets. Fluid-based
models attempt to abstract an entire data transaction as a single fluid representation,
called a flow. Typically, analytical models are used to define a flow’s behavior as well
as the interactions between multiple flows contending for a shared resource or channel.
Conversely, the behavior of a function-level transaction is captured by the collective
behavior of each individual fragment that composes the transaction as dictated by the
state of the system. By using a fluid-based model, an extremely large data transaction can
be modeled as a continuous event, without generating a large number of discrete-events
that would otherwise slow the simulation. In cases where the simulation time is dominated
by the modeling of these large data transactions, such a technique has the potential to
significantly reduce simulation time. For these reasons, the use of fluid-based models
has become popular for applications such as large-scale network simulations [67].
Unfortunately, coarser-grained models are typically less accurate than function-level
models [68], and at times less efficient due to ripple effects caused by the interaction
of competing flows [69]. The ripple effect, discussed in [70], refers to the phenomenon
where the rate change of a single flow causes rate changes to other flows that propagate
throughout the system. In simulations handling a large number of flows, these ripples can
create enough extraneous events to significantly decrease the gains in simulation efficiency
obtained from using fluid-based models.
Several projects have proposed designs of hybrid simulation environments that
combine functional-level and fluid-based, analytical models for accurate and efficient
network simulations [71],[72],[73]. For each of these simulators, users typically decide
whether a network transaction is modeled as multiple, function-level fragments or as
a single, fluid-based flow. While the frameworks provide the flexibility of supporting
104
both modeling approaches, they suffer from a few shortcomings. First, fluid streams are
modeled solely using analytical models that consider traffic rates and buffer capacities.
Thus, the accuracy of the fluid transaction is based entirely on the analytical model’s
ability to approximate contention and fluctuating network behavior without the use of
detailed, functional-level models. Furthermore, each simulation environment is designed to
analyze only the network-specific characteristics of a system. Therefore, the frameworks
are unable to correctly model the complete behavior of a data-intensive application
and system. Finally, the capabilities of each simulator are demonstrated by focusing
on a single data stream in the presence of randomly generated background traffic. The
demonstrations suggest that difficulties could arise when attempting to model real traffic
generated by the entire system in order to mimic the behavior of an application.
Other research projects have looked beyond fluid-based modeling to reduce the
complexity of simulating data-intensive applications. A framework that combines
application emulators with a set of simulation models for dealing with large-scale, parallel
applications is described in [74]. The application emulators dynamically feed stimuli to
the simulator in the form of epochs, which represent groups of events and their high-level
data dependencies processed by a processor model. While the coarse-grained processor
models and the application emulators abstract away much of the work performed for a
given application, the framework still relies on high-fidelity bus and disk models, which
can hinder the simulation during very large data transactions.
Since the FASE framework places an emphasis on accurate and detailed modeling of
data transactions, it stands to benefit greatly from the hybrid modeling methodology. For
this work, the discrete-event models developed in Chapters 3 and 4 have been extended to
support the hybrid modeling methodologies proposed in this chapter. The remainder of
this chapter presents the hybrid-modeling extensions incorporated into FASE and results
from case studies that showcase the new capabilities of the extended framework.
105
5.3 Hybrid Simulation Approach
In this section, we introduce a hybrid simulation approach that can be applied to
a number of component types to produce fast and accurate simulation results. Before
providing the specific details on the methodology, we first define some basic terminology
used throughout the chapter as well as a few basic modeling concepts. In this chapter, we
refer to a generic data operation as a transaction, while a fragment and a flow correspond
to function-level and fluid-based modeling, respectively. Figure 5-1a illustrates a generic
hybrid system model consisting of three model types: 1) source, 2) path component, and 3)
sink. Each model type consists of both functional and fluid-based models and participates
in the modeling of the transactions within the system. The components labeled as a source
begin new data transactions based on operations created by the corresponding component
or received from an upper-layer component. The path components are intermediate
models that receive both fragments and flows and propagate each data type to the next
component in the path. The sink models signify the destinations for the transactions
and ultimately receive the actual data sent by the corresponding sources for further
processing. Also, logical channels connect the origin source, path components, and the
sink model participating in each data transaction. In order to relate these generic terms to
a real-world system, we illustrate a system (see Figure 5-1b) employing three transactions.
Transaction T1 corresponds to a remote disk write in which server A (source) transmits
data to a shared, remote disk (sink) using two shared switches (path components).
Transaction T2 is a data transfer in which the workstation receives a message from Server
B, and Transaction T3 represents a write access to the remote disk by server B. These
transactions are used throughout this section to exemplify the techniques employed in the
proposed approach.
Our hybrid modeling approach incorporates three main steps in order to support
quick yet accurate results: 1) function-level training, 2) analytical modeling, and 3)
micro-simulation. As illustrated in Figure 5-2, the first step of a hybrid simulation
106
(a) Generic system (b) Real-world system
Figure 5-1. High-level example systems employing hybrid modeling
(i.e., function-level training) employs the functional models at all hybrid model types
participating in the transaction to collect performance measurements that characterize
the current state of the system. These measurements are then used in the analytical
modeling stage to configure the corresponding model and calculate the length of time
the source is busy processing the current transaction. The third step, micro-simulation,
uses the metrics collected by path components and sinks during function-level training
to compute delays experienced by each flow due to internal mechanisms and contention.
Data flows through our hybrid systems as follows. First, the source model receives the
data and uses its function-level model to transmit F fragments using its functional models.
As these fragments traverse the system, each model collects statistics on the behavior of
the transaction. After F measurements have been made by the source, one more fragment,
called the head fragment, is transmitted and the statistics gathered by the source are fed
to the analytical model. The analytical model calculates a time based on the statistics and
effectively delays the component for the computed time. While delayed, the device can
still respond to remote requests from other components though no further discrete events
are created by the current data transaction. After the calculated time has elapsed, the
107
source sends one final fragment, deemed the tail fragment, using its function-level models.
This fragment can be delayed at each path-component and sink model according to the
corresponding micro-simulations that occur at each model type. The data transaction
is complete when the tail fragment reaches the sink model. The following subsections
present in-depth details on the three steps employed in our hybrid simulation approach
and discuss the interdependencies between each stage.
Figure 5-2. High-level diagram of hybrid simulation approach
5.3.1 Function-Level Training
Data-intensive applications typically perform numerous operations that manipulate
large quantities of data. In order to effectively model these operations, we consider them
as a two-stage process. The first stage, named transitional period, represents the period
in which the system begins a new operation. During the transitional period, the system
experiences an increase in data traversing between components while these devices take
the necessary actions to adjust to the new influx of data. The most common impacts
observed in each affected component are buffer growth and increased contention. Once
the system entities have adjusted to the changes, the operation reaches its steady-state
period. In this stage, the system is effectively in equilibrium and the output rate of
each component is highly predictable when considering identical inputs. Our hybrid
approach uses these stages by applying the best suited models for each in order to ensure
108
accuracy while providing opportunities to reduce simulation time. Specifically, we employ
function-level models during the transitional period and then switch to the fluid-based
models when a steady state has been reached.
The use of high-fidelity models during the transitional period allows us to accurately
capture fine-grained events that have the potential to greatly impact the system’s
overall performance. However, once the operation has reached its steady-state period,
an analytical, fluid-based model can be employed. By using the analytical model,
numerous redundant computations can be replaced by a single calculation thus saving
a significant amount of time proportional to the size of the data corresponding to the
current transaction. One difficulty with this approach is configuring the analytical model
to correctly represent the system’s current state. To overcome this challenge, our hybrid
simulation approach collects statistics on specific attributes of the system as experienced
by the corresponding component while using the function-level models. By collecting these
measurements, we can effectively “train” our analytical model to capture the behavior of
the component based on the state of the system. This process of collecting performance
metrics and other statistics during the transitional period is called function-level training.
Function-level training is conducted at the beginning of each data transaction to
ensure the accuracy of each modeled transaction. As shown in Figure 5-3, the process
begins with the source model transmitting the first F fragments of the data transaction
using its functional model. While these fragments traverse the system, the source keeps
track of key statistics such as the departure rate of fragments as dictated by the system.
Meanwhile, the path-component and sink models monitor the stream of fragments received
from each source in order to calculate internal statistics used within the micro-simulation
stage. Once F measurements have been collected by the source, it transmits a final
fragment, called the head fragment, that contains information, such as the transaction’s
remaining data size, that is used during micro-simulations within the path-component
and sink models. This event also marks the time when the source switches from its
109
function-level implementation to its fluid-based design. Sections 5.3.2 and 5.3.3 provide
in-depth details on the fluid-based, analytical model and micro-simulations, respectively.
Figure 5-3. Function-level training procedure
The source model can be configured to switch between its function-level and
fluid-based, analytical models according to two user-definable parameters. The first
parameter allows users to specify the number of measurements, F, that should be taken
using the function-level model before switching to the fluid model. The second defines a
specific data size under which the source is forced to use only its functional model. The
first parameter is ideally set to a value that supports the use of the function-level model
during the entire transitional period of a transaction while also capturing the steady-state
statistics for the various component models. If the training period is too short, the
component models could potentially capture metrics that misrepresent the steady-state
conditions for the given transaction. However, if the training period is too long, simulation
time is wasted due to the scheduling and processing of superfluous discrete events. The
second parameter allows the user to control the minimum data size that can use the
hybrid approach. This parameter is important for two reasons. First, each transaction
110
incurs overhead when employing the hybrid approach due to the extra computation
required at each component to calculate and log statistics. As a result, this overhead has
a greater impact on smaller transactions and can potentially increase simulation time as
compared to functional modeling. More importantly, the hybrid approach was designed
for large transactions, and therefore, can suffer from inaccuracies when considering smaller
transactions. The proceeding section provides details on the issues regarding the accuracy
of the analytical model.
In order to illustrate function-level training, we examine the three transactions
introduced in the previous section and displayed in Figure 5-1b. Consider data sizes
of 10 KB, 8 KB, and 10 KB for Transactions T1, T2, and T3, respectively. Also, the
maximum fragment size is 1 KB and all sources’ training periods are configured to
take four measurements (i.e., F = 4). In this example, each source transmits 4 KB of
data during its training period. The source for T1 (i.e., Server A) determines it can
send 0.5 maximum-sized fragment every second while Server B can transmit 0.25 and
1.0 maximum-sized fragments every second for transactions T2 and T3, respectively.
Furthermore, the path-component and sink models calculate similar arrival rates for the
transactions during the function-level training periods. After the four measurements are
collected, each source transmits a head fragment containing the necessary information to
be used by the path-component and sink models during their micro-simulations. Also,
the sources use the calculated fragment output rates to calibrate their analytical models
to determine the times the sources are busy processing their current transactions. The
following sections continue this example by outlining how the analytical models and
micro-simulations are used to model the behavior of the three transactions traversing the
system.
5.3.2 Analytical Modeling
After the training period completes, the resulting performance metrics collected at
the source model are used by a fluid-based, analytical model (see Figure 5-2) to calculate
111
the time in which the current transaction is expected to complete. A timer is set within
the source model based on the calculated time. When the timer expires, the source model
outputs one last fragment, the tail fragment, which signifies the end of the transaction.
The tail fragment is used by the path-component and sink models to calculate final delays
incurred by each flow. These delays are computed using micro-simulations, which are
described in more detail in the following section.
In order to ensure accuracy within the analytical model, three requirements must
be satisfied. First, the model must be capable of capturing the steady-state behavior
of the component that it represents. This requirement can be fulfilled by employing
validated derivations that are either custom-built or defined in literature. Second, the
model should use parameters that correspond to the current state of the system. We
address this requirement through function-level training as described in the previous
section. Finally, since the analytical model is invoked only one time to calculate the time
required to complete each transaction, it inherently assumes the state of the system does
not change within the calculated time period. However, complex systems typically have
numerous transactions interacting and causing changes within the system. As a result, the
last constraint requires a means to recalibrate the analytical model when a system update
potentially affects the behavior of the modeled component. Instead of fully satisfying this
requirement, we allow the analytical model to suffer from any inaccuracies that may occur
due to system updates. However, we introduce a mechanism described in the subsequent
section to compensate for these inaccuracies.
The speedup attained using the analytical model depends on the complexity of the
mechanisms replaced and the number of discrete events created using the equivalent
function-level model. The greater the model’s complexity and the larger the number
of discrete events created by the functional model, the more speedup is achieved using
the hybrid approach. Furthermore, the complexity of the path-component and sink
models also come into play since discrete events created at the source normally lead to
112
nontrivial computations at lower-layer components as well as the potential to create more
events that further increase simulation times. Finally, the reduction of discrete events
enabled by analytical models has the potential to greatly trim the memory requirements
of the simulation. As a result, the employment of the hybrid approach allows users to
more efficiently analyze larger systems or attain even greater speedups especially when
replacing function-level systems that require the use of disk swap space. The combination
of all these factors imbue the analytical model with the potential to greatly reduce the
simulation times observed in pure function-level simulations.
During each function-level training period for the three transactions considered in our
example system, 4 KB of data was sent. Also, Server A and Server B calculated fragment
output rates of 0.5, 0.25, and 1.0 fragments per second for Transactions T1, T2, and T3,
respectively. For this example, we assume both sources use the simple analytical model
expressed as
SourceDelay =d T
Fragmaxe − 1
O(5–1)
where O is the fragment output rate, T is the remaining size of the transaction, and
Fragmax is the maximum fragment size. Using this analytical model, the sources delay
for 10, 12, and 5 seconds before sending the tail fragments for Transactions T1, T2, and
T3, respectively. After delaying for the specified amount of time, the corresponding source
transmits the current transaction’s tail fragment, signifying the end of the transfer. The
next section details the role of the micro-simulation technique within the example system.
5.3.3 Micro-Simulation
Thus far, our hybrid simulation approach has addressed source-based delays as
configured according to a steady-state view of the system. However, the state of a system
frequently changes resulting in potential impacts that propagate back to the source
models due to feedback mechanisms. There are two options available to handle this
situation. First, the feedback mechanisms can be incorporated into the fluid models such
113
that changes in the system cause source models to recalibrate their analytical equations
according to the new state of the system. However, this approach can greatly suffer from
an effect similar to the “ripple effect” reported in [70]. That is, changes can continuously
cause source models to recalibrate, thus reverting the hybrid simulation approach into
nothing more than a function-level approach with additional overhead.
The second alternative attempts to take advantage of a simple observation – the
most likely system changes to have substantial effects on the performance of a transaction
are the additions or removals of other transactions competing for the same resource.
By using this observation, the second option allows the source to complete as originally
scheduled while the path-component and sink models account for delays caused by
the additional transactions creating contention within the system. This approach,
deemed micro-simulation, uses FIFO queues and the performance statistics collected
during function-level training to calculate device and contention delays incurred by
each transaction. Micro-Simulation effectively reduces the complexity of the modeled
component into a simple queuing system that approximates the delays experienced by
each flow. In order to simplify the system, the device’s behavior is characterized by a
single parameter, service rate, while each flow is represented with three parameters:
1) start time, 2) arrival rate, and 3) number of fragments. The service rate parameter
specifies the rate at which the device can complete a virtual fragment. This parameter
is typically calculated based on the component’s performance attributes such as latency
and throughput as well as the fragment size. The start time specifies the time at which
a flow first arrives at the component, the arrival rate denotes the rate at which a virtual
fragment arrives at the device, and the number of fragments parameter indicates the
number of virtual fragments represented in a flow. The arrival rate for each flow is
calculated at the path-component and sink models during the flow’s function-level training
while the start time and number of fragments are defined in the head fragment. Due to
queuing complexities, computer simulation is used to calculate the delays for each flow in
114
a path-component or sink model. Micro-Simulations are conducted only when a new flow
begins (as signified by a head fragment) or an existing flow completes (as indicated by
a tail fragment). For both events, the micro-simulation’s state is updated to the current
simulation time; however, predictions of a flow’s completion time are only made when
its tail fragment is received before the micro-simulation progressed to a state in which
the flow has completed. Since the addition of new transactions cannot be forecasted, yet
will likely affect the delays experienced by existing flows, speculative calculations of the
completion times can lead to wasted computation. As a result, micro-simulations perform
the minimum amount of computations necessary to calculate the delays experienced by a
flow, thus minimizing the simulation time required by this technique.
Micro-Simulation improves simulation time for two reasons. First, by reducing the
system into a queuing problem, micro-simulation abstracts away the complexities involved
with modeling large transactions employing potentially complicated components. The
queuing system can be quickly processed using computer simulation, thus decreasing the
computation time required to explicitly simulate the system. Second, micro-simulations
do not create discrete events and, thus, do not suffer from scheduling or other processing
delays external to the model that have the potential to drastically increase the overall
simulation time.
Figure 5-4 provides the micro-simulation that occurs at Switch B illustrated in Figure
5-1b. For this example, we assume that Switch B uses a bus-based backplane to route
fragments, and therefore, its micro-simulations model any contention between transactions
sharing this resource. Also, Flow #1, Flow #2, and Flow #3 in Figure 5-4 correspond
to Transactions T1, T2, and T3 in Figure 5-1b, respectively. The characterization
parameters of the switch component and three flows are provided at the top, the state
of the micro-simulation’s queue with respect to time is displayed in the middle, and the
micro-simulation updates and key events that trigger them are shown at the bottom. The
115
micro-simulation’s queue state illustrates the order and number of virtual fragments from
each flow residing in the queue at a given point in time.
In order to illustrate the three flow types that can occur in our hybrid simulations, we
assume that the transactions’ start times are interleaved according to the values displayed
in Figure 5-4 and each tail fragment sent by the corresponding source experiences a two
second routing delay before it is received by Switch B. Flow #1 illustrates the case in
which the component receives the flow’s tail fragment and then calculates the flow’s
completion time to be the current simulation time. In this case, the tail fragment is
immediately processed by the component since no contention delays occurred. Flow
#2 represents a flow that encounters source-based delays. That is, the component
receives the flow’s tail fragments after the flow’s completion time as calculated by the
micro-simulation. Similar to Flow #1, the tail fragment is processed immediately in
this case. Finally, Flow #3 demonstrates the process followed when the tail fragment
is received by the component but the micro-simulation determines that the flow is not
complete. In this case, the device schedules the processing of the tail fragment to the
corresponding flow’s predicted completion time. Take notice, the micro-simulation’s state
never progresses further than the current simulation time. However, predictions are made
to calculate the completion time of Flow #3 when the tail fragment arrives before the
micro-simulation’s state has progressed to the point in which the flow is determined to be
complete.
While the micro-simulation technique eliminates potential slowdowns due to the
ripple effect phenomenon, it does not compensate for all source-based inaccuracies that
may occur when modeling a transaction. Consider the following scenario. Transaction
A begins its training period and causes contention at a shared resource already in use
by Transaction B. Assuming the shared resource can handle only one transaction at a
time without causing delays, the sources of both transactions observe slower rates at
which their data traverses the system. Transaction A finishes its training period just as
116
Figure 5-4. Example three-flow micro-simulation
Transaction B completes, thus configuring its analytical source model to represent the
component’s behavior in the presence of contention. However, the contention no longer
exists since Transaction B completed. The resulting scenario cannot be resolved through
micro-simulations and requires a feedback mechanism that can notify the source of the
change in the system so it can retrain. However, the feedback loop leads back to the ripple
effect for which micro-simulations are used to avoid. Section 5.5 discusses proposed future
work that investigates techniques that incorporate a feedback mechanism to remedy these
inaccuracies while mitigating the ripple effect.
Throughout this study, we consider only FIFO queues, although our design can be
easily extended to support priority-based queues. Furthermore, we assume infinite queue
capacities in order to simplify the modeling effort. However, the necessary mechanisms are
in place to support finite queue sizes. Finally, when using this approach, the source cannot
proceed with the next transaction until the tail fragment has been fully completed. As
a result, mechanisms that may allow transactions to complete early (e.g., non-blocking)
should be disabled or carefully controlled using this hybrid approach.
117
5.4 Results and Analysis
In this section we discuss the results of the analyses conducted to show the
capabilities of our hybrid simulation approach as it is applied to the NASA Dependable
Multiprocessor (see Chapter 4 for details). The first section presents the simulation setup
followed by validation results using low-level benchmarks to verify the accuracy and
showcase the speed of the hybrid models. Next, we evaluate the speed and accuracy of the
hybrid models when considering contention at shared resources. Contention stresses the
proposed hybrid methodology due to the lack of feedback loops that adjust the analytical
models within the source components. After the speed and accuracy of our approach have
been showcased, we apply the technique to analyze the performance of the DM system
executing a data-intensive, hyperspectral imaging (HSI) application.
5.4.1 Simulation Setup
The key simulation models employed for the following experiments are listed in
Table 5-1. Similar to the models developed in the previous chapters, these models
were also created using MLD. Each model captures the functionality and behavior of
the corresponding technology while adhering to the FASE methodology of balancing
speed and accuracy. From these core component models, node and system models were
developed. Finally, Table 5-2 lists the key system parameters configured to best match the
performance of the prototype DM system.
The DM system incorporates two main subsystems that can benefit from hybrid
simulation models - the network and mass data store. Both subsystems use the HAM for
inter-node communication, which in turn employs TCP/IP as its primary communication
protocol over a Gigabit Ethernet link. Therefore, we retrofitted the HAM model in the
DM model library and the TCP, IP, and Ethernet models found in the pre-built FASE
library with the appropriate hybrid simulation mechanisms. The MDS Server model was
also modified to incorporate hybrid mechanisms to model large data accesses to the MDS
device.
118
Table 5-1. Summary of relevant simulation models
Library Component Description
DMHigh-AvailabilityMiddleware
Provides reliable communication between nodes in system.
MDS Server Handles data access requests to the mass data store.
FASE
TCP Layer Provides TCP protocol for reliable communication betweennodes.
IP Layer Provides IP protocol for all network transfers.
Ethernet NIC Provides Ethernet protocol for all network transfers. Sup-ports multiple ports.
Ethernet Switch Provides Ethernet connectivity between nodes. Supportsvariety of backplane and routing options.
Table 5-2. Key system parameters
Parameter name Value
Processor power 1200 MIPS, 600 MFLOPS
MPI maximum throughput 57 MB/s
MPI message latency 13.6 ms
HAM buffer size 2000000 bytes
Network bandwidth Non-blocking 1000 Mb/s
Network switch latency 5 µs
MDS bandwidth (write/read) 60/40 MB/s
MDS latency (write/read) 300/500 µs
MDS open file overhead 8 ms
The HAM model acts as a source model for most data transfers between nodes.
The model receives data transactions from the application layer and fragments
each transaction into messages with a maximum size of 14000 bytes when using its
function-level implementation. During the training period, the round-trip time of each
message is calculated in order to calibrate the analytical model used in the fluid-based
implementation. The HAM model supports non-blocking communications via buffering
techniques.
The TCP, IP, Ethernet NIC, and Ethernet switch models are retrofitted with hybrid
mechanisms. Each of these components acts as a path component and therefore collects
119
statistics on the flows traversing through them in order to calculate delays incurred at the
device via micro-simulations. The TCP model also acts as a source model since it has the
capability to fragment the messages received from the HAM into TCP segments with a
maximum size of 1460 bytes. The TCP model is outfitted with an analytical model that
uses the command window, maximum window size, and acknowledgement rate to calculate
the time needed to transmit N bytes of data. Currently, our fluid model does not account
for TCP segments dropped in transit; however, the model can be extended to collect the
necessary metrics during training and the analytical model modified to account for the
effects of dropped segments.
Finally, the MDS model is also retrofitted with hybrid simulation mechanisms to
provide sequential accesses for submitted disk I/O jobs. The MDS model represents
a sink model and therefore uses micro-simulations to calculate delays incurred while
accessing the mass storage device. More specifically, the queue service rate of the MDS’s
micro-simulations is configured according to the bandwidth and latency of the storage
device while the arrival rates of each flow are calculated based on the performance metrics
gathered during each transaction’s training period. Due to the features of the MDS
server, read accesses retain exclusive ownership of the MDS storage device throughout the
duration of the access while write accesses can be interleaved in 14000-byte messages, the
maximum message size of the HAM.
For this study, the HAM and TCP source models are configured as shown in Table
5-3. The training values were chosen to best balance the accuracy and speed of the
hybrid simulations used to analyze the DM system. The following sections show that
these values provide sufficient training periods to produce fast yet accurate results from
each transaction. The minimum hybrid message size for each source model is initially
set to zero in order to evaluate the hybrid methodology for all message sizes. It should
be noted that each simulation presents a unique case for which these parameters should
120
be calibrated to best accommodate the characteristics of the systems and applications of
interest.
Table 5-3. Hybrid source model parameters
Model Parameter name Initial value
HAMFunction-level training measurements 10
Minimum hybrid message size 0 MB
TCPFunction-level training measurements 100
Minimum hybrid message size 0 MB
All simulations are conducted on dedicated compute nodes with a nominal number
of processes executing in the background to minimize the noise experienced during the
experiments. Each node is configured with a quad-core, 2.4 GHz Xeon processor with 2
GB of main memory running a 64-bit Linux variant using kernel 2.6.9-55.
5.4.2 Performance Modeling
The first study conducted uses two simple benchmarks, PingPong and MDSTest, to
show the accuracy and speed of the hybrid simulation approach under ideal conditions
(i.e., no resource contention and minimal system state changes). The PingPong benchmark
transfers data between two data nodes while the MDSTest benchmark writes and reads
data to and from the MDS node. For both programs, the data transfers range from one
byte to four gigabytes. The small-scale DM testbed system is used to collect experimental
measurements to which results from both the function-level and hybrid model results are
compared. Consequentially, the results not only showcase the capabilities of our modeling
approach, but also validate the DM model’s accuracy.
From Figure 5-5 and Figure 5-6, one can observe that both modeling approaches
closely reproduce the throughputs observed in the testbed system. For the PingPong
benchmark, we calculate nearly identical mean relative errors of 1.24% for the functional
and hybrid models when compared to the experimental measurements. For this particular
benchmark, the error found between the two modeling approaches was negligible. The
MDS benchmark tests produced similar results with mean relative errors of 1.50% for
121
writes and 3.01% for reads when using the hybrid approach as compared to experimentally
collected measurements. Again, the MDSTest benchmark showed negligible errors between
the two modeling approaches.
Figure 5-5. PingPong accuracy results
Speedup results for the PingPong benchmark are shown in Figure 5-7. From the
figure, we see that employing the hybrid simulation approach in systems executing
applications that transfer large data sizes can greatly improve simulation times. In fact,
we observe an order of magnitude speedup for datasets as small as 4 MB, while 4 GB
data transfers observe nearly an 1850× speedup. The large speedups observed at the
larger data sizes are directly proportional to the number of discrete events eliminated
through the use of the hybrid simulation approach. For example, a 4 GB function-level
transfer using TCP requires the creation, scheduling, and processing of approximately 2.9
million discrete events while the currently configured hybrid approach uses no more than
1000 discrete events to simulate the same transaction. By simply dividing the number of
122
Figure 5-6. MDSTest accuracy results
discrete events generated using both approaches, we calculate an ideal speedup of around
2940× thus verifying such large gains using our method.
The MDSTest speedup results (see Figure 5-8) show similar trends to those observed
in the PingPong benchmark. Over 10× speedups are achieved at 4 MB datasets and
maximum speedups of over 1700× and 1800× are observed at 4GB datasets for writes and
reads, respectively. The MDSTest showed larger speedups for read operations due to a
slightly smaller amount of computation required to conduct micro-simulations at the MDS
since reads are sequentially executed with exclusive access to the device.
Both benchmarks show speedups that begin to level off past the 1 GB message size.
This behavior is due to a 1 GB message size limitation placed on data transactions that
use the TCP model in order to avoid problems associated with its 32-bit variables that
maintain its current state for mechanisms such as windowing and acknowledgements. It
should be noted that these restrictions are only placed on component models that have
inherent limitations that can potentially cause problems when considering very large
data sizes. Also, both benchmarks show increasing slowdowns between 1 KB and 256 KB
123
Figure 5-7. PingPong speedup results
Figure 5-8. MDSTest speedup results
124
messages, which signify additional overhead when using the proposed hybrid simulation
approach. The cause of this added overhead is due to the gathering of statistics on the
increasing number of fragments used in each transaction. However, once a message size of
512 KB is reached, our hybrid approach overcomes this logging penalty and shows positive
speedup.
The benchmarks used in this study represent best-case scenarios for using the hybrid
simulation technique. This quick study shows that our hybrid approach can provide
substantial reductions in simulation time while having little impact on the accuracy of the
results. However, neither benchmark experienced conditions that could potentially cause
inaccuracies in the hybrid models and thus represent the best-case numbers achievable
by the technique as it is currently configured. More specifically, neither benchmark
causes system state changes while previously configured transactions are executing. The
proceeding section investigates the accuracy and speedup attainable by the hybrid method
when the system is exposed to contention and thus numerous state updates.
5.4.3 Contention Modeling
Now that we have shown the accuracy and speedups achieved using our hybrid
models under relatively ideal conditions (i.e., little to no external effects influencing
a transaction), we investigate the impacts of using the technique under more extreme
conditions. In this study, we introduce contention into the system to observe how the
hybrid models react with regards to simulation speed and accuracy. For this test, we
use the MDSTest benchmark while increasing the number of nodes that simultaneously
access the MDS from two to thirty-two. This scenario creates contention at two shared
resources within the system – the output port in the Ethernet switch attaching the MDS
node to the network and the MDS storage device. For this benchmark, each node involved
first writes N bytes of data to the MDS, where N ranges from 1 B to 128 MB, and then
reads the data back with a synchronization point between each operation to maximize the
amount of contention and thus the stress on the hybrid simulation approach.
125
Figure 5-9 illustrates the relative errors and speedups observed when comparing the
hybrid write and read access times to the results obtained via the function-level models.
Data sizes less than 64 KB are not displayed since the hybrid models use only their
function-level implementations consequentially producing identical results between the two
simulation approaches. From Figure 5-9a, one can see that the hybrid write accesses show
relatively large deviations in accuracy at smaller data sizes. In fact, a maximum error
of just over 46% was calculated for a 256 KB dataset regardless of the number of flows
tested. Furthermore, as the data size increases, the general trend observed is a decrease in
error. These observations suggest that although the fluid-based models do not adequately
represent the behavior of the source models at smaller data sizes, they are much more
accurate at larger datasets. Meanwhile, for larger flow counts, we observe a minor increase
in error when transitioning from 1 MB to 2 MB data transactions. This increase is likely
due to abstractions made within the hybrid HAM model with respect to its buffer size
(recall from Table 5-2 that the HAM’s buffer size was set to 2 MB) and the non-blocking
functionality provided by this buffer. Figure 5-9b shows that the hybrid read accesses
perform nearly identically to the function-level operations with a maximum observed error
of 0.26% occurring at 64 KB. From the figure, we also observe that the number of nodes
participating in the read portion of the MDSTest benchmark has a minimal effect on the
accuracy of the hybrid simulation approach due to the serialization of accesses conducted
by the MDS Server.
Figure 5-9c and Figure 5-9d show the speedups achieved for the hybrid write and
read accesses over the various message sizes and flow counts. From the figures, one can
see that both operations show significant gains in speedup as the data size increases. The
maximum speedups, 595× and 760×, for the write and read operations, respectively, are
observed at 128 MB. A minimum speedup of 0.75× (i.e., slowdown of 25%) is observed
for both writes and reads, which represents the overhead associated with the extra
computation required to calculate and log statistics when using the hybrid approach.
126
(a) Write errors (b) Read errors
(c) Write speedups (d) Read speedups
Figure 5-9. MDSTest accuracy and speedup results using hybrid modeling approach
While large errors are observed in the hybrid models when processing smaller write
operations, we must remember that the proposed approach is designed to model large
data transactions. As a result, we identify a cross-over point for which data transactions
within the DM system use only the function-level models versus the hybrid models in
order to remedy the large errors at small data sizes while still achieving speedups for large
data sizes. To pinpoint this cross-over point, we calculate the speedup-to-error ratios
for the write accesses (see Figure 5-9a and Figure 5-9c) at each data size and select the
value that sustains a ratio greater than one for all flow counts. When this ratio is greater
than one, the speedup is larger than the error thus suggesting that the benefit of the
technique outweighs its inaccuracies. It should be noted that while this ratio is useful to
quickly identify a cross-over point, it can provide values that result in potentially large
127
inaccuracies in the cases where both the speedup and error values are large. For this study
we find that the 4 MB data size produced ratios of one or greater for all flow counts.
However, we select 8 MB as the minimum message size to use the hybrid models since
we desire single-digit errors for all flow counts as well. In the next section, we explore the
impact of configuring the HAM and TCP models to use this value as its minimum hybrid
message size on a real data-intensive application.
5.4.4 Case Study
Hyperspectral imaging is a technique that combines conventional imaging and
spectroscopy to identify and classify various objects within a 3D image. HSI is used in
applications that include mapping, reconnaissance and surveillance, and environmental
monitoring. Similar to other remote-sensing techniques, HSI typically deals with large
amounts of data that in some applications must be processed in real-time to provide
immediate assessment of potentially threatening scenarios. In this study, we apply our
hybrid simulation approach to an HSI application based on the algorithm presented in [75]
in order to showcase its capabilities when analyzing a real scientific application executing
on the DM system.
Figure 5-10 illustrates the dataflow diagram of the HSI application. Each
participating node acquires a slab of the image, calculates the autocorrelation sample
matrix (ACSM), and transmits the results to a single root node. The root node processes
the data collected from each node in the weight calculation stage and broadcasts C
classification constraints to each node. The nodes then classify the original image data
based on the constraints and save the resulting data to construct an output image. Table
5-4 displays the number of 4-byte elements transferred in each stage in terms of pixels
per row/column (N ), spectral bands (L), number of processors (P), and number of
classification constraints (C ).
For this case study, we explore the accuracy and speed of the hybrid version of
the DM flight system composed of twenty data nodes [53] as compared to its standard,
128
Figure 5-10. The HSI data decomposition and dataflow diagram
Table 5-4. Dataset sizes for each HSI data transaction
Transaction Dataset size (elements)
Get DataN2×L
P
Reduce L2
Broadcast C × L
Save DataN2×C
P
function-level counterpart. The study analyzes a total of ten image sizes (listed in Table
5-5). The first five datasets represent the images processed using current and emerging
implementations of the HSI application. The last five image sizes represent datasets that
may be analyzed in future versions of HSI and showcase the capabilities of the hybrid
technique when dealing with very large datasets. The triple line denotes the boundary in
which extrapolation is used to approximate the simulation times required by the standard,
function-level approach to complete a single HSI iteration. Extrapolation is employed as a
means to quickly estimate the full functional simulation time rather than occupy resources
for such large periods of time. The study also considers two configurations of the hybrid
HAM and TCP models. The configuration labeled Hybrid-0MB uses the default values
listed in Table 5-3 while the Hybrid-8MB configuration sets the minimum hybrid message
size for both the HAM and TCP models to 8 MB. Recall from Section 5.4.3 that this
129
data size was determined to balance the trade-off of using the hybrid method’s speed and
accuracy for smaller data transactions.
Table 5-5. Simulation times for various HSI image sizes
Image dimensions Raw data size Simulation times
(elements) Functional Hybrid-0MB Hybrid-8MB
1024×1024×64 256 MB 2.18 min 7.56 s 39.40 s
1024×1024×128 512 MB 3.92 min 8.08 s 40.25 s
1024×1024×256 1 GB 7.37 min 9.55 s 42.48 s
1024×1024×512 2 GB 14.31 min 15.34 s 50.88 s
1024×1024×1024 4 GB 28.33 min 20.91 s 80.33 s
2048×2048×1024 16 GB 1.88 hours 23.44 s 50.17 s
4096×4096×1024 64 GB 7.51 hours 34.07 s 1.03 min
8192×8192×1024 256 GB 1.25 days 1.25 min 1.74 min
16384×16384×1024 1 TB 5.01 days 4.08 min 4.66 min
32768×32768×1024 4 TB 20.03 days 15.53 min 16.29 min
Table 5-5 shows simulation times required to complete a single iteration of HSI
processing the corresponding image while Figure 5-11 displays the error and speedup of
the two hybrid configurations versus the standard, function-level approach. The results
show that both hybrid configurations provide very accurate results as well as improved
simulation times for all image sizes. In fact, the maximum errors observed in the Hybrid-
0MB and Hybrid-8MB setups are 0.77% and 0.0032%, respectively. Similar to the previous
studies, we find that speedup increases with dataset size though it begins to level as data
sizes become very large. Maximum speedups of 1858× and 1771× are observed in the
Hybrid-0MB and Hybrid-8MB configurations, respectively. Note that Hybrid-8MB reports
less speedup for all data sizes due to the increased amount of fragmentation occurring
for small to medium data transactions. However, the accuracy of this configuration for
smaller image sizes is significantly improved, though the accuracy of the Hybrid-0MB
configuration is very acceptable.
130
(a) Error (b) Speedup
Figure 5-11. The HSI accuracy and speedup results for two hybrid configurations
5.5 Conclusions
As data-intensive applications become increasingly prevalent, more efficient systems
must be designed to accommodate their special demands. In order to facilitate the design
of these systems, discrete-event simulation is often used to virtually prototype candidate
systems. However, lengthy analysis times of complex systems via simulation are further
hindered when evaluating data-intensive applications due to the sheer volume of data
created, processed, and scheduled by the simulation environment. In this chapter, we
presented a novel approach for hybrid simulation to speed the analysis of applications
processing large datasets while retaining a high degree of accuracy. Our approach
featured two techniques, function-level training and micro-simulations, to calibrate
analytical models that depict the long-term, steady-state behaviors of the corresponding
components and account for changes in the system’s performance without the use of
feedback mechanisms. Details on our hybrid simulation approach were outlined and the
various implications of each technique used were discussed.
To showcase the capabilities of the proposed approach, we applied the techniques
to the NASA Dependable Multiprocessor. First, we observed the accuracy and speedup
achieved by the DM system models using the proposed techniques as compared to a pure
functional model while employing two low-level benchmarks. The PingPong benchmark
reported a mean relative error of 1.24% when using the hybrid simulation approach
131
while the MDSTest benchmark showed 1.64% and 3.01% errors for writes and reads,
respectively. Furthermore, our approach showed speedups up to 1850× in the PingPong
benchmark and over 1700× in the MDSTest. These large speedups were a result of the
drastic reduction of discrete events processed by the hybrid approach. However, the
outcomes observed from the initial tests represented best-case numbers. As a result, we
analyzed the hybrid DM system model while executing the MDSTest benchmark on two
to thirty-two nodes. This scenario caused contention at the MDS node and therefore
investigated the capabilities of our proposed methodology in more stressing conditions.
The results from this study showed errors up to 46%, though these larger errors occurred
for smaller data sizes. As the transaction size increased, the errors decreased to more
reasonable percentages and eventually to values less than 1%. Maximum speedups of the
hybrid approach were observed to be 595× and 790× for writes and reads, respectively, at
the maximum message size observed (i.e., 128 MB). While the proposed hybrid simulation
approach reported large errors in this study, they were observed at smaller transaction
sizes that displayed smaller speedups when employing the hybrid models. As a result,
we identified a cross-over point at 8MB that supported the use of the function-level
models to ensure accurate results for small data sizes while transitioning to the hybrid
models for data sizes larger than 8MB in order to support the speedy simulations of large
transactions.
Once our approach was validated and its potential demonstrated using low-level
benchmarks, we evaluated its accuracy and speed using a hyperspectral imaging (HSI)
application executing on the DM flight system. By analyzing various image sizes using
both the standard function-level and proposed hybrid simulations, we found that our
approach produced a maximum error of 0.7% while displaying a maximum speedup of
290×. The trends observed from the study showed larger errors for smaller datasets due
to inaccuracies in the analytical model while the observed speedup increased with larger
datasets. The analysis concluded by analyzing much larger datasets using the hybrid
132
simulation approach with extrapolations used to estimate the amount of time required
by the function-level models. The results showed a projected, maximum speedup of
1858×. Speedups of this magnitude dictate that our hybrid approach has the potential to
complete month-long simulations in mere minutes.
The work conducted during this final phase of research produced a hybrid
simulation approach that employed two novel techniques, function-level training and
micro-simulations, to reduce the analysis times of simulations considering data-intensive
applications. These hybrid simulation mechanisms were incorporated into the FASE
framework and numerous pre-existing models within the FASE and DM model libraries
were retrofitted to accommodate these speedy features. Case studies demonstrated
the capabilities of the proposed approach as applied to the DM system executing a
data-intensive, hyperspectral imaging application. The contributions and accomplishments
of this work have been compiled into a manuscript that was submitted to ACM Transac-
tions on Modeling and Computer Simulation [76].
133
CHAPTER 6CONCLUSIONS
This document presented three phases of research to show the wide range of research
topics addressed by the FASE framework. The first phase analyzed the various aspects
involved with the design and development of a performance prediction infrastructure
using application characterization and discrete-event simulation in order to balance speed
and accuracy. The work laid out a generalize methodology to predict the performance
of applications executing on virtually prototyped systems. The methodology was then
realized through the use of an application characterization tool called Sequoia, which
traced MPI function calls and measured computation time between communication
events, and a pre-built model library created in MLDesigner, a hierarchical, discrete-event
simulation tool. Case studies were then conducted to observe simulation accuracy and
speed when compared to experimental measurements. The results showed accuracy
errors within an acceptable threshold (within 25%) and simulation speeds no greater
than three orders of magnitude slower than experimental processing times. After
validation, the potential of FASE was showcased with an in-depth study of the Sweep3D
algorithm executing on virtual systems composed of various network types, middleware
implementations, processing capabilities, and other degrees of freedom in the systems’
hardware and software configurations.
The framework developed in the first phase of this research created an ideal
environment to evaluate high-performance, embedded space systems. Consequently,
a NASA-sponsored project was used to explore the flexibility of FASE and extend
the framework to support not only scalability studies of the proposed space system,
but also availability studies. The work conducted in phase two expanded the FASE
pre-built models to include reliable middleware technologies that monitor system health,
schedule and deploy jobs, and react and recover from faults. A fault model library
was also developed to inject faults into the system in order to perform availability
134
studies. After the necessary toolkit was in place, a three-stage analysis procedure was
formulated for performance and availability evaluations. This approach allows users to
calibrate component models using experimental measurements, quickly identify workloads
and fault rates supported by a management software via Markov-reward models, and
thoroughly investigate specific applications executing on virtually prototyped system
through discrete-event simulation. The novel analysis methodology and simulation
models were then applied to explore the scalability of the 2D FFT kernel executing on
the DM system. The scalability study revealed the prime bottleneck of the system was
the centralized memory and algorithmic and architectural variations were analyzed to
alleviate the problem. After the scalability analysis, the SAR application was used to
study the performability of the virtual flight system consisting of twenty data nodes.
The results showed good system throughput (i.e., between 300 and 600 images per orbit)
and performability (i.e., over 99.5% in low radiation environments and 54.0% in extreme
conditions) when the system was enhanced and augmented with improved data processors
and MDS storage devices as well as extra MDS nodes to mitigate the contention point
discovered in the 2D FFT case study and further substantiated in the SAR study.
The final phase of this research considered extensions to the existing FASE framework
in order to overcome scalability issues with simulation time when analyzing data-intensive
applications. To overcome these issues, we proposed a novel hybrid simulation approach
that employ two unique techniques, function-level training and micro-simulations, to
reduce the amount of time required to simulate system executing applications with
large datasets. The proposed approach combines the accuracy of function-level models,
via function-level training, with the speed of analytical models and micro-simulations
in order to quickly and accurately approximate the time needed to complete a data
transaction. This combination drastically reduces the number of discrete-events processed
and scheduled by the simulator, thus resulting in simulation speedups over the pure
function-level approach. The approach was then applied to the DM system executing
135
low-level benchmarks and a hyperspectral imaging application. The low-level benchmarks
showed relatively accurate results (less than 7%) and order-of-magnitude speedups when
considering data transactions as small as 8 MB. Furthermore, the accuracy and speedup
of the hybrid simulation approach improved as the transaction size increased. In fact, the
simulations reported minimum errors of less than 1% and speedups over 1700× for the
low-level benchmarks and 1500× for the HSI application.
136
APPENDIX AEXPERIMENTAL AND SIMULATIVE SETUP
This research project incorporates two realms to explore various aspects and proposed
techniques for fast and accurate performance prediction. These realms are experimental
and simulative. The experimental realm deals with physical hardware and software used
to construct a compute system to collect “real-world” values, which are compared to
the results gathered in the second realm, simulation. The simulation realm provides an
environment to explore computational systems unavailable to the researcher due to limited
funds, non-existent components, or future generations of components. More details on the
interactions between these realms as applied to performance modeling and prediction will
be discussed in proceeding sections.
A.1 Experimental Setup
The work conducted during the course of this study employs equipment from the
High-performance Computing and Simulation (HCS) Lab at the University of Florida.
The HCS lab consists of 9 compute clusters each with various resources regarding the
processor, interconnect, main memory, hard disk, and software modules. Table A-1
lists the subset of clusters used for this study and their resource types, capabilities, and
capacities.
Table A-1. Computation systems at the HCS Lab at UF
Cluster CPU CPU CPU Node Memory Specialname type speed count count features
Alpha Xeon 3.2 GHz 128 32 2 GB DDR667 EMT64, Quad-coreDelta Xeon 3.2 GHz 16 16 2 GB DDR333 EMT64, PCI-ExpressMu Opteron 2.0 GHz 32 32 1 GB DDR400 PCI-Express, QsNetII
Lambda Opteron 1.4 GHz 32 16 1 GB DDR333 10 Gb/s InfiniBandKappa Xeon 2.4 GHz 70 35 1 GB DDR266
A.2 Simulation Setup
The modeling tool employed for this project was Mission-Level Designer (MLD)
developed by MLDesign Technologies Inc [27]. MLD is a block-oriented, discrete-event
simulation environment that supports modular and hierarchical designs. At its core,
137
MLD uses primitives, C++ code that provides some specific function such as arithmetic,
dataflow switching, or data queuing. Larger modules and systems are constructed by
connecting two or more primitives and/or other modules via a graphical interface. In order
to further facilitate user design, MLD supplies numerous libraries with pre-built primitives
and modules. Figure A-1 shows the development environment of MLD.
Figure A-1. The MLD development environment
138
REFERENCES
[1] O. Lubeck, Y. Luo, H. Wasserman, and F. Bassetti, “An Empirical HierarchicalMemory Model Based on Hardware Performance Counters,” Proc. Int’l Conf.Parallel and Distributed Processing Techniques and Applications, Las Vegas, NV,July 13-16, 1998.
[2] D. Kerbyson, H. Wasserman, and A. Hoisie, “Exploring Advanced ArchitecturesUsing Performance Prediction,” Proc. Int’l Workshop on Innovative Architecture,Kohala Coast, Big Island, HI, Jan. 10-11, 2002.
[3] M. Salsburg, “A Statistical Approach to Computer Performance Modeling,” ACMSIGMETRICS Performance Evaluation Review, vol. 15, no. 1, pp. 155-162, May1987.
[4] E. Strohmaier, “Statistical Performance Modeling: Case Study of the NPB 2.1Results,” Proc. Third Int’l Euro-Par Conf. Parallel Processing, Passau, Germany,Aug. 26-29, 1997.
[5] R. Jain, The Art of Computer Systems Performance Analysis. John Wiley and Sons,1991.
[6] A. Sampogna, D. Kaeli, D. Green, M. Silva, and C. Sniezek, “Performance ModelingUsing Object-Oriented Execution-Driven Simulation,” Proc. 29th Simulation Symp.,New Orleans, LA, Apr. 8-11, 1996.
[7] S. Dwarkadas, J. Jump, and J. Sinclair, “Execution-Driven Simulation ofMultiprocessors: Address and Timing Analysis,” ACM Trans. Modeling andComputer Simulation, vol. 4, no. 4, pp. 314-338, Oct. 1994.
[8] R. Uhlig and T. Mudge, “Trace-driven Memory Simulation: A Survey,” ACMComputing Surveys, vol. 29, no. 2, pp. 128-170, June 1997.
[9] J. Flanagan, B. Nelson, J. Archibald, and G. Thompson, “The Inaccuracy ofTrace-Driven Simulation Using Incomplete Multiprogramming Trace Data,” Proc.Fourth Int’l Workshop Modeling, Analysis, and Simulation of Computer andTelecommunication Systems, San Jose, CA, Feb. 1-3, 1996.
[10] S. Moore, F. Wolf, J. Dongarra, S. Shende, P. Teller, and B. Mohr, “A ScalableApproach to MPI Application Analysis,” Proc. 12th European PVM/MPI Users’Group Meeting, Sorrento, Italy, Sept. 18-21, 2005.
[11] D. Culler, J. Singh, and A. Gupta, Parallel Computer Architecture: A Hard-ware/Software Approach. Morgan Kaufmann Publishers, 1998.
[12] MPI Forum, “MPI: A Message-Passing Interface Standard,” University of Tennessee,Version 1.1, June 1995.
139
[13] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A High-Performance, PortableImplementation of the MPI Message Passing Interface Standard,” Parallel Comput-ing, vol. 22, no. 6, pp. 789-828, Sept. 1996.
[14] A. George, R. Fogarty, J. Markwell, and M. Miars, “An Integrated SimulationEnvironment for Parallel and Distributed System Prototyping,” Simulation, vol. 72,no. 5, pp. 283-294, May 1999.
[15] E. Deelman, A. Dube, A. Hoisie, Y. Luo, R. Oliver, D. Sundaram-Stukel,H. Wasserman, V. Adve, R. Bagrodia, J. Browne, E. Houstis, O. Lubeck, J. Rice,P. Teller, and M. Vernon, “POEMS: End-to-End Performance Design of LargeParallel Adaptive Computational Systems,” IEEE Trans. Software Engineering, vol.26, no. 11, pp. 1027-1048, Nov. 2000.
[16] R. Bagrodia, E. Deelman, S. Docy, and T. Phan, “Performance Prediction of LargeParallel Applications Using Parallel Simulations,” Proc. Seventh ACM SIGPLANSymp. Principles and Practice of Parallel Programming, pp. 151-161, Atlanta, GA,May 1999.
[17] M. Uysal, T. Kurc, A. Sussman, and J. Saltz, “A Performance Prediction Frameworkfor Data Intensive Applications on Large Scale Parallel Machines,” Technical ReportCS-TR-3918 and UMIACS-TR-98-39, University of Maryland, Department ofComputer Science and UMIACS, July 1998.
[18] J. Cao, D. Kerbyson, E. Papaefstathiou, and G. Nudd, “Performance Modelingof Parallel and Distributed Computing Using PACE,” Proc. 19th IEEE Int’lPerformance, Computing, and Communications Conf., pp. 485-492, Phoenix, AZ,Feb. 20-22, 2000.
[19] S. Pllana and T. Fahringer, “Performance Prophet: A Performance Modeling andPrediction Tool for Parallel and Distributed Programs,” Proc. Int’l Conf. ParallelProcessing, Oslo, Norway, June 14-17, 2005.
[20] A. Snavely, L. Carrington, and N. Wolter, “A Framework for Performance Modelingand Prediction,” Proc. 15th Supercomputing Conf., Baltimore, MD, Nov. 16-22,2002.
[21] D. Bailey and A. Snavely, “Performance Modeling: Understanding the Present andPredicting the Future,” Proc. Euro-Par Conf., Lisbon, Portugal, Aug. 30-Sept. 2,2005.
[22] R. Badia, J. Labarta, J. Gimenez, and F. Escale, “DIMEMAS: Predicting MPIApplications Behavior in Grid Environments,” Proc. Workshop on Grid Applicationsand Programming Tools, Seattle, WA, June 25, 2003.
[23] S. Moore, D. Cronk, K. London, and J. Dongarra, “Review of Performance AnalysisTools for MPI Parallel Programs,” Technical Report, University of Tennessee,Computer Science Department, 1998.
140
[24] L. DeRose and D. Reed, “SvPablo: A Multi-Language Architecture-IndependentPerformance Analysis System,” Proc. Int’l Conf. Parallel Processing, Fukushima,Japan, Sept. 1999.
[25] B. Miller, M. Callaghan, J. Cargille, J. Hollingsworth, R. Irvin, K. Karavanic,K. Kunchithapadam, and T. Newhall, “The Paradyn Parallel PerformanceMeasurement Tool,” IEEE Computer, vol. 28, no. 11, pp. 37-46, Nov. 1995.
[26] S. Shende and A. Malony, “The TAU Parallel Performance System,” Int’l Journalof High Performance Computing Applications, vol. 20, no. 2, pp. 287-331, Summer2006.
[27] G. Schorcht, I. Troxel, K. Farhangian, P. Unger, D. Zinn, C. Mick, A. George, andH. Salzwedel, “System-Level Simulation Modeling with MLDesigner,” Proc. 11thInt’l Symp. Modeling, Analysis, and Simulation of Computer and TelecommunicationSystems, pp. 207-212, Orlando, FL, Oct. 12-15, 2003.
[28] J. Vetter, N. Bhatia, E. Grobelny, and P. Roth, “Capturing Petascale ApplicationCharacteristics with the Sequoia Toolkit,” Proc. Parallel Computing, Malaga, Spain,Sept. 13-16, 2005.
[29] S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci, “A ScalableCross-Platform Infrastructure for Application Performance Tuning Using HardwareCounters,” Proc. 13th Supercomputing Conf., Dallas, TX, Nov. 4-10, 2000.
[30] E. Grobelny and J. Vetter, “Extrapolating Communication Patterns of Large-scaleScientific Applications,” Technical Report, University of Florida and Oak RidgeNational Laboratory, 2006.
[31] O. Zaki, E. Lusk, W. Gropp, and D. Swider, “Toward Scalable PerformanceVisualization with Jumpshot,” The Int’l Journal of High Performance ComputingApplications, vol. 13, no. 2, pp. 277-288, Fall 1999.
[32] D. Gustavson and Q. Li, “The Scalable Coherent Interface (SCI),” IEEE Communi-cations Magazine, vol. 34, no. 8, pp. 52-63, Aug. 1996.
[33] K. Koch, R. Baker, and R. Alcouffe, “Solution of the First-Order Form of the 3-DDiscrete Ordinates Equation on a Massively Parallel Processor,” Trans. of theAmerican Nuclear Society, vol. 65, no. 198, 1992.
[34] J. Vetter and A. Yoo, “An Empirical Performance Evaluation of Scalable ScientificApplications,” Proc. 15th Supercomputing Conf., Baltimore, MD, Nov. 16-22, 2002.
[35] E. Grobelny, D. Bueno, I. Troxel, A. George, and J. Vetter, “FASE: A Frameworkfor Scalable Performance Prediction of HPC Systems and Applications,” Simulation:Transactions of The Society for Modeling and Simulation International, vol. 83, no.10, pp. 721-745, Oct. 2007.
141
[36] M. Griffin, “NASA 2006 Strategic Plan,” National Aeronautics and SpaceAdministration, NP-2006-02-423-HQ, Washington DC, Feb. 2006.
[37] J. Ramos, J. Samson, D. Lupia, I. Troxel, R. Subramaniyan, A. Jacobs, J. Greco,G. Cieslewski, J. Curreri, M. Fischer, E. Grobelny, A. George, V. Aggarwal, M. Pateland R. Some, “High-Performance, Dependable Multiprocessor,” Proc. IEEE/AIAAAerospace Conf., Big Sky, MT, Mar. 4-11, 2006.
[38] D. Dechant, “The Advanced Onboard Signal Processor (AOSP),” Advances in VLSIand Computer Systems, vol. 2, no. 2, pp. 69-78, Oct. 1990.
[39] M. Iacoponi and D. Vail, “The Fault Tolerance Approach of the AdvancedArchitecture On-Board Processor,” Proc. Symp. Fault-Tolerant Computing, Chicago,IL, June 21-23, 1989.
[40] F. Chen, L. Craymer, J. Deifik, A. Fogel, D. Katz, A. Silliman Jr., R. Some,S. Upchurch and K. Whisnant, “Demonstration of the Remote Exploration andExperimentation (REE) Fault-Tolerant Parallel-Processing Supercomputer forSpacecraft Onboard Scientific Data Processing,” Proc. Int’l Conf. DependableSystems and Networks, New York, NY, June 25-28, 2000.
[41] E. Prado, P. Prewitt and E. Ille, “A Standard Approach to Spaceborne PayloadData Processing,” Proc. IEEE Aerospace Conf., Big Sky, MT, March 10-17, 2001.
[42] S. Fuller, RapidIO - The Embedded System Interconnect. John Wiley & Sons, 2005.
[43] J. Meyer, “On Evaluating the Performability of Degradable Computing Systems,”IEEE Trans. Computers, vol. C-29, no. 8, pp.720-731, Aug. 1980.
[44] R. Subramaniyan, V. Aggarwal, A. Jacobs, and A. George, “FEMPI: A LightweightFault-tolerant MPI for Embedded Cluster Systems,” Proc. Int’l Conf. EmbeddedSystems and Applications, Las Vegas, NV, June 26-29, 2006.
[45] R. Smith, K. Trivedi and A. Ramesh, “Performability Analysis: Measures, anAlgorithm and a Case Study,” IEEE Trans. Computers, vol. 37, no. 4, pp.406-417,Apr. 1988.
[46] B. Haverkort, R. Marie, G. Rubino, and K. Trivedi (editors), PerformabilityModeling: Techniques and Tools. Wiley, 2001.
[47] C. Hirel, R. Sahner, X. Zang, and K. Trivedi, “Reliability and PerformabilityModeling Using SHARPE 2000,” Proc. Int’l Conf. Computer Performance Evalua-tion: Modeling Techniques and Tools, Schaumburg, IL, Mar. 27-31, 2000.
[48] I. Troxel, E. Grobelny and A. George, “System Management Services forHigh-Performance In-situ Aerospace Computing,” AIAA Journal of AerospaceComputing, Information, and Communication, vol. 4, no. 2, pp. 636-656, Feb. 2007.
142
[49] D. Bueno, C. Conger, A. Leko, I. Troxel and A. George, “RapidIO-based SpaceSystem Architectures for Synthetic Aperture Radar and Ground Moving TargetIndicator,” High-Performance Embedded Computing Workshop, MIT Lincoln Lab,Lexington, MA, Sept. 20-22, 2005.
[50] P. Meisl, M. Ito, and I. Cumming, “Parallel Synthetic Aperture Radar Processingon Workstation Networks,” Proc. 10th Int’l Parallel Processing Symp., pp. 716-723,Honolulu, HI, Apr. 15-19, 1996.
[51] C. Miller, D. Payne, T. Phung, H. Siegel, and R. Williams, “Parallel Processing ofSpaceborne Imaging Radar Data,” Proc. Eighth Supercomputing Conf., San Diego,CA, Dec. 4-8, 1995.
[52] D. Sandwell, “SAR Image Formation: ERS SAR Processor Coded in MATLAB,”http://www.geo.uzh.ch/rsl/research/SARLab/GMTILiterature/PDF/San02d.
pdf, 2002.
[53] J. Samson, G. Gardner, D. Lupia, M. Patel, P. Davis, V. Aggarwal, A. George,Z. Kalbarcyzk, and R. Some, “Technology Validation: NMP ST8 DependableMultiprocessor Project II,” Proc. IEEE Aerospace Conf., Big Sky, MT, Mar. 3-10,2007.
[54] E. Grobelny, G. Cieslewski, I. Troxel, and A. George, “Predicting the Performance ofRadiation-Susceptible Aerospace Computing Systems and Applications,” submittedto ACM Trans. Embedded Computing Systems.
[55] M. Cannataro, D. Talia, and P. Srimani, “Parallel Data Intensive Computing inScientific and Commercial Applications,” Parallel Computing, vol. 28, no. 5, pp.673-704, May 2002.
[56] U. Fayyad, “Data Mining and Knowledge Discovery: Making Sense Out Of Data,”IEEE Expert: Intelligent Systems and Their Applications, vol. 11, no. 5, pp. 20-25,Oct. 1996.
[57] Data-Intensive Computing Initiative (DICI), Pacific Northwest National Laboratory,http://dicomputing.pnl.gov/.
[58] K. Walsh and E. Sirer, “Staged Simulation: A General Technique For ImprovingSimulation Scale and Performance,” ACM Trans. Modeling and Computer Simula-tions, vol. 14, no. 2, pp. 170-195, Apr. 2004.
[59] R. Fujimoto, “Parallel Simulation: Parallel and Distributed Simulation Systems,”Proc. Winter Simulation Conf., pp. 147-157, Arlington, VA, Dec. 9-12, 2001.
[60] D. Nicol, “Principles of Conservative Parallel Simulation,” Proc. Winter SimulationConf., pp. 128-135, Coronado, CA, Dec. 8-11, 1996.
[61] B. Groselj, “CPSim: A Tool for Creating Scalable Discrete-Event Simulations,”Proc. Winter Simulation Conference, pp 579-583, Arlington, VA, Dec. 3-6, 1995.
143
[62] R. Bagrodia, R. Meyer, M. Takai, Y. Chen, X. Zeng, J. Martin, B. Park andH. Song “Parsec: A Parallel Simulation Environment for Complex Systems,” IEEEComputer, vol. 31, no. 10, pp. 77-85, Oct. 1998.
[63] R. Fujimoto, “Optimistic Approaches to Parallel Discrete-event Simulation,” Trans.of the Society for Computer Simulation International, vol. 7, no. 2, pp. 153-191, June1990.
[64] Y. Teo, S. Tay and S. Kong, “SPaDES: An Environment for Structured ParallelSimulation,” Technical Report, TR20/96, Department of Information Systems andComputer Science, National University of Singapore, Singapore, Oct. 1996.
[65] D. Martin, T. McBrayer, and P. Wilsey, “WARPED: A Time Warp SimulationKernel for Analysis and Application Development,” Proc. 29th Hawaii Int’l Conf.System Sciences Volume 1: Software Technology and Architecture, Jan. 3-6, 1996, pp.383-386.
[66] J. West and A. Mullarney, “ModSim: A Language for Distributed Simulation.”Proc. SCS Multiconf. Distributed Simulation, pp. 155-159, San Diego, CA, Feb. 3-5,1988.
[67] Y. Liu, F. Presti, V. Misra, D. Towsley and Y. Gu, “Fluid Models and Solutions forLarge-scale IP Networks,” Proc. ACM SIGMETRICS Int’l Conf. Measurement andModeling Computer Systems, pp. 91-101, San Diego, CA, June 10-14, 2003.
[68] A. Yan and W. Gong, “Time-driven Fluid Simulation for High Speed Networks,”IEEE Trans. Information Theory, vol. 45, no. 5, pp. 1588-1599, June 1999.
[69] B. Liu, Y. Gao, J. Kurose, D. Towsley and W. Gong, “Fluid Simulation ofLarge-scale Networks: Issues and Tradeoffs,” Proc. Int’l Conf. Parallel and Dis-tributed Processing Techniques and Applications, pp. 2136-2142, Las Vegas, NV, June28-July 1, 1999.
[70] G. Kesidis, A. Singh, D. Cheung and W. Kwok, “Feasibility of Fluid-DrivenSimulation for ATM Network,” Proc. IEEE Global Telcommunications Conf., pp.2013-2017, London, England, Nov. 18-22, 1996.
[71] B. Melamed and S. Pan, “HNS: A Streamlined Hybrid Network Simulator,” ACMTrans. Modeling and Computer Simulation, vol. 14, no. 3, pp. 251-277, July 2004.
[72] G. Riley, T. Jafaar and R. Fujimoto, “Integrated Fluid and Packet NetworkSimulations,” Proc. IEEE Int’l Symp. Modeling, Analysis and Simulation ofComputer and Telecommunications Systems, pp. 511-518, Fort Worth, TX, Oct.11-16, 2002.
[73] C. Kiddle, R. Simmonds, C. Williamson and B. Unger, “Hybrid Packet/Fluid FlowNetwork Simulation,” Proc. 17th Workshop Parallel and Distributed Simulation, pp.143-152, San Diego, CA, June 10-13, 2003.
144
[74] M. Uysal, T. Kurc, A. Sussman and J. Saltz, “A Performance Prediction Frameworkfor Data Intensive Applications on Large-Scale Parallel Machines,” Proc. Fourth Int’lWorkshop on Languages, Compilers, and Run-time Systems for Scalable Computers,pp. 243-258, Pittsburgh, PA, May 28-30, 1998.
[75] C. Chang, H. Ren, and S. Chiang, “Real-time Processing Algorithms for TargetDetection and Classification in Hyperspectral Imagery,” IEEE Trans. Geoscienceand Remote Sensing, vol. 39, no. 4, pp. 760-768, Apr. 2001.
[76] E. Grobelny, C. Reardon, and A. George, “A Hybrid Simulation Approach toReduce Analysis Time of Data-Intensive Applications,” submitted to ACM Trans.Modeling and Computer Simulation.
145
BIOGRAPHICAL SKETCH
Eric Grobelny began attending the University of Florida in Fall 1998 and received
his B.S. in 2002 and M.E. in 2004. He conducted research at the High-performance
Computing and Simulation (HCS) Laboratory under the supervision of Dr. Alan George
for six years focusing on performance analysis and prediction for high-performance
computing systems and applications. His other interests include high-performance
embedded computing for aerospace systems and applications, simulation-based fault
injection, and high-performance interconnect technologies. As a member of the HCS lab,
he also worked on numerous side projects including developing an MPI communication
layer for satellite systems, investigating performance enhancements for low-locality
applications, and exploring techniques and best practices for disaster recovery and
mission assurance in dynamic, high-performance environments. Eric has accepted a job at
Honeywell Space Systems in Clearwater, Florida.
146